Methods for Measuring Geodiversity in Large Overhead Imagery Datasets
This research introduces some of the first geo-computational methods to address a key gap in the artificial intelligence (AI) and big data literature as it relates to the geosciences and remote sensing: the lack of understanding of the global feature representativeness of labels in large remotely-sensed imagery (RSI) datasets. Issues of data fairness, heterogeneity and equitability – often related directly to geographic and demographic under-sampling – have recently come to the fore in multidisciplinary discussions of the ethics of AI. The risks of perpetuating data and models with unknown biases are particularly heightened in the air-and space-born RSI domain, given…
Teaching Professor Faculty Positions in Data Science and Analytics
Description: The University of Missouri (MU) Institute for Data Science and Informatics (IDSI) is accepting applications for multiple positions of Teaching Professor (Assistant and Associate levels) of Data Science and Analytics. In today’s information-centric world, data are becoming increasingly important for the success of businesses in every industry. Data science at MU is an interdisciplinary field with collaborators from a variety of academic areas, such as computer science, journalism, information science, healthcare, and bioinformatics. The data science and analytics curriculum covers the entire life-cycle from data ingestion and conditions at source through to business intelligence and decision making products. The…
The Genescape Allele Catalog Development for Precise Identification of Causative SNPs
Next-generation sequencing (NGS) has become more popular in the modern-day. Large amounts of next-generation resequencing data have been generated and are available online for various organisms including soybeans. However, current genome-wide association study (GWAS) prediction tools simply identify the most significant SNP based on Manhattan plots and still have some limitations in pinpointing the exact causative SNPs using the SNP array or NGS datasets. Therefore, we are developing a Genescape catalog, a new bioinformatics approach to integrate all potential alleles for all genes in soybean genome using the genomic variations and phenotypic information from a large subset of cultivated and…
Building a Population-based Childhood Cancer Data Ecosystem: Challenges and Opportunities for Informatics and Data Science
Childhood cancer is a relatively rare disease diagnosed in over 16,000 U.S. children and adolescents (ages 0 – 19) each year. While 84% of children with cancer survive 5 years or more, cancer remains the second leading cause of death in children after accidents. Molecular variations make all childhood cancers extraordinarily rare and difficult to study. The Childhood Cancer Data Initiative (CCDI) of the National Cancer Institute recognizes the critical need to collect and analyze and share data to address this understudied cancer burden. Innovations in informatics methods and data science are clearly needed to assimilate new data sources, characterize the burden…
Evaluating the effectiveness of transfer-learning with DeepVariant
Genomic data has become ubiquitous for bioinformaticians; however, successfully inferring biological meaning depends upon the sensitive prediction of differences between genomes. The most popular method to infer short sequence variants is the Genome Analysis Toolkit (GATK). While GATK provides rigorous guidelines, the methods require knowledge-intensive refinement as software and sequencing technologies advance. A recent advancement from Google Health Genomics called DeepVariant uses a deep neural network to call variants in human whole-genome sequence (WGS) data. In comparison to GATKv4, after training, the human genome DeepVariant model achieved a significant drop in Mendelian Inheritance Errors (MIE). MIE variants are not passed…
Tool Development for Analyzing Arrhythmias in Fast Cardiac Magnetic Resonance Scans
Cardiac magnetic resonance (CMR) scanning provides a method to diagnose cardiac disease. For obtaining an effective image, the standard procedure of CMR requires patients to hold their breath during the scanning, but it is difficult for frail patients in clinic. Furthermore, standard CMR imaging depends upon averaging together regular cardiac cycles, which is disrupted by irregular heartbeats, this irregularity prevents visualization of arrhythmias. HeartSpeed software will introduce a new strategy to help frail patients, magnetic resonance technicians, and physicians, by enabling free-breathing CMR image post-processing that corrects the breathing motion. Closely related algorithms provide a new approach to visualize and…
Using Big Data to Identify Possible External Risk Factors for Poorly Understood Cancers
Worldwide, cancer is the second leading cause of death (Cancer, 2012). There were 17 million new cases and 9.6 million cancer deaths worldwide in 2018, including approximately 1.7 million new U.S. cases and 600,000 U.S. cancer deaths (Cancer Facts & Figures 2018 | American Cancer Society, 2018; Worldwide Cancer Statistics, 2019). The worldwide incidence of cancer is expected to increase to 27.5 million per year by 2040 (Worldwide Cancer Statistics, 2019). The U.S. expects an increase to over 1.9 million new cases per year by 2020 due to an aging Caucasian population and a growing African American population (CDC – Expected New Cancer Cases and Deaths in…
An Evaluation of Physician Burnout by EMR Use Characterization and Correlation
Burnout disproportionately affects healthcare workers and continues to rise. This condition potentially contributes to cost, quality and patient safety risk in an already overburdened United States healthcare system. While the causes of burnout are complex, evidence exists pointing to Electronic Medical Record use (EMR) as one major contributor due to the increased clerical burden that decreases patient contact time and contributes to disruption for the provider. The growth and consolidation of large-scale EMR vendors has given rise to enterprise-scale electronic medical records with workflows applied across disparate venues and specialties, further complicating the ability to optimize the physician EMR experience and leading…
Early Detection of Glaucoma Using Electronic Health Records
Glaucoma is the second leading cause of irreversible blindness worldwide. About 70 million people have glaucoma, and nearly 4.4 million people are blind from optic nerve damage due to undiagnosed glaucoma. Besides, the current glaucoma growth rate and its economic burdens are unsustainable. As a result, warrant a systematic evaluation for glaucoma risk assessment and early prediction for better glaucoma care management. Effective use of temporal information across electronic health records (EHR) provides data-driven and evidence-based risk factors linked to glaucoma development and supports the early predictive model. In the present study, we used 830,125 unique patient records from the…