P3DB-4.0: a large-scale plant protein phosphorylation database from diverse species to phosphosite conservation
Since the last version (version 3) of the P3DB was published in 2014, we are very encouraged to see that more and more plant scientists have been using our database including reanalysis of our data collections. In this new version (version 4.0), we extensively re-innovated our database to a brand-new platform (https://www.p3db.org) for accommodating much more data sets and new bioinformatic modules, many of which are suggested by our users and collaborators. In detail, we would like to highlight our novel development from the following aspects: (1) The new P3DB altogether harvests ~220,000 phosphosites in ~57,000 phosphoproteins curated across…
Predictive Models for Early Detection of Glaucoma using Electronic Health Records
Glaucoma is the second leading cause of irreversible blindness across the world. Around 70 million people have glaucoma, and 4.4 million people are blind due to undiagnosed glaucoma by optic nerve damage worldwide. Studies suggest that early detection of glaucoma is critically essential for the prevention of irreversible blindness. Effective use of electronic health records (EHR) provides data-driven, evidence-based risk factors for glaucoma progression that would also enable applying a practical machine learning (ML) model to predict glaucoma before the onset of clinical symptoms. Predictive models for the early detection of glaucoma are critical steps for subsequent actionable/preventable interventions. We…
Analysis of tumor associated macrophages’ heterogeneity in colorectal cancer patients using single-cell RNA-seq data
Colorectal cancer (CRC) is one of the deadliest malignancies worldwide. Though immune checkpoint inhibition has proven effective for a number of other tumors, it offers benefits in only a small group of CRC. In general, heterogenous cell groups in the tumor microenvironment (TME) are considered as the major barrier for unveiling the causes of low immune response. Therefore, deconvolution of cellular components in highly heterogeneous microenvironments is crucial for understanding those mechanisms. Single cell sequencing technology revolutionized TME research enabling profiling cells in high resolution. We have analyzed scRNA-seq data from 23 CRC patients with pre-treatment primary tumors using Seurat…
EXPLAINABLE ARTIFICIAL INTELLIGENCE FOR PATIENT STRATIFICATION AND DRUG REPOSITIONING
Enabling precision medicine requires developing robust patient stratification methods as well as drugs tailored to homogeneous subgroups of patients from a heterogeneous population. Developing de novo drugs is expensive and time consuming with an ultimately low FDA approval rate. These limitations make developing new drugs for a small portion of a disease population unfeasible. Therefore, drug repositioning is an essential alternative for developing new drugs for a disease subpopulation. There is a crucial need to develop data-driven approaches that find druggable homogeneous subgroups within the disease population and reposition the drugs for these subgroups. In this study, we developed an explainable AI…
Scientific illustrations as a source of biological pathway information
The idea of cross-referencing distinct areas of knowledge to glean new insights is not new, but not much work has been done with automatic analysis of different expressions of the same knowledge source, such as a given scientific text and illustrations that accompany it. The NLM grant “Image-guided Biocuration of Disease Pathways From Scientific Literature” attempts to do just that, using state-of-the-art NLP and graph technologies to extract published knowledge and convert it into both machine- and human-readable formats. With thousands of scientific papers being published every month, such a system can significantly improve visibility of new research and speed…
Synteny and compositional domains in honey bee genomes
Analysis of synteny, the conservation of gene order between homologs along chromosome segments, is a standard investigative strategy done in many comparative genomic studies to understand genomic conservation and evolution. It allows us to study the evolutionary history between genomes. Conservation of synteny can reflect highly important regions in the genome or critical functional relationships between orthologous genes. Our recent study of genome compositional features of Hymenopteran provided insights into genomic compositional features of Apis and other insects with different levels of eusociality. To further understand the biological meaning of different GC regions within the honey bee genomes, we analyzed…
A Case-Control based Genomic Analysis of Chronic Obstructive Pulmonary Disease
Chronic Obstructive Pulmonary Disease (COPD) is a respiratory illness that affects millions of people all over the world. It is a major cause of chronic morbidity and mortality and a serious global public health problem. COPD is the fourth leading cause of death worldwide. Although the environmental causes of COPD which predominantly include cigarette smoking are well-documented, to this date the genetic underpinnings of COPD remain largely unknown. Furthermore, in the current landscape of a respiratory pandemic, COPD patients are at a much higher risk for developing other respiratory illnesses and co-morbidities. In this study we use genomic data from…
Large Scale Study of the Long-Term Effects of COVID-19 Infection Using Aggregated EHR Data
COVID-19 is known to have complex multi-system effects but the lasting effects of infection remain poorly understood. With the number of confirmed infections globally surpassing 200 million it is more important than ever to understand the potential long-term implications of COVID-19 infection on patient health. We have created a pipeline to analyze data from over 1.4 billion medical encounters recorded in the Cerner Real World Data EHR database. We apply this pipeline to analyze the effect of COVID-19 infection on the rate of subsequent new dementia diagnoses 30+ days after the infection period in patients hospitalized with pneumonia with no…
FASTAptameR 2.0: A Web Server for Combinatorial Selections Analysis
Combinatorial selection strategies are powerful tools that allow researchers to simulate selective pressures over time on randomized sequence libraries. This is important for lead discovery and optimization and for understanding selection dynamics. Given the evolutionary nature of these experiments, high-fitness sequences will enrich, whereas low-fitness sequences will deplete. These experiments can generate large magnitudes of data, thus driving a need for high-throughput sequence (HTS) analyses that can utilize sequence-specific evolutionary trajectories. Recently, the selections field has benefitted from several software for HTS analysis. However, these software have a high entrance barrier for many users because they are only accessible through…