DeepVariant – TrioTrain: Developing a transfer learning protocol using non-human genomes – MU Institute for Data Science and Informatics

Published on Dec. 1, 2022

Genomic data are widely available for investigating phenotypes that impact both human and animal health. Although the investigation of human health often begins with model organisms, genomics technologies and software are often initially developed with only the human genome in mind, severely limiting their comparative applicability. Translational research requires robust systems-focused, or “One Health,” solutions that enable mutual progress across the animal, plant, and human genomics communities. Regardless of species, genomics faces a common challenge: continuous data re-processing due to a rapidly increasing sample size. The Genome Analysis Toolkit (GATK) is currently the preferred method for calling variants with short-read, whole-genome sequencing (WGS) data. While providing a rigorous template, GATK remains limited by our knowledge of error introduction and relies on cohort calling to improve variant quality drastically. DeepVariant (DV) is an alternative method that calls variants from pileup images using a deep neural network. Unlike GATK, DeepVariant can identify known and unknown error sources and produces highly accurate genotypes for one sample. However, DeepVariant remains constrained to human, mouse, and mosquito genomes because most non-human species lack highly accurate genotypes for training labels. Leveraging existing WGS samples and transfer learning, our bovid DeepVariant model enables others to benefit from the entire SRA catalog of bovine sequence data (n=5,500) without re-processing massive data. By automatically iterating through existing sequenced trios, our open-source re-training pipeline, TrioTrain, serves as a foundation to repeat the training process in any species with the necessary data. Our work includes a protocol for creating silver-standard truth sets in non-human species. Furthermore, TrioTrain improves re-training portability to SLURM-based, High-Performance Computing (HPC) clusters. By providing detailed guidance on replicating the transfer learning process, we expedite the development of deep-learning genomics tools for interdisciplinary health.