Published on Dec. 4, 2020
Genomic data has become ubiquitous for bioinformaticians; however, successfully inferring biological meaning depends upon the sensitive prediction of differences between genomes. The most popular method to infer short sequence variants is the Genome Analysis Toolkit (GATK). While GATK provides rigorous guidelines, the methods require knowledge-intensive refinement as software and sequencing technologies advance. A recent advancement from Google Health Genomics called DeepVariant uses a deep neural network to call variants in human whole-genome sequence (WGS) data. In comparison to GATKv4, after training, the human genome DeepVariant model achieved a significant drop in Mendelian Inheritance Errors (MIE). MIE variants are not passed from parents to offspring and thus occur infrequently; a variant calling tool with a high MIE is an indicator of low variant quality. However, the application of DeepVariant remains limited to human, mice, and mosquito genomes. A comparative evaluation of the performance of DeepVariant with other species will help facilitate the tool’s adoption in new disciplines. DeepVariant derives feature weights from genome structure; transferring this pre-trained model to infer variants in other species can reduce accuracy that improves after species-specific re-training. We aim to evaluate the ability of DeepVariant to infer variants in non-model organisms from the previous training on the human genome. Cattle offer robust family-based WGS data that enables calculation of MIE rate between both algorithms. Our work demonstrates how variant sensitivity of DeepVariant changes through re-training DeepVariant successively in cattle genomes. We highlight the challenges of applying a human-centric model to other species while describing some limitations of DeepVariant and the hurdles to deploying the pipeline on cluster computing architecture outside of the Google Cloud for genomics research.
Please contact Robert Sanders (sandersrl@missouri.edu) for Zoom information.