LARGE-SCALE SOYBEAN GENOME-WIDE VARIATION WORKFLOW AND ASSOCIATION ANALYSIS USING DEEP LEARNING

With the advances in next-generation sequencing technology and significant reduction in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations, and apply the knowledge towards improvements in traits. To facilitate large-scale NGS resequencing data analysis of genomic variations efficiently, we developed a systematic solution using high-performance computing environment, cloud data storage resources and graphics processing unit computing with cutting-edge deep learning approach. The solution contains an integrated and optimized variant calling workflow called ‘PGen’, a quantitative phenotype prediction model using convolutional neural network and an algorithm to study genome-wide association study based on deep learning model. We reviewed and compared studies of statistical and deep learning genomic selection and genome-wide association methods, present our work on thousands of lines of soybean sequencing dataset, summarized ongoing progress of large-scale genome-associated studies and discussed the future work and development.