Published on March 6, 2017
Genomic selection is an approach to enhance the quantitative traits in plant and animal breeding program at early stage using whole genome molecular markers, especially for long life-cycle species. It’s based on the assumption that all quantitative trait loci (QTL) tend to be in linkage disequilibrium with at least on marker. Statistical methods, such as ridge regression, best linear unbiased prediction (RR-BLUP)[1], Bayes A[2], Bayesian LASSO[3] are widely used for genomic selection problem works SNP matrix. Other machine learning methods (random forrest, support vector machine and neural network)[4] are also been applied for this study. In this work, we are developing a deep learning method using long short term memory (LSTM) recurrent network on a public standard dataset of Pinus taeda (loblolly pine) [5]. The stem height trait (HT, cm) was measured across 861 individuals genotyped with 4,853 SNPs derived from 32 parents. The genomic estimated breeding values (GEBV) was calculated using 10-fold cross-validation method and accuracy was measured using Pearson correlation coefficient between GEBV and observed values.
Reference:
[1] Hoerl, Arthur E., and Robert W. Kennard. “Ridge regression: biased estimation for nonorthogonal problems.” Technometrics 42.1 (2000): 80-86.
[2] Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829.
[3] Park, Trevor, and George Casella. “The bayesian lasso.” Journal of the American Statistical Association 103.482 (2008): 681-686.
[4] Heslot, Nicolas, et al. “Genomic selection in plant breeding: a comparison of models.” Crop Science 52.1 (2012): 146-160.
[5] Resende, Márcio FR, et al. “Accuracy of genomic selection methods in a standard data set of loblolly pine (Pinus taeda L.).” Genetics 190.4 (2012): 1503-1510.