An embarrassingly parallel application: High accuracy mapping of copy number variable regions

Finding gene copy number variation in a species is the cornerstone of genomic research. Most CNV finding tools and methods rely on comparing samples to the reference genome and on detecting certain signatures in the alignment data. These methods are robust and are significantly accurate, however they are not perfect. Different tools have various levels of success. Early research had access to few genome samples and these disadvantages could be overcome by using multiple tools for each study. With the development of significantly fast and cheap sequencing machines, a large numbers of samples can be produced in a short amount of time. This is causing significant strain on established research pipelines. New ways to map CNV regions onto the genome are needed. One approach is to create a multi-link map, based on the principle of mapping new features with relation to know features on a reference genome. An attempt was made to map transposons with relation to CNV regions and vice versa. This method proved difficult and generated poor results due to uncertainties associated with transposons and CNVs. Another way to address this issue is the development of a highly scalable parallel processing pipeline, running multiple instances of select CNV calling tools. The pipeline would increase accuracy by allowing each genome in the sample set to act as the “reference” genome. This will allow possible CNV events to be flagged and ones which pass a certain consensus threshold, would become candidates for further investigation.