Sequence Identity Study for Operational Taxonomic Unit Classification

In metagenomics studies of microorganisms, Operational Taxonomic Unit (OTU) is often used as the replacement for species distinction. This pseudo-species definition is helpful in cases when the scientists would like to understand the composition and diversity of the culture in different environments. Traditional numerical taxonomy method typically defines an OTU as a cluster in a graph resulting from sequence alignment. According to this method, organisms whose 16S rDNA sequences have more than 97% sequence similarity threshold are connected together to firm a cluster.

In this study, we investigate on whether the tradition numerical method results in OTUs that behave as a sensible replacement for species. If the method applies, we investigate what factors, or metrics, can help determine which sequence similarity threshold to use.

In the study, first we plan to first construct a network graph representing the OTUs by connecting the nodes representing the sequences. Then, we plan to investigate which metrics (such as, clustering coefficient or average degree) gives us the hint to a well-behaving graph. We assume that there exists some metrics that are more informative than others in evaluating the resulting network of OTUs. A good metric should have a saddle point at a specific identity threshold (see Fig. 1). We plan to iteratively construct the resulting graph and measure all possible metrics and record the values for each metric at each identity level. Finally, we plan to study the sensitivity of the metrics by recording and analyzing the variance of the metrics at random subsets of the network.