Hierarchical agglomerative clustering of eusocial bee proteins

The assembly and annotation of the European Honey bee (Apis mellifera) genome has predicted more than 15,000 protein-coding genes and became the foundation for studies of nature and evolution of eusociality. Since then, other eusocial bee genomes have been sequenced, providing an excellent opportunity to seek additional insight into unique traits of eusocial bees at the genomic and proteomic level. We wish to build a non-redundant organization of protein sequences by applying an unsupervised hierarchical protein-clustering method to the protein sequences of 4 advanced eusocial honey bees and 2 primitively eusocial bumblebees. The clustering method will group proteins into clusters according to their sequence similarity, generating homogeneous protein domain organization within clusters. The goal is to generate a final organization that consists of protein clusters representing biological families. In this seminar presentation, we will discuss the pros and cons of different agglomerative clustering methods and the performance of different linkage criteria. We also propose the cluster domain consistency (CDC) score, a new metric to validate protein cluster sets.