Model-, structure-, and sequence-based methods for prediction of protein binding sites

Identification of protein-protein binding sites is important in understanding the protein function. The binding site prediction methods that rely on structure are generally more accurate than those ones relying on sequence. However, the coverage of structure-based methods is significantly lower than of the sequence-based method due to the lack of experimental structures.

Here, we propose a sequence-based protein binding site prediction approach that utilizes structure-based methods’ benefits. We utilize L1-regularized logistic regression to integrate sequence- and structure-based predictions for comparative models. The method relies on a series of features, including evaluation of comparative models, geometric features, solvent accessibility, hydrophobicity, secondary structure based on comparative models and name of residues. The non-redundant dataset of feature vectors for training and testing is automatically generated from the hetero-oligomer structures. The assessment of our binding site prediction strategy has demonstrated that it is able to use protein sequences as the only input and obtain comparable accuracies to the state-of-art structure based predictors across different quality levels of homology models. Our method could be useful in the large-scale functional annotation of proteins whose structures are represented only by the comparative models.