Abstract
RNA-binding hot spots are a small and complementary set of interfacial residues that contribute most to the binding energy of protein-RNA interfaces. As experimental methods for identifying hot spots are time-consuming, labor-intensive and costly, there is a great interest in computational approaches that can predict hot spots on a large scale. In this work, we introduced a sequence-based method that used ensemble classifier to predict hot spots in protein-RNA complexes. We first employed three different sequence encoding schemes based on the physicochemical properties from the AAindex database, the amino acid substitution matrix (BLOSUM62) and the predicted relative accessible surface area. Based on these sequence features, 249 individual predictors are developed to identify hot spots using radial basis function (RBF)-based support vector machine (SVM), sigmoid-based SVM, and k-nearest neighbor algorithm (k-NN), respectively. The combinations of these individual predictors by majority voting were explored in a comprehensive way and an ensemble vote classifier composed of 43 individual predictors were selected to construct the final ensemble classifier. The ensemble classifier outperformed the state-of-the-art computational methods, yielding an F1 score of 0.843 and AUC of 0.893 on the training set as well as F1 score of 0.814 and AUC of 0.842 on the test set. SPHot is free to download and install for academic.