Background. Although synonymous single nucleotide variants (sSNVs) do not alter the protein sequences, they have been shown to
play an important role in human disease. Distinguishing pathogenic sSNVs from neutral ones is challenging because pathogenic sSNVs tend
to have low prevalence. Although many methods have been developed for predicting the functional impact of single nucleotide variants,
only a few have been specifically designed for identifying pathogenic sSNVs;
Results. In this work, we describe a computational model, IDSV (Identification of Deleterious Synonymous Variants), which uses random
forest (RF) to detect deleterious sSNVs in human genomes. We systematically investigate a total of of 74 multifaceted features across seven
categories: splicing, conservation, codon usage, sequence, pre-mRNA folding energy, translation efficiency, and function regions annotation
features. Then, to remove redundant and irrelevant features and improve the prediction performance, feature selection is employed using the
sequential backward selection method. Based on the optimized 10 features, a RF classifier is developed to identify deleterious sSNVs. The
results on benchmark datasets show that IDSV outperforms other state-of-the-art methods in identifying sSNVs that are pathogenic;
Conclusions. We have developed an efficient feature-based prediction approach (IDSV) for deleterious sSNVs by using a wide variety of
features. Among all the features, a compact and useful feature subset that has an important implication for identifying deleterious sSNVs is
identified. Our results indicate that besides splicing and conservation features, a new translation efficiency feature is also an informative
feature for identifying deleterious sSNVs. While the function regions annotation and sequence features are weakly informative, they may have the
ability to discriminate deleterious sSNVs from benign ones when combined with other features. |
|
|