Methodology
In this article, we have developed a novel method to predict disulfide connectivity patterns from protein primary sequence, which used support vector regression (SVR) approach based on multiple sequence feature vectors and predicted secondary structure by PSIPRED program. The results have indicated that our method could achieve an overall prediction accuracy of 74.4%, tested on the three well-defined datasets using 4-fold cross-validation.
1. Support Vector Regression (SVR): We have successfully applied a novel machine learning method- support vector regression to address the problem of predicting disulfide connectivity patterns in proteins. We extensively investigated eight different sequence encoding schemes in order for the deep understanding of their respective influences on the final prediction performance. In particular, the main sequence features were constituted by: (1) Multiple sequence alignment profiles in the form of position-specific scoring matrices (PSSMs) produced by PSI-BLAST program, which contain the important evolutionary information responsible for improving the prediction accuracy; (2) Predicted secondary structure information generated by PSIPRED program; (3) Global information in terms of twenty amino acid compositions, the normalized protein molecular weight, as well as the normalized sequence length; (4) Sequential distance of cysteine-cysteine residue pair (also denoted as DOC, Distance of Oxidized Cyseines in the literature). This value was then normalized using the logarithm function before its input into SVR models.
2. The Architecture of SVR Prediction System:
|
|
3. Detailed computational equations for calculating different types of information as the input into SVR prediction system: 1. PSSMs: It is well-known that evolutionary information in the form of multiple sequence alignment profiles generated by PSI-BLAST program can significantly improve the overall prediction performance. The PSSM is a protein sequence is an M ¡Á 20 matrix, where M is the target sequence length and 20 is the number of amino acid types. The neighboring sequence environments of cysteine residues can be extrated by using a sliding window method with a local window length M. In this study for all the proteins with different numbers of disulfide bridges, we have set up the local window size at M = 13 consistently. Firstly, we obtained the NCBI nr database, which contained all known databases: all non-redundant GenBank translations, SwissProt, PIR, PDB, PRF, and NCBI RefSeq database. Then, blastpgp program was run to query each protein sequence in our datasets against this NCBI nr database to generate the PSSM profiles, by three iterations of PSI-BLAST, with a cutoff E-value of 0.001. Then these profiles were scaled to the required 0¨C1 range by the following standard logistic function:
where x is the raw profile matrix value. The scaled PSSM profiles of every two single cysteines were then concatenated to form a cysteine-cysteine residue pair before being input into SVR. 2. PSS: The predicted probability matrices of the secondary structure states by PSIPRED was also taken into consideration in order to further enhance the prediction performance. PSIPRED is a software package that generates the reliability indices (in 0-1 range) for all the three states (helix, strand, and coil) for each residue in a protein sequence. We performed the PSIPRED program against every protein sequence in the datasets and subsequently extracted the M¡Á3 matrix using a sliding window scheme, where M = 13 was adopted in this study. 3. AA Content: We also used the global twenty amino acid contents of the protein sequence as the input to SVR, as amino acid compositions tend to result in the improvement of prediction accuracy to some extent. This vector is calculated by:
where ni is the number of occurrences of amino acid type i in the sequence with length L. 4. Proweight: Incorporation
of global sequence features such as normalized protein molecular weight
(Proweight) and protein length (Prolength) can yield better prediction
performance when compared with local sequence alone. Its molecular weight
where 5. Prolength: Similar to
Proweight, Prolength denotes a global vector of protein sequence length.
The normalized Prolength value is also calculated using the above equation,
where 6. DOC: This vector decribes the squential Distance between Oxidized Cysteines (referred to as DOC or Cysteine Separation Distance). It is defined as: DOC(i,j)=||i-j|| where i and j represents the two oxidized cysteines that form a disulfide bridge. Previous work by Tsai et al. (2005) indicated that normalizing the DOC value using the logarithm function can significantly improve the prediction accuracy. Hence, in this study, we also incorporated this normalized value into SVR models. 7. Cysorder: This vector describes the sequential order difference between each cysteine pair and was originally suggested by Chen et al. (2006). For instance, a protein with three disulfide bridges that are formed between cys 21 and cys 42 (cysordering residue 1 and 4), cys 28 and 60 (cysordering residue 2 and 5), and cys 36 and 66 (cysordering residue 3 and 6) will have the following Cysorder: (1/6, 4/6, 2/6, 5/6, 3/6, 6/6) = (0.1667, 0.6667, 0.3333, 0.8333, 0.5000, 1.000). For further details about how to explain and construct this vector, please refer to Chen et al.'s Proteins paper.
To summarize, a cysteine-cysteine pair is constituted by PSSMs + PSIPRED + AA Content + Proweight + Prolength + DOC + Cysorder, and therefore there are totally 520 + 78 + 20 + 1 + 1 + 1 + 2 = 623 - dimensional vectors encoded for each cysteine-cysteine pair.
4. Parameter selection of SVR training and testing: The selection of the kernel function parameters is an important step for SVR training and testing, as it implicitly determines the structure of the high dimensional feature space when constructing the OSH. We trained and constructed our SVR classifiers based on Radial Basis Function (RBF kernel). There are two parameters needed to be determined in advance to optimize the SVR training. They are the regularization parameter C and the kernel parameter gamma y. The former is the cost parameter and the latter determines the width of RBF kernel. We carried out the preliminary studies to select the optimal parameter combinations of C and gamma y.
5. Scoring the Prediction Performance: In this study, we employ two assessment measures Qc and Qp to evaluate the predictve power of our SVR classifiers, which is consistent with the previous prediction studies. Qc and Qp measures are based on the basis of cysteine pair and protein level, respectively. Qc (the cysteine pair-based or disulfide bridge-based measure, i.e. the fraction of correctly predicted disulfide bridges in a protein) is given by Qc = Nc / Tc where Nc is the number of disulfide bridges that are correctly predicted, and Tc is the total number of disulfide bridges in the testing dataset. Qp (the protein-based measure, i.e. the fraction of proteins whose disulfide connectivity patterns are all predicted correctly) is given by Qp = Np / Tp where Np is the number of proteins whose disulfide connectivity patterns are correctly predicted, and Tp is the total number of proteins in the testing dataset. All
the results were evaluated using a 4-fold
cross-validation procedure, i.e. the whole dataset was randomly divided
into roughly four subsets, with each subset possessing the roughly equal
numbers of protein sequences. In the cross-validation step, Each subset
was singled out in turn as the testing dataset, while all the remaining
proteins in other subsets were used as the training dataset to build the
SVR models.
6. Predicting disulfide connectivity patterns: To address and resolve the prediction task, we have reduced the problem of predicting disulfide connectivity patterns to a cysteine pair-wise one, i.e. we trained and built our SVR models based on the disulfide bonding probabilities of two cysteine residue pairs, then used the built SVR models to predict the bonding potential of each cysteine pair of every protein sequence in the testing dataset. We have given a complete txt file of disulfide connectivity patterns of proteins with 2, 3, 4 and 5 disulfide bridges, based on which the total probability scores of a protein sequence can be calculated. The disulfide connectivity pattern that has the largest probability score will be predicted as the result. As an alternative stategy, the prediction problem of disulfide connectivity can be solved by drawing a maximum-weight matching graph whose nodes are disulfide-bonded cysteines and whose edge weight is the potential disulfide bonding probability of the corresponding cysteine pair. As a matter of fact, this strtegy has been employed by the majority of previous studies in the literature (Fariselli and Casadio, 2001; Ferre and Clote, 2005a, 2005b; Tsai et al., 2005; Cheng et al., 2006). However, our method can directly solve this difficult problem by predicting the disulfide bonding probability of each cysteine pair and subsequently enumerating the probability scores of all the possible disulfide connectivity patterns, without exhaustively transforming it into a maximum weight matching problem.
References
|