Methodology

 

In this article, we have developed a novel method to predict disulfide connectivity patterns from protein primary sequence, which used support vector regression (SVR) approach based on multiple sequence feature vectors and predicted secondary structure by PSIPRED program. The results have indicated that our method could achieve an overall prediction accuracy of 74.4%, tested on the three well-defined datasets using 4-fold cross-validation.


1. Support Vector Regression (SVR):

We have successfully applied a novel machine learning method- support vector regression to address the problem of predicting disulfide connectivity patterns in proteins. We extensively investigated eight different sequence encoding schemes in order for the deep understanding of their respective influences on the final prediction performance. In particular, the main sequence features were constituted by:

(1) Multiple sequence alignment profiles in the form of position-specific scoring matrices (PSSMs) produced by PSI-BLAST program, which contain the important evolutionary information responsible for improving the prediction accuracy;

(2) Predicted secondary structure information generated by PSIPRED program;

(3) Global information in terms of twenty amino acid compositions, the normalized protein molecular weight, as well as the normalized sequence length;

(4) Sequential distance of cysteine-cysteine residue pair (also denoted as DOC, Distance of Oxidized Cyseines in the literature). This value was then normalized using the logarithm function before its input into SVR models.

 

2. The Architecture of SVR Prediction System:

 

 

 

 

3. Detailed computational equations for calculating different types of information as the input into SVR prediction system:

1. PSSMs:

It is well-known that evolutionary information in the form of multiple sequence alignment profiles generated by PSI-BLAST program can significantly improve the overall prediction performance. The PSSM is a protein sequence is an M ¡Á 20 matrix, where M is the target sequence length and 20 is the number of amino acid types. The neighboring sequence environments of cysteine residues can be extrated by using a sliding window method with a local window length M. In this study for all the proteins with different numbers of disulfide bridges, we have set up the local window size at M = 13 consistently.

Firstly, we obtained the NCBI nr database, which contained all known databases: all non-redundant GenBank translations, SwissProt, PIR, PDB, PRF, and NCBI RefSeq database. Then, blastpgp program was run to query each protein sequence in our datasets against this NCBI nr database to generate the PSSM profiles, by three iterations of PSI-BLAST, with a cutoff E-value of 0.001. Then these profiles were scaled to the required 0¨C1 range by the following standard logistic function:

where x is the raw profile matrix value. The scaled PSSM profiles of every two single cysteines were then concatenated to form a cysteine-cysteine residue pair before being input into SVR.

2. PSS:

The predicted probability matrices of the secondary structure states by PSIPRED was also taken into consideration in order to further enhance the prediction performance. PSIPRED is a software package that generates the reliability indices (in 0-1 range) for all the three states (helix, strand, and coil) for each residue in a protein sequence.

We performed the PSIPRED program against every protein sequence in the datasets and subsequently extracted the M¡Á3 matrix using a sliding window scheme, where M = 13 was adopted in this study.

3. AA Content:

We also used the global twenty amino acid contents of the protein sequence as the input to SVR, as amino acid compositions tend to result in the improvement of prediction accuracy to some extent. This vector is calculated by:

where ni is the number of occurrences of amino acid type i in the sequence with length L.

4. Proweight:

Incorporation of global sequence features such as normalized protein molecular weight (Proweight) and protein length (Prolength) can yield better prediction performance when compared with local sequence alone. Its molecular weight is calculated by summing up all its residues using their individual residue molecular weights.

where and are the raw and average protein molecular weights, respectively. SD is the standard deviation that is computed based on the whole dataset.

5. Prolength:

Similar to Proweight, Prolength denotes a global vector of protein sequence length. The normalized Prolength value is also calculated using the above equation, where and are the raw and average protein sequence lengths, respectively. And SD refers to as the standard deviation here.

6. DOC:

This vector decribes the squential Distance between Oxidized Cysteines (referred to as DOC or Cysteine Separation Distance). It is defined as:

DOC(i,j)=||i-j||

where i and j represents the two oxidized cysteines that form a disulfide bridge. Previous work by Tsai et al. (2005) indicated that normalizing the DOC value using the logarithm function can significantly improve the prediction accuracy. Hence, in this study, we also incorporated this normalized value into SVR models.

7. Cysorder:

This vector describes the sequential order difference between each cysteine pair and was originally suggested by Chen et al. (2006). For instance, a protein with three disulfide bridges that are formed between cys 21 and cys 42 (cysordering residue 1 and 4), cys 28 and 60 (cysordering residue 2 and 5), and cys 36 and 66 (cysordering residue 3 and 6) will have the following Cysorder: (1/6, 4/6, 2/6, 5/6, 3/6, 6/6) = (0.1667, 0.6667, 0.3333, 0.8333, 0.5000, 1.000). For further details about how to explain and construct this vector, please refer to Chen et al.'s Proteins paper.

 

To summarize, a cysteine-cysteine pair is constituted by PSSMs + PSIPRED + AA Content + Proweight + Prolength + DOC + Cysorder, and therefore there are totally 520 + 78 + 20 + 1 + 1 + 1 + 2 = 623 - dimensional vectors encoded for each cysteine-cysteine pair.

 

4. Parameter selection of SVR training and testing:

The selection of the kernel function parameters is an important step for SVR training and testing, as it implicitly determines the structure of the high dimensional feature space when constructing the OSH. We trained and constructed our SVR classifiers based on Radial Basis Function (RBF kernel).

There are two parameters needed to be determined in advance to optimize the SVR training. They are the regularization parameter C and the kernel parameter gamma y. The former is the cost parameter and the latter determines the width of RBF kernel. We carried out the preliminary studies to select the optimal parameter combinations of C and gamma y.

 

5. Scoring the Prediction Performance:

In this study, we employ two assessment measures Qc and Qp to evaluate the predictve power of our SVR classifiers, which is consistent with the previous prediction studies. Qc and Qp measures are based on the basis of cysteine pair and protein level, respectively.

Qc (the cysteine pair-based or disulfide bridge-based measure, i.e. the fraction of correctly predicted disulfide bridges in a protein) is given by

Qc = Nc / Tc

where Nc is the number of disulfide bridges that are correctly predicted, and Tc is the total number of disulfide bridges in the testing dataset.

Qp (the protein-based measure, i.e. the fraction of proteins whose disulfide connectivity patterns are all predicted correctly) is given by

Qp = Np / Tp

where Np is the number of proteins whose disulfide connectivity patterns are correctly predicted, and Tp is the total number of proteins in the testing dataset.

All the results were evaluated using a 4-fold cross-validation procedure, i.e. the whole dataset was randomly divided into roughly four subsets, with each subset possessing the roughly equal numbers of protein sequences. In the cross-validation step, Each subset was singled out in turn as the testing dataset, while all the remaining proteins in other subsets were used as the training dataset to build the SVR models.

 

6. Predicting disulfide connectivity patterns:

To address and resolve the prediction task, we have reduced the problem of predicting disulfide connectivity patterns to a cysteine pair-wise one, i.e. we trained and built our SVR models based on the disulfide bonding probabilities of two cysteine residue pairs, then used the built SVR models to predict the bonding potential of each cysteine pair of every protein sequence in the testing dataset.

We have given a complete txt file of disulfide connectivity patterns of proteins with 2, 3, 4 and 5 disulfide bridges, based on which the total probability scores of a protein sequence can be calculated. The disulfide connectivity pattern that has the largest probability score will be predicted as the result.

As an alternative stategy, the prediction problem of disulfide connectivity can be solved by drawing a maximum-weight matching graph whose nodes are disulfide-bonded cysteines and whose edge weight is the potential disulfide bonding probability of the corresponding cysteine pair. As a matter of fact, this strtegy has been employed by the majority of previous studies in the literature (Fariselli and Casadio, 2001; Ferre and Clote, 2005a, 2005b; Tsai et al., 2005; Cheng et al., 2006). However, our method can directly solve this difficult problem by predicting the disulfide bonding probability of each cysteine pair and subsequently enumerating the probability scores of all the possible disulfide connectivity patterns, without exhaustively transforming it into a maximum weight matching problem.

 

 

 

References
Abkevich, V.I. and Shakhnovich, E.I. (2000) What can disulfide bonds tell us about protein energetics, function and folding: simulations and bioinformatics analysis. J Mol Biol, 300, 975-985.
Ahmad, S. and Sarai, A. (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 6, 33.
Baldi, P., Cheng, J. and Vullo, A. (2005) Large-scale prediction of disulphide bond connectivity. In: Saul, L.K., Weiss, Y. Bottou, L, editors, Advances in neural information processing systems. Cambridge, MA: MIT Press, 97-104.
Ceroni, A., Passerini, A., Vullo, A. and Frasconi, P. (2006) DISULFIND: a disulfide bonding state and cysteine connectivity prediction server. Nucleic Acids Res, 34, W177-181.
Cheek, S., Krishna, S.S. and Grishin, N.V. (2006) Structural classification of small, disulfide-rich protein domains. J Mol Biol, 359, 215-237.
Chen, B.J., Tsai, C.H., Chan, C.H. and Kao, C.Y. (2006) Disulfide connectivity prediction with 70% accuracy using two-level models. Proteins, 64, 246-252.
Chen, Y.C. and Hwang, J.K. (2005) Prediction of disulfide connectivity from protein sequences. Proteins, 61, 507-512.
Chuang, C.C., Chen, C.Y., Yang, J.M., Lyu, P.C. and Hwang, J.K. (2003) Relationship between protein structures and disulfide-bonding patterns. Proteins, 53, 1-5.
Fariselli, P. and Casadio, R. (2001) Prediction of disulfide connectivity in proteins. Bioinformatics, 17, 957-964.
Fariselli, P., Riccobelli, P. and Casadio, R. (2002) A neural network based method for predicting the disulfide connectivity in proteins. In Damiani, E. et al. (eds), Knowledge based intelligent information engineering systems and allied technologies (KES 2002), IOS Press, Amsterdam, 1, 464-468.
Ferre, F. and Clote, P. (2005a) DiANNA: a web server for disulfide connectivity prediction. Nucleic Acids Res, 33, W230-232.
Ferre, F. and Clote, P. (2005b) Disulfide connectivity prediction using secondary structure information and diresidue frequencies. Bioinformatics, 21, 2336-2346.
Gupta, A., Van Vlijmen, H.W. and Singh, J. (2004) A classification of disulfide patterns and its relationship to protein structure and function. Protein Sci, 13, 2045-2058.
Joachims, T. (1999) Making large-Scale SVM Learning Practical. In Advances in Kernel Methods - Support Vector Learning. Edited by: Schlkopf, B., Burges, C. and Smola, A., MIT Press. http://svmlight.joachims.org/
Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol, 292, 195-202.
Lu, C.H., Chen, Y.C., Yu, C.S. and Hwang, J.K. (2007) Predicting disulfide connectivity patterns. Proteins, 67, 262-270.
Sarda, D., Chua, G.H., Li, K.B. and Krishnan, A. (2005) pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinformatics, 6, 152.
Song, J., Burrage, K., Yuan, Z. and Huber, T. (2006) Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information. BMC Bioinformatics, 7, 124.
Song, J. and Burrage, K. (2006) Predicting residue-wise contact orders in proteins by support vector regression. BMC Bioinformatics, 7, 425.
Thangudu, R.R., Vinayagam, A., Pugalenthi, G., Manonmani, A., Offmann, B. and Sowdhamini, R. (2005) Native and modeled disulfide bonds in proteins: knowledge-based approaches toward structure prediction of disulfide-rich polypeptides. Proteins, 58, 866-879.
Tsai, C.H., Chen, B.J., Chan, C.H., Liu, H.L. and Kao, C.Y. (2005) Improving disulfide connectivity prediction with sequential distance between oxidized cysteines. Bioinformatics, 21, 4416-4419.
van Vlijmen, H.W., Gupta, A., Narasimhan, L.S. and Singh, J. (2004) A novel database of disulfide patterns and its application to the discovery of distantly related homologs. J Mol Biol, 335, 1083-1092.
Vapnik, V. (2000) The nature of statistical learning theory. Springer, New York, NY.
Vullo, A. and Frasconi, P. (2004) Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics, 20, 653-659.
Yuan, Z. and Huang, B. (2004) Prediction of protein accessible surface areas by support vector regression. Proteins, 57, 558-564.
Yuan, Z., Bailey, T.L. and Teasdale, R.D. (2005) Prediction of protein B-factor profiles. Proteins, 58, 905-912.
Yuan, Z. (2005) Better prediction of protein contact number using a support vector regression analysis of amino acid sequence. BMC Bioinformatics, 6, 248.
Zhao, E., Liu, H.L., Tsai, C.H., Tsai, H.K., Chan, C.H. and Kao, C.Y. (2005) Cysteine separations profiles on protein sequences infer disulfide connectivity. Bioinformatics, 21, 1415-1420.