What can the Essential Protein Prediction website do?
The website is accompanied with the paper, "Prediction of Protein Essentiality by the Support Vector Machine with Statistical Tests". It is designed to predict the essentiality for given a protein sequence (required) and its associated interaction network (optional). The user can also determine whether to adopt PSSM (Position-Specific Scoring Matrix) features for prediction.
How to use?
In the website, the user can specify which species to use: S. cerevisiae and Escherichia coli (E. coli), each of which is designated with its corresponding SVM (support vector machine) model.
Then, the website will require users to input a protein sequence for prediction. The format must be in fasta format. The website also provides an example button for illustration. Once clicked, an example protein sequence will be shown.
It should be noted that the fasta data contains two main parts: the sequence name and the amino acid sequence. The sequence name must start with a '>' and then is followed with a meaningful name (or ID). The name should be in [A-Z,a-z,0-9] or hyphen ("-"). In addition, the amino acid sequence must be in [A-Z,a-z]. Otherwise, the program will not work appropriately. To avoid the problem, the program also conduct some preliminary format checking after all data are submitted to ensure the data is input correctly.
In addition to protein sequence, the website can optionally let users determine whether to adopt PSSM and PPI (protein-protein physical interaction) information for prediction. For PSSM, users only need to check or uncheck the 'Use PSSM?' checkbox. Once the 'Use PSSM?' option is adopted, the website will invoke PSIBLAST to calculate PSSM. It will typically take about 20 minutes to execute PSIBLAST for a sequence with 260 residues.
For the PPI, once this option is selected, users are required to input the interaction network. The network can be multiple lines, each of which is consisted with <ID1><tab><ID2>. The <tab> character is used to delimit two proteins. It should be noted that the input network must be consistent of that in the 'Input Protein Sequence' panel.
Once all data are correctly input, click the 'Submit' button to perform prediction. The output of the prediction is shown in the 'Output' panel. There are two available kinds of outputs: Essential and Nonessential.
About the paper
Essential proteins provide the minimum required set to support cell life. Identifying essential proteins can help people understand the cellular processes of an organism. However, it is known that distinguishing essential proteins in experimental ways is an extremely time-consuming and labor-intensive task. It is thus indispensable to construct an alternative way to address the problem. In this paper, two goals are to be achieved: identifying the important features and building learning machines for discriminating protein essentiality. Two data sets are adopted in this paper: Saccharomyces cerevisiae and Escherichia coli. We first collect features from a variety of sources. Then we propose a modified backward feature selection method and build SVM predictors with the selected features. To evaluate the performance, we conduct 10-Fold cross validations on the originally imbalanced data set and the down-sampling balanced data set. The statistical tests are applied on the performance associated with obtained feature subsets to confirm their significance. In the first data set, our best values of F-measure and MCC are 0.549 and 0.495 in the imbalanced experiments. For the balanced experiment, the best values of F-measure and MCC are 0.770 and 0.545, respectively. In the second data set, our best values of F-measure and MCC are 0.421 and 0.407 in the imbalanced experiments. For the balanced experiment, the best values of F-measure and MCC are 0.718 and 0.448 , respectively. The experimental results show that our selected features are compact and the performances are better than previous results. We also provide a website for this paper so that researchers can predict essentiality by themselves.
In the following, we show the ROC curves, which are based on the SVM models associated with mRMR (minimal redundancy and maximal relevance) , CMIM (conditional mutual information maximization), Hwang et al., Acencio et al., Gustafson et al. and our Nx's. All curves are produced by the ROCR package.
Experimential data preparation
1.Goto the following website, then download the datasets in Additional Files 3 and 4. These two files individually contain part of S.cere dataset and full E.coli dataset adopted in our study. In the feature file, the first column contains orf ids.
2.Goto the following website, then download the interaction network file. Please note that the interaction relationship is represented in terms of orf_id1 vs. orf_id2.
For S. cere:
Click here to download part of physical interaction data
Click here to download all interaction data
Click here to download integrated functional interaction data
Click here to download physical interaction data
Click here to download physical interaction + genomic context data
3.Rearrange the interaction network data in the following format: orf_id orf_id ... Then, goto the website and paste the interaction network data into the "Data Input" field. Some network features, like degree, MNC and DMNC et al., can be extracted. http://hub.iis.sinica.edu.tw/Hubba
4.Goto iGraph website and install the software. iGraph can be used to calculate features like betweenness and closeness.
5.Goto the website and locate the supplement download web site. There is a table which is about how orf ids (also Blattner ID) can be converted into Swiss-Prot IDs (sp ids). For example, we can find a correspondence about b0002 -> P00561. Please note there may be other possible way to obtain the correspondence.
6.Goto the uniprot website and download the sequence file with sp ids. If you want to get P00561, the the download link is: http://www.uniprot.org/uniprot/P00561.fasta
7.Goto the BLAST ftp site. Download and run blast+. PSI-Blast can be used to calculated PSSM.
In our study, we adopt nr database and the dos command is:
psiblast.exe -db nr_database_path -num_iterations 3 -query sp_id.fasta -out_ascii_pssm sp_id.pssm
8.Extract featue. For codon feature, please visit http://codonw.sourceforge.net/ to obtain more detailed information. About how to calculate other features, they are detailed in our paper.
1. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997.
2. M. L. Acencio and N. Lemke, “Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information,” BMC Bioinformatics, vol. 10, no. 1, pp. 290–307, 2009.
3. C. C. Chang and C. J. Lin, “LIBSVM: a library for support vector machines,” 2001, software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
4. F. Fleuret, “Fast binary feature selection with conditional mutual information,” Journal of Machine Learning Research,vol. 5, p. 1531V1555, 2004.
5. A. M. Gustafson, E. S. Snitkin, S. C. Parker, C. DeLisi, and S. Kasif, “Towards the identiﬁcation of essential genes using targeted genome sequencing and comparative analysis,” BMC Genomics, vol. 7, 2006.
6. P. Hu, S. C. Janga, M. Babu, J. J. Diaz-Mejia, G. Butland, W. Yang, O. Pogoutse, X. Guo, S. Phanse, P. Wong, S. Chandran, C. Christopoulos, A. Nazarians-Armavil, N. K. Nasseri, G. Musso, M. Ali, N. Nazemof, V. Eroukova, A. Golshani, A. Paccanaro, J. F. Greenblatt, G. Moreno-Hagelsieb, and A. Emili, “Global functional atlas of escherichia coli encompassing previously uncharacterized proteins,” PLoS BIOLOGY, vol. 7, pp. 929–D947, 2009.
7. Y. C. Hwang, C. C. Lin, J. Y. Chang, H. Mori, H. F. Juan, and H. C. Huang, “Predicting essential genes based on network and sequence analysis,” Molecular BioSystems, vol. 5, no. 12, pp. 1672–1678, 2009.
8. H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundance,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
9. R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2008, ISBN 3-900051-07-0. [Online]. Available: http://www.R-project.org.
10. T. Sing, O. Sander, N. Beerenwinkel, and T. Lengauer, “ROCR: visualizing classiﬁer performance in R,” Bioinformatics, vol. 21, pp. 3940–3941, 2005.