Protein Life: R & D experience of a bioinformatician

Exploring science is typically characterized by a lot of puzzles, frustrations or even failures. This weblog is mainly intended to record my working, thinking and knowledge acquisitions. I expect that some reflection would refresh my mind from time to time, and motivate me to move further, and hopefully give me a better view about even changing the landscape of bioinformatics. You are welcome to leave some comments, good or bad, but hopefully something constructive. Enjoy your surfing!

Monday, March 28, 2011

Score regularization for peptide identification

Abstract

Background

Peptide identification from tandem mass spectrometry (MS/MS) data is one of the most important problems in computational proteomics. This technique relies heavily on the accurate assessment of the quality of peptide-spectrum matches (PSMs). However, current MS technology and PSM scoring algorithm are far from perfect, leading to the generation of incorrect peptide-spectrum pairs. Thus, it is critical to develop new post-processing techniques that can distinguish true identifications from false identifications effectively.

Results

In this paper, we present a consistency-based PSM re-ranking method to improve the initial identification results. This method uses one additional assumption that two peptides belonging to the same protein should be correlated to each other. We formulate an optimization problem that embraces two objectives through regularization: the smoothing consistency among scores of correlated peptides and the fitting consistency between new scores and initial scores. This optimization problem can be solved analytically. The experimental study on several real MS/MS data sets shows that this re-ranking method improves the identification performance.

Conclusions

The score regularization method can be used as a general post-processing step for improving peptide identifications. Source codes and data sets are available at: http://bioinformatics.ust.hk/SRPI.rar webcite.

full article

Baking a mass-spectrometry data PIE with McMC and simulated annealing: predicting protein post-translational modifications from integrated top-down and bottom-up data

Abstract

Motivation: Post-translational modifications are vital to the function of proteins, but are hard to study, especially since several modified isoforms of a protein may be present simultaneously. Mass spectrometers are a great tool for investigating modified proteins, but the data they provide is often incomplete, ambiguous and difficult to interpret. Combining data from multiple experimental techniques—especially bottom-up and top-down mass spectrometry—provides complementary information. When integrated with background knowledge this allows a human expert to interpret what modifications are present and where on a protein they are located. However, the process is arduous and for high-throughput applications needs to be automated.

Results: This article explores a data integration methodology based on Markov chain Monte Carlo and simulated annealing. Our software, the Protein Inference Engine (the PIE) applies these algorithms using a modular approach, allowing multiple types of data to be considered simultaneously and for new data types to be added as needed. Even for complicated data representing multiple modifications and several isoforms, the PIE generates accurate modification predictions, including location. When applied to experimental data collected on the L7/L12 ribosomal protein the PIE was able to make predictions consistent with manual interpretation for several different L7/L12 isoforms using a combination of bottom-up data with experimentally identified intact masses.

full article

Xlink-Identifier: An Automated Data Analysis Platform for Confident Identifications of Chemically Cross-Linked Peptides Using Tandem Mass Spectrometry

Chemical cross-linking combined with mass spectrometry provides a powerful method for identifying protein−protein interactions and probing the structure of protein complexes. A number of strategies have been reported that take advantage of the high sensitivity and high resolution of modern mass spectrometers. Approaches typically include synthesis of novel cross-linking compounds, and/or isotopic labeling of the cross-linking reagent and/or protein, and label-free methods. We report Xlink-Identifier, a comprehensive data analysis platform that has been developed to support label-free analyses. It can identify interpeptide, intrapeptide, and deadend cross-links as well as underivatized peptides. The software streamlines data preprocessing, peptide scoring, and visualization and provides an overall data analysis strategy for studying protein−protein interactions and protein structure using mass spectrometry. The software has been evaluated using a custom synthesized cross-linking reagent that features an enrichment tag. Xlink-Identifier offers the potential to perform large-scale identifications of protein−protein interactions using tandem mass spectrometry.
read more

Network-Based Pipeline for Analyzing MS Data: An Application toward Liver Cancer

Current limitations in proteome analysis by high-throughput mass spectrometry (MS) approaches have sometimes led to incomplete (or inconclusive) data sets being published or unpublished. In this work, we used an iTRAQ reference data on hepatocellular carcinoma (HCC) to design a two-stage functional analysis pipeline to widen and improve the proteome coverage and, subsequently, to unveil the molecular changes that occur during HCC progression in human tumorous tissue. The first involved functional cluster analysis by incorporating an expansion step on a cleaned integrated network. The second used an in-house developed pathway database where recovery of shared neighbors was followed by pathway enrichment analysis. In the original MS data set, over 500 proteins were detected from the tumors of 12 male patients, but in this paper we reported an additional 1000 proteins after application of our bioinformatics pipeline. Through an integrative effort of network cleaning, community finding methods, and network analysis, we also uncovered several biologically interesting clusters implicated in HCC. We established that HCC transition from a moderate to poor stage involved densely connected clusters that comprised of PCNA, XRCC5, XRCC6, PARP1, PRKDC, and WRN. From our pathway enrichment analyses, it appeared that the HCC moderate stage, unlike the poor stage, is enriched in proteins involved in immune responses, thus suggesting the acquisition of immuno-evasion. Our strategy illustrates how an original oncoproteome could be expanded to one of a larger dynamic range where current technology limitations prevent/limit comprehensive proteome characterization.

mProphet: automated data processing and statistical validation for large-scale SRMSRMSRM experiments

Selected reaction monitoring (SRM) is a targeted mass spectrometric method that is increasingly used in proteomics for the detection and quantification of sets of preselected proteins at high sensitivity, reproducibility and accuracy. Currently, data from SRM measurements are mostly evaluated subjectively by manual inspection on the basis of ad hoc criteria, precluding the consistent analysis of different data sets and an objective assessment of their error rates. Here we present mProphet, a fully automated system that computes accurate error rates for the identification of targeted peptides in SRM data sets and maximizes specificity and sensitivity by combining relevant features in the data into a statistical model.

Natural Language Processing to Play Major Role in Bringing Watson into Clinics

Under the terms of a recently inked agreement between IBM And Nuance, Watson's deep question answering, natural language processing, and machine learning capabilities will be linked with Nuance's speech recognition and Clinical Language Understanding, CLU, solutions to help physicians more accurately diagnose and treat their patients (BI02/11/2011).

In the months leading up to the first offerings from the collaboration, researchers at IBM and Nuance will work with collaborators at Columbia University and the University of Maryland, to figure out how Watson can best help in the clinical setting as well as to incorporate some healthcare-specific adaptations to the system, Jennifer Chu-Carroll, a member of the Watson Research Team, told BioInform.

"For the most part, the natural language analytics, the machine learning and the whole architecture are domain independent so we expect to be able plug these into the medical domain," she said. However, "there [will] be some ... research and development that is specific to the medical domain that we are going to have to bring in."

Strong Investments to Boost Global Bioinformatics Market

According to our latest report entitled “Global Bioinformatics Market Outlook”, the global bioinformatics industry has been witnessing a remarkable growth rate pattern for the past few years. Factors, such as increasing R&D investments by companies and regulatory support boosted the market revenue to reach around US$ 2.6 Billion in 2009. As the bioinformatics market is at the nascent stage, its full potential is yet to be exploited. Moreover, there are significant advances in technologies that will boost the bioinformatics industry in future. It is expected that the market will grow at a CAGR of nearly 26% during 2011-2013 to reach US$ 6.2 Billion.

Our report has found that the escalating importance of personalized medicines has significantly augmented the growth level in proteomics market. Rising awareness of protein research and its study has opened the market for different proteomic-related equipments, technologies, and services. It is estimated that by 2013, the proteomics market will be worth around US$ 17 Billion and it will act as an important contributor in the growth of bioinformatics.

Protein Life: R & D experience of a bioinformatician

Monday, March 28, 2011

Score regularization for peptide identification

Abstract

Background

Results

Conclusions

Baking a mass-spectrometry data PIE with McMC and simulated annealing: predicting protein post-translational modifications from integrated top-down and bottom-up data

Abstract

Xlink-Identifier: An Automated Data Analysis Platform for Confident Identifications of Chemically Cross-Linked Peptides Using Tandem Mass Spectrometry

Network-Based Pipeline for Analyzing MS Data: An Application toward Liver Cancer

mProphet: automated data processing and statistical validation for large-scale SRMSRMSRM experiments

Saturday, March 5, 2011

Natural Language Processing to Play Major Role in Bringing Watson into Clinics

Tuesday, March 1, 2011

Strong Investments to Boost Global Bioinformatics Market