Protein Life: R & D experience of a bioinformatician

Exploring science is typically characterized by a lot of puzzles, frustrations or even failures. This weblog is mainly intended to record my working, thinking and knowledge acquisitions. I expect that some reflection would refresh my mind from time to time, and motivate me to move further, and hopefully give me a better view about even changing the landscape of bioinformatics. You are welcome to leave some comments, good or bad, but hopefully something constructive. Enjoy your surfing!

Thursday, December 30, 2010

Time for a break: happy new year!

Tuesday, December 28, 2010

A performance enhanced PSI-BLAST based on hybrid alignment

Abstract

Motivation: Sequence alignment is one of the most popular tools of modern biology. NCBI's PSI-BLAST utilizes iterative model building in order to better detect distant homologs with greater sensitivity than non-iterative BLAST. However, PSI-BLAST's performance is limited by the fact that it relies on deterministic alignments. Using a semi-probabilistic alignment scheme such as Hybrid alignment should allow for better informed model building and improved identification of homologous sequences, particularly remote homologs.

Results: We have built a new version of the tool in which the Smith-Waterman alignment algorithm core is replaced by the hybrid alignment algorithm. The favorable statistical properties of the hybrid algorithm allow the introduction of position-specific gap penalties in Hybrid PSI-BLAST. This improves the position-specific modeling of protein families and results in an overall improvement of performance.

Availability: Source code is freely available for download at http://bioserv.mps.ohio-state.edu/HybridPSI, implemented in C and supported on linux.

Contact: bundschuh@mps.ohio-state.edu

Supplementary information:Supplementary data are available at Bioinformatics online.

ICPD-a new peak detection algorithm for LC/MS.

BMC Genomics. 2010 Dec 1;11 Suppl 3:S8.

ICPD-a new peak detection algorithm for LC/MS.

Zhang J, Haskins W.

Department of Electrical Engineering, University of Texas at San Antonio, Texas, USA. michelle.zhang@utsa.edu

Abstract

BACKGROUND: The identification and quantification of proteins using label-free Liquid Chromatography/Mass Spectrometry (LC/MS) play crucial roles in biological and biomedical research. Increasing evidence has shown that biomarkers are often low abundance proteins. However, LC/MS systems are subject to considerable noise and sample variability, whose statistical characteristics are still elusive, making computational identification of low abundance proteins extremely challenging. As a result, the inability of identifying low abundance proteins in a proteomic study is the main bottleneck in protein biomarker discovery.

RESULTS: In this paper, we propose a new peak detection method called Information Combining Peak Detection (ICPD ) for high resolution LC/MS. In LC/MS, peptides elute during a certain time period and as a result, peptide isotope patterns are registered in multiple MS scans. The key feature of the new algorithm is that the observed isotope patterns registered in multiple scans are combined together for estimating the likelihood of the peptide existence. An isotope pattern matching score based on the likelihood probability is provided and utilized for peak detection.

CONCLUSIONS: The performance of the new algorithm is evaluated based on protein standards with 48 known proteins. The evaluation shows better peak detection accuracy for low abundance proteins than other LC/MS peak detection methods.

On the accuracy and limits of peptide fragmentation spectrum prediction

Anal Chem. 2010 Dec 22. [Epub ahead of print]

On the Accuracy and Limits of Peptide Fragmentation Spectrum Prediction.

Li S, Arnold RJ, Tang H, Radivojac P.

School of Informatics and Computing, Indiana University , Bloomington, Indiana 47408, United States.

Abstract

We estimated the reproducibility of tandem mass spectra for the widely used collision-induced dissociation (CID) of peptide ions. Using the Pearson correlation coefficient as a measure of spectral similarity, we found that the within-experiment reproducibility of fragment ion intensities is very high (about 0.85). However, across different experiments and instrument types/setups, the correlation decreases by more than 15% (to about 0.70). We further investigated the accuracy of current predictors of peptide fragmentation spectra and found that they are more accurate than the ad-hoc models generally used by search engines (e.g., SEQUEST) and, surprisingly, approaching the empirical upper limit set by the average across-experiment spectral reproducibility (especially for charge +1 and charge +2 precursor ions). These results provide evidence that, in terms of accuracy of modeling, predicted peptide fragmentation spectra provide a viable alternative to spectral libraries for peptide identification, with a higher coverage of peptides and lower storage requirements. Furthermore, using five data sets of proteome digests by two different proteases, we find that PeptideART (a data-driven machine learning approach) is generally more accurate than MassAnalyzer (an approach based on a kinetic model for peptide fragmentation) in predicting fragmentation spectra but that both models are significantly more accurate than the ad-hoc models.

PMID: 21175207 [PubMed - as supplied by publisher]

My comments: the ad-hoc model used by SEQUEST internally is well-known a simple one. Most prediction models can
outperform it with flying color.

Peptide synthesis

Click and see our every day special $2.00 per residue

Welcome to Peptide 2.0

- The Second Generation for Peptide Syntheis

Peptide 2.0 strives to provide best custom peptide services. With the state-of-the-art facility and an outstanding management team, Peptide 2.0 is able to provide custom peptide service with the best quality and the best value to our customers at unprecedented price (from $2.00 per amino acid).

Peptide-2.0	Peptide-2-Library	Pepide-2-Go
From $2.00 per amino acid MS report included Up to 20 amino acide residue	96-well plate format Up to 3 mg scale From $2880 per plate	Up to 98% purity From mg to kg Fast turnaround time

		CALL:	1-800-301-6268
		FAX:	1-703-637-9863
		EMAIL:	order@peptide20.org


	Why Peptide2.0:

	High Quality with Unbeatable Low Price --- From $2.00 per amino acid residue (aa)
	Real-time tracking your order status
	Two to Three weeks delivery time for most peptides.
	Try our risk-free service now! You won't be charged unless you receive your peptides.

a dynamic data structure for flexible molecular maintenance and informatics

Abstract

Motivation: We present the ‘Dynamic Packing Grid’ (DPG), a neighborhood data structure for maintaining and manipulating flexible molecules and assemblies, for efficient computation of binding affinities in drug design or in molecular dynamics calculations.

Results: DPG can efficiently maintain the molecular surface using only linear space and supports quasi-constant time insertion, deletion and movement (i.e. updates) of atoms or groups of atoms. DPG also supports constant time neighborhood queries from arbitrary points. Our results for maintenance of molecular surface and polarization energy computations using DPG exhibit marked improvement in time and space requirements.

Availability: http://www.cs.utexas.edu/~bajaj/cvc/software/DPG.shtml

Contact: bajaj@cs.utexas.edu

Full article

Thursday, December 23, 2010

A label-free differential quantitative mass spectrometry method for the characterization and identification of protein changes during citrus fruit development

Background

Citrus is one of the most important and widely grown commodity fruit crops. In this study a label-free LC-MS/MS based shot-gun proteomics approach was taken to explore three main stages of citrus fruit development. These approaches were used to identify and evaluate changes occurring in juice sac cells in various metabolic pathways affecting citrus fruit development and quality.

Results

Protein changes in citrus juice sac cells were identified and quantified using label-free shotgun methodologies. Two alternative methods, differential mass-spectrometry (dMS) and spectral counting (SC) were used to analyze protein changes occurring during earlier and late stages of fruit development. Both methods were compared in order to develop a proteomics workflow that could be used in a non-model plant lacking a sequenced genome. In order to resolve the bioinformatics limitations of EST databases from species that lack a full sequenced genome, we established iCitrus. iCitrus is a comprehensive sequence database created by merging three major sources of sequences (HarvEST:citrus, NCBI/citrus/unigenes, NCBI/citrus/proteins) and improving the annotation of existing unigenes. iCitrus provided a useful bioinformatics tool for the high-throughput identification of citrus proteins. We have identified approximately 1500 citrus proteins expressed in fruit juice sac cells and quantified the changes of their expression during fruit development. Our results showed that both dMS and SC provided significant information on protein changes, with dMS providing a higher accuracy.

Conclusion

Our data supports the notion of the complementary use of dMS and SC for label-free comparative proteomics, broadening the identification spectrum and strengthening the identification of trends in protein expression changes during the particular processes being compared.

Full article

Saturday, December 18, 2010

Speeding up tandem mass spectrometry-based database searching by longest common prefix

From my Chinese colleagues.

Abstract

Background

Tandem mass spectrometry-based database searching has become an important technology for peptide and protein identification. One of the key challenges in database searching is the remarkable increase in computational demand, brought about by the expansion of protein databases, semi- or non-specific enzymatic digestion, post-translational modifications and other factors. Some software tools choose peptide indexing to accelerate processing. However, peptide indexing requires a large amount of time and space for construction, especially for the non-specific digestion. Additionally, it is not flexible to use.

Results

We developed an algorithm based on the longest common prefix (ABLCP) to efficiently organize a protein sequence database. The longest common prefix is a data structure that is always coupled to the suffix array. It eliminates redundant candidate peptides in databases and reduces the corresponding peptide-spectrum matching times, thereby decreasing the identification time. This algorithm is based on the property of the longest common prefix. Even enzymatic digestion poses a challenge to this property, but some adjustments can be made to this algorithm to ensure that no candidate peptides are omitted. Compared with peptide indexing, ABLCP requires much less time and space for construction and is subject to fewer restrictions.

Conclusions

The ABLCP algorithm can help to improve data analysis efficiency. A software tool implementing this algorithm is available athttp://pfind.ict.ac.cn/pfind2dot5/index.htm webcite

full article

Cresset Introduces Newest Cloud Computing Enabled Application with FieldAlign V3.0

Cresset today announced that it has released a major new version of its FieldAlign package. FieldAlign V3.0 is the latest in Cresset’s new generation of “cloud enabled” applications, which support parallel, distributed computing by default. It can be deployed both as a traditional desktop application and as a command-line application distributed onto large computing servers. FieldAlign V3.0 also introduces native support for the Mac for the first time.
FieldAlign is a powerful molecular design and 3D Structure Activity Relationship (SAR) tool which generates biologically relevant molecular comparisons, which can be used to find the root causes of activity or inactivity. FieldAlign helps chemists to gain detailed understanding of the SAR of their lead molecules and to use this to design the best next synthesis.

FieldAlign V3.0 introduces a range of user features that improve productivity and customisability. Its command line interface supports scripting and workflow systems, and is now available on Windows, Linux and Mac platforms. It incorporates a new molecule table enabling filtering and sorting of lead molecules, using imported data or standard physical properties, such as wcLogP, TPSA, Rule of Five violations. The new molecule editor enables rapid design iterations, while the multi-processor support facilitates faster run times on modern computers. With the option to expand computational power by distributing remote FieldEngines and its enhanced integration with other chemistry applications, more flexible licensing options provide advanced user flexibility.

“FieldAlign V3.0 is now much simpler to deploy throughout a company and easier to use. This gives medicinal chemists access to an intuitive and powerful tool that enables them to accurately evaluate the effect of small design changes before synthesis”, said Tim Cheeseright, Products Director at Cresset.

Monday, December 13, 2010

Cytoscape

"Cytoscape is an open source bioinformatics software platform for visualizing molecular interaction networks and integrating with gene expression profiles and other state data. Additional features are available as plugins. Plugins are available for network and molecular profiling analyses, new layouts, additional file format support and connection with databases and searching in large networks. Plugins may be developed using the Cytoscape open Java software architecture by anyone and plugin community development is encouraged"

Official website

Identification of functional modules in a ppi network by bounded diameter clustering.

Dense subgraphs of Protein-Protein Interaction (PPI) graphs are assumed to be potential functional modules and play an important role in inferring the functional behavior of proteins. Increasing amount of available PPI data implies a fast, accurate approach of biological complex identification. Therefore, there are different models and algorithms in identifying functional modules. This paper describes a new graph theoretic clustering algorithm that detects densely connected regions in a large PPI graph. The method is based on finding bounded diameter subgraphs around a seed node. The algorithm has the advantage of being very simple and efficient when compared with other graph clustering methods. This algorithm is tested on the yeast PPI graph and the results are compared with MCL, Core-Attachment, and MCODE algorithms.

Full article

ADEPTS: ADVANCED PEPTIDE DE NOVO SEQUENCING WITH A PAIR OF TANDEM MASS SPECTRA

From the author of PEAKS (de facto standard of de novo sequencing).

De novo sequencing is an important task in proteomics to identify novel peptide sequences. Traditionally, only one MS/MS spectrum is used for the sequencing of a peptide; however, the use of multiple spectra of the same peptide with different types of fragmentation has the potential to significantly increase the accuracy and practicality of de novo sequencing. Research into the use of multiple spectra is in a nascent stage. We propose a general framework to combine the two different types of MS/MS data. Experiments demonstrate that our method significantly improves the de novo sequencing of existing software.

Read more

Friday, December 10, 2010

Isocratic flow and gradient elution

A separation in which the mobile phase composition remains constant throughout the procedure is termed isocratic (meaning constant composition). The word was coined by Csaba Horvath who was one of the pioneers of HPLC.

The mobile phase composition does not have to remain constant. A separation in which the mobile phase composition is changed during the separation process is described as a gradient elution.One example is a gradient starting at 10% methanol and ending at 90% methanol after 20 minutes. The two components of the mobile phase are typically termed "A" and "B"; A is the "weak" solvent which allows the solute to elute only slowly, while B is the "strong" solvent which rapidly elutes the solutes from the column. Solvent A is often water, while B is an organic solvent miscible with water, such as acetonitrile, methanol, THF, or isopropanol.

In isocratic elution, peak width increases with retention time linearly according to the equation for N, the number of theoretical plates. This leads to the disadvantage that late-eluting peaks get very flat and broad. Their shape and width may keep them from being recognized as peaks.

Gradient elution decreases the retention of the later-eluting components so that they elute faster, giving narrower (and taller) peaks for most components. This also improves the peak shape for tailed peaks, as the increasing concentration of the organic eluent pushes the tailing part of a peak forward. This also increases the peak height (the peak looks "sharper"), which is important in trace analysis. The gradient program may include sudden "step" increases in the percentage of the organic component, or different slopes at different times – all according to the desire for optimum separation in minimum time.

In isocratic elution, the selectivity does not change if the column dimensions (length and inner diameter) change – that is, the peaks elute in the same order. In gradient elution, the elution order may change as the dimensions or flow rate change.

The driving force in reversed phase chromatography originates in the high order of the water structure. The role of the organic component of the mobile phase is to reduce this high order and thus reduce the retarding strength of the aqueous component.

Read more,

Reversed Phase LC

Reversed-phase chromatography (RPC) has a non-polar stationary phase and an aqueous, moderately polar mobile phase. The name "reversed phase" has a historical background. In the 1970s most liquid chromatography was done on non-modified silica or alumina with a hydrophilic surface chemistry and a stronger affinity for polar compounds - hence it was considered "normal". The introduction of alkyl chains bonded covalently to the support surface reversed the elution order [2]. Now in RPC, polar compounds are eluted first while non-polar compounds are retained - hence "reversed phase". All of the mathematical and experimental considerations used in other chromatographic methods apply (ie separation resolution proportional to the column length). Today, reversed-phase column chromatography accounts for the vast majority of analysis performed in liquid chromatography.

FAIMS

Field Asymmetric Ion Mobility (FAIMS) - Mass Spectrometry

Field Asymmetric Ion Mobility Spectrometer (FAIMS) is a high speed, gas phase ion separation technique. When interfaced to a Mass Spectrometer, the FAIMS chip provides an additional separation stage, making it suitable for a applications ranging from drug development to proteomics.

The FAIMS ion filter is orthogonal to both LC and MS, so has the potential to separate analytes that are difficult to distinguish using only LC-MS. In some cases, the FAIMS stage can replace the LC and associated sample preparation steps.

read more

Thursday, December 9, 2010

Orbitrap/LTQ signal threshold

We evaluate the effect of ion-abundance threshold settings for data dependent acquisition on a hybrid
LTQ-Orbitrap mass spectrometer, analyzing features such as the total number of spectra collected,
the signal to noise ratio of the full MS scans, the spectral quality of the tandem mass spectra acquired,
and the number of peptides and proteins identified from a complex mixture. We find that increasing
the threshold for data dependent acquisition generally decreases the quantity but increases the quality
of the spectra acquired. This is especially true when the threshold setting is set above the noise level
of the full MS scan. We compare two distinct experimental configurations: one where full MS scans
are acquired in the Orbitrap analyzer, while tandem MS scans are acquired in the LTQ analyzer and
one where both full MS and tandem MS scans are acquired in the LTQ analyzer. We examine the
number of spectra, peptides, and proteins identified under various threshold conditions, and we find
that the optimal threshold setting is at or below the respective noise level of the instrument regardless
of whether the full MS scan is performed in the Orbitrap or in the LTQ analyzer. When comparing
the high-throughput identification performance of the two analyzers, we conclude that, used at
optimal threshold levels, the LTQ and the Orbitrap identify similar numbers of peptides and proteins.
The higher scan speed of the LTQ, which results in more spectra being collected, is roughly
compensated by the higher mass accuracy of the Orbitrap, which results in improved database
searching and peptide validation software performance.

full article

Monday, December 6, 2010

mass accuracy of Orbitrap: internal communications

From communications with one colleague.

It was mentioned during the training course last month that, while performing DDA, the Orbitrap performs a "pre-scan" of the ions entering the mass spectrometer. This pre-scan is really just a part of the full scan. Rather than waiting for the completion of the full scan at a high resolution, instead, part way through, it registers the pre-scan (at a resolution of about 15,000) for the purpose of selecting ions for MS/MS. The full scan continues to completion, but the MS/MS scans on the linear ion trap already are underway by the time this happens.

The point here is that the precursor masses reported for the MS/MS spectra are from the pre-scan, not the higher resolution (if one was specified in the methods file) full scan.

At my request, he installed a utility named extractMSn on the computer associated with OT1. This utility takes a .RAW data file and extracts all of the MS and MS/MS spectra, creating a set of .dta files. The first line of each .dta file representing a MS/MS file contains the m/z value of the precursor as determined by the full scan, not the pre-scan.

The purpose of the pre-scan is to speed up the process of data-dependent acquisition (DDA). Instead of waiting for an entire full scan to be completed before doing the first MS/MS, it collects data when the full scan is partially complete. This doesn't mean it only has covered a part of the scan range. The full range of ions are present in both the pre-scan and the full scan; they are just better resolved in the full scan (and a little more accurate).

The pre-scan is not inherently more useful than the full scan. In fact, the peaks of the full scan should be better resolved. The problem is that the precursor m/z values reported for the MS/MS spectra (i.e. the values that appear in the .dtas) are the less-resolved pre-scan values. So something is needed to go back to the full scan and pick out the better resolved values if those are what one wants.

I decided to have a look at whether it makes any signficant difference. So, for 63 MS/MS spectra representing BSA peptides, I calculated the ppm error for precursor m/z values taken from the pre-scan and the full scan. This data is attached as a spreadsheet.

For the pre-scan precursor m/z values, the ppm error ranged from -2.1 ppm to as high as 10.9 ppm in what seems to be a uniform distribution.

For the full scan precursor values, most fell into the range -2.8 ppm to 3.4 ppm, with an average of -0.8. This matches what I've observed for y ions of MS/MS spectra. However, there was a second range or errors, from -31.9 ppm to -22.1 ppm, for the precursors of 11 peptides.

Possibly, the second range arises from the utility selecting an interfering peak rather than the proper one in certain instances. I'll look into it further.

That is what ExtractMSn does (generates peak lists with the full scan m/z value for the precursor included instead of the pre-scan value), but it apparently doesn't always do it well. If it did it with 100% success, I would recommend using it for applications where resolution greater than 15,000 is important. But it doesn't.

In any case, it seems the extractMSn utility isn't very helpful and should not be used.