Protein Life: R & D experience of a bioinformatician

Exploring science is typically characterized by a lot of puzzles, frustrations or even failures. This weblog is mainly intended to record my working, thinking and knowledge acquisitions. I expect that some reflection would refresh my mind from time to time, and motivate me to move further, and hopefully give me a better view about even changing the landscape of bioinformatics. You are welcome to leave some comments, good or bad, but hopefully something constructive. Enjoy your surfing!

Showing posts with label software. Show all posts

Friday, December 9, 2011

KABOOM! A new suffix array based algorithm for clustering expression data

Abstract

Motivation: Second-generation sequencing technology has reinvigorated research using expression data, and clustering such data remains a significant challenge, with much larger datasets and with different error profiles. Algorithms that rely on all-versus-all comparison of sequences are not practical for large datasets.

Results: We introduce a new filter for string similarity which has the potential to eliminate the need for all-versus-all comparison in clustering of expression data and other similar tasks. Our filter is based on multiple long exact matches between the two strings, with the additional constraint that these matches must be sufficiently far apart. We give details of its efficient implementation using modified suffix arrays. We demonstrate its efficiency by presenting our new expression clustering tool, wcd-express, which uses this heuristic. We compare it to other current tools and show that it is very competitive both with respect to quality and run time.

Availability: Source code and binaries available under GPL athttp://code.google.com/p/wcdest. Runs on Linux and MacOS X.

The Infobiotics Workbench: an integrated in silico modelling platform for Systems and Synthetic Biology

Abstract

Summary: The Infobiotics Workbench is an integrated software suite incorporating model specification, simulation, parameter optimization and model checking for Systems and Synthetic Biology. A modular model specification allows for straightforward creation of large-scale models containing many compartments and reactions. Models are simulated either using stochastic simulation or numerical integration, and visualized in time and space. Model parameters and structure can be optimized with evolutionary algorithms, and model properties calculated using probabilistic model checking.

Availability: Source code and binaries for Linux, Mac and Windows are available at http://www.infobiotics.org/infobiotics-workbench/; released under the GNU General Public License (GPL) version 3.

Gene Ontology-driven inference of protein-protein interactions using inducers

Motivation: Protein-protein interactions (PPI) are pivotal for many biological processes and similarity in Gene Ontology (GO) annotation has been found to be one of the strongest indicators for PPI. Most GO-driven algorithms for PPI inference combine machine learning and semantic similarity techniques. We introduce the concept of inducers as a method to integrate both approaches more effectively, leading to superior prediction accuracies.

Results: An inducer (ULCA) in combination with a Random Forest classifier compares favorably to several sequenced-based methods, semantic similarity measures and multi-kernel approaches. On a newly created set of high-quality interaction data, the proposed method achieves high cross-species prediction accuracies (AUC ≤ 0.88), rendering it a valuable companion to sequence-based methods.

Availability: Software and datasets are available athttp://bioinformatics.org.au/go2ppi/

Saturday, October 29, 2011

Enhanced peptide quantification using spectral count clustering and cluster abundance

Quantification of protein expression by means of mass spectrometry (MS) has been introduced in various proteomics studies. In particular, two label-free quantification methods, such as spectral counting and spectra feature analysis have been extensively investigated in a wide variety of proteomic studies.

The cornerstone of both methods is peptide identification based on a proteomic database search and subsequent estimation of peptide retention time. However, they often suffer from restrictive database search and inaccurate estimation of the liquid chromatography (LC) retention time.

Furthermore, conventional peptide identification methods based on the spectral library search algorithms such as SEQUEST or SpectraST have been found to provide neither the best match nor high-scored matches. Lastly, these methods are limited in the sense that target peptides cannot be identified unless they have been previously generated and stored into the database or spectral libraries.To overcome these limitations, we propose a novel method, namely Quantification method based on Finding the Identical Spectral set for a Homogenous peptide (Q-FISH) to estimate the peptide's abundance from its tandem mass spectrometry (MS/MS) spectra through the direct comparison of experimental spectra.

Intuitively, our Q-FISH method compares all possible pairs of experimental spectra in order to identify both known and novel proteins, significantly enhancing identification accuracy by grouping replicated spectra from the same peptide targets.

Results: We applied Q-FISH to Nano-LC-MS/MS data obtained from human hepatocellular carcinoma (HCC) and normal liver tissue samples to identify differentially expressed peptides between the normal and disease samples. For a total of 44,318 spectra obtained through MS/MS analysis, Q-FISH yielded 14,747 clusters.

Among these, 5,777 clusters were identified only in the HCC sample, 6,648 clusters only in the normal tissue sample, and 2,323 clusters both in the HCC and normal tissue samples. While it will be interesting to investigate peptide clusters only found from one sample, further examined spectral clusters identified both in the HCC and normal samples since our goal is to identify and assess differentially expressed peptides quantitatively.

The next step was to perform a beta-binomial test to isolate differentially expressed peptides between the HCC and normal tissue samples. This test resulted in 84 peptides with significantly differential spectral counts between the HCC and normal tissue samples.

We independently identified 50 and 95 peptides by SEQUEST, of which 24 and 56 peptides, respectively, were found to be known biomarkers for the human liver cancer. Comparing Q-FISH and SEQUEST results, we found 22 of the differentially expressed 84 peptides by Q-FISH were also identified by SEQUEST.

Remarkably, of these 22 peptides discovered both by Q-FISH and SEQUEST, 13 peptides are known for human liver cancer and the remaining 9 peptides are known to be associated with other cancers.

Conclusions: We proposed a novel statistical method, Q-FISH, for accurately identifying protein species and simultaneously quantifying the expression levels of identified peptides from mass spectrometry data. Q-FISH analysis on human HCC and liver tissue samples identified many protein biomarkers that are highly relevant to HCC.

Q-FISH can be a useful tool both for peptide identification and quantification on mass spectrometry data analysis. It may also prove to be more effective in discovering novel protein biomarkers than SEQUEST and other standard methods.

Author: Seungmook LeeMin-Seok KwonHyoung-Joo LeeYoung-Ki PaikHaixu TangJae LeeTaesung Park
Credits/Source: BMC Bioinformatics 2011, 12:423

Wednesday, October 5, 2011

mz5: Space- and time-efficient storage of mass spectrometry data sets

"Across a host of mass spectrometry (MS)-driven -omics fields, researchers witness the acquisition of ever increasing amounts of high throughput MS datasets and the need for their compact yet efficiently accessible storage has become clear.
The HUPO proteomics standard initiative (PSI) has defined an ontology and associated controlled vocabulary that specifies the contents of MS data files in terms of an open data format. Current implementations are the mzXML and mzML formats (mzML specification), both of which are based on an XML representation of the data. As a consequence, these formats are not particular efficient with respect to their storage space requirements or I/O performance.
This contribution introduces mz5, an implementation of the PSI mzML ontology that is based on HDF5, an efficient, industrial strength storage backend.
Compared to the current mzXML and mzML standards, this strategy yields an average file size reduction of a factor of ~2 and increases I/O performace ~3-4 fold.
The format is implemented as part of the ProteoWizard project."
more

Friday, February 11, 2011

ESS++: a C++ objected-oriented algorithm for Bayesian stochastic search model exploration

Summary: ESS++ is a C++ implementation of a fully Bayesian variable selection approach for single and multiple response linear regression. ESS++ works well both when the number of observations is larger than the number of predictors and in the ‘large p, small n’ case. In the current version, ESS++ can handle several hundred observations, thousands of predictors and a few responses simultaneously. The core engine of ESS++ for the selection of relevant predictors is based on Evolutionary Monte Carlo. Our implementation is open source, allowing community-based alterations and improvements.

Availability: C++ source code and documentation including compilation instructions are available under GNU licence at http://bgx.org.uk/software/ESS.html.

Contact: l.bottolo@imperial.ac.uk

click to see more

Thursday, February 3, 2011

Uniquant: an alternative to MaxQuant

UNiquant, a Program for Quantitative Proteomics Analysis Using Stable Isotope Labeling

Abstract

Stable isotope labeling (SIL) methods coupled with nanoscale liquid chromatography and high resolution tandem mass spectrometry are increasingly useful for elucidation of the proteome-wide differences between multiple biological samples. Development of more effective programs for the sensitive identification of peptide pairs and accurate measurement of the relative peptide/protein abundance are essential for quantitative proteomic analysis. We developed and evaluated the performance of a new program, termed UNiquant, for analyzing quantitative proteomics data using stable isotope labeling. UNiquant was compared with two other programs, MaxQuant and Mascot Distiller, using SILAC-labeled complex proteome mixtures having either known or unknown heavy/light ratios. For the SILAC-labeled Jeko-1 cell proteome digests with known heavy/light ratios (H/L = 1:1, 1:5, and 1:10), UNiquant quantified a similar number of peptide pairs as MaxQuant for the H/L = 1:1 and 1:5 mixtures. In addition, UNiquant quantified significantly more peptides than MaxQuant and Mascot Distiller in the H/L = 1:10 mixtures. UNiquant accurately measured relative peptide/protein abundance without the need for postmeasurement normalization of peptide ratios, which is required by the other programs.

Keywords (keywords):

Quantitative proteomics; Stable isotope labeling; LC-MS/MS; Software Development

Monday, January 31, 2011

MaxQuant comes with a brand-new search engine

J Proteome Res. 2011 Jan 21. [Epub ahead of print]

Andromeda - a peptide search engine integrated into the MaxQuant environment.

Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, Mann M.

Abstract

A key step in mass spectrometry (MS)-based proteomics is the identification of peptides in sequence databases by their fragmentation spectra. Here we describe Andromeda, a novel peptide search engine using a probabilistic scoring model. On proteome data Andromeda performs as well as Mascot, a widely used commercial search engine, as judged by sensitivity and specificity analysis based on target decoy searches. Furthermore, it can handle data with arbitrarily high fragment mass accuracy, is able to assign and score complex patterns of post-translational modifications, such as highly phosphorylated peptides and accommodates extremely large databases. The algorithms of Andromeda are provided. Andromeda can function independently or as an integrated search engine of the widely used MaxQuant computational proteomics platform and both are freely available at www.maxquant.org. The combination enables analysis of large data sets in a simple analysis workflow on a desktop computer. For searching individual spectra Andromeda is also accessible via a web server. We demonstrate the flexibility of the system by implementing the capability to identify co-fragmented peptides, significantly improving the total number of identified peptides.

Tuesday, December 28, 2010

A performance enhanced PSI-BLAST based on hybrid alignment

Abstract

Motivation: Sequence alignment is one of the most popular tools of modern biology. NCBI's PSI-BLAST utilizes iterative model building in order to better detect distant homologs with greater sensitivity than non-iterative BLAST. However, PSI-BLAST's performance is limited by the fact that it relies on deterministic alignments. Using a semi-probabilistic alignment scheme such as Hybrid alignment should allow for better informed model building and improved identification of homologous sequences, particularly remote homologs.

Results: We have built a new version of the tool in which the Smith-Waterman alignment algorithm core is replaced by the hybrid alignment algorithm. The favorable statistical properties of the hybrid algorithm allow the introduction of position-specific gap penalties in Hybrid PSI-BLAST. This improves the position-specific modeling of protein families and results in an overall improvement of performance.

Availability: Source code is freely available for download at http://bioserv.mps.ohio-state.edu/HybridPSI, implemented in C and supported on linux.

Contact: bundschuh@mps.ohio-state.edu

Supplementary information:Supplementary data are available at Bioinformatics online.

Monday, December 13, 2010

Cytoscape

"Cytoscape is an open source bioinformatics software platform for visualizing molecular interaction networks and integrating with gene expression profiles and other state data. Additional features are available as plugins. Plugins are available for network and molecular profiling analyses, new layouts, additional file format support and connection with databases and searching in large networks. Plugins may be developed using the Cytoscape open Java software architecture by anyone and plugin community development is encouraged"

Official website

Wednesday, October 20, 2010

ProSightPC

ProSightPC 2.0 Software

Thermo Scientific* ProSightPC, the first stand-alone software for analyzing top-down proteomics data, has been enhanced to add support for middle-down and bottom-up experiments, making it an all-around tool for identification and characterization of both intact proteins and peptides.
ProSightPC* 2.0 software enables high-throughput processing of all accurate-mass MS/MS data, whether from top-down, middle-down or bottom-up experiments including the characterization of proteins with known PTMs. ProSightPC 2.0 software uses multiple search modes to determine the exact protein sequence including modifications and alternative splicing. It is the only proteomics software that allows the user to search their tandem MS data against proteome warehouses containing the known biological complexity present in UniProt.

ProSightPC 2.0 software is a complete software package for the identification and characterization of proteins, peptides, and PTMs. It features multiple search modes and can accommodated data generated with several different fragmentation techniques.
Supports top-down, middle-down, and bottom-up experiments
Includes five different search modes, including Accurate Mass, Biomarker, Sequence Tag, Single Protein and Gene Restricted search modes
Processes fragmentation data from ECD, IRMPD or CID

The proprietary ProSight Warehouse includes all known post-translational modifications (PTMs), alternative splicing events and single nucleotide polymorphisms (SNPs)
Import FASTA databases and shotgun annotates these databases with all possible modifications
Includes Sequence Gazer, which allows users to review search results and add, remove, or change modifications to look for better fits

Compatible with:
LTQ FT family of hybrid mass spectrometers
LTQ Orbitrap family of hybrid mass spectrometers
Proteome Discoverer software

A user guide

Protein Life: R & D experience of a bioinformatician

Friday, December 9, 2011

KABOOM! A new suffix array based algorithm for clustering expression data

Abstract

Saturday, December 3, 2011

The Infobiotics Workbench: an integrated in silico modelling platform for Systems and Synthetic Biology

Abstract

Monday, November 7, 2011

Gene Ontology-driven inference of protein-protein interactions using inducers

Saturday, October 29, 2011

Enhanced peptide quantification using spectral count clustering and cluster abundance

Wednesday, October 5, 2011

mz5: Space- and time-efficient storage of mass spectrometry data sets

Friday, February 11, 2011

ESS++: a C++ objected-oriented algorithm for Bayesian stochastic search model exploration

Thursday, February 3, 2011

Uniquant: an alternative to MaxQuant

UNiquant, a Program for Quantitative Proteomics Analysis Using Stable Isotope Labeling

Abstract

Keywords (keywords):

Monday, January 31, 2011

MaxQuant comes with a brand-new search engine

Andromeda - a peptide search engine integrated into the MaxQuant environment.

Abstract

Tuesday, December 28, 2010

A performance enhanced PSI-BLAST based on hybrid alignment

Abstract

Monday, December 13, 2010

Cytoscape

Wednesday, October 20, 2010

ProSightPC