Wednesday, October 5, 2011

mz5: Space- and time-efficient storage of mass spectrometry data sets

"Across a host of mass spectrometry (MS)-driven -omics fields, researchers witness the acquisition of ever increasing amounts of high throughput MS datasets and the need for their compact yet efficiently accessible storage has become clear.
The HUPO proteomics standard initiative (PSI) has defined an ontology and associated controlled vocabulary that specifies the contents of MS data files in terms of an open data format. Current implementations are the mzXML and mzML formats (mzML specification), both of which are based on an XML representation of the data. As a consequence, these formats are not particular efficient with respect to their storage space requirements or I/O performance.
This contribution introduces mz5, an implementation of the PSI mzML ontology that is based on HDF5, an efficient, industrial strength storage backend.
Compared to the current mzXML and mzML standards, this strategy yields an average file size reduction of a factor of ~2 and increases I/O performace ~3-4 fold.
The format is implemented as part of the ProteoWizard project."

