Child pages
  • Abstract: DeepSNAP: Scalable Machine Learning for Mass Spectrometry based Proteomics, Y1
Skip to end of metadata
Go to start of metadata

Historically, there have been two opposing approaches to peptide deduction from mass-spectrometry data i.e. de novo sequencing and database searching. The de novo approach tries to transform spectral space into peptide space by predicting individual amino acids from a given spectrum. On the other hand, database search tries to associate the experimental spectra to existing peptides by transforming peptide space into the spectral space and performing the comparisons. Each approach uses a similarity function based on heuristics that more or less correlates with the match quality between an experimental spectrum and its corresponding peptide. When using heuristics, there is no solid reasoning outlining why a function is chosen over the other one or why a certain feature within a function has the given associated weight. In this project, we design and implement a deep learning framework, called DeepSNAP, which presents the middle ground between the above mentioned opposing techniques. DeepSNAP, which stands for ``Deep Similarity Network for Proteomics", transforms spectral and peptide spaces into shared euclidean subspace by learning embeddings for both spectra and peptides. Transformation to euclidean space has numerous advantages e.g. simpler comparison using distance (L1, L2) functions, ability to cluster data for finding similarities, reducing dimensionality, and efficiently performing large scale comparisons required in proteogenomics search. The problem is tackled by training a similarity network to learn a similarity function for obtaining high-quality peptide spectrum matches. The network is trained on triplets (Q, P, N) using a custom-designed "SNAP-Loss" function. Each triplet consists of a query spectrum Q, a positive peptide P, and a negative peptide N. By training the network on a dataset of nearly 3 Million triplets, obtained from the NIST peptide library and MassIVE-KB spectral library, we achieve an accuracy of 99.7%. We are extending our framework to work on a distributed system to enable large scale training, iterative deepening of the network, and allowing database search on orders of magnitudes larger than what is possible in the state-of-the-art. Our distributed framework is a generic proteomics database search framework with the ability to plugin any scoring model effectively. For the shared-peak-count based model, our framework search 1.6 million experimental MS/MS spectra against a search space of 1.38 billion theoretical spectra (∼1Terabyte) in open-search mode within 3.7 hours using 72 XSEDE Comet cluster nodes (from another allocation).

  • No labels