Using machine learning to identify antimicrobial peptides

  • Techniques like alphabet reduction can be applied to group amino acids based on their physicochemical properties
  • Distributed vector representation can then be used to transform peptides into numeric forms for ML algorithms

Lowering peptide sequence complexity

Amino acids in peptides are responsible for their structure and function. Every amino acid has unique characteristics generally determined by its side chains.

  1. When applying alphabet reduction, it becomes possible to abstract information as per the physicochemical properties of the amino acids and study them in smaller groups. This technique reduces the computational complexity in processing the sequences. Below is an example of how this takes place.

Converting sequences to numeric form

Just like letters of a language string together to form a word, amino acids form peptides and proteins. Using vector representation which is commonly used in natural language processing helps identify unique patterns in peptide sequences more effectively. Let us take a look at how it works.

Fig. 2 Word vectors

Classifying AMPs smartly

A reduced vector representation can be created by applying the alphabet reduction technique and then converting them to a numerical form using distributed vector representation. This can be done for each of the physicochemical properties and their multiple combinations. For example, hydropathy + conformation similarity is one combination, hydropathy + contact energy is another. All such possibilities led to 256 precise combinations.

Fig. 3 Workflow for binary classification of AMP
  1. Vector representation of sequences in the curated dataset using the embeddings created from the Swiss-Prot data, with and without reduction techniques (left)
  2. Machine learning models to classify the sequences using their vector representations (right)

Contributing to drug discovery

Scientists and pharma researchers are welcome to use this model to identify antimicrobial activity for their sequences. The confirmed sequences can be further studied for their potential as drug candidates against microbes. Such computational methods are of immense use in the long and complex process of drug discovery where they can not only reduce the time taken in laboratory activities but also identify diverse data driven patterns and features otherwise overlooked.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Thoughtworks

Thoughtworks

A community of passionate individuals whose purpose is to revolutionize software design, creation and delivery, while advocating for positive social change.