Back

Accelerate protein structure prediction with the ESMFold language model on Amazon SageMaker

Proteins drive many biological processes, such as enzyme activity, molecular transport, and cell support. The three-dimensional structure of a protein provides information about its function and how it interacts with other biomolecules. Experimental methods for determining protein structure, such as X-ray crystallography and NMR spectroscopy, are expensive and time-consuming.

In contrast, recently developed computational methods can quickly and accurately predict the structure of a protein from its amino acid sequence. These methods are critical for proteins that are difficult to study experimentally, such as membrane proteins, the targets of many drugs. A well-known example of this is AlphaFold, a deep learning-based algorithm known for its accurate predictions.

ESMFold is another highly accurate deep learning-based method developed to predict the structure of proteins from their amino acid sequence. ESMFold uses a large protein language model (pLM) as its backbone and works end-to-end. Unlike AlphaFold2, it does not need any search or multiple sequence alignment (MSA) steps, nor does it rely on external databases to generate predictions. Instead, the development team trained the model on millions of protein sequences from UniRef. During training, the model developed patterns of attention that elegantly represent the evolutionary interactions between amino acids in the sequence. This use of a pLM instead of an MSA allows prediction times up to 60 times faster than other state-of-the-art models.

In this publication, we use the Hugging Face pre-trained ESMFold model with Amazon SageMaker to predict the heavy chain structure of trastuzumab, a monoclonal antibody developed by Genentech for the treatment of HER2-positive breast cancer. Quickly predicting the structure of this protein could be useful if researchers wanted to test the effect of sequence modifications. This could lead to improved patient survival or fewer side effects.

This post provides an example Jupyter notebook and related scripts in the following GitHub repository.

Prerequisites

We recommend that you run this example in an Amazon SageMaker Studio notebook with the PyTorch 1.13 Python 3.9 CPU-optimized image on an ml.r5.xlarge instance type.

View the experimental structure of trastuzumab

To begin with, we use the biopython library and a helper script to download the trastuzumab structure from the RCSB Protein Data Bank:

from Bio.PDB import PDBList, MMCIFParser
from prothelpers.structure import atoms_to_pdb

target_id = "1N8Z"
pdbl = PDBList()
filename = pdbl.retrieve_pdb_file(target_id, pdir="data")
parser = MMCIFParser()
structure = parser.get_structure(target_id, filename)
pdb_string = atoms_to_pdb(structure)

Next, we use the py3Dmol library to visualize the structure as an interactive 3D visualization:

view = py3Dmol.view()
view.addModel(pdb_string)
view.setStyle('chain':'A',"cartoon": 'color': 'orange')
view.setStyle('chain':'B',"cartoon": 'color': 'blue')
view.setStyle('chain':'C',"cartoon": 'color': 'green')
view.show()

The figure below represents the 1N8Z 3D protein structure from the Protein Data Bank (PDB). In this image, the trastuzumab light chain is shown in orange, the heavy chain is blue (with the variable region in light blue), and the HER2 antigen is green.

We will first use ESMFold to predict the structure of the heavy chain (Chain B) from its amino acid sequence. Next, we will compare the prediction with the experimentally determined structure shown above.

Predict the structure of trastuzumab heavy chain from its sequence using ESMFold

We use the ESMFold model to predict the heavy chain structure and compare it with the experimental result. To get started, we’ll use a pre-built notebook environment in Studio that includes several important libraries, such as PyTorch, pre-installed. Although we could use an accelerated instance type to improve the performance of our notebook analysis, we will instead use a non-accelerated instance and run the ESMFold prediction on one CPU.

First, let’s load Hugging Face Hub’s pre-trained ESMFold model and tokenizer:

from transformers import AutoTokenizer, EsmForProteinFolding

tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1")
model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1", low_cpu_mem_usage=True)

Next, we copy the model to our device (in this case CPU) and set some model parameters:

device = torch.device("cpu")
model.esm = model.esm.float()
model = model.to(device)
model.trunk.set_chunk_size(64)

To prepare the protein sequence for analysis, we need to tokenize it. This translates the amino acid symbols (EVQLV…) into a numeric format that the ESMFold model can understand (6,19,5,10,19,…):

tokenized_input = tokenizer([experimental_sequence], return_tensors="pt", add_special_tokens=False)["input_ids"]
tokenized_input = tokenized_input.to(device)

We then copy the tokenized input to the mode, make a prediction, and save the result to a file:

with torch.no_grad():
notebook_prediction = model.infer_pdb(experimental_sequence)
with open("data/prediction.pdb", "w") as f:
f.write(notebook_prediction)

This takes about 3 minutes on a non-accelerated instance type, like an r5.

We can check the accuracy of the ESMFold prediction by comparing it with the experimental structure. We do this with the US-Align tool developed by the Zhang Lab at the University of Michigan:

from prothelpers.usalign import tmscore

tmscore("data/prediction.pdb", "data/experimental.pdb", pymol="data/superimposed")

PDBChain 1 PDBchain2 TM-Score
data/prediction.pdb:A data/experimental.pdb:B 0.802

The template modeling score (TM-score) is a metric for evaluating the similarity of protein structures. A score of 1.0 indicates a perfect match. Scores greater than 0.7 indicate that the proteins share the same backbone structure. Scores greater than 0.9 indicate that the proteins are functionally interchangeable for downstream use. In our case of achieving TM-Score 0.802, the ESMFold prediction would probably be suitable for applications such as structure scoring or ligand binding experiments, but may not be suitable for use cases such as molecular replacement that require extremely high precision.

We can validate this result by visualizing the aligned structures. The two structures show a high, but not perfect, degree of overlap. Protein structure predictions is a rapidly evolving field, and many research teams are developing increasingly accurate algorithms!

Deploy ESMFold as a SageMaker inference endpoint

Running model inference in a notebook is fine for experimentation, but what if you need to integrate your model with an application? Or an MLOps pipeline? In this case, a better option is to deploy your model as an inference endpoint. In the following example, we will implement ESMFold as a SageMaker real-time inference endpoint on an accelerated instance. SageMaker real-time endpoints provide a scalable, cost-effective and secure way to deploy and host machine learning (ML) models. With auto-scaling, you can adjust the number of instances running the endpoint to meet the demands of your application, optimizing costs and ensuring high availability.

The pre-built SageMaker container for Hugging Face makes it easy to implement deep learning models for common tasks. However, for new use cases like protein structure prediction, we need to define a custom inference.py script to load the model, run the prediction, and format the output. This script includes much of the same code we used in our notebook. We also create one requirements.txt file to define some Python dependencies for our endpoint to use. You can see the files we created in the GitHub repository.

In the figure below, the experimental (blue) and predicted (red) structures of the trastuzumab heavy chain are very similar, but not identical.

After you have created the necessary files in the file code directory, we deploy our model using SageMaker HuggingFaceModel class This uses a pre-built container to simplify the process of deploying Hugging Face models to SageMaker. Note that it may take 10 minutes or more to create the endpoint, depending on the availability of ml.g4dn types of instances in our Region.

from sagemaker.huggingface import HuggingFaceModel
from datetime import datetime

huggingface_model = HuggingFaceModel(
model_data = model_artifact_s3_uri, # Previously staged in S3
name = f"emsfold-v1-model-" + datetime.now().strftime("%Y%m%d%s"),
transformers_version='4.17',
pytorch_version='1.10',
py_version='py38',
role=role,
source_dir = "code",
entry_point = "inference.py"
)

rt_predictor = huggingface_model.deploy(
initial_instance_count = 1,
instance_type="ml.g4dn.2xlarge",
endpoint_name=f"my-esmfold-endpoint",
serializer = sagemaker.serializers.JSONSerializer(),
deserializer = sagemaker.deserializers.JSONDeserializer()
)

When the endpoint deployment is complete, we can resubmit the protein sequence and display the first rows of the prediction:

endpoint_prediction = rt_predictor.predict(experimental_sequence)[0]
print(endpoint_prediction[:900])

Since we deployed our endpoint on an accelerated instance, the prediction should only take a few seconds. Each row of the result corresponds to a single atom and includes the amino acid identity, three spatial coordinates, and a pLDDT score representing the prediction confidence at that location.

PDB_GROUP id ATOM_LABEL RES_ID CHAIN_ID SEQ_ID CARD_X CARD_Y CARTN_Z OCCUPATION PLDDT ATOM_ID
ATOM 1 N GLU A 1 14,578 -19,953 1.47 1 0.83 N
ATOM 2 ca GLU A 1 13,166 -19,595 1,577 1 0.84 c
ATOM 3 ca GLU A 1 12,737 -18,693 0.423 1 0.86 c
ATOM 4 CB GLU A 1 12,886 -18,906 2,915 1 0.8 c
ATOM 5 O GLU A 1 13,417 -17,715 0.106 1 0.83 O
ATOM 6 cg GLU A 1 11,407 -18,694 3.2 1 0.71 c
ATOM 7 CD GLU A 1 11,141 -18,042 4,548 1 0.68 c
ATOM 8 OE1 GLU A 1 12.108 -17,805 5,307 1 0.68 O
ATOM 9 OE2 GLU A 1 9,958 -17,767 4,847 1 0.61 O
ATOM 10 N VALUE A 2 11,678 -19,063 -0.258 1 0.87 N
ATOM 11 ca VALUE A 2 11,207 -18,309 -1,415 1 0.87 c

Using the same method as before, we see that the notebook and endpoint predictions are identical.

PDBChain 1 PDBchain2 TM-Score
data/endpoint_prediction.pdb:A data/prediction.pdb:A 1.0

As seen in the figure below, ESMFold predictions generated on the notebook (red) and endpoint (blue) show perfect alignment.

Clean up

To avoid further charges, we delete our inference endpoint and test data:

rt_predictor.delete_endpoint()
bucket = boto_session.resource("s3").Bucket(bucket)
bucket.objects.filter(Prefix=prefix).delete()
os.system("rm -rf data obsolete code")

Summary

Computational protein structure prediction is a critical tool for understanding protein function. In addition to basic research, algorithms such as AlphaFold and ESMFold have many applications in medicine and biotechnology. The structural insights generated by these models help us better understand how biomolecules interact. This can lead to better diagnostic tools and therapies for patients.

In this post, we show how to deploy Hugging Face Hub’s ESMFold protein language model as a scalable inference endpoint using SageMaker. For more information about implementing Hugging Face models in SageMaker, see Using Hugging Face with Amazon SageMaker. You can also find more protein science examples in the Awesome Protein Analysis on AWS GitHub repo. Please leave us a comment if there are other examples you’d like to see!


About the Authors

Brian Leal is a senior architect of AI/ML solutions in the global health and life sciences team at Amazon Web Services. He has over 17 years of experience in biotechnology and machine learning, and is passionate about helping clients solve genomic and proteomic challenges. In her free time, she likes to cook and eat with her friends and family.

Shamika Ariyawansa is a solutions architect specializing in AI/ML in the global health and life sciences team at Amazon Web Services. He passionately works with customers to accelerate their adoption of AI and ML by providing technical guidance and helping them innovate and build secure cloud solutions on AWS. Outside of work, he loves skiing and off-roading.

Yanjun QiYanjun Qi is a senior manager of applied science in the AWS Machine Learning Solutions Lab. Innovate and apply machine learning to help AWS customers accelerate their AI and cloud adoption.

Source link
At Ikaroa, we strive to be at the forefront of technological innovation and cutting-edge developments. Our newest product, the ESMFold language model, is a revolutionary approach to protein structure prediction on Amazon SageMaker that promises to significantly accelerate the science.

The ESMFold language model uses pre-trained deep learning techniques to predict the 3D structure of a protein from its DNA sequence. The model utilizes the latest advances in neural networks and machine learning to accurately generate five-fold more accurate predictions compared to other models in 2020. This state-of-the-art technology makes the time-consuming task of protein structure prediction easier and can help accelerate scientific discovery.

Amazon SageMaker simplifies and accelerates the process of building and deploying a machine learning model for protein structure prediction. All of the necessary components, from compute infrastructure to algorithms, are provided by SageMaker as part of its one-stop solution. This makes it easy to deploy the ESMFold language models to accelerate research. Plus, it comes with a variety of features that simplify the setup and configuration process, including the ability to quickly scale up and run hundreds of models in parallel.

The ESMFold language model is a powerful tool and can help scientists gain insight into protein architecture. Thanks to the Amazon SageMaker-facilitated deployment of this model, researchers can now access advanced AI features to accelerate their discoveries. At Ikaroa, we are proud to be at the forefront of scientific progress and are excited to continue powering the life sciences through our groundbreaking innovations.

ikaroa
ikaroa
https://ikaroa.com

Leave a Reply

Your email address will not be published. Required fields are marked *