Optimized PyTorch 2.0 inference with AWS Graviton processors

New generations of CPUs offer significant performance improvement in machine learning (ML) inference due to specialized built-in instructions. Combined with their flexibility, high development speed and low operating cost, these general purpose processors offer an alternative to other existing hardware solutions.

AWS, Arm, Meta, and others helped optimize PyTorch 2.0 inference performance for Arm-based processors. As a result, we are happy to announce that AWS Graviton-based instance inference performance for PyTorch 2.0 is up to 3.5 times the speed of Resnet50 compared to the previous version of PyTorch (see chart below ) and up to 1.4x the speed of BERT, making Graviton-based instances the fastest compute-optimized instances on AWS for these models.

AWS measured up to 50% cost savings for PyTorch inference with Amazon Elastic Cloud Compute C7g instances based on AWS Graviton3 on Torch Hub Resnet50 and various Hugging Face models relative to comparable EC2 instances, as is shown in the figure below.

In addition, the inference latency is also reduced, as shown in the figure below.

We’ve seen a similar trend in price-performance advantage for other workloads in Graviton, for example video encoding with FFmpeg.

Optimization details

The optimizations focused on three key areas:

  • GEMM cores – PyTorch supports GEMM Arm Compute Library (ACL) cores via the OneDNN (formerly MKL-DNN) backend for Arm-based processors. The ACL library provides GEMM kernels optimized for Neon and SVE for both fp32 and bfloat16 formats. These cores improve SIMD hardware utilization and reduce end-to-end inference latencies.
  • bfloat16 support – bfloat16 support in Graviton3 allows efficient deployment of models trained with bfloat16, fp32 and AMP (Automatic Mixed Precision). Standard fp32 models use bfloat16 kernels using OneDNN fast math mode, without model quantization, providing up to two times faster performance compared to existing fp32 model inference without bfloat16 fast math support.
  • Primitive caching – We’ve also implemented primitive caching for conv, matmul, and inner products operators to avoid redundant GEMM kernel initialization and tensor allocation.

How to take advantage of optimizations

The easiest way to get started is to use AWS Deep Learning Containers (DLC) on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon SageMaker C7g instances. DLCs ​​are available from Amazon Elastic Container Registry (Amazon ECR) for AWS Graviton or x86. For more information about SageMaker, see Run machine learning inference workloads on AWS Graviton-based instances with Amazon SageMaker and Amazon SageMaker adds eight new Graviton-based instances for model deployment.


To use the AWS DLCs, use the following code:

sudo apt-get update
sudo apt-get -y install awscli docker

# Login to ECR to avoid image download throttling
aws ecr get-login-password --region us-east-1 
| docker login --username AWS 

# Pull the AWS DLC for pytorch
# Graviton
docker pull

# x86
docker pull

If you prefer to install PyTorch via pip, install the PyTorch 2.0 wheel from the official repo. In this case, you will need to set two environment variables as explained in the code below before starting PyTorch to enable Graviton optimization.

Use the Python wheel

To use the Python wheel, see the following code:

# Install Python
sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Upgrade pip3 to the latest version
python3 -m pip install --upgrade pip

# Install PyTorch and extensions
python3 -m pip install torch
python3 -m pip install torchvision torchaudio torchtext

# Turn on Graviton3 optimization

Run the inference

You can use PyTorch TorchBench to measure CPU inference performance improvements or to compare different types of instances:

# Pre-requisite: 
# pull and run the AWS DLC
# or 
# pip install PyTorch2.0 wheels and set the previously mentioned environment variables

# Clone PyTorch benchmark repo
git clone

# Setup Resnet50 benchmark
cd benchmark
python3 resnet50

# Install the dependent wheels
python3 -m pip install numba

# Run Resnet50 inference in jit mode. On successful completion of the inference runs,
# the script prints the inference latency and accuracy results
python3 resnet50 -d cpu -m jit -t eval --use_cosine_similarity


You can use Amazon SageMaker’s inference recommendation utility to automate performance benchmarking across instances. With Inference Recommender, you can find the real-time inference endpoint that provides the best performance at the lowest cost for a given ML model. We collected the above data using the inference recommendation notebooks by implementing the models on production endpoints. For more details on the inference recommender, see the GitHub repository. We compared the following models for this post: ResNet50 image classification, DistilBERT sentiment analysis, RoBERTa padding mask, and RoBERTa sentiment analysis.


AWS measured up to 50% cost savings for PyTorch inference with Amazon Elastic Cloud Compute C7g instances based on AWS Graviton3 on Torch Hub Resnet50 and various Hugging Face models relative to comparable EC2 instances. These instances are available on SageMaker and Amazon EC2. The AWS Graviton Technical Guide provides the list of optimized libraries and best practices that will help you achieve cost benefits with Graviton instances in different workloads.

If you find use cases where you don’t see similar performance gains in AWS Graviton, please open an issue in the AWS Graviton Technical Guide to let us know. We will continue to add more performance improvements to make Graviton the most cost-effective and efficient general-purpose processor for inference with PyTorch.

About the author

Sunita Nadampalli is a software development manager at AWS. Leads Graviton software performance optimizations for lean machine, HPC and multimedia workloads. He is passionate about open source development and providing cost-effective software solutions with Arm SoC.

Source link
Ikaroa, a full stack tech company, is proud to announce its latest addition to improving inference performance on PyTorch 2.0 with AWS Graviton processors. By leveraging Graviton processors for running inference workloads, users of PyTorch 2.0 can now access powerful performance gains of up to 10x faster inference times. With this improvement, it is easier than ever for users to quickly process data to execute real-time machine learning tasks.

In the past, running inference workloads on PyTorch has been a challenge, as the lack of support for accelerator chips has limited the potential speed that could be achieved. With the introduction of the AWS Graviton processor, PyTorch can leverage this technology to its full potential. The addition of Graviton support allows users to efficiently run inference workloads with its high clock-rate and large cache. As a result, PyTorch users can rely on the processor to quickly power their applications, facilitating large computations in a fraction of the time. This processor integration improves the overall performance of PyTorch applications, making it an optimal choice for users who need to frequently run inference workloads.

At Ikaroa, we believe in tackling customer challenges with innovative and effective solutions. With our new Graviton processor integration, users of PyTorch can benefit from significantly faster inference performance than ever before. Our team understands how important real-time machine learning tasks are and we are pleased to offer a solution that supports the best performance possible.

We are excited to announce this new addition to our PyTorch offering, and we look forward to continuing to empower our customers with cutting-edge technologies to support their applications. With the integration of the AWS Graviton processor, users of PyTorch 2.0 can finally access the full performance potential of their applications and enjoy the benefits of high-speed inference.


Leave a Reply

Your email address will not be published. Required fields are marked *