Running machine learning (ML) workloads with containers is becoming common practice. Containers can fully encapsulate not only your training code, but the entire dependency stack down to the hardware libraries and drivers. What you get is an ML development environment that is consistent and portable. With containers, scaling in a cluster is much easier.
In late 2022, AWS announced the general availability of Amazon EC2 Trn1 instances powered by AWS Trainium accelerators, which are designed for high-performance deep learning training. Trn1 instances offer up to 50% savings in training costs compared to other comparable Amazon Elastic Compute Cloud (Amazon EC2) instances. In addition, the AWS Neuron SDK was released to enhance this acceleration, giving developers tools to interact with this technology such as compile, run, and profile to achieve high-performance and cost-effective model training.
Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that simplifies the deployment, management, and scaling of containerized applications. Simply describe your application and the required resources, and Amazon ECS will launch, monitor, and scale your application through flexible compute options with automatic integrations to other supporting AWS services your application needs.
In this post, we show you how to run your ML training jobs in a container using Amazon ECS to deploy, manage, and scale your ML workload.
We explain the following high-level steps:
- Provision an ECS cluster of Trn1 instances with AWS CloudFormation.
- Create a custom container image with the Neuron SDK and submit it to Amazon Elastic Container Registry (Amazon ECR).
- Create a task definition to define an ML training job that Amazon ECS will run.
- Run the ML task on Amazon ECS.
To follow, familiarity with core AWS services such as Amazon EC2 and Amazon ECS is implied.
Provision an ECS cluster of Trn1 instances
To get started, launch the provided CloudFormation template, which will provide the necessary resources such as a VPC, an ECS cluster, and an EC2 Trainium instance.
We use the Neuron SDK to run deep learning workloads on AWS Inferentia and Trainium-based instances. It supports you in your end-to-end ML development lifecycle to build new models, optimize them, and then deploy them to production. To train your model with Trainium, you need to install the Neuron SDK on the EC2 instances where the ECS tasks will run to map the NeuronDevice associated with the hardware, as well as the Docker image that will be sent to Amazon ECR to access commands. to train your model.
Standard versions of Amazon Linux 2 or Ubuntu 20 do not come with the AWS Neuron drivers installed. So we have two different options.
The first option is to use a Deep Learning Amazon Machine Image (DLAMI) that already has the Neuron SDK installed. A sample is available in the GitHub repository. You can choose a DLAMI based on the operating system. Then run the following command to get the AMI ID:
The output will be as follows:
This AMI ID may change over time, so be sure to use the command to get the correct AMI ID.
Now you can change this AMI ID in the CloudFormation script and use the ready-to-use Neuron SDK. To do this, search
The second option is to create an instance that populates the file
userdata field during stack creation. You don’t need to install it because CloudFormation will set it up. For more information, see the Neuron Configuration Guide.
For this post, we use option 2, in case you need to use a custom image. Complete the following steps:
- Start the provided CloudFormation template.
- For KeyName, enter a name of your desired key pair and it will preload the parameters. For this post, we use
- Enter a name for your stack.
- If you are running in the
us-east-1Region, you can keep the values ALBName i AZides by default.
To check which Region Availability Zone Trn1 has available, run the following command:
- choose next and just created the stack.
When the stack is complete, you can move on to the next step.
Prepare and compress an ECR image using the Neuron SDK
Amazon ECR is a fully managed container registry that provides high-performance hosting, so you can reliably deploy application images and artifacts anywhere. We use Amazon ECR to store a custom Docker image that contains our Neuron scripts and packages needed to train a model with ECS tasks running on Trn1 instances. You can create an ECR repository using the AWS Command Line Interface (AWS CLI) or the AWS Management Console. For this post, we use the console. Complete the following steps:
- In the Amazon ECR console, create a new repository.
- For Visibility settings¸ select private.
- For Name of the repositoryenter a name
- choose Create a repository.
Now that you have a repository, let’s create and push an image, which could be created locally (on your laptop) or in an AWS Cloud9 environment. We are training a multilayer perceptron (MLP) model. For the original code, see the Multilayer Perceptron Training Tutorial.
- Copy the train.py and model.py files into a project.
It’s already compatible with Neuron, so you don’t need to change any code.
- 5. Create a Dockerfile that has the commands to install the Neuron SDK and training scripts:
As businesses across various industries turn to machine learning (ML) to help drive more intelligent and effective decision making, the need to scale their ML workloads becomes increasingly important. This is why businesses are turning to Amazon ECS powered by AWS Trainium instances, a cloud platform designed to offer greater scalability and cost savings for those deploying ML workloads. With powerful CPUs, GPUs, and memory, AWS Trainium instances make it faster, easier, and cheaper to scale ML workloads.
At Ikaroa, we believe that the Amazon ECS powered by AWS Trainium instances platform offers businesses a cost-effective and scalable way to deploy their ML workloads. With the incredible power of AWS Trainium, businesses can use their existing virtual machines in the cloud to train their ML models without needing to rent extra resources or pay for additional servers. This ultimately allows businesses to reduce costs while ensuring their ML applications can scale quickly and efficiently.
AWS Trainium instances are specifically designed to process huge amounts of information at a much faster rate than traditional cloud computing. This advanced technology allows ML workloads to be trained and deployed faster, providing improved accuracy and performance for ML models. In addition, with the right setup and configuration, businesses can use the AWS Trainium platform to run ML applications at any scale they need, from small teams to multi-million-dollar enterprises.
At Ikaroa, we understand that businesses are looking for the most cost-effective and scalable solutions when it comes to deploying their ML workloads. We believe that Amazon ECS powered by AWS Trainium instances is a great fit for those looking for a cost-effective and scalable solution for their ML applications. With unmatched performance, reliability, and cost savings, AWS Trainium can be a game-changer for businesses looking to scale their ML workloads quickly and efficiently.