Back

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

Running machine learning (ML) workloads with containers is becoming common practice. Containers can fully encapsulate not only your training code, but the entire dependency stack down to the hardware libraries and drivers. What you get is an ML development environment that is consistent and portable. With containers, scaling in a cluster is much easier.

In late 2022, AWS announced the general availability of Amazon EC2 Trn1 instances powered by AWS Trainium accelerators, which are designed for high-performance deep learning training. Trn1 instances offer up to 50% savings in training costs compared to other comparable Amazon Elastic Compute Cloud (Amazon EC2) instances. In addition, the AWS Neuron SDK was released to enhance this acceleration, giving developers tools to interact with this technology such as compile, run, and profile to achieve high-performance and cost-effective model training.

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that simplifies the deployment, management, and scaling of containerized applications. Simply describe your application and the required resources, and Amazon ECS will launch, monitor, and scale your application through flexible compute options with automatic integrations to other supporting AWS services your application needs.

In this post, we show you how to run your ML training jobs in a container using Amazon ECS to deploy, manage, and scale your ML workload.

Solution overview

We explain the following high-level steps:

  1. Provision an ECS cluster of Trn1 instances with AWS CloudFormation.
  2. Create a custom container image with the Neuron SDK and submit it to Amazon Elastic Container Registry (Amazon ECR).
  3. Create a task definition to define an ML training job that Amazon ECS will run.
  4. Run the ML task on Amazon ECS.

Prerequisites

To follow, familiarity with core AWS services such as Amazon EC2 and Amazon ECS is implied.

Provision an ECS cluster of Trn1 instances

To get started, launch the provided CloudFormation template, which will provide the necessary resources such as a VPC, an ECS cluster, and an EC2 Trainium instance.

We use the Neuron SDK to run deep learning workloads on AWS Inferentia and Trainium-based instances. It supports you in your end-to-end ML development lifecycle to build new models, optimize them, and then deploy them to production. To train your model with Trainium, you need to install the Neuron SDK on the EC2 instances where the ECS tasks will run to map the NeuronDevice associated with the hardware, as well as the Docker image that will be sent to Amazon ECR to access commands. to train your model.

Standard versions of Amazon Linux 2 or Ubuntu 20 do not come with the AWS Neuron drivers installed. So we have two different options.

The first option is to use a Deep Learning Amazon Machine Image (DLAMI) that already has the Neuron SDK installed. A sample is available in the GitHub repository. You can choose a DLAMI based on the operating system. Then run the following command to get the AMI ID:

aws ec2 describe-images --region us-east-1 --owners amazon --filters 'Name=name,Values=Deep Learning AMI Neuron PyTorch 1.13.? (Amazon Linux 2) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text

The output will be as follows:

ami-06c40dd4f80434809

This AMI ID may change over time, so be sure to use the command to get the correct AMI ID.

Now you can change this AMI ID in the CloudFormation script and use the ready-to-use Neuron SDK. To do this, search EcsAmiId in Parameters:

"EcsAmiId":  
    "Type": "String", 
    "Description": "AMI ID", 
    "Default": "ami-09def9404c46ac27c" 

The second option is to create an instance that populates the file userdata field during stack creation. You don’t need to install it because CloudFormation will set it up. For more information, see the Neuron Configuration Guide.

For this post, we use option 2, in case you need to use a custom image. Complete the following steps:

  1. Start the provided CloudFormation template.
  2. For KeyName, enter a name of your desired key pair and it will preload the parameters. For this post, we use trainium-key.
  3. Enter a name for your stack.
  4. If you are running in the us-east-1 Region, you can keep the values ALBName i AZides by default.

To check which Region Availability Zone Trn1 has available, run the following command:

aws ec2 describe-instance-type-offerings --region us-east1 --location-type availability-zone --filter Name=instance-type,Values=trn1.2xlarge

  1. choose next and just created the stack.

When the stack is complete, you can move on to the next step.

Prepare and compress an ECR image using the Neuron SDK

Amazon ECR is a fully managed container registry that provides high-performance hosting, so you can reliably deploy application images and artifacts anywhere. We use Amazon ECR to store a custom Docker image that contains our Neuron scripts and packages needed to train a model with ECS tasks running on Trn1 instances. You can create an ECR repository using the AWS Command Line Interface (AWS CLI) or the AWS Management Console. For this post, we use the console. Complete the following steps:

  1. In the Amazon ECR console, create a new repository.
  2. For Visibility settings¸ select private.
  3. For Name of the repositoryenter a name
  4. choose Create a repository.

Now that you have a repository, let’s create and push an image, which could be created locally (on your laptop) or in an AWS Cloud9 environment. We are training a multilayer perceptron (MLP) model. For the original code, see the Multilayer Perceptron Training Tutorial.

  1. Copy the train.py and model.py files into a project.

It’s already compatible with Neuron, so you don’t need to change any code.

  1. 5. Create a Dockerfile that has the commands to install the Neuron SDK and training scripts:
FROM amazonlinux:2

RUN echo $'[neuron] n
name=Neuron YUM Repository n
baseurl=https://yum.repos.neuron.amazonaws.com n
enabled=1' > /etc/yum.repos.d/neuron.repo

RUN rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

RUN yum install aws-neuronx-collectives-2.* -y
RUN yum install aws-neuronx-runtime-lib-2.* -y
RUN yum install aws-neuronx-tools-2.* -y
RUN yum install -y tar gzip pip
RUN yum install -y python3 python3-pip
RUN yum install -y python3.7-venv gcc-c++
RUN python3.7 -m venv aws_neuron_venv_pytorch

# Activate Python venv
ENV PATH="/aws_neuron_venv_pytorch/bin:$PATH"
RUN python -m pip install -U pip
RUN python -m pip install wget
RUN python -m pip install awscli

RUN python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
RUN python -m pip install torchvision tqdm torch-neuronx neuronx-cc==2.* pillow
RUN mkdir -p /opt/ml/mnist_mlp
COPY model.py /opt/ml/mnist_mlp/model.py
COPY train.py /opt/ml/mnist_mlp/train.py
RUN chmod +x /opt/ml/mnist_mlp/train.py
CMD ["python3", "/opt/ml/mnist_mlp/train.py"]

To build your own Dockerfile with Neuron, see Develop on AWS ML accelerator instance, where you can find guides for other OS and ML frameworks.

  1. 6. Create an image, then submit it to Amazon ECR with the following code (provide your region, account ID, and ECR repository):
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin your-account-id.dkr.ecr.your-region.amazonaws.com

docker build -t mlp_trainium .

docker tag mlp_trainium:latest your-account-id.dkr.ecr.us-east-1.amazonaws.com/mlp_trainium:latest

docker push your-account-id.dkr.ecr.your-region.amazonaws.com/your-ecr-repo-name:latest

After that, your image version should be visible in the ECR repository you created.

Run the ML training task as an ECS task

To run the ML training task on Amazon ECS, you must first create a task definition. A task definition is required to run Docker containers on Amazon ECS.

  1. In the Amazon ECS console, choose Task definitions in the navigation pane.
  2. At the Create a new task definition menu, choose Create a new task definition using JSON.

You can use the following task definition template as a baseline. Note that in the image field, you can use the one generated in the previous step. Make sure it includes your account ID and ECR repository name.

To make sure Neuron is installed, you can check if the volume /dev/neuron0 is assigned to the device block. This is assigned to a single NeuronDevice running on the trn1.2xlarge instance with two cores.

  1. Create your task definition using the following template:

    "family": "mlp_trainium",
    "containerDefinitions": [
        
            "name": "mlp_trainium",
            "image": "your-account-id.dkr.ecr.us-east-1.amazonaws.com/your-ecr-repo-name",
            "cpu": 0,
            "memoryReservation": 1000,
            "portMappings": [],
            "essential": true,
            "environment": [],
            "mountPoints": [],
            "volumesFrom": [],
            "linuxParameters": 
                "capabilities": 
                    "add": [
                        "IPC_LOCK"
                    ]
                ,
                "devices": [
                    
                        "hostPath": "/dev/neuron0",
                        "containerPath": "/dev/neuron0",
                        "permissions": [
                            "read",
                            "write"
                        ]
                    
                ]
            ,
            ,
            "logConfiguration": 
                "logDriver": "awslogs",
                "options": 
                    "awslogs-create-group": "true",
                    "awslogs-group": "/ecs/task-logs",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                
            
        
    ],
    "networkMode": "awsvpc",
    "placementConstraints": [
        
            "type": "memberOf",
            "expression": "attribute:ecs.os-type == linux"
        ,
        
            "type": "memberOf",
            "expression": "attribute:ecs.instance-type == trn1.2xlarge"
        
    ],
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "1024",
    "memory": "3072"

You can also complete this step in the AWS CLI using the following task definition or the following command:

aws ecs register-task-definition 
--family mlp-trainium 
--container-definitions '[    
    "name": "my-container-1",    
    "image": "your-account-id.dkr.ecr.us-east-1.amazonaws.com/your-ecr-repo-name",
    "cpu": 0,
    "memoryReservation": 1000,
    "portMappings": [],
    "essential": true,
    "environment": [],
    "mountPoints": [],
    "volumesFrom": [],
    "logConfiguration": 
        "logDriver": "awslogs",
        "options": 
            "awslogs-create-group": "true",
            "awslogs-group": "/ecs/task-logs",
            "awslogs-region": "us-east-1",
            "awslogs-stream-prefix": "ecs"
        
    ,
    "linuxParameters": 
        "capabilities": 
            "add": [
                "IPC_LOCK"
            ]
        ,
        "devices": [
            "hostPath": "/dev/neuron0",
            "containerPath": "/dev/neuron0",
            "permissions": ["read", "write"]
        ]
    
]' 
--requires-compatibilities EC2
--cpu "8192" 
--memory "16384" 
--placement-constraints '[
    "type": "memberOf",
    "expression": "attribute:ecs.instance-type == trn1.2xlarge"
, 
    "type": "memberOf",
    "expression": "attribute:ecs.os-type == linux"
]'

Run the task on Amazon ECS

After we’ve created the ECS cluster, pushed the image to Amazon ECR, and created the task definition, we run the task definition to train a model on Amazon ECS.

  1. In the Amazon ECS console, choose clusters in the navigation pane.
  2. Open your cluster.
  3. At the Duties tab, choose Run a new task.

  1. For Release typechoose EC2.

  1. For Type of applicationselect task.
  2. For familychoose the task definition you created.

  1. In the Network work section, specify the VPC created by the CloudFormation stack, subnet, and security group.

  1. choose To create.

You can monitor your task in the Amazon ECS console.

You can also run the task using the AWS CLI:

aws ecs run-task --cluster <your-cluster-name> --task-definition <your-task-name> --count 1 --network-configuration '"awsvpcConfiguration": "subnets": ["<your-subnet-name> "], "securityGroups": ["<your-sg-name> "] '

The result will be like the screenshot below.

You can also view training job details using the Amazon CloudWatch log group.

After training your models, you can store them in Amazon Simple Storage Service (Amazon S3).

Clean up

To avoid additional costs, you can change the Auto Scaling group to Minimum capacity i Desired Capacity to zero, to shut down Trainium instances. For a full cleanup, delete the CloudFormation stack to remove all resources created by this template.

conclusion

In this post, we showed how to use Amazon ECS to deploy your ML training jobs. We created a CloudFormation template to create the ECS cluster of Trn1 instances, built a custom Docker image, pushed it to Amazon ECR, and ran the ML training task on the ECS cluster using a Trainium instance.

To learn more about Neuron and what you can do with Trainium, check out the following resources:


About the Authors

Guilherme Ricci is a Senior Startup Solutions Architect at Amazon Web Services, helping startups modernize and cost-optimize their applications. With more than 10 years of experience with companies in the financial sector, he currently works with a team of AI/ML specialists.

Evander Franco is an AI/ML solutions architect working at Amazon Web Services. Help AWS customers overcome AI/ML-related business challenges beyond AWS. He has been working with technology for over 15 years, from software development, infrastructure, serverless to machine learning.

Matthew McClean leads Annapurna ML’s Solution Architecture team helping customers adopt AWS Trainium and AWS Inferentia products. He is passionate about generative AI and has been helping customers adopt AWS technologies for the past 10 years.

Source link
As businesses across various industries turn to machine learning (ML) to help drive more intelligent and effective decision making, the need to scale their ML workloads becomes increasingly important. This is why businesses are turning to Amazon ECS powered by AWS Trainium instances, a cloud platform designed to offer greater scalability and cost savings for those deploying ML workloads. With powerful CPUs, GPUs, and memory, AWS Trainium instances make it faster, easier, and cheaper to scale ML workloads.

At Ikaroa, we believe that the Amazon ECS powered by AWS Trainium instances platform offers businesses a cost-effective and scalable way to deploy their ML workloads. With the incredible power of AWS Trainium, businesses can use their existing virtual machines in the cloud to train their ML models without needing to rent extra resources or pay for additional servers. This ultimately allows businesses to reduce costs while ensuring their ML applications can scale quickly and efficiently.

AWS Trainium instances are specifically designed to process huge amounts of information at a much faster rate than traditional cloud computing. This advanced technology allows ML workloads to be trained and deployed faster, providing improved accuracy and performance for ML models. In addition, with the right setup and configuration, businesses can use the AWS Trainium platform to run ML applications at any scale they need, from small teams to multi-million-dollar enterprises.

At Ikaroa, we understand that businesses are looking for the most cost-effective and scalable solutions when it comes to deploying their ML workloads. We believe that Amazon ECS powered by AWS Trainium instances is a great fit for those looking for a cost-effective and scalable solution for their ML applications. With unmatched performance, reliability, and cost savings, AWS Trainium can be a game-changer for businesses looking to scale their ML workloads quickly and efficiently.

ikaroa
ikaroa
https://ikaroa.com

Leave a Reply

Your email address will not be published. Required fields are marked *