Back

Connect Amazon EMR and RStudio on Amazon SageMaker

RStudio on Amazon SageMaker is the first fully managed RStudio Workbench integrated development environment (IDE) in the cloud. You can quickly launch the familiar RStudio IDE and dial in the underlying computing resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale.

Along with tools like RStudio in SageMaker, users are analyzing, transforming and preparing large amounts of data as part of their data science and ML workflow. Data scientists and data engineers use Apache Spark, Hive, and Presto running on Amazon EMR for large-scale data processing. By using RStudio in SageMaker and Amazon EMR together, you can continue to use the RStudio IDE for analysis and development, while using Amazon EMR managed clusters for larger data processing.

In this post, we demonstrate how you can connect your RStudio domain to SageMaker with an EMR cluster.

Solution overview

We use an Apache Livy connection to send a sparklyr job from RStudio to SageMaker to an EMR cluster. This is demonstrated in the diagram below.

Scope of solution
All the code shown in the post is available in our GitHub repository. We implement the following solution architecture.

Prerequisites

Before deploying any resources, make sure you have all the requirements to set up and use RStudio in SageMaker and Amazon EMR:

We’ll also be building a custom RStudio image in SageMaker, so make sure you have Docker running and all the necessary permissions. For more information, see Use a custom image to bring your own development environment to RStudio in Amazon SageMaker.

Build resources with AWS CloudFormation

We use an AWS CloudFormation stack to build the necessary infrastructure.

If you already have an existing RStudio domain and EMR cluster, you can skip this step and start building your custom RStudio image in SageMaker. Substitute your EMR cluster and RStudio domain information for the EMR cluster and RStudio domain created in this section.

Throwing this stack creates the following resources:

  • Two private subnets
  • EMR Spark cluster
  • AWS Glue database and tables
  • SageMaker domain with RStudio
  • SageMaker RStudio User Profile
  • IAM service role for the SageMaker RStudio domain
  • IAM service role for the SageMaker RStudio user profile

Follow these steps to create your resources:

choose Start Stack to create the stack.

  1. At the Create a stack page, choose next.
  2. At the Specify the stack details page, provide a name for your stack and leave the remaining options at default, then choose next.
  3. At the Configure stack options page, leave the default options and choose next.
  4. At the Review pageselect
  5. I acknowledge that AWS CloudFormation can create IAM resources with custom names i
  6. I acknowledge that AWS CloudFormation may require the following capability: CAPABILITY_AUTO_EXPAND.
  7. choose Create a stack.

The template generates five stacks.

To see the EMR Spark cluster that has been created, go to the Amazon EMR console. You will see a cluster created for you named sagemaker. This is the cluster we connect to using RStudio in SageMaker.

Create the custom RStudio image in SageMaker

We have created a custom image that will install all sparklyr dependencies and establish a connection to the EMR cluster we created.

If you are using your own EMR cluster and RStudio domain, modify the scripts accordingly.

Make sure Docker is running. Get started by going to our project repository:

Now we will build the Docker image and register it in our RStudio domain in SageMaker.

  1. In the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain select rstudio-domain.
  3. At the Natural environment tab, choose Attached image.

    Now we attach the sparklyr image we created earlier to the domain.
  4. For Choose the image sourceselect Existing image.
  5. Select the shiny image we created.
  6. For Image propertiesleave the default options.
  7. For Image typeselect Image by RStudio.
  8. choose present.

    Validate that the image has been added to the domain. The image may take a few minutes to fully attach.
  9. When available, log in to the RStudio console in SageMaker using the rstudio-user profile that was created.
  10. From there, create a session using the sparklyr image we created earlier.

    First, we need to connect to our EMR cluster.
  11. In the connections panel, choose New connection.
  12. Select the EMR cluster connection code snippet and choose Connect to Amazon EMR Cluster.

    After the connection code has run, you’ll see a Spark connection through Livy, but no tables.
  13. Change the database to credit_card:
    tbl_change_db(sc, “credit_card”)
  14. choose Update connection data.
    You can now see the tables.
  15. Now navigate to rstudio-sparklyr-code-walkthrough.md dossier

This has a set of Spark transformations that we can use on our credit card dataset to prepare it for modeling. The following code is an excerpt:

We go count() how many transactions are in the transaction table. But first we need to save the cache Use the tbl() function

users_tbl <- tbl(sc, "users")
cards_tbl <- tbl(sc, "cards")
transactions_tbl <- tbl(sc, "transactions")

Let’s count the number of rows for each table.

count(users_tbl)
count(cards_tbl)
count(transactions_tbl)

We now register our tables as Spark Data Frames and cache them cluster-wide for better performance. We will also filter the header that is placed in the first row of each table.

users_tbl <- tbl(sc, 'users') %>%
  filter(gender != 'Gender')
sdf_register(users_tbl, "users_spark")
tbl_cache(sc, 'users_spark')
users_sdf <- tbl(sc, 'users_spark')

cards_tbl <- tbl(sc, 'cards') %>%
  filter(expire_date != 'Expires')
sdf_register(cards_tbl, "cards_spark")
tbl_cache(sc, 'cards_spark')
cards_sdf <- tbl(sc, 'cards_spark')

transactions_tbl <- tbl(sc, 'transactions') %>%
  filter(amount != 'Amount')
sdf_register(transactions_tbl, "transactions_spark")
tbl_cache(sc, 'transactions_spark')
transactions_sdf <- tbl(sc, 'transactions_spark')

For the full list of commands, see the rstudio-sparklyr-code-walkthrough.md dossier

Clean up

To clean up any resources to avoid incurring recurring costs, delete the root CloudFormation template. Also delete all created Amazon Elastic File Service (Amazon EFS) mounts and any Amazon Simple Storage Service (Amazon S3) buckets and objects.

conclusion

The integration of RStudio in SageMaker with Amazon EMR provides a powerful solution for data analysis and modeling tasks in the cloud. By connecting RStudio to SageMaker and establishing a Livy to Spark connection in EMR, you can leverage the computing resources of both platforms to efficiently process large datasets. One of the most widely used IDEs for data analysis, RStudio lets you take advantage of SageMaker’s fully managed infrastructure, access control, network, and security capabilities. Meanwhile, connecting Livy to Spark in Amazon EMR provides a way to perform distributed processing and scale data processing tasks.

If you’re interested in learning more about how to use these tools together, this post serves as a starting point. For more information, see RStudio in Amazon SageMaker. If you have suggestions or feature improvements, please create a pull request in our GitHub repository or leave a comment on this post.


About the Authors

Ryan Garner is a data scientist with AWS Professional Services. He is passionate about helping AWS customers use R to solve their data science and machine learning problems.


Raj Pathak
is a Senior Solutions Architect and Technologist specialized in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM), and Machine Learning Infrastructure and Operations (MLOps) projects.


Saiteja Pudi
is a solutions architect at AWS, based in Dallas, Texas. He has been working with AWS for over 3 years, helping customers realize the true potential of AWS by being their trusted advisor. He comes from an application development background, interested in data science and machine learning.

Source link
Ikaroa is proud to announce our latest integration, allowing users to easily connect Amazon EMR and RStudio on Amazon SageMaker. This new feature allows data scientists and software engineers to work together on the same platform, leveraging the best qualities of Amazon EMR and RStudio to create powerful machine learning models and data visualizations.

Amazon EMR offers cloud-native big data processing, allowing users to quickly and cost-effectively process and analyze large volumes of data. By leveraging clusters of compute nodes, Amazon EMR can automatically provision resources on demand and scale performance as needed. Built on top of Amazon EMR, Amazon SageMaker provides an easy way to package, deploy, and manage distributed machine learning models.

With Ikaroa’s new integration, Amazon EMR and RStudio are easily connected to Amazon SageMaker so users can build and deploy sophisticated machine learning models in minutes. The integration uses the open-source RStudio Server platform, allowing users to access their environment from any web browser. With RStudio, data scientists and software engineers can quickly create interactive data visualizations and models without writing any code.

Here at Ikaroa, we are committed to building solutions that make it easier for our customers to realize their data science and machine learning goals. We are excited to provide this integration to help bring a powerful and intuitive development experience to the Amazon SageMaker platform. For more information on the integration, please contact our team.

ikaroa
ikaroa
https://ikaroa.com

Leave a Reply

Your email address will not be published. Required fields are marked *