Amazon SageMaker with TensorBoard: An overview of a hosted TensorBoard experience

Today, data scientists training deep learning models must identify and correct model training issues to meet accuracy goals for production deployment and require a way to use standard tools for debugging of model training. Among the data scientist community, TensorBoard is a popular toolkit that allows data scientists to visualize and analyze various aspects of their machine learning (ML) models and training processes. It provides a set of tools for visualizing training metrics, examining model architectures, exploring embeddings, and more. The TensorFlow and PyTorch projects endorse and use TensorBoard in their official documentation and examples.

Amazon SageMaker with TensorBoard is a capability that brings TensorBoard visualization tools to SageMaker. Integrated with SageMaker domains and training jobs, it provides SageMaker domain users access to TensorBoard data and helps domain users perform model debugging tasks using SageMaker TensorBoard visualization plugins. When creating a SageMaker training job, domain users can use TensorBoard using the SageMaker Python SDK or the Boto3 API. SageMaker with TensorBoard supports the SageMaker Data Manager plugin, which allows domain users to access many training tasks in one place within the TensorBoard application.

In this post, we demonstrate how to set up a TensorBoard training job in SageMaker using the SageMaker Python SDK, access SageMaker TensorBoard, explore training output data displayed in TensorBoard, and delete unused TensorBoard applications.

Solution overview

A typical training task for deep learning in SageMaker consists of two main steps: preparing a training script and configuring a SageMaker training task launcher. In this post, we walk you through the changes needed to collect TensorBoard-compatible data from SageMaker training.


To start using SageMaker with TensorBoard, you need to set up a SageMaker domain with an Amazon VPC with an AWS account. Domain user profiles for each individual user are required to access TensorBoard in SageMaker, and the AWS Identity and Access Management (IAM) runtime role needs a minimum set of permissions, including the following:

  • sagemaker:CreateApp
  • sagemaker:DeleteApp
  • sagemaker:DescribeTrainingJob
  • sagemaker:Search
  • s3:GetObject
  • s3:ListBucket

For more information about setting up your SageMaker domain and user profiles, see Joining your Amazon SageMaker domain using Quick Setup and Add and Remove User Profiles.

Directory structure

When using Amazon SageMaker Studio, the directory structure can be organized as follows:

├── script
│	└──
└── simple_tensorboard.ipynb

Here, script/ is your training script, and simple_tensorboard.ipynb launch the SageMaker training job.

Modify your training script

You can use any of the following tools to collect tensors and scalars: TensorBoardX, TensorFlow Summary Writer, PyTorch Summary Writer, or Amazon SageMaker Debugger, and specify the data output path as the logging directory in the training container (log_dir). In this sample code, we use TensorFlow to train a simple, fully connected neural network for a classification task. For other options, see Prepare a training job with a TensorBoard output data configuration. In the train() function, we use the tensorflow.keras.callbacks.TensorBoard tool to collect tensors and scalars, specify /opt/ml/output/tensorboard as the log directory in the training container and pass it to the model’s training callbacks argument. See the following code:

import argparse
import json
import tensorflow as tf

def parse_args():
    cmdline = argparse.ArgumentParser(
    cmdline.add_argument("--epochs", default=5, type=int, help="""Number of epochs.""")
        "--optimizer", default="adam", type=str, help="""Optimizer type"""
        help="""Optimizer type""",
        help="List of metrics to be evaluated by the model during training and testing.",
    return cmdline

def create_model():
    return tf.keras.models.Sequential(
            tf.keras.layers.Flatten(input_shape=(28, 28)),
            tf.keras.layers.Dense(512, activation="relu"),
            tf.keras.layers.Dense(10, activation="softmax"),

def train(args):
    mnist = tf.keras.datasets.mnist
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train, x_test = x_train / 255.0, x_test / 255.0

    model = create_model()
    model.compile(optimizer=args.optimizer, loss=args.loss, metrics=args.metrics)

    # setup TensorBoard Callback
    LOG_DIR = "/opt/ml/output/tensorboard"
    tensorboard_callback = tf.keras.callbacks.TensorBoard(

    # pass TensorBoard Callback into the Model fit
        validation_data=(x_test, y_test),

if __name__ == "__main__":
    cmdline = parse_args()
    args, unknown_args = cmdline.parse_known_args()

Build a SageMaker training launcher with a TensorBoard data configuration

Use sagemaker.debugger.TensorBoardOutputConfig while configuring a SageMaker frame estimator, which maps the Amazon Simple Storage Service (Amazon S3) bucket you specify to save TensorBoard data to the local path to the training container (eg /opt/ml/output/tensorboard). You can use a different local container output path. However, it must be consistent with the value of the LOG_DIR variable, as specified in the previous step, so that SageMaker correctly finds the local path in the training container and saves the TensorBoard data to the S3 output bucket.

Then pass the module object to tensorboard_output_config parameter of the estimator class. The following code snippet shows an example of preparing a TensorFlow estimator with the TensorBoard output configuration parameter.

The following is the boilerplate code:

import os
from datetime import datetime
import boto3
import sagemaker

time_str ="%d-%m-%Y-%H-%M-%S")

region = boto3.session.Session().region_name
boto_sess = boto3.Session()
role = sagemaker.get_execution_role()
sm = sagemaker.Session()

base_job_name = "simple-tensorboard"
date_str ="%d-%m-%Y")
time_str ="%d-%m-%Y-%H-%M-%S")
job_name = f"base_job_name-time_str"

s3_output_bucket = os.path.join("s3://", sm.default_bucket(), base_job_name)

output_path = os.path.join(s3_output_bucket, "sagemaker-output", date_str, job_name)
code_location = os.path.join(s3_output_bucket, "sagemaker-code", date_str, job_name)

The following code is for the training container:

instance_type = "ml.c5.xlarge"
instance_count = 1

image_uri = sagemaker.image_uris.retrieve(

The following code is the TensorBoard configuration:

from sagemaker.tensorflow import TensorFlow

tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=os.path.join(output_path, "tensorboard"),

hyperparameters = 
    "epochs": 5,
    "optimizer": "adam",
    "loss": "sparse_categorical_crossentropy",
    "metrics": "'["accuracy"]'",

estimator = TensorFlow(

Start the training job with the following code:

Access TensorBoard in SageMaker

You can access TensorBoard in two methods: through programming sagemaker.interactive_apps.tensorboard module that generates the URL or uses the TensorBoard landing page in the SageMaker console. After opening TensorBoard, SageMaker runs the TensorBoard plugin and automatically finds and loads all training job output data in a TensorBoard-compatible file format from S3 buckets paired with training jobs during or after training

The following code automatically generates the TensorBoard console landing page URL:

from sagemaker.interactive_apps import tensorboard
app = tensorboard.TensorBoardApp(region)

print("Navigate to the following URL:")
if app._is_studio_user:

This returns the following message with a URL that opens the TensorBoard landing page.

>>> Navigate to the following URL: https://<sagemaker_domain_id>.studio.<region>

To open TensorBoard from the SageMaker console, see How to access TensorBoard in SageMaker.

When you open the TensorBoard application, TensorBoard opens with the SageMaker Data Manager tab The screenshot below shows the full view of the file SageMaker Data Manager tab in the TensorBoard application.

At the SageMaker Data Manager tab, you can select any training job and upload TensorBoard-compatible training output data from Amazon S3.

  1. In the Add a training job section, use the check boxes to choose training jobs from which you want to extract data and view them for debugging.
  2. choose Add the selected jobs.

The selected jobs should appear in the Training jobs with monitoring section

Refresh the viewer by choosing the refresh icon in the top right corner and the view tabs should appear once the job data has been successfully loaded.

Explore the training output data displayed in TensorBoard

At the Time series and other graphics-based tabs, you can see the list Training jobs with monitoring in the left panel. You can also use checkboxes in training tasks to show or hide visualizations. TensorBoard’s dynamic plugins are dynamically activated based on how you’ve configured your training script to include summary writers and callbacks for the tensor and scalar collection, and graph tabs also appear dynamically . The following screenshots show sample views of each tab with views of metrics collected from two training jobs. Metrics include time series, scalar, graph, distribution, and histogram plugins.

The following screenshot is the Time series tab view

The following screenshot is the to climb tab view

The following screenshot is the graphs tab view

The following screenshot is the Distributions tab view

The following screenshot is the histograms tab view

Clean up

When you’re done monitoring and experimenting with tasks in TensorBoard, close the TensorBoard application:

  1. In the SageMaker console, choose Domains in the navigation pane.
  2. Choose your domain.
  3. Choose your user profile.
  4. Under applicationschoose Delete the app for the TensorBoard row.
  5. choose Yes, delete the app.
  6. Enter delete in the text box, then choose delete.

A message should appear at the top of the page: “Deleting by default.”


TensorBoard is a powerful tool for visualizing, analyzing and debugging deep learning models. In this post, we provide a guide to using SageMaker with TensorBoard, including how to configure TensorBoard in a SageMaker training job using the SageMaker Python SDK, access SageMaker TensorBoard, explore training output data displayed in TensorBoard, and delete unused TensorBoard applications. By following these steps, you can start using TensorBoard in SageMaker for your work.

We encourage you to experiment with different features and techniques.

About the authors

Dr. Baichuan Sun is a Senior Data Scientist at AWS AI/ML. He is passionate about solving strategic business problems with clients using a cloud data-driven methodology, and has led projects in challenging areas such as robotics computer vision, time series forecasting, pricing optimization, predictive maintenance, pharmaceutical development, product recommendation system, etc. In his spare time he likes to travel and go out with his family.

Hands to delight is a Senior Product Manager at Amazon SageMaker. He is passionate about building next-generation AI products and works with software and tools to facilitate large-scale machine learning for customers. He has an MBA from the Haas School of Business and a master’s degree in Information Systems Management from Carnegie Mellon University. In his free time, Manoj enjoys playing tennis and landscape photography.

Source link
Ikaroa is proud to present Amazon SageMaker with TensorBoard, an informational overview of a hosted TensorBoard experience. TensorBoard is an open source visualization suite that helps machine learning engineers and data scientists quickly and intuitively gain insights into the performance of their machine learning models. With SageMaker, developers can easily create and access interactive, comprehensive visualizations of their data that can be used to identify correlations, gain insights into data characteristics, improve model performance, and find and identify patterns in large datasets.

Amazon SageMaker with TensorBoard helps data scientists and engineers visualize and compare results from their different experiments in a single view, allowing them to compare both performance and technical characteristics. This helps streamline the development process and provide insights quickly and efficiently.

One of the major advantages of Amazon SageMaker with TensorBoard is its scalability. With an easy to use interface, developers can quickly build and manage thousands of machine learning jobs without getting bogged down in complex coding. By scaling up the performance, developers can focus on creating machine learning models and enabling more detailed analysis.

TensorBoard also allows developers to quickly identify and fix issues by representing data in two dimensions. This includes, for example, the ability to compare hyper-parameters to the model’s performance. This allows developers to quickly compare and identify what is causing performance issues and thresholds, enabling them to easily adjust the parameters in order to improve outcomes.

Using Amazon SageMaker with TensorBoard, developers have the opportunity to see experiments graphically and compare results easier than ever before. TensorBoard provides an all-in-one solution for any machine learning engineer, allowing them to compare experiments, identify technical characteristics and performance metrics, debug models, and gain valuable insights into their data.

At Ikaroa, we are proud to provide access to the powerful combination of Amazon SageMaker and TensorBoard, offering enterprises the opportunity to drive innovation and identify opportunities while also gaining valuable insights quickly and efficiently.


Leave a Reply

Your email address will not be published. Required fields are marked *