Retraining SageMaker models with Chalice and Serverless

Julien Simon
5 min readMay 14, 2018

--

Amazon SageMaker makes it easy to train (and deploy) Machine Learning models at scale. Thanks to its Python SDK, developers can first experiment with their data set and model using a notebook instance. Once they’re happy with a model, it’s quite likely that they will need to train it again and again with new data or new parameters.

In this post, I will show you how to retrain a SageMaker model in two simple ways:

1 — On-demand, using a web service implemented with AWS Chalice.

2 — Periodically, using a scheduled Lambda function deployed with Serverless.

Sorry, Rock. Training doesn’t ALWAYS require heavy lifting!

Training a SageMaker model

The SageMaker SDK is great when experimenting, but it’s too large to fit in a Lambda package. No worries though: the SageMaker client in boto3 includes a CreateTrainingJob API that will serve our purpose just fine.

The list of parameters may look intimidating, but keep in mind that we’re retraining an existing model. All of these parameters have already been passed: we’ll grab them using the DescribeTrainingJob API and reuse most of them as is, just modifying some of them for extra flexibility.

Let’s get to work :)

Training on-demand with a Chalice web service

We’ve used Chalice before to serve SageMaker predictions, so please check out this post if you need a refresher on Chalice.

Our web service will provide three APIs:

  • /list/{results}: list the last results training jobs, sorted by descending completion date. We’ll return only the job name and job status, as provided by the ListTrainingJobs API.
  • /get/{name}: describe a training job. We’ll return the result of the DescribeTrainingJob API.
  • /train/{name}: retrain a job with CreateTrainingJob. In the request body, we’ll pass a new name (required), an S3 output path to store the model (optional), an instance type (optional) and an instance count (optional).

Defining IAM permissions

Chalice is able to auto-generate IAM policies, but it won’t do here. Indeed, we need to add a special permission allowing the Lambda function deployed by Chalice to pass the SageMaker service role.

Rationale: being allowed to call service APIs doesn’t mean that you should have the same permissions as the service itself. To prevent privilege escalation, a user must thus be explicitly allowed to pass the service role, i.e. to inherit the permissions granted to the service.

As a consequence, we’ll use the following custom policy: don’t forget to update it with your own account number and role name.

This is what the policy should look like in the IAM console.

Now we’ve got permissions sorted out, let’s start writing our APIs.

Listing jobs

Nothing fancy here: just call the ListTrainingJobs API and return the two fields we’re interested in.

A note of caution here: when running a service locally, Chalice uses the AWS credentials of the current user, not the policy defined in the configuration file. Take this into account when testing and debugging!

Describing a job

Super simple. DescribeTrainingJob is all we need.

Retraining a job

This requires a little more work:

  • Describe the old training job.
  • Read from the request body, set the name for the new training job, as well as the optional new output location, instance type and instance count.
  • Reuse all other parameters. VpcConfig is causing code duplication, as it cannot be set to an empty dictionary (issue created, feel free to +1).
  • Train the new job.
  • Return the response from the CreateTrainingJobAPI.

Let’s deploy our service and test!

All right, this works! Now we have a simple way to retrain any SageMaker job on demand with new parameters (code on Github).

What if we needed to retrain according to a specific schedule? Don’t you go writing cron jobs deployed on on EC2 instance! There is a much better way :)

Training periodically with a scheduled Lambda

Lambda functions can be triggered by all kinds of different events. One of them is CloudWatch Events, which let us schedule the execution of a Lambda function at periodic intervals.

Of course, we could set everything up with AWS APIs, but when it comes to deploying Lambda functions, one of the simpler ways is to use the now well-known Serverless framework.

Configuring the Lambda function

The configuration is straightforward:

  • Define permissions. No need to worry about CloudWatch Logs permissions, they are added automatically. Make sure you use the same service role that was used to train the initial job.
  • Define a function named ‘main’.
  • Define a event source with a rate of 5 minutes (*** this is for testing purposes only: please change this! ***). We could also use a cron expression.
  • Define environment variables storing the name of the old job, a name prefix for the new job, the instance type and the instance count. This will provide initial values for new trainings, which can be updated at any time in the console or with the UpdateFunctionConfiguration API.

Writing the Lambda function

We can reuse the code written for our Chalice service. The main difference is of course that we’re now reading parameters from environment variables.

As this should be able to run many times, we’ll use the prefix to build a different name for each new job.

Deploying the Lambda function

Just like for Chalice, a single CLI command is all it takes.

We can check our function in the Lambda console.

The environment variables are visible too.

Everything looks fine. A few minutes later, training jobs start to appear in the SageMaker console. Hurrah!

We now have a completely automated way to retrain SageMaker models (code on Github). If we wanted to retrain different models, we would simply deploy additional Lambda functions with different parameters.

Conclusion

With just a few lines of code, we built two different solutions to retrain SageMaker models and we deployed each with one single command. Both solutions are simple, reliable, scalable… and have pretty much zero cost. What’s not to like? I guess Serverless ML Ops are a thing, then :*)

That’s it for today. Thank you for reading, happy to answer questions here or on Twitter.

For more content, please feel free to check out my YouTube channel.

AWS at Eurovision :D Guitars and fire!

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Julien Simon
Julien Simon

Responses (2)

Write a response