Predicting world temperature with time series and DeepAR on Amazon SageMaker

Julien Simon
6 min readFeb 1, 2018

Predicting time-based values is a popular use case for Machine Learning. Indeed, a lot of phenomena — from rainfall to fast-food queues to stock prices — exhibit time-based patterns that can be successfully captured by a Machine Learning model.

In this post, you will learn how to predict temperature time-series using DeepAR — one of the latest built-in algorithms added to Amazon SageMaker. As usual, a Jupyter notebook is available on Github.

The very nice collection of SageMaker sample notebooks includes another DeepAR example and I strongly encourage you to check it out. Mine differs in two ways: it uses a real-life data set (not a synthetic one) and it doesn’t use the pandas library, which I believe makes it a little easier to understand :)

A word about DeepAR

DeepAR is an algorithm introduced in 2017. It’s quite complex and I won’t go into details here. Let’s just say that unlike other techniques that train a different model for each time-series, DeepAR builds a single model for all time-series and tries to identify similarities across them. Intuitively, this sounds like a good idea for temperature time-series as we could expect them to exhibit similar patterns year after year.

If you’d like to know more about DeepAR, please refer to the original research article as well as the SageMaker documentation.

The Berkeyley Earth dataset

This dataset contains a very large number of temperatures recorded across the globe since the 18th century. Here, I will use the Daily Land dataset, which holds a daily temperature measure from 1880 to 2014. Temperatures are reported as a variation (aka anomaly) from the 1951–1980 average (8.68°C), as visible in the sixth column.

Before we go any further, we have to decide how to build our time-series. Given data resolution (one data point per day), we should probably build yearly series: having long enough series (hundreds of samples at least) is one of the requirements for a successful model. Thus, we’re going to build 135 series of 365 samples (or 366 for leap years) using the second and sixth columns in our file.

In addition, we should decide how many samples we’d like to predict: let’s go for 30, i.e. predicting a month’s worth of temperatures.

Loading the dataset

Very little dataset preparation is required (woohoo!). Once we’ve downloaded and cleaned up the file (remove header and empty lines), we can directly load it into a list (for plotting) and a dictionary (for training). Note that we’re also adding the average temperature to all samples in order to work with actual temperatures, not variations.

Plotting the dataset

Let’s take a quick look at our dataset.

I see an upward trend, but I’ll let each of you come to their own conclusions.

World temperature from 1880 to 2014.

Preparing the training and test sets

We’re not going to split 80/20 like we usually would. Things are a bit different when working with time series:

  • Training set: we need to remove the last 30 sample points from each time series. Time series should also be shuffled, although it is unnecessary here because Python dictionaries are not ordered ;)
  • Test set: we can use the full dataset.

Writing the datasets to JSON format

The input format for DeepAR is JSON Lines: one sample per line, such as this one.

{"start":"2009-11-01 00:00:00", "target": [4.3, 10.3, ...]}

Easy enough, let’s take care of it.

Uploading to S3

Next, let’s upload our data to S3. The SageMaker SDK has a nice little function to do this.

Configuring the training job

As usual with built-in algorithms, we need to select the container corresponding to the region we run in and then create an Estimator. Nothing unusual here.

Now let’s look at hyper parameters.

Defining hyper parameters

Hyper parameters for DeepAR are detailed in the documentation. Let’s focus on the required ones:

  • time_freq: the resolution of time series (from minutes to months). Ours has daily resolution.
  • prediction_length: how many data samples we’re going to predict (30, remember?).
  • context_length: how many data points we’re going to look at before predicting. We’ll use 30 too, it should be enough to figure out temperatures.
  • epochs: I wonder what this does ;)

In addition, after a unreasonable number of different tests, I ended up getting better results with only two layers (default is three), a smaller learning rate and a large number of epochs.

My intuition is that the data set being pretty small with rather short time-series, three layers tend to overfit more. Two layers don’t learn as well, but letting them learn longer with a smaller learning rate makes up for it. Or something. Curious to here your own opinion.

Training the model

Super simple :)

Training stops early and the best epoch is selected (#206). Here are my three metrics: loss for p50 and p90 (which tell us how accurate the predicted distribution is), as well as Root Mean Square Error.

[01/31/2018 22:19:52 INFO 140078416930624] #test_score (algo-1, wQuantileLoss[0.5]): 0.0584739
[01/31/2018 22:19:52 INFO 140078416930624] #test_score (algo-1, wQuantileLoss[0.9]): 0.0294685
[01/31/2018 22:19:52 INFO 140078416930624] #test_score (algo-1, RMSE): 0.62858571005

OK, now let’s deploy this model and use it.

Deploying the model

Nothing complicated here: create an endpoint hosting our model and create a RealTimePredictor to send requests to.

Building a prediction request

According to the inference format for DeepAR, here’s what we should send to the endpoint:

  • JSON-formatted samples (we’ll use only one at a time)
  • an optional configuration listing the series we’d like to receive (mean values, quantiles and raw samples: we’ll take all of them, thank you) as well as the number of samples from which they’re built.

Here’s an example request, where we provide the first 30 data points.

{"instances": [ {"start": "2018-01-01 00:00:00", "target": [8.371208085491508, 8.38437885371535, 8.860699073980985, 8.047195011672134, 9.42771383264719, 8.02120332304575, 9.839234913116105, 9.237618947392374, 8.214949470821212, 9.814497679561292, 9.052164695305954, 8.102437854966766, 8.928941871965348, 9.844116398312188, 9.221646100693144, 8.853571486995326, 8.560903044968434, 8.240263518568812, 9.221323908588538, 9.448381346299827, 9.996678314417732, 8.520757726306975, 9.978841260562627, 9.196420806291513, 9.587904493744922, 9.367880938747199, 9.606228859687628, 9.277298500001638, 8.694011829622228, 8.264125277439893]}], 
"configuration": {"output_types": ["mean", "quantiles", "samples"], "quantiles": ["0.1", "0.9"], "num_samples": 100}}

Extracting prediction results

Once we get prediction results, we need to extract each time series for plotting. Obviously, we’re not interested in the 100 raw sample series, let’s just pick one at random.

Plotting prediction results

Pretty graphs: everyone loves them. They’ll also help us get a sense of how well we’re predicting. We’ll throw in ground truth for good measure.

Predicting some samples

All right, time to put all of this to work!

First, let’s predict the last 30 days of 1984 and compare to ground truth.

Purple vs blue: not too bad!

Let’s try another example. This time, suppose that we have data samples for the first 90 days of 2018 (we’ll just use random values here) and that we want to predict the next 30 days. Here’s how you would do it.


As you can see, built-in algorithms like DeepAR are a great way to get the job done quickly: no training code to write, no infrastructure drama to endure. We can thus focus on experimenting with our time series and hyper-parameters to get the best result possible.

If you’re curious about other SageMaker built-in algorithms, here are some previous posts on:

As always, thank you for reading. Happy to answer questions on Twitter.

“Do you wanna know the truth, son? Lord, I’ll tell you the truth. Your soul’s gonna burn in a lake of fire”.