ImageNet — part 1: going on an adventure
When it comes to building image classifiers, ImageNet is probably the most well known data set. It’s also used for the annual ILSVRC competition, where researchers from all over the world compete to build the most efficient models.
As previously discussed, models are frequently trained on ImageNet before being fine-tuned on other image sets. Fine-tuning is a much faster process and a great way to get very good accuracy in just a few epochs.
Still, what does it take to actually grab ImageNet and prepare it for training. Let’s find out. We’ll start with Apache MXNet, but who knows where we’ll end up?
ImageNet in numbers
Clocking in at 150 GB, ImageNet is quite a beast. It holds 1,281,167 images for training and 50,000 images for validation, organised in 1,000 categories. We’re pretty far from MNIST or CIFAR-10!
This creates all kinds of interesting problems that need to be solved before training even starts, namely:
- Downloading the data set,
- Organising the data set for training,
- Creating RecordIO files to optimize I/O,
- Backing up the data set, because who wants to download 150GB again?
- Deploying the data set efficiently to GPU instances for training,
We’re going to look at all these steps. As always, the devil is in the details.
Starting up a download instance
We need to download 150GB. That’s gonna take a while (think days) and it’s gonna fill, well, just about 150GB of disk space. So you’d better plan ahead.
Here’s how I did it: I started a t2.large instance (but t2.micro should work too), which is more than powerful enough to handle the download. I also attached a 1000GB EBS volume to it (because we’ll need additional space later on). As I/O performance is not important here, I picked the least expensive volume type (sc1).
Then I ssh’ed into the instance, formatted the volume, mounted it on /data and chown’ed it recursively to ec2-user:ec2-user. I hope you don’t need me to show you these steps, but here’s the tutorial, just in case ;)
Downloading the data set
It starts easy. Head out to the ImageNet website, register, accept conditions and voila. You’ll get a username and an access key which will let you download the data set, which is composed of several very large files: no way we can right click and “save as”.
Fortunately, one of the Tensorflow repositories includes a nice download script, download_imagenet.sh. It’s quite straightforward, this is how to use it.
And then, you have to wait for a bit: my download took about 5 days…
Once the script has completed, your directory should look like this.
Organising the data set for training
Let’s take a look at the data set. If you list the imagenet/train directory, you’ll see 1,000 directories, each of them holding images for a given ImageNet category. Yes, they have weird names, which is why you need to grab this file to know what’s what, e.g. n02510455 is the category for giant pandas.
If you list the imagenet/validation directory, you see 50,000 images in a single directory. That’s not really practical, we’d like to have them in 1,000 directories as well. This script will take care of it: simply run it inside the validation directory.
Creating RecordIO files to optimize I/O
During training, we could definitely load images from disk. However, this would require a lot of I/O, especially when using multiple GPUs: these beasts are very hungry and you need to keep feeding them with data at the proper throughout. Failing to do so will stall the GPUs and your training speed will drop (more on this in a future post).
Another problem arises when distributed training is used, i.e. when multiple GPU instances are learning the same data set. Sure, we could copy the full data set to each instance, but that may be impractical for huge ones.
In order to solve both issues, we’re going to convert the data set into RecordIO files. This compact format is both I/O efficient and easily shareable across instances.
The process is pretty simple: we’re going to pack the training set and the validation set in their own RecordIO file. We’re also going to resize and compress the images a bit to save some space: this won’t have any impact on training quality, since most of the ImageNet models require 224x244 images. Feel free to create plenty of threads to maximize throughput :)
Let’s start with the validation set. It only takes a couple of minutes.
Now, let’s do the same thing for the training set. This is going to run for a while (about 1h30 on my t2.xlarge).
At this point, losing all this would suck beyond belief, wouldn’t it? Let’s make sure we back everything up in S3. Just create a bucket and sync it with /data. Yes, that’s going to take a while.
Once the backup is over, you should:
- terminate the download instance,
- create a snapshot of the EBS volume: it’s another long operation but better safe than sorry, plus it’s going to help us deploy the data set to as many instances as we need, including across accounts and regions if needed.
Deploying the data set
Deploying the data set to a new GPU instance is now as easy as:
- creating a new EBS volume from the snapshot (make sure you create it in the same AZ as your instance),
- attaching it to the instance,
- mounting the filesystem.
This will only take a few seconds and can easily be scripted and scaled to as many instances as needed (aws ec2 create volume, aws ec2 attach-volume). For full automation, you could perform these operations as User Data commands at instance launch.
Sure, it takes a while to download ImageNet, but thanks to the flexibility of EBS, we’re now able to deploy it as many times as needed in just a few seconds. Of course, you can easily apply this fast and cost-effective technique to other data sets :)
In the next post, we’ll train a model from scratch and focus on making sure that we get as much performance as possible from our GPU instance.
That’s it for today. As always, thank you for reading.
This article was written while listening to vintage AC/DC songs: “It’s a long way to the top if you wanna… train ImageNet” ;)