An introduction to the MXNet API — part 5

In part 4, we saw how easy it was to use a pre-trained version of the Inception v3 model for object detection. In this article, we’re going to load two other famous Convolutional Neural Networks (VGG19 and ResNet-152) and we’ll compare them to Inception v3.

Architecture of a CNN (Source: Nvidia)

Published in 2014, VGG16 is a model built from 16 layers (research paper). It won the 2014 ImageNet challenge by achieving a 7.4% error rate on object classification.

Published in 2015, ResNet-152 is a model built from 152 layers (research paper). It won the 2015 ImageNet challenge by achieving a record 3.57% error rate on object detection. That’s much better than the typical human error rate which is usually measured at 5%.

Time to visit the model zoo once again. Just like for Inception v3, we need to download model definitions and parameters. All three models have been trained on the same categories, so we can reuse our synset.txt file.

$ wget$ wget$ wget$ wget

All three models have been trained on the ImageNet data set, with a typical image size of 224 x 224. Since data shape and categories are identical, we can reuse our previous code as-is.

All we have to change is the model name :) Let’s just add a parameter to our loadModel() and init() functions.

def loadModel(modelname):
sym, arg_params, aux_params = mx.model.load_checkpoint(modelname, 0)
mod = mx.mod.Module(symbol=sym)
mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))])
mod.set_params(arg_params, aux_params)
return mod
def init(modelname):
model = loadModel(modelname)
cats = loadCategories()
return model, cats

Let’s compare these models on a couple of images.

*** VGG16
[(0.58786136, 'n03272010 electric guitar'), (0.29260877, 'n04296562 stage'), (0.013744719, 'n04487394 trombone'), (0.013494448, 'n04141076 sax, saxophone'), (0.00988709, 'n02231487 walking stick, walkingstick, stick insect')]

Good job on the top two categories, but the other three are wildly wrong. Looks like the vertical shape of the microphone stand confused the model.

*** ResNet-152
[(0.91063803, 'n04296562 stage'), (0.039011702, 'n03272010 electric guitar'), (0.031426914, 'n03759954 microphone, mike'), (0.011822623, 'n04286575 spotlight, spot'), (0.0020199812, 'n02676566 acoustic guitar')]

Very high on the top category. The other four are all meaningful.

*** Inception v3
[(0.58039135, 'n03272010 electric guitar'), (0.27168664, 'n04296562 stage'), (0.090769522, 'n04456115 torch'), (0.023762707, 'n04286575 spotlight, spot'), (0.0081428187, 'n03250847 drumstick')]

Very similar results to VGG16 for the top two categories. The other three are a mixed bag.

Let’s try another picture.

*** VGG16
[(0.96909302, 'n04536866 violin, fiddle'), (0.026661994, 'n02992211 cello, violoncello'), (0.0017284016, 'n02879718 bow'), (0.00056815811, 'n04517823 vacuum, vacuum cleaner'), (0.00024804732, 'n04090263 rifle')]
*** ResNet-152
[(0.96826887, 'n04536866 violin, fiddle'), (0.028052919, 'n02992211 cello, violoncello'), (0.0008367821, 'n02676566 acoustic guitar'), (0.00070532493, 'n02787622 banjo'), (0.00039021231, 'n02879718 bow')]
*** Inception v3
[(0.82023674, 'n04536866 violin, fiddle'), (0.15483995, 'n02992211 cello, violoncello'), (0.0044540241, 'n02676566 acoustic guitar'), (0.0020963412, 'n02879718 bow'), (0.0015099624, 'n03447721 gong, tam-tam')]

All three models score very high on the top category. One can suppose that the shape of a violin is a very unambiguous pattern for a neural network.

Obviously, it’s impossible to draw conclusions from a couple of samples. If you’re looking for a pre-trained model, you should definitely look at the training set, run tests on your own data and make up your mind!

You’ll find extensive model benchmarks in research papers such as this one. For developers, the two most important factors will probably be:

  • how much memory does the model require?
  • how fast can it predict?

To answer the first question, we could take an educated guess by looking at the size of the parameters file:

  • VGG16: 528MB (about 140 million parameters)
  • ResNet-152: 230MB (about 60 million parameters)
  • Inception v3: 43MB (about 25 million parameters)

As we can see, the current trend is to use deeper networks with less parameters. This has a double benefit: faster training time (since the network has to learn less parameters) and reduced memory usage.

The second question is more complex and depends on many parameters such as batch size. Let’s time the prediction call and run our examples again.

t1 = time.time()
t2 = time.time()
t = 1000*(t2-t1)
print("Predicted in %2.2f millisecond" % t)

This is what we see (values have been averaged over a few calls).

*** VGG16
Predicted in 0.30 millisecond
*** ResNet-152
Predicted in 0.90 millisecond
*** Inception v3
Predicted in 0.40 millisecond

To sum things up (standard disclaimer applies):

  • ResNet-152 has the best accuracy of all three networks (by far) but it’s also 2–3 times slower.
  • VGG16 is the fastest — due its small number of layers? — but it has the highest memory usage and the worst accuracy.
  • Inception v3 is almost as fast, while delivering better accuracy and the most conservative memory usage. This last point makes it a good candidate for constrained environments. More on this in part 6 :)

That’s it for today! Full code below.


  • Part 6: Real-time object detection on a Raspberry Pi (and it speaks, too!)

Chief Evangelist, Hugging Face (