An introduction to the MXNet API — part 5

Architecture of a CNN (Source: Nvidia)

VGG16

Published in 2014, VGG16 is a model built from 16 layers (research paper). It won the 2014 ImageNet challenge by achieving a 7.4% error rate on object classification.

ResNet-152

Published in 2015, ResNet-152 is a model built from 152 layers (research paper). It won the 2015 ImageNet challenge by achieving a record 3.57% error rate on object detection. That’s much better than the typical human error rate which is usually measured at 5%.

Downloading the models

Time to visit the model zoo once again. Just like for Inception v3, we need to download model definitions and parameters. All three models have been trained on the same categories, so we can reuse our synset.txt file.

$ wget http://data.dmlc.ml/models/imagenet/vgg/vgg16-symbol.json$ wget http://data.dmlc.ml/models/imagenet/vgg/vgg16-0000.params$ wget http://data.dmlc.ml/models/imagenet/resnet/152-layers/resnet-152-symbol.json$ wget http://data.dmlc.ml/models/imagenet/resnet/152-layers/resnet-152-0000.params

Loading the models

All three models have been trained on the ImageNet data set, with a typical image size of 224 x 224. Since data shape and categories are identical, we can reuse our previous code as-is.

def loadModel(modelname):
sym, arg_params, aux_params = mx.model.load_checkpoint(modelname, 0)
mod = mx.mod.Module(symbol=sym)
mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))])
mod.set_params(arg_params, aux_params)
return mod
def init(modelname):
model = loadModel(modelname)
cats = loadCategories()
return model, cats

Comparing predictions

Let’s compare these models on a couple of images.

*** VGG16
[(0.58786136, 'n03272010 electric guitar'), (0.29260877, 'n04296562 stage'), (0.013744719, 'n04487394 trombone'), (0.013494448, 'n04141076 sax, saxophone'), (0.00988709, 'n02231487 walking stick, walkingstick, stick insect')]
*** ResNet-152
[(0.91063803, 'n04296562 stage'), (0.039011702, 'n03272010 electric guitar'), (0.031426914, 'n03759954 microphone, mike'), (0.011822623, 'n04286575 spotlight, spot'), (0.0020199812, 'n02676566 acoustic guitar')]
*** Inception v3
[(0.58039135, 'n03272010 electric guitar'), (0.27168664, 'n04296562 stage'), (0.090769522, 'n04456115 torch'), (0.023762707, 'n04286575 spotlight, spot'), (0.0081428187, 'n03250847 drumstick')]
*** VGG16
[(0.96909302, 'n04536866 violin, fiddle'), (0.026661994, 'n02992211 cello, violoncello'), (0.0017284016, 'n02879718 bow'), (0.00056815811, 'n04517823 vacuum, vacuum cleaner'), (0.00024804732, 'n04090263 rifle')]
*** ResNet-152
[(0.96826887, 'n04536866 violin, fiddle'), (0.028052919, 'n02992211 cello, violoncello'), (0.0008367821, 'n02676566 acoustic guitar'), (0.00070532493, 'n02787622 banjo'), (0.00039021231, 'n02879718 bow')]
*** Inception v3
[(0.82023674, 'n04536866 violin, fiddle'), (0.15483995, 'n02992211 cello, violoncello'), (0.0044540241, 'n02676566 acoustic guitar'), (0.0020963412, 'n02879718 bow'), (0.0015099624, 'n03447721 gong, tam-tam')]

Comparing technical performance

You’ll find extensive model benchmarks in research papers such as this one. For developers, the two most important factors will probably be:

  • how much memory does the model require?
  • how fast can it predict?
  • VGG16: 528MB (about 140 million parameters)
  • ResNet-152: 230MB (about 60 million parameters)
  • Inception v3: 43MB (about 25 million parameters)
t1 = time.time()
model.forward(Batch([array]))
t2 = time.time()
t = 1000*(t2-t1)
print("Predicted in %2.2f millisecond" % t)
*** VGG16
Predicted in 0.30 millisecond
*** ResNet-152
Predicted in 0.90 millisecond
*** Inception v3
Predicted in 0.40 millisecond
  • ResNet-152 has the best accuracy of all three networks (by far) but it’s also 2–3 times slower.
  • VGG16 is the fastest — due its small number of layers? — but it has the highest memory usage and the worst accuracy.
  • Inception v3 is almost as fast, while delivering better accuracy and the most conservative memory usage. This last point makes it a good candidate for constrained environments. More on this in part 6 :)
  • Part 6: Real-time object detection on a Raspberry Pi (and it speaks, too!)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store