An introduction to the MXNet API — part 3

Julien Simon
6 min readApr 12, 2017


In part 2, we discussed how Symbols allow us to define computation graphs processing data stored in NDArrays (which we studied in part 1).

In this article, we’re going to use what we learned on Symbols and NDArrays to prepare some data and build a neural network. Then, we’ll use the Module API to train the network and predict results.

Defining our data set

Our (imaginary) data set is composed of 1000 data samples

  • Each sample has 100 features.
  • A feature is represented by a float value between 0 and 1.
  • Samples are split in 10 categories. The purpose of the network will be to predict the correct category for a given sample.
  • We’ll use 800 samples for training and 200 samples for validation.
  • We’ll use a batch size of 10 for training and validation
import mxnet as mx
import numpy as np
import logging
logging.basicConfig(level=logging.INFO)sample_count = 1000
train_count = 800
valid_count = sample_count - train_count
feature_count = 100
category_count = 10

Generating the data set

Let’s use a uniform distribution to generate the 1000 samples. They are stored in an NDArray named ‘X’: 1000 lines, 100 columns.

X = mx.nd.uniform(low=0, high=1, shape=(sample_count,feature_count))>>> X.shape
(1000L, 100L)
>>> X.asnumpy()
array([[ 0.70029777, 0.28444085, 0.46263582, ..., 0.73365158,
0.99670047, 0.5961988 ],
[ 0.34659418, 0.82824177, 0.72929877, ..., 0.56012964,
0.32261589, 0.35627609],
[ 0.10939316, 0.02995235, 0.97597599, ..., 0.20194994,
0.9266268 , 0.25102937],
[ 0.69691515, 0.52568913, 0.21130568, ..., 0.42498392,
0.80869114, 0.23635457],
[ 0.3562004 , 0.5794751 , 0.38135922, ..., 0.6336484 ,
0.26392782, 0.30010447],
[ 0.40369365, 0.89351988, 0.88817406, ..., 0.13799617,
0.40905532, 0.05180593]], dtype=float32)

The categories for these 1000 samples are represented as integers in the 0–9 range. They are randomly generated and stored in an NDArray named ‘Y’.

Y = mx.nd.empty((sample_count,))
for i in range(0,sample_count-1):
Y[i] = np.random.randint(0,category_count)
>>> Y.shape
>>> Y[0:10].asnumpy()
array([ 3., 3., 1., 9., 4., 7., 3., 5., 2., 2.], dtype=float32)

Splitting the data set

Next, we’re splitting the data set 80/20 for training and validation. We use the NDArray.crop function to do this. Here, the data set is completely random, so we can use the top 80% for training and the bottom 20% for validation. In real life, we’d probably shuffle the data set first, in order to avoid potential bias on sequentially-generated data.

X_train = mx.nd.crop(X, begin=(0,0), end=(train_count,feature_count-1))X_valid = mx.nd.crop(X, begin=(train_count,0), end=(sample_count,feature_count-1))Y_train = Y[0:train_count]Y_valid = Y[train_count:sample_count]

Our data is now ready!

Building the network

Our network is pretty simple. Let’s look at each layer:

  • The input layer is represented by a Symbol named ‘data’. We’ll bind it to the actual input data later on.
data = mx.sym.Variable('data')
  • fc1, the first hidden layer is built from 64 fully-connected neurons, i.e. each feature in the input layer is connected to all 64 neurons. As you can see, we use the high-level Symbol.FullyConnected function, which is much more convenient than building each connection manually!
fc1 = mx.sym.FullyConnected(data, name='fc1', num_hidden=64)
  • Each output of fc1 goes through an activation function. Here we use a rectified linear unit, aka ‘relu’. I promised minimal theory, so let’s just say that an activation function is how we decide whether a neuron should “fire” or not, i.e. whether its inputs are meaningful enough in predicting the correct result.
relu1 = mx.sym.Activation(fc1, name='relu1', act_type="relu")
  • fc2, the second hidden layer is built from 10 fully-connected neurons, which map to our 10 categories. Each neuron outputs a float value of arbitrary scale. The largest of the 10 values represents the most likely category for the data sample.
fc2 = mx.sym.FullyConnected(relu1, name='fc2', num_hidden=category_count)
  • The output layer applies the Softmax function to the 10 values coming from the fc2 layer: they are transformed into 10 values between 0 and 1 that add up to 1. Each value represents the predicted probability for each category, the largest one pointing at the most likely category.
out = mx.sym.SoftmaxOutput(fc2, name='softmax')
mod = mx.mod.Module(out)

Building the data iterator

In part 1, we saw that neural networks not trained one sample at a time, as this is quite inefficient from a performance point of view. Instead, we use batches, i.e. a fixed number of samples.

In order to deliver these batches to the network, we need to build an iterator using the NDArrayIter function. Its parameters are the training data, the categories (MXNet calls these labels) and the batch size.

As you can see, we can indeed iterate on the data set, 10 samples and 10 labels at a time. We then call the reset() function to restore the iterator to its original state.

train_iter =,label=Y_train,batch_size=batch)>>> for batch in train_iter:
... print
... print batch.label
[<NDArray 10x99 @cpu(0)>]
[<NDArray 10 @cpu(0)>]
[<NDArray 10x99 @cpu(0)>]
[<NDArray 10 @cpu(0)>]
[<NDArray 10x99 @cpu(0)>]
[<NDArray 10 @cpu(0)>]
<edited for brevity>

Our network is now ready for training!

Training the model

First, let’s bind the input symbol to the actual data set (samples and labels). This is where the iterator comes in handy.

mod.bind(data_shapes=train_iter.provide_data, label_shapes=train_iter.provide_label)

Next, let’s initialize the neuron weights in the network. This is actually a very important step: initializing them with the “right” technique will help the network learn much faster. The Xavier initializer (named after his inventor, Xavier Glorot — PDF) is one of these techniques.

# Allowed, but not efficient
# Much better

Next, we need to define the optimization parameters:

  • we’re using the Stochastic Gradient Descent algorithm (aka SGD), which has long been used for Machine Learning and Deep Learning application.
  • we’re setting the learning rate to 0.1, a pretty typical value for SGD.
mod.init_optimizer(optimizer='sgd', optimizer_params=(('learning_rate', 0.1), ))

And finally, we can train the network! We’re doing it over 50 epochs, which means the full data set will flow 50 times through the network (in batches of 10 samples)., num_epoch=50)INFO:root:Epoch[0] Train-accuracy=0.097500
INFO:root:Epoch[0] Time cost=0.085
INFO:root:Epoch[1] Train-accuracy=0.122500
INFO:root:Epoch[1] Time cost=0.074
INFO:root:Epoch[2] Train-accuracy=0.153750
INFO:root:Epoch[2] Time cost=0.087
INFO:root:Epoch[3] Train-accuracy=0.162500
INFO:root:Epoch[3] Time cost=0.082
INFO:root:Epoch[4] Train-accuracy=0.192500
INFO:root:Epoch[4] Time cost=0.094
INFO:root:Epoch[5] Train-accuracy=0.210000
INFO:root:Epoch[5] Time cost=0.108
INFO:root:Epoch[6] Train-accuracy=0.222500
INFO:root:Epoch[6] Time cost=0.104
INFO:root:Epoch[7] Train-accuracy=0.243750
INFO:root:Epoch[7] Time cost=0.110
INFO:root:Epoch[8] Train-accuracy=0.263750
INFO:root:Epoch[8] Time cost=0.101
INFO:root:Epoch[9] Train-accuracy=0.286250
INFO:root:Epoch[9] Time cost=0.097
INFO:root:Epoch[10] Train-accuracy=0.306250
INFO:root:Epoch[10] Time cost=0.100
INFO:root:Epoch[20] Train-accuracy=0.507500
INFO:root:Epoch[30] Train-accuracy=0.718750
INFO:root:Epoch[40] Train-accuracy=0.923750
INFO:root:Epoch[50] Train-accuracy=0.998750
INFO:root:Epoch[50] Time cost=0.077

As we can see, the training accuracy rises rapidly and reaches 99+% after 50 epochs. It looks like our network was able to learn the training set. That’s pretty impressive!

But how does it perform against the validation set?

Validating the model

Now we’re going to throw new data samples at the network, i.e. the 20% that haven’t been used for training.

First, we’re building an iterator. This time, we’re using the validation samples and labels.

pred_iter =,label=Y_valid, batch_size=batch)

Next, using the Module.iter_predict() function, we’re going to run these samples through the network. As we do this, we’re going to compare the predicted label with the actual label. We’ll keep track of the score and display the validation accuracy, i.e. how well the network did on the validation set.

pred_count = valid_count
correct_preds = total_correct_preds = 0
for preds, i_batch, batch in mod.iter_predict(pred_iter):
label = batch.label[0].asnumpy().astype(int)
pred_label = preds[0].asnumpy().argmax(axis=1)
correct_preds = np.sum(pred_label==label)
total_correct_preds = total_correct_preds + correct_preds
print('Validation accuracy: %2.2f' % (1.0*total_correct_preds/pred_count))

There is quite a bit happening here :)

iter_predict() returns:

  • i_batch: the batch number
  • batch: an array of NDArrays. Here, it holds a single NDArray storing the current batch. We’re using it to find the labels of the 10 data samples in the current batch. We store them in the label numpy array (10 elements).
  • preds: an array of NDArrays. Here, it holds a single NDArray storing predicted labels for the current batch: for each sample, we have predicted probabilities for all 10 categories (10x10 matrix). Thus, we’re using argmax() to find the index of the highest value, i.e. the most likely category. Thus, pred_label is a 10-element array holding the predicted category for each data sample in the current batch.

Then, we’re comparing the number of equal values in label and pred_label using Numpy.sum().

Finally, we compute and display the validation accuracy.

Validation accuracy: 0.09

What? 9%? This is really bad! If you needed proof that our data set was random, there you have it!

The bottom line is that you can indeed train a neural network to learn anything, but if your data is meaningless (like ours here), it won’t be able to predict anything. Garbage in, garbage out!

If you read this far, I guess you deserve to get the full code for this example ;) Please take some time to use it on your own data, it’s the best way to learn.

Next :

  • Part 4: Using a pre-trained model for image classification (Inception v3)
  • Part 5: More pre-trained models (VGG16 and ResNet-152)
  • Part 6: Real-time object detection on a Raspberry Pi (and it speaks, too!)



Julien Simon