Johnny Pi, I am your father — part 4: adding cloud-based vision

5 min readSep 11, 2017

In the previous post, we learned how to use Amazon Polly to let our robot speak. I hope you had fun with that :)

In this post, I’ll show you how to take picture with the robot’s camera and how to use Amazon Rekognition to identify faces and objects… and yeah, we’ll send some tweets.

Let’s get going.

What we’ll need

Here’s the shopping list:

A camera module v2
Optional: an extra flex ribbon (such as this one), as they tend to bend and break pretty easily.

Camera Module V2 - Raspberry Pi

The Raspberry Pi Camera Module v2 replaced the original Camera Module in April 2016. The v2 Camera Module has a Sony…

www.raspberrypi.org

Allowing the robot to subscribe to a new MQTT topic

Once again, all commands will be sent through a dedicated MQTT topic, named JohnnyPi/see. We have to update the thing’s IAM policy to allow it to subscribe to this new topic.

Just go to the IAM console, locate the proper policy and add the following statement.

Allowing the robot to invoke Rekognition

As we did for Polly, he have to allow the robot to call the Rekognition API. Let’s use the AWS CLI again and grant the robot the appropriate IAM permissions.

$ aws iam attach-user-policy — user-name johnny-pi — policy-arn arn:aws:iam::aws:policy/AmazonRekognitionReadOnlyAccess

Code overview

OK, with IAM out of the way, let’s write some code. Here’s what should happen when we send an MQTT message to the JohnnyPi/see topic:

take a picture with the Pi camera,
send it to Rekognition for face and label detection,
draw a rectangle around each face, add a legend and save the new image,
build a text message about faces and another one about labels,
send both messages to Polly for speech generation,
play both sound files on the Pi,
last but not least, tweet the new image if the MQTT message contains ‘tweet’.

As usual, we need to add a callback for messages posted to the topic. This is what the function looks like.

It should be quite explanatory. Let’s look at the important steps in more detail.

Taking a picture

The Pi camera API is nice and simple. Open the camera, take a picture, close the camera. Why can’t programming be always this simple?

Detecting faces and labels with Rekognition

First, we have to copy the picture to S3. Make sure to use your own bucket in awsUtils.py.

Next, we invoke two Rekognition APIs:

detect_faces() returns a list of face details: position, landmarks, age range, etc.
detect_labels() returns a list of labels and confidence scores. By default, we’re using 10 labels at most, with confidence score of 80% or higher.

Generating the new image

Thanks to the face details provided by Rekognition, we’re now able to locate the position of each face in the picture. Using the Pillow library, we’re going to draw a rectangle around each face and add a legend with the face count (‘Face0’, ‘Face1’, etc.).

Drawing a rectangle around each face requires a bit more work that I’d have liked. First, Rekognition returns fractional coordinates for the bounding box, which need to be converted into absolute pixel values. Second, the Pillow API to draw rectangles doesn’t allow line width to be set: for high resolution pictures, the resulting rectangle tends to be invisible :-/ Thus, I’m drawing lines instead. If you’re curious about Pillow, all code is located in RekognitionUtils.py. If not, no worries, you can happily ignore this. It seems to work fine ;)

Once rectangles and legends have been added, the new image is saved locally and the number of faces is returned.

Generating the face and label messages

Using the number of faces, we build a text string that Polly will speak. In the same way, we build a text string about labels: more details in generateMessages().

Generating and playing the sound files

We simply reuse the code we wrote in part 3.

Sending a tweet

The first step is to create a Twitter developer account and get API credentials. Then, we can use the super simple Tweepy library to send a tweet.

That’s it, we have everything we need. Let’s test!

Testing

As previously, I’m using MQTT.fx to publish messages to the JohnnyPi/see topic. Here’s the output for Rekognition only.

Topic=JohnnyPi/see
Picture uploaded
Label People, confidence: 99.1184310913
Label Person, confidence: 99.1184387207
Label Human, confidence: 99.0959320068
Label Computer, confidence: 98.671875
Label Electronics, confidence: 98.671875
Label LCD Screen, confidence: 98.671875
Label Laptop, confidence: 98.671875
Label Pc, confidence: 98.671875
*** Face 0 detected, confidence: 99.9929733276
Gender: Male
Age: 48-68
HAPPY 99.4530334473
ANGRY 1.54959559441
CALM 0.563991069794
Face message: A single face has been detected.
Label message: Here are some keywords about this picture: People, Person, Human, Computer, Electronics, LCD Screen, Laptop, Pc

48–68? WTF. I should have a chat with the Product Manager about the age range. Or maybe I simply need some sleep :)

Here’s a second try with both Rekognition and Twitter.

Topic=JohnnyPi/see
Picture uploaded
Label People, confidence: 98.8171768188
Label Person, confidence: 98.8172149658
Label Human, confidence: 98.7540512085
Label Computer, confidence: 98.6133422852
Label Electronics, confidence: 98.6133422852
Label LCD Screen, confidence: 98.6133422852
Label Laptop, confidence: 98.6133422852
Label Pc, confidence: 98.6133422852
*** Face 0 detected, confidence: 99.9958724976
Gender: Male
Age: 26-43
Beard
CONFUSED 56.1017799377
SAD 16.9187488556
ANGRY 5.65937137604
Face message: A single face has been detected.
Label message: Here are some keywords about this picture: People, Person, Human, Computer, Electronics, LCD Screen, Laptop, Pc,
Tweet sent

All right, the age range is more like it. I guess the moral of the story is: don’t smile.

What’s next

As usual, you’ll find the full code on Github.

In the next part, we’ll keep expanding our robot vision skills with a pre-trained MXNet model for image recognition. More silliness in sight, no doubt!

Until then, have fun and as always, thanks for reading.

Part 0: a sneak preview

Part 1: moving around

Part 2: the joystick

Part 3: cloud-based speech

Part 5: local image classification with MXNet

This article was written while playing way too many songs by Lynyrd Skynyrd. Must be the 68-year old redneck in me.