A quick look at Natural Language Processing with Amazon Comprehend

5 min readDec 20, 2017

Amazon Comprehend is a new service announced at AWS re:Invent 2017. At the time of writing, it is available in the US (Virginia, Ohio, Oregon) and in Europe (Ireland).

Features

This is what Amazon Comprehend is capable of right now:

Language detection (single document or batch mode),
Entity detection (single document or batch mode),
Key phrases extraction (single document or batch mode),
Sentiment analysis (single document or batch mode),
Topic modeling on large document collection.

Detection, extraction and analysis

Let’s look at the first four features with a simple example in the console. Here’s a sample text from an Associated Press data set:

“A suspect bit the ear of a 4-year-old police dog and injured the animal’s neck during a chase and arrest, police said today. The dog, Rex, was on patrol with Constable Philip Rajah in the Natal provincial capital during the weekend when they came across two suspicious individuals,’’ police said. While Rajah searched one man, Rex chased the other and got the worst of it when his quarry turned on the animal and bit him. Rajah had to yank the man off the dog, police said. They said the dog was being treated for a serious neck injury at a veterinary clinic. The man who bit the dog may face a charge of malicious injury to state property”.

What does Comprehend make of this crazy story?

As you can see, key entities are properly identified: the police officer, Rex, the two individuals, the date and the location. Language is also detected as ‘English’.

What about key phrases?

Once again, key phrases are understood. It’s quite likely that indexing them in a backend would let us find this article quickly and accurately.

Now, what about sentiment?

Tone is indeed pretty neutral, although I’m quite sure poor Rex would think this is a pretty negative story :D

Language detection

Let’s take another shot at language detection. Comprehend is able to detect 100 languages, so let’s use tweets in three different languages: Swahili, Tagalog and Ukrainian. As you can see, these are rather short sentences, which are more difficult to detect than longer texts.

aws comprehend detect-dominant-language --text "Wabunge wapitisha muswada wa ukomo wa rais Uganda kusomwa kwa mara ya pili"
{
    "Languages": [
        {
            "LanguageCode": "sw",
            "Score": 0.994857907295227
        }
    ]
}aws comprehend detect-dominant-language --text "Nakabibili pala ng durian sa U.S. supermarkets kasama ng mga epol. Galing siguro sa Thailand."
{
    "Languages": [
        {
            "LanguageCode": "tl",
            "Score": 0.9984232187271118
        }
    ]
}aws comprehend detect-dominant-language --text "Помер відомий тренер та функціонер київського Динамо"
{
    "Languages": [
        {
            "LanguageCode": "uk",
            "Score": 0.9999969005584717
        }
    ]
}

All three are correctly detected. Good job, Comprehend!

Topic modeling

Now, let’s take a look at topic modeling, a powerful technique allowing us to automatically:

build a list of topics from a large document collection,
group documents according to these topics.

First of all, we need a data set. We’ll use a collection of 2,246 news items from Associated Press (source), stored in a TSV file.

First, we have to remove the first two columns (we don’t need them) and save the file in UTF-8 format, which is what Comprehend expects.

Amazon Comprehend expects documents to be stored in an S3 bucket: either one file per document, or a single file with one document per line. We’ll use the second option, as this is how our data set is structured.

Creating the job is straightforward: S3 bucket, file format, number of topics to detect, etc. Once the job is complete, we can download the results from S3.

The output file contains two CSV files: doc-topics.csv and topic-terms.csv.

Documents and topics

The doc-topics.csv file stores the list of topics and weights detected in each document (here each line in the file),

For example, let’s look at topics for document #648:

Here are the first few lines of this document: “Savings institution depositors withdrew more money from their accounts than they deposited in May for the first time in seven months, while losses continued to erode the ailing industry’s capital, the government said Monday. The Federal Home Loan Bank Board said net withdrawals at the nation’s 3,102 federally insured S &Ls totaled $941 million in May. But the decline came after seven consecutive months of net deposit gains totaling $28 billion”.

Looks like ‘finance’ is the key topic here.

Topics and terms

The topic-terms.csv file stores the list of topics that have been detected in the document collection. For each topic, we see a list of words and their weights.

The dominant topic in document #648 is topic #3: let’s check what it’s about.

These words could reasonably be associated with a ‘finance’ topic, so it’s a pretty good match for the document above.

Curious about this works? Read on :)

Latent Dirichlet Allocation (LDA)

LDA is an unsupervised learning algorithm which is commonly used to discover a user-specified number of topics shared by documents within a text collection.

As a sidenote, LDA is one of the built-algorithms available in Amazon SageMaker . You’ll find a high-level description here.

If you want to dive deeper (and I mean ‘deeper’) on topic modeling and LDA, I strongly recommend this re:Invent video by my colleague Anima Anandkumar.

That’s it for today. Thank you for reading.