Lab 2.3: Text analysis

Objective:

In this lab, you'll explore text analysis. You'll learn how to use the _analyze API and create a custom analyzer.

From Kibana's main menu, select Dev Tools to open Console if it is not already open.
To help you understand how text analysis works, Elasticsearch provides an _analyze API. For example, to see what would happen to the string "United Kingdom" if you applied the standard analyzer, you can use the following in Console:
```
GET _analyze
{
  "text": "United Kingdom",
  "analyzer": "standard"
}
```
What two tokens are the output of the request above?

Solution

The two tokens are united and kingdom (both lowercased).
Let's take a closer look at analyzers. Compare the output of the _analyze API on the string "Nodes and Shards" using the standard analyzer and using the english analyzer.
Solution
```
GET _analyze
{
  "text": "Nodes and Shards",
  "analyzer": "standard"
}

GET _analyze
{
  "text": "Nodes and Shards",
  "analyzer": "english"
}
```
The standard analyzer outputs three tokens: nodes, and, and shards. The english analyzer however only outputs two tokens: node and shard. There are two things that happened here (besides lowercasing all tokens): First, the english analyzer removed so called "stop words".
Certain words, like the word "and" in English do not add a lot of relevance to queries. Almost every document will contain the word "and", if your documents are in English. That's why you can choose to ignore those words to save on disk space, by not indexing them.
Secondly, the tokens nodes and shards were stemmed by the english analyzer, meaning they were reduced to their root form. In this case, the plural nodes and shards became the singular node and shard. By applying stemming to both the indexed data as well as the queries, the user does not need to worry about whether a search term is singular or plural.

Using the _analyze API, see what the standard analyzer does with the following HTML snippet:

"<b>Is</b> this <a href='/blogs'>clean</a> text?"

Solution

GET _analyze
{
  "analyzer": "standard",
  "text":     "<b>Is</b> this <a href='/blogs'>clean</a> text?"
}

Notice how the HTML tags get parsed and indexed as if they were actual terms in the string. The content field of blogs has a lot of HTML code written into it. Run a query on the content field of the blogs index for the term "quot". You get a lot of hits: 826!
```
GET blogs/_search
{
  "query": {
    "match": {
      "content": "quot"
    }
  }
}
```
The "quot` string is way of encoding quotes in HTML. End users who query the blogs are most likely not interested in HTML tags and attributes. Let's see how you can improve the search experience by removing those.
EXAM PREP: The html_strip character filter strips out HTML code before indexing the data. As a result, you will have cleaner data to search against. To use this filter, you need to create a custom analyzer. Create a new index named blogs_test that defines an analyzer named content_analyzer that uses:
- the html_strip character filter
- the standard tokenizer
- the lowercase filter
Solution
```
PUT blogs_test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "content_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase"],
          "char_filter": ["html_strip"]
        }
      }
    }
  }
}
```
Use the _analyze API to test the new content_analyzer in the blogs_test index with the following text:
```
"<b>Is</b> this <a href='/blogs'>clean</a> text?"
```
Solution
```
GET blogs_test/_analyze
{
  "analyzer": "content_analyzer",
  "text":     "<b>Is</b> this <a href='/blogs'>clean</a> text?"
}
```
The output only includes the tokens "is", "this", "clean" and "text" for the output. None of the HTML tags will get indexed.

Summary:

In this lab, you've learned how to test analyzers with the _analyze API and how to define a custom analyzer.