Lab 2.3: Text analysis
Objective:
In this lab, you'll explore text analysis. You'll learn how to use the _analyze API and create a custom analyzer.
-
From Kibana's main menu, select Dev Tools to open Console if it is not already open.
-
To help you understand how text analysis works, Elasticsearch provides an
_analyzeAPI. For example, to see what would happen to the string"United Kingdom"if you applied thestandardanalyzer, you can use the following in Console:What two tokens are the output of the request above?GET _analyze { "text": "United Kingdom", "analyzer": "standard" }Solution
The two tokens are
unitedandkingdom(both lowercased). -
Let's take a closer look at analyzers. Compare the output of the
_analyzeAPI on the string"Nodes and Shards"using thestandardanalyzer and using theenglishanalyzer.Solution
TheGET _analyze { "text": "Nodes and Shards", "analyzer": "standard" } GET _analyze { "text": "Nodes and Shards", "analyzer": "english" }standardanalyzer outputs three tokens:nodes,and, andshards. Theenglishanalyzer however only outputs two tokens:nodeandshard. There are two things that happened here (besides lowercasing all tokens): First, theenglishanalyzer removed so called "stop words".
Certain words, like the word "and" in English do not add a lot of relevance to queries. Almost every document will contain the word "and", if your documents are in English. That's why you can choose to ignore those words to save on disk space, by not indexing them.
Secondly, the tokensnodesandshardswere stemmed by theenglishanalyzer, meaning they were reduced to their root form. In this case, the pluralnodesandshardsbecame the singularnodeandshard. By applying stemming to both the indexed data as well as the queries, the user does not need to worry about whether a search term is singular or plural. -
Using the
_analyzeAPI, see what thestandardanalyzer does with the following HTML snippet:"<b>Is</b> this <a href='/blogs'>clean</a> text?"Solution
GET _analyze { "analyzer": "standard", "text": "<b>Is</b> this <a href='/blogs'>clean</a> text?" } -
Notice how the HTML tags get parsed and indexed as if they were actual terms in the string. The
contentfield ofblogshas a lot of HTML code written into it. Run a query on thecontentfield of theblogsindex for the term "quot". You get a lot of hits: 826!The "quot` string is way of encoding quotes in HTML. End users who query the blogs are most likely not interested in HTML tags and attributes. Let's see how you can improve the search experience by removing those.GET blogs/_search { "query": { "match": { "content": "quot" } } } -
EXAM PREP: The
html_stripcharacter filter strips out HTML code before indexing the data. As a result, you will have cleaner data to search against. To use this filter, you need to create a custom analyzer. Create a new index namedblogs_testthat defines ananalyzernamedcontent_analyzerthat uses:- the
html_stripcharacter filter - the
standardtokenizer - the
lowercasefilter
Solution
PUT blogs_test { "settings": { "analysis": { "analyzer": { "content_analyzer": { "tokenizer": "standard", "filter": ["lowercase"], "char_filter": ["html_strip"] } } } } } - the
-
Use the
_analyzeAPI to test the newcontent_analyzerin theblogs_testindex with the following text:"<b>Is</b> this <a href='/blogs'>clean</a> text?"Solution
The output only includes the tokens "is", "this", "clean" and "text" for the output. None of the HTML tags will get indexed.GET blogs_test/_analyze { "analyzer": "content_analyzer", "text": "<b>Is</b> this <a href='/blogs'>clean</a> text?" }
Summary:
In this lab, you've learned how to test analyzers with the _analyze API and how to define a custom analyzer.