Lab 4.1: Changing data

Objective:

In this lab, you will learn how to reindex or delete documents. You will also learn how to define an ingest node pipeline.

The blogs index on the remote cluster cluster2 contains 7 documents. Index those 7 documents into your existing blogs_fixed2 index on cluster1 using the Reindex API. You will need the following details:
- the username is training
- the password is nonprodpwd
- the hostname for cluster2 is node5 and is using SSL on port 9204
Note that the elasticsearch.yml file on cluster1 has all the necessary settings, so you will not need to change it. Here are the settings that were added for the remote reindex to work:
```
reindex.remote.whitelist: node5:9204
reindex.ssl.certificate_authorities: /usr/share/elasticsearch/config/certificates/ca/ca.crt
reindex.ssl.verification_mode: none
```
You can view the entire file by running the following command in the terminal:
```
cat /home/ubuntu/elasticsearch/elasticsearch1.yml
```
Solution
```
POST _reindex
{
  "source": {
    "remote": {
      "host": "https://node5:9204",
      "username": "training",
      "password": "nonprodpwd"
    },
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fixed2"
  }
}
```
Run a _count_ on blogs_fixed2 and verify that you have 4,726 documents, which is 7 more than you had before:
```
GET blogs_fixed2/_count
```
Delete all the documents in blogs_fixed2 where tags.use_case equals uptime monitoring. You should see 67 documents deleted.
Solution
```
POST blogs_fixed2/_delete_by_query
{
  "query": {
    "match": {
      "tags.use_case": "uptime monitoring"
    }
  }
}
```

EXAM PREP: Elastic's web team has been tracking the visitors to our blogs. Here is an example of a log from one of those visits:

{
  "@timestamp": "2021-03-21T19:25:05.000-06:00",
  "bytes_sent": 26774, 
  "content_type": "text/html; charset=utf-8", 
  "geoip_location_lat": 39.1029, 
  "geoip_location_lon": -94.5713, 
  "is_https": true, 
  "request": "/blog/introducing-elastic-endpoint-security", 
  "response": 200, 
  "runtime_ms": 191, 
  "user_Agent": "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)", 
  "verb": "GET"
}

Using the Ingest Node Pipeline UI, create an ingest pipeline that satisfies the following requirements:

the name of the pipeline is web_traffic_pipeline
removes the field is_https
renames the field request to url.original
renames the field verb to http.request.method
renames the field response to http.response.status_code
renames the field geoip_location_lat to geo.location.lat
renames the field geoip_location_lon to geo.location.lon
uses the user_agent processor on the field user_Agent
removes the field user_Agent
above processors should ignore documents that do not have the specified fields

Test the pipeline on the sample document above. The output should look like this

{
          "geo": {
            "location": {
              "lon": -94.5713,
              "lat": 39.1029
            }
          },
          "@timestamp": "2021-03-21T19:25:05.000-06:00",
          "content_type": "text/html; charset=utf-8",
          "runtime_ms": 191,
          "http": {
            "request": {
              "method": "GET"
            },
            "response": {
              "status_code": 200
            }
          },
          "bytes_sent": 26774,
          "url": {
            "original": "/blog/introducing-elastic-endpoint-security"
          },
          "user_agent": {
            "name": "MJ12bot",
            "original": "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)",
            "device": {
              "name": "Spider"
            },
            "version": "1.4.8"
          }
        }

Solution

In Kibana go to Stack Management > Ingest Node Pipelines. Click on Create new pipeline, then add each processor. If you want to bypass the UI and define the pipeline using an HTTP request, copy-and-paste the following into Console:

PUT _ingest/pipeline/web_traffic_pipeline
{
  "processors": [
    {
      "remove": {
        "field": "is_https",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "request",
        "target_field": "url.original",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "verb",
        "target_field": "http.request.method",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "response",
        "target_field": "http.response.status_code",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "geoip_location_lat",
        "target_field": "geo.location.lat",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "geoip_location_lon",
        "target_field": "geo.location.lon",
        "ignore_missing": true
      }
    },
    {
      "user_agent": {
        "field": "user_Agent",
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "user_Agent",
        "ignore_missing": true
      }
    }
  ]
}

You can test your pipeline by running the following command:

GET _ingest/pipeline/web_traffic_pipeline/_simulate
{
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "@timestamp": "2021-03-21T19:25:05.000-06:00",
        "bytes_sent": 26774,
        "content_type": "text/html; charset=utf-8",
        "geoip_location_lat": 39.1029,
        "geoip_location_lon": -94.5713,
        "is_https": true,
        "request": "/blog/introducing-elastic-endpoint-security",
        "response": 200,
        "runtime_ms": 191,
        "user_Agent": "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)",
        "verb": "GET"
      }
    }
  ]
}

EXAM PREP: Once your pipeline is working, define a new index named web_traffic. Configure web_traffic_pipeline as the default pipeline for web_traffic. Use the following mapping for web_traffic (copy-and-paste it into your PUT request):

  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "geo": {
        "properties": {
          "location": {
            "type": "geo_point"
          }
        }
      },
      "http": {
        "properties": {
          "request": {
            "properties": {
              "method": {
                "type": "keyword"
              }
            }
          },
          "response": {
            "properties": {
              "status_code": {
                "type": "keyword"
              }
            }
          }
        }
      },
      "runtime_ms": {
        "type": "long"
      },
      "url": {
        "properties": {
          "original": {
            "type": "keyword",
            "fields": {
              "text": {
                "type": "text"
              }
            }
          }
        }
      },
      "user_agent": {
        "properties": {
          "device": {
            "properties": {
              "name": {
                "type": "keyword"
              }
            }
          },
          "name": {
            "type": "keyword"
          },
          "original": {
            "type": "keyword",
            "fields": {
              "text": {
                "type": "text"
              }
            }
          },
          "version": {
            "type": "keyword"
          }
        }
      }
    }
  }

Solution

PUT web_traffic
{
  "settings": {
    "default_pipeline": "web_traffic_pipeline"         
  },
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "geo": {
        "properties": {
          "location": {
            "type": "geo_point"
          }
        }
      },
      "http": {
        "properties": {
          "request": {
            "properties": {
              "method": {
                "type": "keyword"
              }
            }
          },
          "response": {
            "properties": {
              "status_code": {
                "type": "keyword"
              }
            }
          }
        }
      },
      "runtime_ms": {
        "type": "long"
      },
      "url": {
        "properties": {
          "original": {
            "type": "keyword",
            "fields": {
              "text": {
                "type": "text"
              }
            }
          }
        }
      },
      "user_agent": {
        "properties": {
          "device": {
            "properties": {
              "name": {
                "type": "keyword"
              }
            }
          },
          "name": {
            "type": "keyword"
          },
          "original": {
            "type": "keyword",
            "fields": {
              "text": {
                "type": "text"
              }
            }
          },
          "version": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

Test your pipeline by running the following script within a Terminal window in Strigo. The script indexes a few log events into the web_traffic index:
```
cd datasets
./test_webtraffic.sh
```
Check your web_traffic index and verify everything is working.
Solution

Run a search request and verify that your documents have the correct structure.
```
GET web_traffic/_search
```
If the sample documents look correct, delete them by running the following command into Console:
```
POST web_traffic/_delete_by_query
{
  "query": {
    "match_all": {}
  }
}
```
You will now index about 1.4 million log events, which is about a month's worth of logs. From the same folder, run the provided load_webtraffic.sh script:
```
./load_webtraffic.sh
```
It will take several minutes for all 1,462,658 documents to be indexed. Let it run and move on to the next step.
Create a data view in Kibana for your web_traffic index, using @timestamp as the time field.
Go to Discover in Kibana and select your web_traffic data view. Change the time picker to be from April 1 to April 30, 2021. You should see the count of all the documents in the date histogram:

Summary:

In this lab, you reindexed documents from a remote cluster using the Reindex API. You saw how to use _delete_by_query. You also created an ingest node pipeline to transform log data and created an index with a default pipeline. You now have a new index called web_traffic that contains log data of visitors to our blogs' website.