grimoirelab-elk [mappings] Add Truncate Token filter to keyword fields

Long texts cause errors when indexed as keyword. To avoid this kind of error we would need to truncate those fields tom something like 256 chars.

To do that, I suggest to use mapping features. In this case, we can use a normalizer together with a token filter. It should be something like (I didn't try it, I'm guessing the syntax to pass the argument to the filter):

From https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-snowball-tokenfilter.html :

{
  "settings": {
    "analysis": {
      "normalizer": {
        "keyword_normalizer": {
          "type": "custom",
          "filter": ["trunc_256"]
        }
      },
      "filter" : {
          "trunc_256" : {
              "type" : "truncate",
              "length" : 256
          }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "keyword_normalizer"
        }
      }
    }
  }
}

I would also suggest to have a look to copy_to fields for those cases when we want to have the same text stored with different types (keyword and text): https://www.elastic.co/guide/en/elasticsearch/reference/5.4/copy-to.html

Mar 16 '18 13:03 alpgarcia

I like the idea as a middle-term solution. A short-term solution (while no decision has been taken) is to modify the code of elk connectors by limiting the size of keyword fields when it is needed (https://github.com/chaoss/grimoirelab-elk/pull/254/files)

Maybe a long-term solution could be to refine/review the panels/visualizations to not truncate any data.

@acs, what do you think?

Mar 16 '18 17:03 valeriocos

I like the idea as a middle-term solution. A short-term solution (while no decision has been taken) is to modify the code of elk connectors by limiting the size of keyword fields when it is needed (https://github.com/chaoss/grimoirelab-elk/pull/254/files)

I have proposed a change over this approach. If we follow it, in the panels CSV it must be specified also for which text fields this truncation must be done.

Maybe a long-term solution could be to refine/review the panels/visualizations to not truncate any data.

Yes, the issue to be fixed is why panels need to aggregate using long text fields.

@acs, what do you think?

Not sure, let's try to talk about it today! Thanks @valeriocos !

Mar 19 '18 06:03 acs

Yes, the issue to be fixed is why panels need to aggregate using long text fields.

Because some tables needs to use things like issue titles.

There are two benefits of using mappings to truncate this data:

It allows to index data without using gelk. This would allow to update the indexes with no need of re-enrichment.
Data schema is clear and not hidden in code and it would be easy to test, as all the information we need is available in mapping file.

Of course, this is just my opinion :)

Mar 19 '18 10:03 alpgarcia

Because some tables needs to use thinks like issue titles.

Yes, issue titles could be right. But fields with 65K chars ... I suppose that they are exceptions, like the commit description, which is the only text field in a commit with a kind of "title" but that in some cases is larger than 65K.

There are two benefits of using mappings to truncate this data:

It allows to index data without using gelk. This would allow to update the indexes with no need of re-enrichment.

But the all ready existing mappings cold be changed without recreating them? It will be great to test this kind of stuff.

Data schema is clear and not hidden in code and it would be easy to test, as all the information we need is available in mapping file.

Not sure if the data schema in our approach is more hidden than the code :)

Of course, this is just my opinion :)

An interesting one!

Mar 19 '18 13:03 acs

grimoirelab-elk grimoirelab-elk copied to clipboard

[mappings] Add Truncate Token filter to keyword fields

grimoirelab-elk
grimoirelab-elk copied to clipboard