elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

[Transform] We might need a patch transform

Open machadoum opened this issue 6 months ago • 4 comments

Description

TLDR

We need flexible field-level customizable logic when ingesting documents. Could we achieve that with a painless script that receives the old and new documents as parameters?

Context

The Security Entity Analytics team is working on creating entity-based views inside the Security Solution. We are using the entity-manager framework developed by the Obs team: https://github.com/elastic/kibana/pull/183205.

The framework creates two indices. A pivot transform makes the first and has several time series aggregations (every x minutes). The second index contains one document per entity generated by another pivot transform that searches the first index.



The goal of the security solution entity store is to have a single index with several entity properties that can be sorted and filtered. Some fields in the Security Solution entity store come from events that could have been emitted only once, so we must preserve their value in the store forever. Therefore, we must generate the document based on the entire entity's history.

The problem with this solution is that the time series index will grow over time and slow down the second transform. Discussions are happening about ILM policies and using a time-based filter for the second transform, but that wouldn't work for the security solution team.

The proposal

We believe that the current Transform implementation can't solve our problems, and we need to change it. In simple words, we need a pivot transform that should add fields to the existing document instead of overwriting the entity document every time it runs. Different fields could require a different patch logic. Some fields should always be replaced, others should preserve the oldest value permanently, and others could be accumulated as an array.

That way, the security solution entity store could be simplified like:

The POC

To exemplify one possible solution, we implemented a transform POC. It performs poorly and could be severely improved, but it does the job of exemplifying what we need. These are the changes we made:

  • Add a new script field to the pivot transform config
  • Queries the destination index by id to retrieve the current value of the document
  • Execute a painless script that receives as parameters the old and new document
  • Store the script result in the index

You can see the POC code here:

Please keep in mind that it was just a viability exercise. A production-ready implementation will require several improvements, like bathing the search by ID and caching the painless script compilation.

Could something like this be implemented? Would the performance be good enough? Is there a better way to solve the problem that we are missing?

Questions

Could something like this be implemented? Would the performance be good enough? Is there a better way to solve the problem that we are missing?

machadoum avatar Aug 26 '24 11:08 machadoum