OpenSearch
OpenSearch copied to clipboard
Infer and cache date field format instead of re-parsing it for every document
The date field for the default format uses high CPU during parsing. A huge portion of date formatting time(close to 7.12% of CPU time in profiles) goes into parsing, which generally happens when the date format is optional for certain segments. Our customers don’t often set the date parser, but rely on the unoptimized default one. When I changed the date parsing format to a strict one for the same data set, the indexing throughput increased by 8%.
For logs, the date format does not change across different log lines. Hence, it is pretty inefficient to compute the date format for every single document. For such users, we could infer and set a stricter date format after parsing a few documents.
Additionally, 7% CPU seems too high just for date parsing. Maybe Java formatter has improved since the time I ran these tests. CPU profile shows that the most time goes into parsing the optional segments for the date.
Solutions?
- We should definitely improve our documentation to clearly call out that the date mapping should be set to a stricter format if known well in advance.
- Infer the date time parsing format when it is set top optional and re-use it across requests?
I wonder whether we can leverage the fact that most of the time all documents have the same date format. Maybe the date parser code can cache the date format and attempt to reuse it, only falling back to re-computing the date format when that fails?
Yes, we should cache it.
Started working on this issue, will share baseline benchmarking numbers soon to highlight the differences in cpu for different datetime field formats
Microbenchmarks
Experiment(datetime format) | Data | Average Time(ns/op) | Std Dev |
---|---|---|---|
epoch_millis | 123456789 | 915.87 | 2.613 |
strict_date_optional_time||epoch_millis | 123456789 | 1141.29 | 68.7 |
strict_date_optional_time | "2022-04-05T22:00:12Z" | 3019.366 | 40.168 |
yyyy-MM-ddThh:mm:ssZ | "2022-04-05T22:00:12Z" | 669.83 | 13.83 |
strict_date_time_no_millis | "2022-04-05T22:00:12Z" | 1155.38 | 30.03 |
CPU % Diff Matrix:
 | Candidate | strict_date_time_no_millis | strict_date_optional_time | yyyy-MM-ddThh:mm:ssZ |
---|---|---|---|---|
Baseline | Â | Â | Â | Â |
strict_date_time_no_millis | Â | NA | $${\color{red}-161}$$ | $${\color{green}42}$$ |
strict_date_optional_time | Â | $${\color{green}61}$$ | NA | $${\color{green}71}$$ |
yyyy-MM-ddThh:mm:ssZ | Â | $${\color{red}-72}$$ | $${\color{red}-350}$$ | NA |
Grab workload benchmark (Grab is synthetic data generated, osb-benchmark didn't have the workload for testing different data time mappings except http_logs
which has a datetime type fallback where I didn't noticed any significant improvement)
We found no significant differences in search performance of both the formats
With respect to the implementation for this issue, We've couple of approaches for caching datetime field
- Cache it on a node level, if last
X
datetime parsing continuosly suceeded for a specific datetime field type on the node then we'll cache that type and for each document parsing we'll try the cached datetime field first given that the field mapping does contain the cached datetime field. This will be a micro optimization on a per node basis. It may lead to not honoring the order in which the customer defined the datetime field formatters. Also, any caching maintained will get reset on a node restart. - Cache it on a shard level, on each data node we maintain a mapping of shard to cached datetime field format. The criteria for caching will be similar i.e. for last
X
datetime parsing request on a particular shard succeeds for a specific datetime format then that format will be cached for the particular shard to be tried first for subsequent doc parsing request on the shard. This may also lead to opensearch not honoring the order of the datetime field formatters provided by the user. Also, any caching maintained will get reset on a node restart. This may also lead to different parsing flow on different shards of the same index, so it may also address the shard level hotspots receiving different type of datetime fields. - Cache the datetime field format on an index level, after each successful date time parsing we update the index metadata datetime field mapping list with a reordered version containing the last used dateformatter as the first element so that it'll be tried out first for the subsequent indexing requests. This may also lead to us not honoring the order of formatters provided by the user in the datetime field mapping. The caching will not get reset on node restart and will be uniform across the shards for the particular datetime field parsing.
- In case where user has provided multiple optional datetime formats but a stricter datetime format suffices for last
X
datetime field parsing. We may choose to override the datetime format provided by the user to a stricter more optimized version and cache it on an index level for further use. This will lead to not honoring the list of datetime formats provided by the user and therefore customer expectations need to aligned on this.
Once we implement caching, a quick win will be to add a stricter format like strict_no_millis
to the default formatter as one of the formatter so that the overhead of strict_date_optional_time
will be minimized if the datetime fields conforms with the strict format at runtime
@prabs @tharejas @mgodwan please provide your thoughts on this