OpenSearch Dynamic FieldType inference based on random sampling of documents

Description

This class performs type inference by analyzing the _source documents. This will be useful in inferring the type of nested derived field of object type. See https://github.com/opensearch-project/OpenSearch/issues/13143 for more details on requirement.

It uses a random sample of documents to infer the field type, similar to dynamic mapping type guessing logic. Unlike guessing based on the first document, where field could be missing, this method generates a random sample to make a more accurate inference. This approach is especially useful for handling missing fields, which could be common in nested fields within derived fields of object types. The sample size should be chosen carefully to ensure a high probability of selecting at least one document where the field is present. However, it's essential to strike a balance because a large sample size can lead to performance issues since each sample document's _source field is loaded and examined until the field is found. Determining the sample size (S) is akin to deciding how many balls to draw from a bin, ensuring a high probability (>=P) of drawing at least one green ball (documents with the field) from a mixture of R red balls (documents without the field) and G green balls:

 P >= 1 - C(R, S) / C(R + G, S)

Here, C() represents the binomial coefficient. For a high confidence level, we aim for P >= 0.95. For example, with 10^7 documents where the field is present in 2% of them, the sample size S should be around 149 to achieve a probability of 0.95.

Here is the small python script which i used to calculate above

Related Issues

Resolves https://github.com/opensearch-project/OpenSearch/issues/13143

Check List

[x] New functionality includes testing.
- [x] All tests pass
[x] New functionality has been documented.
- [x] New functionality has javadoc added
[x] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
[x] Commits are signed per the DCO using --signoff
~~[ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)~~
~~[ ] Public documentation issue/PR created~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

May 07 '24 19:05 rishabhmaurya

:x: Gradle check result for 31d21522a795f6bafc55847efb48fae6401701bb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

May 07 '24 19:05 github-actions[bot]

:x: Gradle check result for 5ed477e30bb03ada40ffc42004eb2aaa3c4ad707: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

May 07 '24 19:05 github-actions[bot]

:x: Gradle check result for 540ff72e1455fbc41519428655dafa28ab2dabad: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

May 07 '24 19:05 github-actions[bot]

:x: Gradle check result for 16c2071c6665a9fad56741d184f1abcbaecc5f19: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

May 07 '24 19:05 github-actions[bot]

:white_check_mark: Gradle check result for ca050d0f376fe0125d23455dab5e57988a2af38a: SUCCESS

May 07 '24 20:05 github-actions[bot]

:white_check_mark: Gradle check result for d9bf558f8d8cfbff9318a522849d8a1221c898f7: SUCCESS

May 07 '24 20:05 github-actions[bot]

Codecov Report

Attention: Patch coverage is 87.83784% with 9 lines in your changes missing coverage. Please review.

Project coverage is 71.55%. Comparing base (b15cb0c) to head (c546323). Report is 337 commits behind head on main.

Files	Patch %	Lines
...rg/opensearch/index/mapper/FieldTypeInference.java	87.83%	4 Missing and 5 partials :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #13592      +/-   ##
============================================
+ Coverage     71.42%   71.55%   +0.13%     
- Complexity    59978    61308    +1330     
============================================
  Files          4985     5066      +81     
  Lines        282275   288198    +5923     
  Branches      40946    41729     +783     
============================================
+ Hits         201603   206225    +4622     
- Misses        63999    64938     +939     
- Partials      16673    17035     +362

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

May 07 '24 20:05 codecov[bot]

:x: Gradle check result for b74c3536a0f3bbfa9cf7b4e8b095259a076f4920: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

May 16 '24 18:05 github-actions[bot]

~Also, why skip-changelog? This will be consumed by the DerivedFieldMapper in some way I am assuming, wouldn't we want to keep track of this change?~ I realized that the final PR will probably have the changelog entry.

May 16 '24 21:05 harshavamsi

:x: Gradle check result for f6ed5e1aa3d672dafdf463870e3b141d669dd44d: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

May 16 '24 21:05 github-actions[bot]

@msfroh @harshavamsi what do you think about order in which we should scan the segments here? If we start with the smaller segments and if the field is found, then it would be pretty fast, whereas, if we start with a bigger segment, the odds of finding a field is high but comes at a cost of loading a bigger segment. So for rare fields, later performs better whereas for common fields, the former performs better.

May 24 '24 16:05 rishabhmaurya

:x: Gradle check result for b0050b963aed0e3fe46726e2eb3c39d2f5a9ab2b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

May 24 '24 20:05 github-actions[bot]

:white_check_mark: Gradle check result for 5e276cb4531a9588faa6ff82b7026c14223fc310: SUCCESS

May 24 '24 23:05 github-actions[bot]

@msfroh @harshavamsi what do you think about order in which we should scan the segments here? If we start with the smaller segments and if the field is found, then it would be pretty fast, whereas, if we start with a bigger segment, the odds of finding a field is high but comes at a cost of loading a bigger segment. So for rare fields, later performs better whereas for common fields, the former performs better.

I think I would optimize more for the common fields.

I appreciate that an advantage of this feature is that it's another way of handling a mix of different document types, similar to flat_object fields -- just pushing the hard work to search time, rather than flattening at indexing time. But at the same time, I feel like it makes more sense to assume that you want to search on "relatively" common fields (i.e. fields present in at least 5-10% of documents).

May 25 '24 00:05 msfroh

@msfroh looking at holistic picture, I agree that optimizing on common fields is a wiser choice. If you think this isn't super critical, I can take it up as a subsequent PR.

May 27 '24 23:05 rishabhmaurya

:grey_exclamation: Gradle check result for c546323abedc16f3d222ec968713d6daa8ef2ca8: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Jun 03 '24 19:06 github-actions[bot]