Dynamic FieldType inference based on random sampling of documents
Description
This class performs type inference by analyzing the _source documents. This will be useful in inferring the type of nested derived field of object type. See https://github.com/opensearch-project/OpenSearch/issues/13143 for more details on requirement.
It uses a random sample of documents to infer the field type, similar to dynamic mapping type guessing logic. Unlike guessing based on the first document, where field could be missing, this method generates a random sample to make a more accurate inference. This approach is especially useful for handling missing fields, which could be common in nested fields within derived fields of object types. The sample size should be chosen carefully to ensure a high probability of selecting at least one document where the field is present. However, it's essential to strike a balance because a large sample size can lead to performance issues since each sample document's _source field is loaded and examined until the field is found. Determining the sample size (S) is akin to deciding how many balls to draw from a bin, ensuring a high probability (>=P) of drawing at least one green ball (documents with the field) from a mixture of R red balls (documents without the field) and G green balls:
P >= 1 - C(R, S) / C(R + G, S)
Here, C() represents the binomial coefficient. For a high confidence level, we aim for P >= 0.95. For example, with 10^7 documents where the field is present in 2% of them, the sample size S should be around 149 to achieve a probability of 0.95.
Here is the small python script which i used to calculate above
Related Issues
Resolves https://github.com/opensearch-project/OpenSearch/issues/13143
Check List
- [x] New functionality includes testing.
- [x] All tests pass
- [x] New functionality has been documented.
- [x] New functionality has javadoc added
- [x] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
- [x] Commits are signed per the DCO using --signoff
- ~~[ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)~~
- ~~[ ] Public documentation issue/PR created~~
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
:x: Gradle check result for 31d21522a795f6bafc55847efb48fae6401701bb: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 5ed477e30bb03ada40ffc42004eb2aaa3c4ad707: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 540ff72e1455fbc41519428655dafa28ab2dabad: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 16c2071c6665a9fad56741d184f1abcbaecc5f19: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for ca050d0f376fe0125d23455dab5e57988a2af38a: SUCCESS
:white_check_mark: Gradle check result for d9bf558f8d8cfbff9318a522849d8a1221c898f7: SUCCESS
Codecov Report
Attention: Patch coverage is 87.83784% with 9 lines in your changes missing coverage. Please review.
Project coverage is 71.55%. Comparing base (
b15cb0c) to head (c546323). Report is 337 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| ...rg/opensearch/index/mapper/FieldTypeInference.java | 87.83% | 4 Missing and 5 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #13592 +/- ##
============================================
+ Coverage 71.42% 71.55% +0.13%
- Complexity 59978 61308 +1330
============================================
Files 4985 5066 +81
Lines 282275 288198 +5923
Branches 40946 41729 +783
============================================
+ Hits 201603 206225 +4622
- Misses 63999 64938 +939
- Partials 16673 17035 +362
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:x: Gradle check result for b74c3536a0f3bbfa9cf7b4e8b095259a076f4920: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
~Also, why skip-changelog? This will be consumed by the DerivedFieldMapper in some way I am assuming, wouldn't we want to keep track of this change?~ I realized that the final PR will probably have the changelog entry.
:x: Gradle check result for f6ed5e1aa3d672dafdf463870e3b141d669dd44d: null
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
@msfroh @harshavamsi what do you think about order in which we should scan the segments here? If we start with the smaller segments and if the field is found, then it would be pretty fast, whereas, if we start with a bigger segment, the odds of finding a field is high but comes at a cost of loading a bigger segment. So for rare fields, later performs better whereas for common fields, the former performs better.
:x: Gradle check result for b0050b963aed0e3fe46726e2eb3c39d2f5a9ab2b: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:white_check_mark: Gradle check result for 5e276cb4531a9588faa6ff82b7026c14223fc310: SUCCESS
@msfroh @harshavamsi what do you think about order in which we should scan the segments here? If we start with the smaller segments and if the field is found, then it would be pretty fast, whereas, if we start with a bigger segment, the odds of finding a field is high but comes at a cost of loading a bigger segment. So for rare fields, later performs better whereas for common fields, the former performs better.
I think I would optimize more for the common fields.
I appreciate that an advantage of this feature is that it's another way of handling a mix of different document types, similar to flat_object fields -- just pushing the hard work to search time, rather than flattening at indexing time. But at the same time, I feel like it makes more sense to assume that you want to search on "relatively" common fields (i.e. fields present in at least 5-10% of documents).
@msfroh looking at holistic picture, I agree that optimizing on common fields is a wiser choice. If you think this isn't super critical, I can take it up as a subsequent PR.
:grey_exclamation: Gradle check result for c546323abedc16f3d222ec968713d6daa8ef2ca8: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.