datawave Add FederatedQueryPlanner

Adds a FederatedQueryPlanner that will break up a query into multiple queries scanning over subsets of the original target date range if field index holes are identified to be present for the query in the target date range.

Note: the work in this PR is dependent on the work in:

Closes #825

Jan 12 '24 21:01 lbschanno

I have verified that the FederatedQueryPlanner is properly determining the date ranges that need to be used when creating the sub-queries. However, I am having difficulties getting the results from the sub-queries to return bundled together. At this point I think I'm running into a deficit of Datawave-specific knowledge on where exactly I need to make changes to support this.

I have added FederatedShardQueryConfiguration and ChainedScheduler to ensure that a scheduler is created based off the configs for each sub-query, but so far, only results from the first sub-query are being returned (see FederatedQueryTest). I would appreciate any feedback on this.

Feb 27 '24 14:02 lbschanno

From @lbschanno

I have been working on getting tests to pass when the FederatedQueryPlanner is the default query planner. Some cases have shown that it may not be enough to simply use the first config returned from a sub-query as the finalized query string.

For instance, in MaxExpansionIndexOnlyQueryTest.testMaxAnyField(), the sub-queries have the following results when the max value expansion threshold is set to 20:

Result 0 over 2015/04/04-2015/10/09 Query String: false Query Data Iterable: Empty

Result 1 over 2015/10/10-2015/10/10 Query String: (CODE == 'b-code' || CITY == 'b-city' || CITY == 'b-2' || CITY == 'b-1' || STATE == 'b-state') && (CODE == 'a-code' || CITY == 'a-1' || STATE == 'a-state' || STATE == 'a-s2') Query Data Iterable: contains 3 query datas

Sub-query 2 over 2015/10/11-2015/11/11 Query String: false Query Data Iterable: empty

In MaxExpansionIndexOnlyQueryTest.testMaxValueRegexIndexOnly(), we receive the following sub-query results when the max expansion threshold is set to 20:

Sub-query 0 over 2015/04/04-2015/10/09 Query String: CITY == 'a-1' && STATE =~ 'b.*' Query Data Iterable: Empty

Sub-query 1 over 2015/10/10-2015/10/10 Query String: CITY == 'a-1' && (STATE == 'b3-state' || STATE == 'b-state' || STATE == 'bi-s' || STATE == 'b2-state' || STATE == 'ba-s2') Query Data Iterable: contains 2 query datas

Sub-query 2 over 2015/10/11-2015/11/11 Query String: CITY == 'a-1' && STATE =~ 'b.*' Query Data Iterable: Empty

In MaxExpansionIndexOnlyQueryTest.testMaxValueNegAnyField(), we receive the following sub-query results when the max expansion threshold is set to 10.

Sub-query 0 over 2015/04/04-2015/10/09 Query String: false Query Data Iterable: Empty

Sub-query 1 over 2015/10/10-2015/10/10 Query String: STATE == 'b-state' && !(((Delayed = true) && (ANYFIELD =~ 'a.*')) || CODE == 'a-code' || CITY == 'a-1' || STATE == 'a-state' || STATE == 'a-s2') Query Data Iterable: contains 4 query datas

Sub-query 2 over 2015/10/11-2015/11/11 Query String: false Query Data Iterable: Empty

I have seen some similar results for MaxExpansionQueryTest and AnyFieldQueryTest. Given that we can have differing query strings, how do you want to handle determining which query string to set in the original config that's passed into the FederatedQueryPlanner.process() method? Do we need the query string unique to the query data iterable returned from each sub-queries when setting up the schedulers in ShardQueryLogic.setUpQuery(GenericQueryConfiguration config)? Is there somewhere else where we need to know the specific query strings from each sub-query?

I have pushed an update that adds tests to MaxExpansionIndexOnlyQueryTest with versions of each test using either the DefaultQueryPlanner or FederatedQueryPlanner so that you can see the results for yourself.

I suppose another question would be do the results above even look correct to you?

Mar 13 '24 18:03 ivakegg

For documentation purposes, here is the response in the conversation we had:

The tests that test the query plan would need to be changed to handle the one returned by the federated query planner.
The federated query plan that gets returned should probably concatenate the plans (as a unique set) into something like this: ((plan = 1) && (<planA>)) || (plan = 2) && (<planB>)) ... simply use <planA> if only one plan is in the set
We can work later to allow the query metrics to handle muliple top level plans after the sub-plan work is pulled in and proven viable.

Mar 15 '24 12:03 ivakegg

datawave datawave copied to clipboard

Add FederatedQueryPlanner

datawave
datawave copied to clipboard