drill
drill copied to clipboard
DRILL-8480: Cleanup before finished. 0 out of 1 streams have finished
DRILL-8480: Make Nested Loop Join operator properly process empty batches and batches with new schema
Description
Nested Loop Join operator (NestedLoopJoinBatch, NestedLoopJoin) unproperly handles batch iteration outcome OK with 0 records. Drill design of the processing of batches involves 5 states:
NONE(batch can have only 0 records)OK(batch can have 0+ records)OK_NEW_SCHEMA(batch can have 0+ records)NOT_YET(undefined)EMIT(batch can have 0+ records) The Nested Loop Join operator in some circumstances could receiveOKoutcome with 0 records, and instead of requesting the next batch, the operator stops data processing and returnsNONEoutcome to upstream batches(operators) without freeing resources of underlying batches.
Solution
Make the Nested Loop Join operator properly handle OK and OK_NEW_SCHEMA outcomes with 0 records and keep processing until NONE and NOT_YET outcomes are received.
Make the Nested Loop Join operator keep processing even if OK_NEW_SCHEMA outcome is received, but the schema wasn’t changed(yes, I know it sounds wild, but it’s possible and it is expected behavior).
Documentation
Testing
Manual testing with a file from the Jira ticket DRILL-8480
@rymarm. What is the status of this PR? Is it ready for merging?
@cgivre Actually, there is one more thing, that I would fix in the scope of this PR: https://github.com/apache/drill/blob/a726a4544dfbf1427f41fb916d3d976bd511189b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinBatch.java#L396-L399
These few lines seem to be kludge which may not work in some very rare cases, but I forgot what issue occurs if remove it. I would like to take a look at this, but I can do this in a separate PR because the current issue is completely fixed with the current changes.