Fixed ClassCastException during Multi-Stage Queries on real-time segments
Fixed issue when msq queries throw ClassCastException with "includeSegmentSource": "REALTIME" and supervisor running.
Caused by: java.lang.ClassCastException: class [B cannot be cast to class org.apache.druid.query.aggregation.datasketches.hll.HllSketchHolder ([B is in module java.base of loader 'bootstrap'; org.apache.druid.query.aggregation.datasketches.hll.HllSketchHolder is in unnamed module of loader java.net.URLClassLoader @38792286)
at org.apache.druid.query.aggregation.datasketches.hll.HllSketchHolderObjectStrategy.toBytes(HllSketchHolderObjectStrategy.java:31)
at org.apache.druid.segment.serde.ComplexMetricSerde.toBytes(ComplexMetricSerde.java:119)
at org.apache.druid.frame.field.ComplexFieldWriter.writeTo(ComplexFieldWriter.java:65)
at org.apache.druid.frame.write.RowBasedFrameWriter.writeDataUsingFieldWriters(RowBasedFrameWriter.java:291)
at org.apache.druid.frame.write.RowBasedFrameWriter.writeData(RowBasedFrameWriter.java:246)
at org.apache.druid.frame.write.RowBasedFrameWriter.addSelection(RowBasedFrameWriter.java:122)
at org.apache.druid.msq.querykit.scan.ScanQueryFrameProcessor.populateFrameWriterAndFlushIfNeeded(ScanQueryFrameProcessor.java:348)
at org.apache.druid.msq.querykit.scan.ScanQueryFrameProcessor.populateFrameWriterAndFlushIfNeededWithExceptionHandling(ScanQueryFrameProcessor.java:329)
at org.apache.druid.msq.querykit.scan.ScanQueryFrameProcessor.runWithLoadedSegment(ScanQueryFrameProcessor.java:231)
at org.apache.druid.msq.querykit.BaseLeafFrameProcessor.runIncrementally(BaseLeafFrameProcessor.java:87)
at org.apache.druid.msq.querykit.scan.ScanQueryFrameProcessor.runIncrementally(ScanQueryFrameProcessor.java:158)
at org.apache.druid.frame.processor.FrameProcessors$1FrameProcessorWithBaggage.runIncrementally(FrameProcessors.java:75)
at org.apache.druid.frame.processor.FrameProcessorExecutor$1ExecutorRunnable.runProcessorNow(FrameProcessorExecutor.java:230)
... 8 more
After some investigation was found that selector's getObject() for real-time segments was returning a byte array representation of HllSketch but still was trying to convert it to a byte[]. While HllSketchHolderObjectStrategy expects to have HllSketchHolder instance in toBytes() leading to the ClassCastException.
This PR has:
- [x] been self-reviewed.
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in licenses.yaml
- [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
- [ ] added integration tests.
- [x] been tested in a test Druid cluster.
Thanks for the changes @nozjkoitop. I'll review them shortly.
Thanks for the changes @nozjkoitop. I'll review them shortly.
@LakshSingla Could I remind you to have a look on that please?
Sorry for the delay @nozjkoitop and thanks for the update 🚀
Can you please add a test in the MSQSelectTest or MSQInsertTest to validate that the PR fails before with the exception you are seeing and passes after this patch? This should be easy to reproduce. LMK if you need help with that.
The new approach is a lot cleaner than the previous one, but I still wanna see if there's a way to restrict the change to MSQ, since I suspect that there's a faulty selector at play. The MSQ test case would help in debugging this as well.
Can you please add a test in the
MSQSelectTestorMSQInsertTestto validate that the PR fails before with the exception you are seeing and passes after this patch? This should be easy to reproduce.
I've added new test-case with hyperUnique column in select from realtime segment, please have a look
@LakshSingla WDYT?
Hey @nozjkoitop apologies for the delay. Let me take a look at the patch and get back to you in a couple of days. It seems correct for the most part. I just need to think if there's a less invasive way to do this given that the regression is in MSQ. I wasn't able to get to it due to a few other patches I was working on, but I'll prioritize this change. Thanks a lot for being patient with me!
Hey @LakshSingla, just a kindly reminder about this PR :)
@nozjkoitop I think @LakshSingla's comment was to only make changes that affect MSQ, since the bug is seen only there. I think your changes might affect other queries as well.
@nozjkoitop I think @LakshSingla's comment was to only make changes that affect MSQ, since the bug is seen only there. I think your changes might affect other queries as well.
Hi @adarshsanjeev ! The main issue is in ComplexFieldWriter, and it's caused by the ValueSelector potentially returning a byte[], which the writer doesn't expect. This needs to be handled, and could be addressed it in either the writeTo() method or serde.toBytes(). Both of these classes are part of Druid's processing, and honestly, I don’t see any other places that would be more isolated for handling this.
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
Hi @LakshSingla I've checked the latest druid and described issue is still there, although now there is a NPE on the ui, but provided patch still resolves an issue, so I merged the changes from master. Please take a look.
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
Issue is still there in v32, merged master to solve conflicts
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.
Something seems off with the rebase on master, there's a bunch of extra files. That being said, it seems to me like this error points at a situation where some aspect of the design is sketchy. I will reopen it. @adarshsanjeev has been looking recently into realtime queries in MSQ- would you mind taking a look at this bug?
Sure, I can take a look at this bug
@adarshsanjeev i see that i messed up the merging before, now ive reset the branch to here and merged it properly. Shall i push that?
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.