druid
druid copied to clipboard
Ability to configure Druid to fail a query that tries to reach an unavailable segment
Description
A configuration option (e.g. druid.sql.planner.failOnMissingSegments
) that when set to true, Druid looks at the segment metadata for the __time specified in the query and fails if there's an unavailable realtime segment for that datasource.
Motivation
I recognize this isn't a perfect system but realtime ingestion is typically ingesting data related to the current time, so if that fails (e.g. a realtime segment gets dropped due to failure during handover) then the data that was once being returned by queries appears to disappear. Our users see this as inconsistent/unreliable data and trust in Druid falls. Our user would prefer an error here, something akin to "Data is missing for __time specified"
For additional motivation, if ingestion is failing then reduced load on the failing nodes might help. If queries fail before ever hitting ingestion nodes this can help reduce load on the failing nodes.
Would be great. A few thoughts:
- I think we'd want the option to be settable in query context. If we do this we'll also get a server-wide setting for free, since you can set server-wide defaults for query context parameters.
- I think we'd want the option to apply to published, unavailable, non-realtime segments too. So that suggests the Broker should keep a list of published segments and verify that they are all available. This would also cover the handoff case, since segments being handed-off are published.
- The case of segments that haven't been handed off yet is tricky: the list of segments that should exist aren't currently registered anywhere central that the Broker would be able to get at. There's the pending segments table, but that also includes segments that shouldn't exist (because they can be abandoned). Perhaps a solution here would involve ensuring the pending segments table only includes segments that should exist (i.e., delete abandoned records immediately).
- We'd need to make sure we correctly handle the case of time-chunk replacement. In this case, while the new set of segments isn't yet fully available, we still want to ensure we recognize the old set as valid. We shouldn't throw errors just because the new set isn't pushed out yet.
Nice! This would be a great feature. I can not really comment on (2), (3), (4) but big 👍 for (1)
Agreed on all 4 of @gianm's points
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.