Druid shows null columns even though is not null is used along with other conditions
Affected Version
Version 25.0.0 and 29.0.1
Description
Please see the following screenshots for details.
When we apply both of these conditions: payment_card_type is not null AND payment_card_type <> 'AMEX', rows with payment_card_type = null will still show up.
When we apply both of these conditions: payment_card_type != 'null' AND payment_card_type <> 'AMEX', rows with payment_card_type = null will still show up.
When we apply only this condition: payment_card_type is not null, rows with payment_card_type = null will NOT show up.
Hi any updates?
@Hi @carltal ,
Do you have any sample data for debugging this issue? I tried to generate some sample data, and ran a few queries on Druid 29.0.1 and I see that the queries return correct values in all cases. Screenshots below:
-
"merchant_id" = 'M101' AND "__time" >= '2023-01-01' AND "__time" < '2023-12-01'(All records)
-
"merchant_id" = 'M101' AND "__time" >= '2023-01-01' AND "__time" < '2023-12-01' AND "payment_card_type" is not null AND "payment_card_type" <> 'AMEX'
-
merchant_id" = 'M101' AND "__time" >= '2023-01-01' AND "__time" < '2023-12-01' AND "payment_card_type" != null AND "payment_card_type" <> 'AMEX'
-
"merchant_id" = 'M101' AND "__time" >= '2023-01-01' AND "__time" < '2023-12-01' AND ("payment_card_type" != null OR "payment_card_type" <> 'AMEX')
-
"merchant_id" = 'M101' AND "__time" >= '2023-01-01' AND "__time" < '2023-12-01' AND "payment_card_type" is not null
Hi @ashwintumma23 greetings. May I know what kind of sample data you would like to have?
I just reproduced the issue on Druid 25.0.0. On Druid 29.0.1, I am getting empty instead of null.
Hi @carltal, Can you share what is the format of the underlying data? And in terms of sample data, based on the format, can you export a couple of hundred records and upload it to Gist, or Dropbox, or Google Druid? Thanks!
Hi @ashwintumma23, so payment_card_type is just VARCHAR
In terms of sample data, do you need all the rows in that particular datasource? I might not be able to provide them due to compliance reasons (though the data are fake, our risk team may still not OK that).
Also, I am not sure how to export data from Druid?
Thanks.
Hi @carltal , We won't need all the rows, just a sample couple of hundred records should suffice, moreover not all columns would also be required and it is fine if the data is synthetically generated. In the interim, can you also try to load the sample data that I have referenced above for my tests in your cluster? and try running the queries to see if you see the same results as my cluster? Thanks!
Hi @ashwintumma23 I see. I wouldn't be able to load that data as merchant_id is a BIGINT in our schema. Also it does not contain the all the columns in my schema.
For the sample data from my end, can you tell me what column is needed? Also is there a way to export that data from Druid UI?
Thanks.
Hi @carltal,
I wouldn't be able to load that data as merchant_id is a BIGINT in our schema. Also it does not contain the all the columns in my schema.
Would it be possible to load the data as a new data source itself, and then try to run the queries?
Also is there a way to export that data from Druid UI? Not sure if it would help entirely to debug, but to gain insights on how your data looks like, we can run a
SELECT * FROMquery, and then clicking on Export results as CSV on the right side.
Example:
@ashwintumma23 Hi, I will check with our risk team and get back to you as soon as possible. Thanks.
@ashwintumma23 Hi, I got approval to submit the sample data. It just contains 20+ rows. I hope this helps.
Hi @carltal ,
Thanks for sharing the sample data, I did try to load it to my local cluster, and I do see the correct results. Only change I had to do to your csv above, was to remove the quotes " before loading, and explicitly specifying the schema.
Some screenshots for Data Loading
- Here's the ingestion spec that was used:
{
"type": "index_parallel",
"spec": {
"ioConfig": {
"type": "index_parallel",
"inputSource": {
"type": "local",
"baseDir": "sqlIssueDebug",
"filter": "druid_data.csv"
},
"inputFormat": {
"type": "csv",
"skipHeaderRows": 1,
"findColumnsFromHeader": false,
"columns": [
"ts",
"merchant_id",
"payment_card_type"
]
}
},
"tuningConfig": {
"type": "index_parallel",
"partitionsSpec": {
"type": "dynamic"
}
},
"dataSchema": {
"dataSource": "sample_payment_data_csv_format",
"timestampSpec": {
"column": "ts",
"format": "iso"
},
"dimensionsSpec": {
"dimensions": [
{
"type": "long",
"name": "merchant_id"
},
"payment_card_type"
]
},
"granularitySpec": {
"queryGranularity": "none",
"rollup": false,
"segmentGranularity": "hour"
}
}
}
}
Querying
- Shows only the
not nullandnon-AMEXvalues
We can also have a screen sharing session to discuss further. Can find me on Apache Druid Community Slack.
@ashwintumma23 Hi, thank you for the quick turnaround. I can try what you did in a new datasource and let you know what I see.
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.