[Bug] Query result duplicate primary key
Search before asking
- [X] I searched in the issues and found nothing similar.
Paimon version
0.7.0-incubating
Compute Engine
Flink 1.18.0
Minimal reproduce step
we have a flink job write data to paimon table,table options is: +---------------------------+---------+ | key | value | +---------------------------+---------+ | bucket | 8 | | scan.remove-normalize | true | | deduplicate.ignore-delete | true | | changelog-producer | none | | file.format | parquet | +---------------------------+---------+
What doesn't meet your expectations?
When job execute for some time, we use batch mode query table ,some of our query results duplicate primary key.
When we update paimon version to query this table,it also has duplicate primary key.
Anything else?
I want to know what cause this problem. Is this problem caused by writer operator? Does a later version fix this issue?
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!
It may happen when you change the bucket number but have not overwrite table first.
It may happen when you change the bucket number but have not overwrite table first.
Bucket number have not been modified since the table was created.
What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree
What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree
{ "id" : 2, "fields" : [ { "id" : 0, "name" : "", "type" : "STRING NOT NULL", "description" : "" }, { "id" : 1, "name" : "", "type" : "STRING", "description" : "" },
......
{ "id" : 54, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 55, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 56, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 57, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 58, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 59, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 60, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 61, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 62, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 63, "name" : "", "type" : "STRING", "description" : "" } ], "highestFieldId" : 63, "partitionKeys" : [ ], "primaryKeys" : [ "id" ], "options" : { "bucket" : "8", "scan.remove-normalize" : "true", "deduplicate.ignore-delete" : "true", "changelog-producer" : "none", "file.format" : "parquet" }, "comment" : "", "timeMillis" : 1722325416673 } I didn‘t delete data before,but the changlog of the upstream table may have -D data.I set deduplicate.ignore-delete is true just don't want -D data was write in this table,Or some flink job don‘t consumer -D data when consuming this table.
I also find this duplicate data in the same bucket.
What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree
I set deduplicate.ignore-delete = false, I find it didn‘t have duplicate primary key,but I not sure that the later versions fix this problem.
@herefree Could you give the detailed minimal reproduce steps so that we can reporduce this bug?