paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[Bug] Query result duplicate primary key

Open herefree opened this issue 1 year ago • 7 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Paimon version

0.7.0-incubating

Compute Engine

Flink 1.18.0

Minimal reproduce step

we have a flink job write data to paimon table,table options is: +---------------------------+---------+ | key | value | +---------------------------+---------+ | bucket | 8 | | scan.remove-normalize | true | | deduplicate.ignore-delete | true | | changelog-producer | none | | file.format | parquet | +---------------------------+---------+

What doesn't meet your expectations?

When job execute for some time, we use batch mode query table ,some of our query results duplicate primary key. image When we update paimon version to query this table,it also has duplicate primary key.

Anything else?

I want to know what cause this problem. Is this problem caused by writer operator? Does a later version fix this issue?

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

herefree avatar Jul 30 '24 03:07 herefree

It may happen when you change the bucket number but have not overwrite table first.

eric666666 avatar Jul 30 '24 06:07 eric666666

It may happen when you change the bucket number but have not overwrite table first.

Bucket number have not been modified since the table was created.

herefree avatar Jul 30 '24 06:07 herefree

What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree

xuzifu666 avatar Jul 30 '24 07:07 xuzifu666

What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree

{ "id" : 2, "fields" : [ { "id" : 0, "name" : "", "type" : "STRING NOT NULL", "description" : "" }, { "id" : 1, "name" : "", "type" : "STRING", "description" : "" },

......

{ "id" : 54, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 55, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 56, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 57, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 58, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 59, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 60, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 61, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 62, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 63, "name" : "", "type" : "STRING", "description" : "" } ], "highestFieldId" : 63, "partitionKeys" : [ ], "primaryKeys" : [ "id" ], "options" : { "bucket" : "8", "scan.remove-normalize" : "true", "deduplicate.ignore-delete" : "true", "changelog-producer" : "none", "file.format" : "parquet" }, "comment" : "", "timeMillis" : 1722325416673 } I didn‘t delete data before,but the changlog of the upstream table may have -D data.I set deduplicate.ignore-delete is true just don't want -D data was write in this table,Or some flink job don‘t consumer -D data when consuming this table.

herefree avatar Jul 30 '24 08:07 herefree

image I also find this duplicate data in the same bucket.

herefree avatar Jul 30 '24 08:07 herefree

What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree

I set deduplicate.ignore-delete = false, I find it didn‘t have duplicate primary key,but I not sure that the later versions fix this problem.

herefree avatar Aug 01 '24 03:08 herefree

@herefree Could you give the detailed minimal reproduce steps so that we can reporduce this bug?

discivigour avatar Aug 05 '24 08:08 discivigour