paimon [Bug] Query result duplicate primary key

Search before asking

[X] I searched in the issues and found nothing similar.

Paimon version

0.7.0-incubating

Compute Engine

Flink 1.18.0

Minimal reproduce step

What doesn't meet your expectations?

When job execute for some time, we use batch mode query table ,some of our query results duplicate primary key. When we update paimon version to query this table,it also has duplicate primary key.

Anything else?

I want to know what cause this problem. Is this problem caused by writer operator? Does a later version fix this issue?

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

Jul 30 '24 03:07 herefree

It may happen when you change the bucket number but have not overwrite table first.

Jul 30 '24 06:07 eric666666

It may happen when you change the bucket number but have not overwrite table first.

Bucket number have not been modified since the table was created.

Jul 30 '24 06:07 herefree

What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree

Jul 30 '24 07:07 xuzifu666

What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree

{ "id" : 2, "fields" : [ { "id" : 0, "name" : "", "type" : "STRING NOT NULL", "description" : "" }, { "id" : 1, "name" : "", "type" : "STRING", "description" : "" },

......

{ "id" : 54, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 55, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 56, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 57, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 58, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 59, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 60, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 61, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 62, "name" : "", "type" : "STRING", "description" : "" }, { "id" : 63, "name" : "", "type" : "STRING", "description" : "" } ], "highestFieldId" : 63, "partitionKeys" : [ ], "primaryKeys" : [ "id" ], "options" : { "bucket" : "8", "scan.remove-normalize" : "true", "deduplicate.ignore-delete" : "true", "changelog-producer" : "none", "file.format" : "parquet" }, "comment" : "", "timeMillis" : 1722325416673 } I didn‘t delete data before，but the changlog of the upstream table may have -D data.I set deduplicate.ignore-delete is true just don't want -D data was write in this table，Or some flink job don‘t consumer -D data when consuming this table.

Jul 30 '24 08:07 herefree

I also find this duplicate data in the same bucket.

Jul 30 '24 08:07 herefree

What about your table schema and did you delete data before? deduplicate.ignore-delete is true @herefree

I set deduplicate.ignore-delete = false, I find it didn‘t have duplicate primary key,but I not sure that the later versions fix this problem.

Aug 01 '24 03:08 herefree

@herefree Could you give the detailed minimal reproduce steps so that we can reporduce this bug?

Aug 05 '24 08:08 discivigour