connectors icon indicating copy to clipboard operation
connectors copied to clipboard

Question about AddFile dataChange flag.

Open horizonzy opened this issue 2 years ago • 2 comments

A question about flag dataChange in the AddFile , the flag dataChange in txn log is true . But in the program, the value is false . Is it a bug?

In the tnx log:

~: cat 00000000000000000000.json
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"91a0d9a7-952a-42a8-abdf-73cbf00b1849","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1662000777202}}
{"add":{"path":"part-00000-f7f493e9-7155-4171-91ac-7e046e31c269-c000.snappy.parquet","partitionValues":{},"size":296,"modificationTime":1662000778449,"dataChange":true,"stats":"{\"numRecords\":0,\"minValues\":{},\"maxValues\":{},\"nullCount\":{}}"}}
{"add":{"path":"part-00001-6d4b2c01-6d87-4aa3-a9a5-bb315cd665e8-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1662000778615,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":0},\"nullCount\":{\"id\":0}}"}}
{"add":{"path":"part-00003-2bfd96d1-aaf5-4cc5-930b-b59d12d17ea9-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1662000778615,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":1},\"maxValues\":{\"id\":1},\"nullCount\":{\"id\":0}}"}}
{"add":{"path":"part-00005-873e9693-7c39-4419-87de-0c27d9f64b37-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1662000778615,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":2},\"maxValues\":{\"id\":2},\"nullCount\":{\"id\":0}}"}}
{"add":{"path":"part-00007-8154cb0e-84b9-4009-9eb2-17ed532a8c82-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1662000778615,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":3},\"maxValues\":{\"id\":3},\"nullCount\":{\"id\":0}}"}}
{"add":{"path":"part-00009-34c9adf2-b652-4450-8b26-dee491ae1ab5-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1662000778615,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":4},\"maxValues\":{\"id\":4},\"nullCount\":{\"id\":0}}"}}
{"commitInfo":{"timestamp":1662000778641,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"6","numOutputRows":"5","numOutputBytes":"2686"},"engineInfo":"Apache-Spark/3.2.1 Delta-Lake/1.2.0","txnId":"eb52cd3e-4e5b-4d1f-8c91-27197f59f74f"}}

In the program: image

I notice the code, it make dataChange to false forcefully. Is there some particular cases? image (1)

horizonzy avatar Sep 01 '22 10:09 horizonzy

I don't ~think it's a bug. Since you're referencing the Standalone library, can you please open this issue in the connectors repo?

FWIW the same behavior is here too so we can leave this open and resolve this when there's a solid answer.

nkarpov avatar Sep 01 '22 17:09 nkarpov

Hi @horizonzy, the dataChange flag is only meaningful when looking at the actions added in a specific version (or the actions within a single commit) but not when looking at all the AddFiles in a snapshot. AFAIK here we just set dataChange=false to canonicalize the actions

allisonport-db avatar Sep 01 '22 20:09 allisonport-db