nessie icon indicating copy to clipboard operation
nessie copied to clipboard

[Bug]: GC deleting files but not their references in metadata

Open justas200 opened this issue 1 year ago • 13 comments

What happened

Our team is running Iceberg on S3 with Nessie on Kubernetes. Recently we turned on GC and noticed that we cannot query the full range history data on some of the files using Spark. We get an error that a file cannot be found in S3:

pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 1621 in stage 11252.0 failed 4 times, most recent failure: Lost task 1621.3 in stage 11252.0 (TID 3116305) (10.32.186.235 executor 4282): org.apache.iceberg.exceptions.NotFoundException: File does not exist: <file_on_s3.parquet>

How to reproduce it

These are the configurations for the GC:

# Marking step

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nessie-gc-mark-step
  namespace: {{ .Release.Namespace }}
spec:
  schedule: "0 0 * * *" # Runs daily at midnight GMT
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: nessie-gc
              image: ghcr.io/projectnessie/nessie-gc:latest
              args:
                - mark
                - --default-cutoff=PT48H
                - --uri
                - http://nessie:19120/api/v2
                - --jdbc
                - --jdbc-url
                - "$(JDBC_URL)"
                - --jdbc-user
                - "$(JDBC_USER)"
                - --jdbc-password
                - "$(JDBC_PASSWORD)"
                - --write-live-set-id-to
                - /nessie/live_set_id
              envFrom:
                - secretRef:
                    name: {{ .Values.nessie.jdbc.secret.name }}-gc
              volumeMounts:
                - name: nessie-gc-data-volume
                  mountPath: /nessie
          restartPolicy: Never
          volumes:
            - name: nessie-gc-data-volume
              persistentVolumeClaim:
                claimName: nessie-gc-pvc

---

# Sweep step

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nessie-gc-sweep-step
  namespace: {{ .Release.Namespace }}
spec:
  schedule: "30 0 * * *" # Runs daily at 30 past midnight GMT
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: nessie-sa
          containers:
            - name: nessie-gc
              image: ghcr.io/projectnessie/nessie-gc:latest
              args:
                - sweep
                - --defer-deletes
                - --jdbc
                - --jdbc-url
                - "$(JDBC_URL)"
                - --jdbc-user
                - "$(JDBC_USER)"
                - --jdbc-password
                - "$(JDBC_PASSWORD)"
                - --read-live-set-id-from
                - /nessie/live_set_id
              envFrom:
                - secretRef:
                    name: "{{ .Values.nessie.jdbc.secret.name }}-gc"
              volumeMounts:
                - name: nessie-gc-data-volume
                  mountPath: /nessie
          restartPolicy: Never
          volumes:
            - name: nessie-gc-data-volume
              persistentVolumeClaim:
                claimName: nessie-gc-pvc

---

### Delete step

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nessie-gc-delete-step
  namespace: {{ .Release.Namespace }}
spec:
  schedule: "0 3 * * *" # Runs daily at 3 AM GMT
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: nessie-sa
          containers:
            - name: nessie-gc
              image: ghcr.io/projectnessie/nessie-gc:latest
              args:
                - deferred-deletes
                - --jdbc
                - --jdbc-url
                - "$(JDBC_URL)"
                - --jdbc-user
                - "$(JDBC_USER)"
                - --jdbc-password
                - "$(JDBC_PASSWORD)"
                - --read-live-set-id-from
                - /nessie/live_set_id
              envFrom:
                - secretRef:
                    name: "{{ .Values.nessie.jdbc.secret.name }}-gc"
              volumeMounts:
                - name: nessie-gc-data-volume
                  mountPath: /nessie
          restartPolicy: Never
          volumes:
            - name: nessie-gc-data-volume
              persistentVolumeClaim:
                claimName: nessie-gc-pvc

---

# DB initialization
apiVersion: batch/v1
kind: Job
metadata:
  name: nessie-gc-db-init-step
  namespace: {{ .Release.Namespace }}
spec:
  template:
    spec:
      containers:
        - name: nessie-gc
          image: ghcr.io/projectnessie/nessie-gc:latest
          args:
            - create-sql-schema
            - --jdbc
            - --jdbc-url
            - "$(JDBC_URL)"
            - --jdbc-user
            - "$(JDBC_USER)"
            - --jdbc-password
            - "$(JDBC_PASSWORD)"
            - --jdbc-schema
            - "CREATE_IF_NOT_EXISTS"
          envFrom:
            - secretRef:
                name: "{{ .Values.nessie.jdbc.secret.name }}-gc"
      restartPolicy: Never

Nessie server type (docker/uber-jar/built from source) and version

docker

Client type (Ex: UI/Spark/pynessie ...) and version

Spark

Additional information

No response

justas200 avatar Dec 05 '24 15:12 justas200

IMHO, GC works as designed. https://projectnessie.org/nessie-0-100-3/gc/

The docs may not be crystal clear, but still they say Nessie GC is a tool to clean up orphaned files in a Nessie repository. The key here is identifying the orphan files. The GC CLI delegates this task to the user by providing "cutoff" arguments / policies. Whatever happens to be beyond the cutoff line is assumed to be not needed by the user, therefore GC deletes those files from storage.

Metadata is not modified to preserve Nessie's commit history integrity (commits are immutable).

If the Spark job needs access to a range of historical Iceberg snapshots, please set cutoff policies so that the required data is considered "live" by Nessie GC.

dimas-b avatar Dec 05 '24 15:12 dimas-b

If your Spark job must process all snapshots in the Iceberg metadata file, perhaps you may want to add another job that removes some (old) snapshots before running GC.

dimas-b avatar Dec 05 '24 15:12 dimas-b

Thank you for the response. Just to clarify, I am running a SELECT * FROM <table> statement, which, I believe, defaults to the latest commit/snapshot on the main branch.

Since my marking policy is configured with a cutoff of at least 48 hours, and GC always leaves at least 1 snapshot needed for latest commit intact, I believe this is a data loss happening at some stage.

I have collected data in this catalog for over a quarter and experienced no issues, however immediately starting the day when GC was launched last week I started having errors on tables with the highest throughput.

I am running GC once a day in production recommended setup, and the number of files missing in each metadata seems to correlate with the number of total files in the table.

Specific numbers:

  • Table1 missing 8/2544
  • Table2 missing 160/719285
  • Table3 missing 70/694985

I have only two workflows that interact with the tables, so I am quite confident that the files were not deleted by accident manually:

  • Kafka Connector in append-only mode (runs Java Iceberg library to append data)
  • Spark Connect to query data

We have three more smaller tables (>100x times smaller in total storage size) and we have no issues with them.

Please advise if some of my understanding is wrong about how GC works, or if I may have missed something.

justas200 avatar Dec 05 '24 17:12 justas200

@justas200 : It is pretty hard to say whether it's a bug or not based on the provided data, unfortunately.

Are you able to reproduce the problem in a small fresh test env?

As to the original problem, are you able to identify the full URI of the missing file that Spark wants to access? Is it referenced from "live" snapshots?

dimas-b avatar Dec 05 '24 18:12 dimas-b

I will try reproducing it in a small test env later today or early next week. I am able to identify the full URI of the missing file - it is referenced from live (current) snapshot.

justas200 avatar Dec 06 '24 07:12 justas200

@justas200 Hi! Do you have any news here? The issue is very bad at first glance

Albel-Analyst avatar Dec 25 '24 19:12 Albel-Analyst

could it be that while writing files that have not yet been committed, nessie gc triggered and marked them as orphaned?

Albel-Analyst avatar Dec 25 '24 19:12 Albel-Analyst

@dimas-b

I suspect there may be a timing/concurrency problem in how Nessie’s GC handles data files. Consider this scenario: 1. A job is in the process of writing data files. 2. At the same time, Nessie GC runs and marks these (still-being-written) files as orphaned. 3. The job completes writing the data files and successfully commits them, so from Iceberg’s perspective, everything looks correct. 4. However, Nessie’s database has already flagged these files as orphaned. When the deletion phase runs, it could remove files that are actually valid and just committed.

This is just a hypothesis, but it highlights a potential race condition: files are labeled orphaned before they are fully committed. To address this, it seems that Nessie GC should not only look at the commit time for pruning older snapshots, but also consider the creation or last-modified time of the “orphaned” files. Otherwise, there’s a risk of deleting files that were recently created and are still in use. Or am I missing something?

Albel-Analyst avatar Dec 26 '24 06:12 Albel-Analyst

I noticed there is a --max-file-modification=<maxFileModificationTime> flag for the expire operation in Nessie GC.

Could this option resolve the concurrency issue described above, where freshly written (but not yet committed) files end up marked as orphaned?

Albel-Analyst avatar Dec 26 '24 06:12 Albel-Analyst

Yes, --max-file-modification is meant to avoid situations when new files are created while GC is running.

This value applies to storage objects' "creation" timestamps as reported by the storage backend (e.g. S3).

dimas-b avatar Dec 27 '24 16:12 dimas-b

@dimas-b So can it be the root cause of topic starter issue? If @justas200 using default

Defaults to the created timestamp of the live-content-set.

Can timezone in env of nessie gc can be different of timezone nessie gc recieving from s3 objects?

Albel-Analyst avatar Dec 28 '24 10:12 Albel-Analyst

Timezones (or DST) should not matter. Timestamps are manipulates as points on the universal time scale.

dimas-b avatar Dec 30 '24 17:12 dimas-b

In case anyone comes across this issue, what I did was create a table with the same name and schema in the dev environment (or another schema should work) and inserted a single row with all null/whatever values (important that they don't match anything in the table you're fixing) into that table. Then, each time the query failed, I copied that parquet file from the dev environment to the path that the error was complaining about (ex. s3a://iceberg/logs/events_b7bb2c4b-c131-4745-b5cd-a7a71d305846/data/0000/0101/0011/10100000/ts_day=2025-05-16/kafka_partition=4/00001-1747379360621-4f80847e-ee2d-47b5-afc6-cd220fd1d36a-00002.parquet) until it didn't fail anymore. Then I ran a delete on the null/whatever values that I had used. This allowed me to get past query failures, albeit with data loss. Avoiding data loss would have been slightly more complex but possible - in my situation I used Kafka Connect for the problem table, so I would have launched a new Kafka Connect consumer group for the topic that points at a temporary table, then used that temporary table to overwrite the partitions where there was data loss. Seems like that would have been necessary had I lost metadata files instead.

agrubb86 avatar May 19 '25 18:05 agrubb86