greptimedb icon indicating copy to clipboard operation
greptimedb copied to clipboard

Failed to find SST files during compaction

Open v0y4g3r opened this issue 1 year ago • 4 comments

What type of bug is this?

Data corruption

What subsystems are affected?

Datanode

Description

When using S3 as storage, datanodes complains failed to compact region since it cannot find some input files.

Failed to compact region: 5905580032000(1375, 0) err=0: OpenDAL operator failed, at greptimedb/src/mito2/src/sst/parquet/reader.rs:126:14
1: NotFound (persistent) at  => File not found: data/[...]/public/1375/1375_0000000000/62a9180f-ff63-4137-9d5e-6eae7ae5e178.parquet

The missing file is created by another compaction which may still be influenced by this EntityTooSmall issue. But in normal control flow, if the manifest file has been successfully updated, the SST file must have already been uploaded to S3 before manifest update. So the problem is:

  • GreptimeDB believes the SST file has been uploaded, and no error logs about this SST file.
  • Manifest file has been updated.
  • No sign about this SST file has ever been deleted.
  • No version of this SST file is listed on S3 console.

In summary, this missing SST file might never have been uploaded successfully ever, while for some reason datanode mistakenly thought it was done and updated the manifest.

TODO

  • [ ] Revert #2745 once we repair all affected manifest

v0y4g3r avatar Nov 14 '23 11:11 v0y4g3r

We can list all incomplete multipart uploads via awscli, and this missing file is marked as "incomplete":

{
  "UploadId": "[...]",
  "Key": "cluster-prod1-2/data/[...]/public/1205/1205_0000000000/0ad457b9-9f21-4593-9d1f-dd1b968e1813.parquet",
  "Initiated": "2023-11-08T12:47:34+00:00",
  "StorageClass": "STANDARD",
  "Owner": {
    "DisplayName": "[...]",
    "ID": "[...]"
  },
  "Initiator": {
    "ID": "[...]",
    "DisplayName": "[...]"
  }
}

v0y4g3r avatar Nov 14 '23 12:11 v0y4g3r

Added a flag to ignore those files. https://github.com/GreptimeTeam/greptimedb/pull/2745

We can enable this to skip incorrect manifests and remove it after all manifests are fixed.

evenyag avatar Nov 14 '23 14:11 evenyag

As per AWS ticket response:

由内部工具,我可以看到您在 Nov 08 12:47:33 建立了 Multipar upload,并上传了约 81 个片,每个片约 4194304 (最后一个 29508)。

Total bytes uploaded to S3 was 4194304*80+29508=335,573,828, which is the same as file size in manifest:

        "files_to_add": [
          {
            "region_id": 5175435591680,
            "file_id": "0ad457b9-9f21-4593-9d1f-dd1b968e1813",
            "time_range": [
              {
                "value": 1698049810950,
                "unit": "Millisecond"
              },
              {
                "value": 1699447600950,
                "unit": "Millisecond"
              }
            ],
            "level": 1,
            "file_size": 335573828
          }
        ],

We now suspect the following causes:

  • datanode did not invoke "complete multipart upload", which is wrapped inside opendal's writer's shutdown method.
  • datanode invoked "complete multipart upload", which actually failed but datanode falsely considered it a success and proceeded to update the manifest.

v0y4g3r avatar Nov 15 '23 03:11 v0y4g3r

Another file not found error https://github.com/GreptimeTeam/greptimedb/issues/3633. But it was caused by a bug.

evenyag avatar Apr 03 '24 08:04 evenyag

Looks like not happen again.

killme2008 avatar Jun 25 '24 21:06 killme2008