greptimedb
greptimedb copied to clipboard
Failed to find SST files during compaction
What type of bug is this?
Data corruption
What subsystems are affected?
Datanode
Description
When using S3 as storage, datanodes complains failed to compact region since it cannot find some input files.
Failed to compact region: 5905580032000(1375, 0) err=0: OpenDAL operator failed, at greptimedb/src/mito2/src/sst/parquet/reader.rs:126:14
1: NotFound (persistent) at => File not found: data/[...]/public/1375/1375_0000000000/62a9180f-ff63-4137-9d5e-6eae7ae5e178.parquet
The missing file is created by another compaction which may still be influenced by this EntityTooSmall issue. But in normal control flow, if the manifest file has been successfully updated, the SST file must have already been uploaded to S3 before manifest update. So the problem is:
- GreptimeDB believes the SST file has been uploaded, and no error logs about this SST file.
- Manifest file has been updated.
- No sign about this SST file has ever been deleted.
- No version of this SST file is listed on S3 console.
In summary, this missing SST file might never have been uploaded successfully ever, while for some reason datanode mistakenly thought it was done and updated the manifest.
TODO
- [ ] Revert #2745 once we repair all affected manifest
We can list all incomplete multipart uploads via awscli, and this missing file is marked as "incomplete":
{
"UploadId": "[...]",
"Key": "cluster-prod1-2/data/[...]/public/1205/1205_0000000000/0ad457b9-9f21-4593-9d1f-dd1b968e1813.parquet",
"Initiated": "2023-11-08T12:47:34+00:00",
"StorageClass": "STANDARD",
"Owner": {
"DisplayName": "[...]",
"ID": "[...]"
},
"Initiator": {
"ID": "[...]",
"DisplayName": "[...]"
}
}
Added a flag to ignore those files. https://github.com/GreptimeTeam/greptimedb/pull/2745
We can enable this to skip incorrect manifests and remove it after all manifests are fixed.
As per AWS ticket response:
由内部工具,我可以看到您在 Nov 08 12:47:33 建立了 Multipar upload,并上传了约 81 个片,每个片约 4194304 (最后一个 29508)。
Total bytes uploaded to S3 was 4194304*80+29508=335,573,828, which is the same as file size in manifest:
"files_to_add": [
{
"region_id": 5175435591680,
"file_id": "0ad457b9-9f21-4593-9d1f-dd1b968e1813",
"time_range": [
{
"value": 1698049810950,
"unit": "Millisecond"
},
{
"value": 1699447600950,
"unit": "Millisecond"
}
],
"level": 1,
"file_size": 335573828
}
],
We now suspect the following causes:
- datanode did not invoke "complete multipart upload", which is wrapped inside opendal's writer's shutdown method.
- datanode invoked "complete multipart upload", which actually failed but datanode falsely considered it a success and proceeded to update the manifest.
Another file not found error https://github.com/GreptimeTeam/greptimedb/issues/3633. But it was caused by a bug.
Looks like not happen again.