opendal icon indicating copy to clipboard operation
opendal copied to clipboard

bug: encounter "writer has not been closed or aborted, must be a bug"

Open hzxa21 opened this issue 1 year ago • 6 comments

Describe the bug

Version: 0.47.2 With the following writer configurations, we encounter "writer has not been closed or aborted, must be a bug".

    let writer = op
            .clone()
            .layer(TimeoutLayer::new().with_io_timeout(Duration::from_millis(
                config.retry.streaming_upload_attempt_timeout_ms,
            )))
            .layer(
                RetryLayer::new()
                    .with_min_delay(Duration::from_millis(1000))
                    .with_max_delay(Duration::from_millis(10000))
                    .with_max_times(3)
                    .with_factor(2.0)
                    .with_jitter(),
            )
            .writer_with(&path)
            .concurrent(8)
            .executor(Executor::with(monitored_execute))
            .await?;

It seems that this happens when the opendal retry is triggered on writer.close() .

Steps to Reproduce

Expected Behavior

Additional Context

Logs:

2024-09-04T07:42:02.770667396Z WARN opendal::layers::retry: will retry after 1.604547739s because: Unexpected (temporary) at Writer::close, context: { timeout: 10 } => io operation timeout reached

2024-09-04T07:42:04.377279911Z WARN opendal::services: service=s3 operation=Writer::close path=xxx -> data close failed: NotFound (permanent) at Writer::close, context: { uri: ..., response: Parts { status: 404, version: HTTP/1.1, headers: {"accept-ranges": "bytes", "cache-control": "no-cache", "content-length": "467", "content-security-policy": "block-all-mixed-content", "content-type": "application/xml", "server": "MinIO", "strict-transport-security": "max-age=31536000; includeSubDomains", "vary": "Origin", "vary": "Accept-Encoding", "x-accel-buffering": "no", "x-amz-id-2": "..."} }, service: s3, path: xxx, written: 138426184 } => S3Error { code: "NoSuchUpload", message: "The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.", resource: "xxx", request_id: "xxx" }

2024-09-04T07:42:04.377323167Z WARN opendal::layers::complete: writer has not been closed or aborted, must be a bug

2024-09-04T07:42:04.37733207Z ERROR risingwave_object_store::object: streaming_upload_finish failed error=NotFound (persistent) at Writer::close, context: { uri:..., response: Parts { status: 404, version: HTTP/1.1, headers: {"accept-ranges": "bytes", "cache-control": "no-cache", "content-length": "467", "content-security-policy": "block-all-mixed-content", "content-type": "application/xml", "server": "MinIO", "strict-transport-security": "max-age=31536000; includeSubDomains", "vary": "Origin", "vary": "Accept-Encoding", "x-accel-buffering": "no", "x-amz-id-2": "..."} }, service: s3, path: ... } => S3Error { code: "NoSuchUpload", message: "The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.", resource: "...", request_id: "..." }

Are you willing to submit a PR to fix this bug?

  • [ ] Yes, I would like to submit a PR.

hzxa21 avatar Sep 06 '24 06:09 hzxa21

Also reproduce in Version: 0.49.

wcy-fdu avatar Sep 06 '24 08:09 wcy-fdu

Hi, could you check if this file has been created successfully? I'm wondering if this is the case:

  • A complete multiple is sent, which returns a timeout.
  • We cancel this request and send a new one.
  • But the old request has been processed successfully, leading to the new request returning an error.

Do you think this is possible? Can you reproduce this over AWS S3? Or it's just on minio?

Xuanwo avatar Sep 07 '24 11:09 Xuanwo

Can you reproduce this over AWS S3? Or it's just on minio?

This error occurs in our CI, which is based on minio. The issue has not yet been reproduced on AWS S3, and it is a bit bard to check whether the file was created successfully on minio.

wcy-fdu avatar Sep 09 '24 04:09 wcy-fdu

This error occurs in our CI, which is based on minio. The issue has not yet been reproduced on AWS S3, and it is a bit bard to check whether the file was created successfully on minio.

Hi, can you run op.stat() on this file if NotFound occurs during w.close() and ignore it file created? If the error disappears, I believe that's the issue.

Xuanwo avatar Sep 09 '24 04:09 Xuanwo

This error occurs in our CI, which is based on minio. The issue has not yet been reproduced on AWS S3, and it is a bit bard to check whether the file was created successfully on minio.

Hi, can you run op.stat() on this file if NotFound occurs during w.close() and ignore it file created? If the error disappears, I believe that's the issue.

The file is not written into minio.

wcy-fdu avatar Sep 09 '24 08:09 wcy-fdu

Update: OpenDAL and the AWS SDK handle multipart uploads differently.

  • The AWS SDK operates completely in parallel
  • while OpenDAL processes uploads based on concurrency, utilizing an internal task queue. During the final close, it continuously polls this task queue until all tasks are completed.

As a result, in extreme cases, the timeout in OpenDAL may require waiting through two rounds of I/O: the first round waits for task.execute, and the second waits for all tasks in the batch to finish. Currently, the timeout is not per operation; it also includes the waiting time for tasks. Therefore, when the number of tasks exceeds the concurrency, the timeout is more likely to be triggered compared to the SDK.

Additionally, there may be a MinIO issue: the CompleteMultipartUpload operation is not necessarily idempotent.

wcy-fdu avatar Sep 18 '24 08:09 wcy-fdu

It doesn't seem related to opendal, so let's close it.

Xuanwo avatar Nov 14 '24 07:11 Xuanwo

@wcy-fdu did you manage to solve this? i'm not using minio but "garage" and i get the same error

[2025-09-26T17:02:14Z WARN  opendal::layers::complete] writer has not been closed or aborted, must be a bug

choucavalier avatar Sep 26 '25 17:09 choucavalier