Very large Ceph RadosGW S3 object staging bug

Open BioWilko opened this issue 2 years ago • 0 comments

Bug report

Expected behavior and actual behavior

I am running a pipeline using the k8s executor from inside a k8s pod (I know this isn't generally recommended but it actually works very well) and use an S3 URI (S3 API mocked via Ceph RadosGW) as a file input. Before any k8s pods are deployed the S3 URI is staged to the work directory at which point for very large objects (> ~120G from my testing) an extremely generic java aws exception is raised killing the pipeline: com.amazonaws.SdkClientException: Unable to store object contents to disk: Connection reset by peer.

Smaller files stage absolutely fine which is the expected behaviour.

I'm not 100% certain this is actually a problem with nextflow or the nf-amazon plugin but I was hoping you might be able to shed some light

Steps to reproduce the problem

No complex reproduction script required, simply try to stage an extremely large file (if you cannot reproduce that would be valuable too since it would indicate the issue is something to do with our setup)

Program output

Lightly redacted nextflow.log available below nextflow-Copy1 (1).log

Environment

Nextflow version: 23.10.0
Java version: openjdk 21.0.1 2023-10-17
Operating system: Linux
Bash version: GNU bash, version 5.2.21(1)-release (x86_64-alpine-linux-musl)

Additional context

The pipe is also being executed as a python subprocess, I'm aware this sounds horrendous but again, it actually works extremely reliably.

Jan 23 '24 14:01 BioWilko