Very large Ceph RadosGW S3 object staging bug
Bug report
Expected behavior and actual behavior
I am running a pipeline using the k8s executor from inside a k8s pod (I know this isn't generally recommended but it actually works very well) and use an S3 URI (S3 API mocked via Ceph RadosGW) as a file input. Before any k8s pods are deployed the S3 URI is staged to the work directory at which point for very large objects (> ~120G from my testing) an extremely generic java aws exception is raised killing the pipeline: com.amazonaws.SdkClientException: Unable to store object contents to disk: Connection reset by peer.
Smaller files stage absolutely fine which is the expected behaviour.
I'm not 100% certain this is actually a problem with nextflow or the nf-amazon plugin but I was hoping you might be able to shed some light
Steps to reproduce the problem
No complex reproduction script required, simply try to stage an extremely large file (if you cannot reproduce that would be valuable too since it would indicate the issue is something to do with our setup)
Program output
Lightly redacted nextflow.log available below nextflow-Copy1 (1).log
Environment
- Nextflow version: 23.10.0
- Java version: openjdk 21.0.1 2023-10-17
- Operating system: Linux
- Bash version: GNU bash, version 5.2.21(1)-release (x86_64-alpine-linux-musl)
Additional context
The pipe is also being executed as a python subprocess, I'm aware this sounds horrendous but again, it actually works extremely reliably.