metaflow
metaflow copied to clipboard
Use of s3() within parallel_map()
Consider this call:
with metaflow.S3() as s3i: result = s3i.info_many(s3_path, return_missing=True)
Can this be put in a metaflow.multicore_utils.parallel_map ?
i.e. parallel_map(wrapper_for_s3_info_many, s3_paths)
When I try, I get this error:
2024-09-30 23:08:29.391 [261693/start/3201226 (pid 1400063)] metaflow.plugins.datatools.s3.s3.MetaflowS3URLException: Specify S3(run=self) when you use S3 inside a running flow. Otherwise you have to use S3 with full s3:// urls. 2024-09-30 23:08:29.391 [261693/start/3201226 (pid 1400063)] Internal error
However, s3_paths=["s3://path/to/something.jpg","s3://path/to/something_else.jpg", ...] and I know 100% that every path in s3_paths starts with "s3://"
Putting run=self in the S3 instantiation within the wrapper yields
2024-09-30 23:21:21.832 [261699/start/3201250 (pid 1405459)] S3 non-transient error (attempt #1): s3op failed: 2024-09-30 23:21:21.913 [261699/start/3201250 (pid 1405459)] Invalid url: /
@hasush the s3.xx_many calls are already parallelized behind the scenes so one shouldn't necessarily need parallel_map. regardless, the error that you highlighted looks like a bug that we will address.