metaflow Use of s3() within parallel

Use of s3() within parallel_map()

Open hasush opened this issue 1 year ago • 1 comments

Consider this call:

with metaflow.S3() as s3i: result = s3i.info_many(s3_path, return_missing=True)

Can this be put in a metaflow.multicore_utils.parallel_map ?

i.e. parallel_map(wrapper_for_s3_info_many, s3_paths)

When I try, I get this error:

2024-09-30 23:08:29.391 [261693/start/3201226 (pid 1400063)] metaflow.plugins.datatools.s3.s3.MetaflowS3URLException: Specify S3(run=self) when you use S3 inside a running flow. Otherwise you have to use S3 with full s3:// urls. 2024-09-30 23:08:29.391 [261693/start/3201226 (pid 1400063)] Internal error

However, s3_paths=["s3://path/to/something.jpg","s3://path/to/something_else.jpg", ...] and I know 100% that every path in s3_paths starts with "s3://"

Putting run=self in the S3 instantiation within the wrapper yields

2024-09-30 23:21:21.832 [261699/start/3201250 (pid 1405459)] S3 non-transient error (attempt #1): s3op failed: 2024-09-30 23:21:21.913 [261699/start/3201250 (pid 1405459)] Invalid url: /

Sep 30 '24 23:09 hasush

@hasush the s3.xx_many calls are already parallelized behind the scenes so one shouldn't necessarily need parallel_map. regardless, the error that you highlighted looks like a bug that we will address.

Oct 06 '24 14:10 savingoyal

metaflow metaflow copied to clipboard

Use of s3() within parallel_map()

metaflow
metaflow copied to clipboard