dagster
dagster copied to clipboard
s3_pickle_io_manager does not work with dynamic outputs
https://dagster.slack.com/archives/C01U954MEER/p1642163479382400:
Hi, I'm experiencing a bug when trying to write an ALS model from pyspark.ml.recommendation to S3 and reading it back in if this takes place within a dynamically executed graph (i.e. via dynamic mapping). I wrote a custom IO manager using the pattern f's3a://{self.s3_bucket}/{key}' as _uri_for_key similar to the one currently implemented in the PickledObjectS3IOManager. As the step identifiers for the dynamically generated steps contain square brackets [ and ] these are included in the S3 uri when an object is written. Even though I can clearly see the model was saved to this path in S3, I'm getting an error when the downstream op tries to load the model, something like:
py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
: org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3a://....fit_model[...]/model/metadata matches 0 files
When I replace/remove the square brackets from _uri_for_key this works fine:
f's3a://{self.s3_bucket}/{key}'.replace('[', '_').replace(']', '')