graphstorm icon indicating copy to clipboard operation
graphstorm copied to clipboard

Support file wildcards in GSProcessing inputs

Open thvasilo opened this issue 1 year ago • 0 comments

Currently we read files in GSProcessing by directly using the path provided by the user in the config in a spark.read.parquet/csv(filepath) call. Spark doesn't support wildcards when used like this, but GConstuct has support for filepath wildcards.

To ensure better compatibility between the two we should support wildcards for S3 paths on GSProcessing as well. One option is to use boto to list all files under the parent path and then apply the wildcard rule, then pass the resulting list of files to the input.

This can happen in config parsing time.

thvasilo avatar Dec 06 '24 18:12 thvasilo