graphstorm
graphstorm copied to clipboard
Support file wildcards in GSProcessing inputs
Currently we read files in GSProcessing by directly using the path provided by the user in the config in a spark.read.parquet/csv(filepath) call. Spark doesn't support wildcards when used like this, but GConstuct has support for filepath wildcards.
To ensure better compatibility between the two we should support wildcards for S3 paths on GSProcessing as well. One option is to use boto to list all files under the parent path and then apply the wildcard rule, then pass the resulting list of files to the input.
This can happen in config parsing time.