scio
scio copied to clipboard
Possible improvements for reading many files
Currently, our built-in file read apis (sc.avroFile
, sc.protobufFile
, sc.textFile
, ...) take a single String path
either matching a specific file, or a filepattern. The approach most users seem to take when reading many files is to either useSCollection.unionAll
on many individual read transforms (can produce a job graph that's too huge to submit), or to specify a file pattern matching many files (not performant if it matches a huge # of files).
Although we do have the SCollection#readFiles[A: Coder](filesTransform: PTransform[PCollection[beam.FileIO.ReadableFile], PCollection[A]])
method, it has the disadvantage that users have to work directly with Beam ReadFiles transforms.
I had two thoughts on optimizing user experience in either case:
-
Expose the
withHintMatchesManyFiles()
option in more ScioIO ReadParams. (doc in Beam's AvroIO). When this is set, theRead
transform is converted to using aReadFiles
transform. -
Offer
sc.avroFile
/sc.protobufFile
/sc.textFile
methods that take aList[String]
of paths. I know this adds some clutter/maintanability issues but it does seem like a VERY common use case, and we don't have to offer it for every ScioIO.
wdyt?
Main concern re 2, it might cause method overload issues? If there's a way around it then 👍
overloading will be an issue here, but maybe we can find some clever naming? I think we should work on this as I have seen more use cases requiring this.
@alexclare wants to take a stab at this.
I'm not sure if this API would be great looking, but I would be happy using something like this:
val myListOfInputs: List[String]
sc.read(myListOfInputs, TextIO_or_any_other_IO(_))
I.e. first parameter is an input list (String type can be made generic) and second parameter is transformation to IO from input item.
I have no idea how this would be possible with current IO types, like ScioIO.ReadP
.
My implicit class version of this in our repo is called avroFiles
(plural). No collision issue.