scio icon indicating copy to clipboard operation
scio copied to clipboard

Possible improvements for reading many files

Open clairemcginty opened this issue 4 years ago • 5 comments

Currently, our built-in file read apis (sc.avroFile, sc.protobufFile, sc.textFile, ...) take a single String path either matching a specific file, or a filepattern. The approach most users seem to take when reading many files is to either useSCollection.unionAll on many individual read transforms (can produce a job graph that's too huge to submit), or to specify a file pattern matching many files (not performant if it matches a huge # of files).

Although we do have the SCollection#readFiles[A: Coder](filesTransform: PTransform[PCollection[beam.FileIO.ReadableFile], PCollection[A]]) method, it has the disadvantage that users have to work directly with Beam ReadFiles transforms.

I had two thoughts on optimizing user experience in either case:

  1. Expose the withHintMatchesManyFiles() option in more ScioIO ReadParams. (doc in Beam's AvroIO). When this is set, the Read transform is converted to using a ReadFiles transform.

  2. Offer sc.avroFile/sc.protobufFile/sc.textFile methods that take a List[String] of paths. I know this adds some clutter/maintanability issues but it does seem like a VERY common use case, and we don't have to offer it for every ScioIO.

wdyt?

clairemcginty avatar Mar 17 '20 18:03 clairemcginty

Main concern re 2, it might cause method overload issues? If there's a way around it then 👍

nevillelyh avatar May 13 '20 18:05 nevillelyh

overloading will be an issue here, but maybe we can find some clever naming? I think we should work on this as I have seen more use cases requiring this.

regadas avatar May 19 '20 19:05 regadas

@alexclare wants to take a stab at this.

regadas avatar Sep 30 '20 14:09 regadas

I'm not sure if this API would be great looking, but I would be happy using something like this:

val myListOfInputs: List[String]
sc.read(myListOfInputs, TextIO_or_any_other_IO(_))

I.e. first parameter is an input list (String type can be made generic) and second parameter is transformation to IO from input item.

I have no idea how this would be possible with current IO types, like ScioIO.ReadP.

sisidra avatar Jul 13 '21 07:07 sisidra

My implicit class version of this in our repo is called avroFiles (plural). No collision issue.

kellen avatar Jul 14 '21 03:07 kellen