scio
scio copied to clipboard
[Discuss]: Decouple read and writer functionality of ScioIO
it seems most of the implementation of ScioIO[T]
is either read-only or write-only. For some IOs, read types are not as same as write types, that reduce the value of our exiting ScioIO implementations since it enforces to have the same read
, write
and tap
type.
What if we decouple the read and write from the top-level ScioIO[T]
trait?. something like the following? what is the reason to choose existing over this?
trait ScioIO[T] {}
trait ScioIOReader[T] extends ScioIO[T] {
type ReadP
def read(sc: ScioContext, params: ReadP): SCollection[T]
}
object ScioIOReader {
type Aux[A, B] = ScioIOReader[A] { type readT = B }
}
trait ScioIOWriter[T, U] extends ScioIO[T] {
type WriteP
def write(scol: SCollection[T], params: WriteP): Tappable[U]
}
trait Tappable[U] {
def tap[R <: ScioIOReader.Aux[U, _]](): R
}
Then the read and write API in ScioContext and SCollection would look something like this.
sc.read[T: Coder](io: ScioIOReader[T])(params: io.ReadP): SCollection[T] = ???
and scol.write[U: Coder](io: ScioIOWriter[T, U])(params: io.WriteP)(implicit coderT = Coder[T]): Tappable[U] =???
While I don't have a strong opinion on the design itself, it will introduce a pretty big breaking change for a benefit that seems, at least to me, unclear.
Unless I missed a big benefit or unless we can make this change in a source compatible way, I'd rather not.
@nevillelyh @regadas @ClaireMcGinty Any opinion ?
Few things to note:
- We are leaking tap API with Readonly IOs. (for read-only IOs tap doesn't make sense)
- With the current implementation,
read
input argument indef tap(read: ReadP): Tap[T] = {..}
is not necessary. (we can't use it sinceTap[T]
impl create another IO instance when opening tap and this ReadP is path-dependent.)
@jto for me the biggest advantage is this will further simplify the ScioIO implementations and improve the code readability a lot. Since most of the ScioIOs are read-only, Read-only IOs shouldn't have either def tap
or def writes
, this will clean a lot of code for us.
Currently exiting API allow users to pass read-only IOs to write APIs which will fail at runtime. But we can do better and catch these mistakes at compile time.
Currently exiting API allow users to pass read-only IOs to write APIs which will fail at runtime. But we can do better and catch these mistakes at compile time.
Isn't this issue prevented at compile time by read-only IOs having a WriteP
of type Nothing
, and vice-versa? I feel like as long as the user is interacting with ScioContextSyntax
/SCollectionSyntax
APIs we provide, it should be pretty fool-proof.
@ClaireMcGinty oh you are right, I think Julien also mentioned this the other day. I keep forgetting that. lol
Looking at the proposed approach, I feel like there are 2 major changes:
- separation of
read
&write
- supporting different return type on
write
Regarding 1) I'm with @jto benefit seems unclear. Today, uses can express the same thing and have compile time guarantees.
Regarding 2) i think is the most important and one that I been thinking about. I've seen couple of use cases where this is important. IO's can return other types besides the normal PDone
i.e BigqueryIO returns a WriteResult
that contains PCollection
with failed elements. Currently it's possible to overcome this by passing a function with the WriteParam
but i feel like it's not very ideal. This leads me to the Tap
subject.
Tap
's how useful are they these days? They tend to require ReadParam
which most of the time is not fully possible to create, so reading from a Tap
might not always give the desired outcome. Users can achieve better semantic without them.
I feel like this is the thing that we should rethink in ScioIO
. wdyt? we can go into further details but probably in another thread.
This exactly. 100% agree with the PDone
thing and indeed the only Tap
I have actually seen used in the wild is InMemoryTap
, for tests.