scio icon indicating copy to clipboard operation
scio copied to clipboard

[Discuss]: Decouple read and writer functionality of ScioIO

Open syodage opened this issue 4 years ago • 8 comments

it seems most of the implementation of ScioIO[T] is either read-only or write-only. For some IOs, read types are not as same as write types, that reduce the value of our exiting ScioIO implementations since it enforces to have the same read, write and tap type.

What if we decouple the read and write from the top-level ScioIO[T] trait?. something like the following? what is the reason to choose existing over this?


trait ScioIO[T] {}

trait ScioIOReader[T] extends ScioIO[T] {
  type ReadP
  def read(sc: ScioContext, params: ReadP): SCollection[T]
}

object ScioIOReader {
  type Aux[A, B] = ScioIOReader[A] { type readT = B }
}

trait ScioIOWriter[T, U] extends ScioIO[T] {
  type WriteP
  def write(scol: SCollection[T], params: WriteP): Tappable[U]
}

trait Tappable[U] {
  def tap[R <: ScioIOReader.Aux[U, _]](): R
}

Then the read and write API in ScioContext and SCollection would look something like this. sc.read[T: Coder](io: ScioIOReader[T])(params: io.ReadP): SCollection[T] = ??? and scol.write[U: Coder](io: ScioIOWriter[T, U])(params: io.WriteP)(implicit coderT = Coder[T]): Tappable[U] =???

syodage avatar Jun 29 '20 19:06 syodage

While I don't have a strong opinion on the design itself, it will introduce a pretty big breaking change for a benefit that seems, at least to me, unclear.

Unless I missed a big benefit or unless we can make this change in a source compatible way, I'd rather not.

jto avatar Jul 01 '20 10:07 jto

@nevillelyh @regadas @ClaireMcGinty Any opinion ?

jto avatar Jul 07 '20 09:07 jto

Few things to note:

  1. We are leaking tap API with Readonly IOs. (for read-only IOs tap doesn't make sense)
  2. With the current implementation, read input argument in def tap(read: ReadP): Tap[T] = {..} is not necessary. (we can't use it since Tap[T] impl create another IO instance when opening tap and this ReadP is path-dependent.)

syodage avatar Jul 07 '20 17:07 syodage

@jto for me the biggest advantage is this will further simplify the ScioIO implementations and improve the code readability a lot. Since most of the ScioIOs are read-only, Read-only IOs shouldn't have either def tap or def writes, this will clean a lot of code for us.

Currently exiting API allow users to pass read-only IOs to write APIs which will fail at runtime. But we can do better and catch these mistakes at compile time.

syodage avatar Jul 07 '20 18:07 syodage

Currently exiting API allow users to pass read-only IOs to write APIs which will fail at runtime. But we can do better and catch these mistakes at compile time.

Isn't this issue prevented at compile time by read-only IOs having a WriteP of type Nothing, and vice-versa? I feel like as long as the user is interacting with ScioContextSyntax/SCollectionSyntax APIs we provide, it should be pretty fool-proof.

clairemcginty avatar Jul 07 '20 18:07 clairemcginty

@ClaireMcGinty oh you are right, I think Julien also mentioned this the other day. I keep forgetting that. lol

syodage avatar Jul 07 '20 19:07 syodage

Looking at the proposed approach, I feel like there are 2 major changes:

  1. separation of read & write
  2. supporting different return type on write

Regarding 1) I'm with @jto benefit seems unclear. Today, uses can express the same thing and have compile time guarantees.

Regarding 2) i think is the most important and one that I been thinking about. I've seen couple of use cases where this is important. IO's can return other types besides the normal PDone i.e BigqueryIO returns a WriteResult that contains PCollection with failed elements. Currently it's possible to overcome this by passing a function with the WriteParam but i feel like it's not very ideal. This leads me to the Tap subject.

Tap's how useful are they these days? They tend to require ReadParam which most of the time is not fully possible to create, so reading from a Tap might not always give the desired outcome. Users can achieve better semantic without them.

I feel like this is the thing that we should rethink in ScioIO. wdyt? we can go into further details but probably in another thread.

regadas avatar Jul 08 '20 09:07 regadas

This exactly. 100% agree with the PDone thing and indeed the only Tap I have actually seen used in the wild is InMemoryTap, for tests.

jto avatar Jul 08 '20 09:07 jto