flink-scala-api icon indicating copy to clipboard operation
flink-scala-api copied to clipboard

Separate type-information derivation into auto and semiauto

Open gaelrenoux-datadome opened this issue 3 months ago • 5 comments

Before I start explaining: I'm willing to work on the PR if you're interested, but I thought it better to discuss it with you first :-)

So, we're using flink-scala-api for type-information (I work with @arnaud-daroussin). One thing we've noted is that if we used it "as intended" (by just importing org.apache.flinkx.api.serializers._ everywhere), it leads to very high compilation times. With the old Flink API, the full clean-compile took around 160 seconds, and with flink-scala-api it moved up to 200 seconds. However, we managed to cut quite a lot of it by using semi-auto derivation instead of full-auto derivation: we've reduced the time down to 140 seconds, even less than before the migration.

I'm not sure how familiar you are with semi-auto vs full-auto derivation? The idea is that instead of importing the macro everywhere, we declare implicit TypeInformation vals in the companion objects of all classes, and they're automatically found (hence semi-auto: they're declared manually, but found automatically). In addition to faster compile times, semi-auto also had the advantage of letting us create custom TypeInformations for certain class where the macro would have worked, but wouldn't have been as optimized for runtime performance. => You trade convenience for control.

So for example, instead of:

import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flinkx.api.serializers._

final case class Alert(message: String)

final case class Notification(alerts: List[Alert])

object Job {
  val info = implicitly[TypeInformation[Notification]]
}

We have:

import org.apache.flink.api.common.typeinfo.TypeInformation
// Don't import deriveTypeInformation
import org.apache.flinkx.api.serializers.{deriveTypeInformation => _, _}

final case class Alert(message: String)

object Alert {
  implicit val alertInfo: TypeInformation[Alert] = org.apache.flinkx.api.serializers.deriveTypeInformation
}

final case class Notification(alerts: List[Alert])

object Notification {
  implicit val notificationInfo: TypeInformation[Notification] = // some custom stuff
}

object Job {
  val info = implicitly[TypeInformation[Notification]]
}

The issue is that flink-scala-api doesn't really support semi-auto derivation natively.

So, we had to jump through some hoops. As you can see, we have to be careful to never import deriveTypeInformation, because it would have a higher priority as an implicit (being already in the scope) than the one on the entity's companion object. That's very error-prone: it's easy to miss (we did it a few times), because if you do everything seems to work "mostly" fine. So instead, we just created our own class that copied everything from org.apache.flinkx.api.serializers except deriveTypeInformation.

Another issue is that it doesn't notice when a type-information is missing, because deriveTypeInformation ends up calling itself if necessary. So for example, this shouldn't compile in semi-auto, but it does:

import org.apache.flink.api.common.typeinfo.TypeInformation
// Don't import deriveTypeInformation
import org.apache.flinkx.api.serializers.{deriveTypeInformation => _, _}

final case class Alert(message: String)

object Alert {
  // No TypeInformation declared
}

final case class Notification(alerts: List[Alert])

object Notification {
  // note that deriveTypeInformation is not in the implicit context, we call it by its full name
  // so it shouldn't find a way to get a TypeInformation[Alert]
  implicit val notificationInfo: TypeInformation[Notification] = org.apache.flinkx.api.serializers.deriveTypeInformation
}

object Job {
  val info = implicitly[TypeInformation[Notification]]
}

OK, that was a wall of text, sorry 😅

So: what do you think about supporting both auto and semi-auto derivation?

That's something projects like Circe are already doing. The idea would be to have two separate packages for the derivation of serializers and type-informations, called auto and semiauto. The generic type-informations (for stuff like Option, List, etc.) would be in a parent trait, inherited both by auto and semi-auto, and the macro would be the only thing being different between the two. Note that on the semi-auto derivation, the cache is not necessary, because the declared type-information vals are doing the job.

gaelrenoux-datadome avatar Aug 28 '25 12:08 gaelrenoux-datadome

Hi @gaelrenoux-datadome ,

It sounds useful. Thanks for detailed explanation. I guess you are also faimiliar with this technique: https://github.com/flink-extended/flink-scala-api?tab=readme-ov-file#compile-times. It does not solve all issues with compilation speed, right?

In general, I think it is good idea to create auto and semi-auto derivation with clear path how and when people should us it. Please feel free to propose a PR. 🤩

novakov-alexey avatar Aug 29 '25 07:08 novakov-alexey

It sounds useful. Thanks for detailed explanation. I guess you are also faimiliar with this technique: https://github.com/flink-extended/flink-scala-api?tab=readme-ov-file#compile-times. It does not solve all issues with compilation speed, right?

That's what we did at first, and it solves most of the performance issue, but it leads to a lot of boiler-plate imports all over the place. It also runs into the issue where you can easily forget one import and it falls back on the default macro. Not a big deal if you're doing it for performance, but bad if you expected to use a custom type-information.

gaelrenoux-datadome avatar Sep 01 '25 14:09 gaelrenoux-datadome

Ok. In general, I am more than welcome to bring this new feature with auto and semi-auto derivations.

novakov-alexey avatar Sep 01 '25 18:09 novakov-alexey

Another issue is that it doesn't notice when a type-information is missing, because deriveTypeInformation ends up calling itself if necessary.

This is the main issue that needs to be solved by a proper semiauto derivation.

During the refactoring to switch to semiauto derivation, I ran into several problems with the same root cause:

  • when a serializer is missing for a case class used inside an immutable List, it results in a stack overflow error (as described in https://github.com/flink-extended/flink-scala-api/pull/280) because it has a recursive structure (see scala.collection.immutable.::). If the List is derived like any regular case class instead of using its specific serializer, it causes this error.
  • other case classes from the Scala standard lib that have their own specific serializer can end up being derived for the same reason without throwing exceptions, at least: scala.Option, scala.util.Either.

The problem is always the same: the choice between a specific serializer and derivation happens at compile time, and once that choice is made, there is no more flexibility. Derivation is performed on all branches of the tree for which specific serializers were not found at compile time.

Except that some specific serializers were not found, not because they weren’t here, but because they would have needed the serializer of a case class in order to be resolved, serializer that was not declared to be derived in semiauto.

How can we make the semiauto derivation fail at compile time with a clear message when attempting to derive case classes that should have a specific serializer?

Or the other way around, how can we make specific serializers fail at compile time with a clear message when they miss a derivation?

arnaud-daroussin avatar Sep 02 '25 11:09 arnaud-daroussin

As I understood, the desired behavior with semi-auto serialization process is to not fallback to generic derivation branch that this library already offers, but rather fail fast when some specific TypeInformation was not defined manually by a user?

novakov-alexey avatar Sep 02 '25 19:09 novakov-alexey