vulcan icon indicating copy to clipboard operation
vulcan copied to clipboard

Some issues migrating from Avro4s

Open erikvanoosten opened this issue 3 years ago • 16 comments

Because Avro4s is too slow in writing binary (for our use case), we want to migrate to Vulcan. Compared to Avro4s the performance is very good! We see 10 to 12 times the throughput on a single thread.

Unfortunately, we do have some problems when migrating from Avro4s:

  1. optional union types get nested --> solved in #450
  2. missing the field mapper
  3. AvroName annotation is ignored --> split to #443
  4. field defaults are ignored
  5. generic parameters are not included in a record name --> proposal in #444

Optional union types get nested

  def main(args: Array[String]): Unit = {
    import vulcan.Codec
    import vulcan.generic._

    @AvroNamespace("ns")
    sealed abstract class Location extends Product with Serializable
    @AvroNamespace("ns")
    case class PostalAddress(city: String) extends Location
    @AvroNamespace("ns")
    case class GeographicLocation(lat: Double, long: Double) extends Location
    @AvroNamespace("ns")
    case class Thing(location: Option[Location])
    implicit val locationCodec: Codec[Location] = Codec.derive
    implicit val ThingCodec: Codec[Thing] = Codec.derive
    println(ThingCodec.schema.left.get.message)
  }

prints:

org.apache.avro.AvroRuntimeException: Nested union: [
    "null",
    [
        {"type":"record","name":"GeographicLocation","namespace":"ns","fields":[{"name":"lat","type":"double"},{"name":"long","type":"double"}]},
        {"type":"record","name":"PostalAddress","namespace":"ns","fields":[{"name":"city","type":"string"}]}
    ]
]

(with newlines added for clarity)

I expected Vulcan to silently add the null type to the union, not create a new union with the 'Location' union nested in it.

Missing a field mapper

Unfortunately we need to use case classes that have weird field names. They contain weird characters like @ and :. With Avro4s we defined a field mapper that would map these to valid names.

As a work around we could use the @AvroName annotation on the case class fields. Unfortunately we are still not allowed to use the @ character in the field name even though it is no longer relevant for the schema. This is probably related to the next item:

AvroName annotation is ignored

  def main(args: Array[String]): Unit = {
    import vulcan.Codec
    import vulcan.generic._

    @AvroNamespace("ns")
    case class Item(@AvroName("id2") id1: String)
    println(Codec.derive[Item].schema.right.get.toString(true))
  }

prints:

{
  "type" : "record",
  "name" : "Item",
  "namespace" : "ns",
  "fields" : [ {
    "name" : "id1",       // <----- "id1" instead of expected "id2"
    "type" : "string"
  } ]
}

Field defaults are ignored

  def main(args: Array[String]): Unit = {
    import vulcan.Codec
    import vulcan.generic._

    @AvroNamespace("ns")
    case class Item(id1: String = "foo")
    println(Codec.derive[Item].schema.right.get.toString(true))
  }

expected:

{
  "type" : "record",
  "name" : "Item",
  "namespace" : "ns",
  "fields" : [ {
    "name" : "id1",
    "type" : "string",
    "default": "foo"       // <-- expected but not present
  } ]
}

Generic parameters are not included in a record name

  def main(args: Array[String]): Unit = {
    import vulcan.Codec
    import vulcan.generic._

    @AvroNamespace("ns")
    case class SomeData(foo: String)
    @AvroNamespace("ns")
    case class Event[A](data: A)
    implicit val someDataCodec: Codec[SomeData] = Codec.derive
    println(Codec.derive[Event[SomeData]].schema.right.get.toString(true))
  }

Prints: Event, Avro4s gives Event__SomeData.

The Avro4s approach has the advantage of giving each type instance of Event a different name. This is relevant if the consumer of the schema uses code generation.

Note that we can not solve this with an @AvroName annotation as then the name would be same for every type instance of Event. (If that is what @AvroName is supposed to do, currently it does not seem to change the name when set on a case class.)

All code in this issue tested with Vulcan 1.3.8, Scala 2.12.15.

erikvanoosten avatar Apr 25 '22 13:04 erikvanoosten

If you like, with pointers into the code, I can attempt to fix some of these things.

erikvanoosten avatar Apr 25 '22 14:04 erikvanoosten

That would be great for the AvroName feature (I've opened a ticket at #443). The relevant bit of code (for Scala 2) is at https://github.com/fd4s/vulcan/blob/26aaa0425821156c5b17b4d13491f630ffe50caa/modules/generic/src/main/scala-2/vulcan/generic/package.scala#L88-L92 - you should be able to follow the pattern from AvroDoc for AvroName. It also needs the same to be done in the scala 3 version and a test to be added.

I have something in progress for flattening nested union codecs, I'll dig it out.

bplommer avatar Apr 25 '22 14:04 bplommer

Re field mappers, the general philosophy in Vulcan is to encourage writing codecs manually and keep generic derivation as lightweight as possible, but I'm open to being persuaded 🙂

bplommer avatar Apr 25 '22 14:04 bplommer

I have added a 4th issue.

erikvanoosten avatar Apr 25 '22 15:04 erikvanoosten

I have added a 5th issue. Also, I added a rationale of why we want to move to Vulcan (performance!).

Re. fieldmapper and keeping generic derivation lightweight, we have a workaround so I won't press the issue :)

Actually, we now have workarounds for all 5 issues. Some are not so pretty, in particular we would like to see issues 1 and 4 be resolved. I can look at no. 4 next week.

erikvanoosten avatar Apr 26 '22 17:04 erikvanoosten

I'd like to keep the existing behaviour for defaults as it is, but we could add an @AvroDefault annotation to opt in to using the default field value from the case class. @AvroNullDefault does something similar in a more limited way.

bplommer avatar Apr 26 '22 17:04 bplommer

WDYT of adding variants of the derive method? (Or add a parameter to it?) I have not looked at the code much, I have no idea how hard that would be.

I have to follow a schema definitions that demands dozens of records. It is maintained by multiple people. If we need to put an annotation on every case class, there is a good chance one will be forgotten. In addition, I'd prefer to keep the Vulcan dependency out of the model code (but that is not a hard requirement).

erikvanoosten avatar Apr 26 '22 20:04 erikvanoosten

WDYT of adding variants of the derive method? (Or add a parameter to it?) I have not looked at the code much, I have no idea how hard that would be.

I'd be down for that. Should be feasible. I'll try to look at your PR and write up some more thoughts in the next few days.

bplommer avatar May 05 '22 07:05 bplommer

Compared to Avro4s the performance is very good! We see 10 to 12 times the throughput on a single thread. Even better, Avro4s seems to be doing some synchronization because we can't get higher throughput with multiple threads.

Interesting! The main issues with avro4s that motivated creating Vulcan were to do with slow compile time and undesired results from always-on autoderivation but I don't remember runtime performance being a problem. Have you opened an issue with avro4s?

bplommer avatar May 05 '22 07:05 bplommer

Compared to Avro4s the performance is very good! We see 10 to 12 times the throughput on a single thread. Even better, Avro4s seems to be doing some synchronization because we can't get higher throughput with multiple threads.

Interesting! The main issues with avro4s that motivated creating Vulcan were to do with slow compile time and undesired results from always-on autoderivation but I don't remember runtime performance being a problem. Have you opened an issue with avro4s?

Actually, we have since found out that we were accidentally running our code in a mutex so no multi-threaded speedup was possible. I have to update this part of the story.

Update: removed this part from the description.

erikvanoosten avatar May 05 '22 08:05 erikvanoosten

Re. no 4, from https://github.com/softwaremill/magnolia#limitations (on the Scala 3 version):

Magnolia is not currently able to access default values for case class parameters.

Fortunately, the next version of Magnolia will support it :)

erikvanoosten avatar May 05 '22 18:05 erikvanoosten

Proposal to solve no. 5 in PR #447

erikvanoosten avatar May 09 '22 18:05 erikvanoosten

We now run in production with a fork that contains #444 (without config) and #450. It works pretty well.

I did not get a chance to work on adding defaults yet. Although I found out that it is not that urgent. This is because we add null defaults to the schema before we use it (see https://github.com/sksamuel/avro4s/issues/714#issuecomment-1092787180). This makes it compatible with the schema generated by Avro4s.

erikvanoosten avatar May 26 '22 06:05 erikvanoosten

Hi, I would like to know more about performance difference between vulcan and avro4s. I currently use avro4s in a kafka streams application for reflection based schema generation. Is performance one of the primary considerations for the creation of this project or was it an afterthought in comparison to more of the developer experience based concerns?

yzia2000 avatar Jun 14 '22 06:06 yzia2000

hi @erikvanoosten, did you find a solution for no. 4?

Zhen-hao avatar Aug 24 '23 19:08 Zhen-hao

hi @erikvanoosten, did you find a solution for no. 4?

@Zhen-hao Since I have a work around (see https://github.com/fd4s/vulcan/issues/442#issuecomment-1138223835), I no longer worked on a proper solution. However, now that Magnolia exposes default values, it could be baked directly into Vulcan.

erikvanoosten avatar Aug 31 '23 18:08 erikvanoosten