avro icon indicating copy to clipboard operation
avro copied to clipboard

AVRO-4069: Remove Reader String Cache from Generic Datum Reader

Open belugabehr opened this issue 1 year ago • 8 comments

What is the purpose of the change

  • This pull request improves file read performance by removing an unnecessary caching level, fixing AVRO-4069.

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

belugabehr avatar Oct 01 '24 03:10 belugabehr

@martin-g Thanks so much for your support on these PRs. Here's another one that needs attention that has a positive performance impact.

belugabehr avatar Oct 08 '24 23:10 belugabehr

stringClassCache was first added in https://github.com/apache/avro/commit/a3b38ebfdaf93eb77860fe80540bbc788e9589df, as part of AVRO-1268 implementation.

AVRO-3531 was a thread safety bug. Its fix https://github.com/apache/avro/pull/1719 changed the types of the caches and moved them into static class ReaderCache, but did not add any new caches.

I don't know whether stringClassCache is beneficial now; but AVRO-3531 doesn't look like a reason to keep it.

KalleOlaviNiemitalo avatar Oct 09 '24 10:10 KalleOlaviNiemitalo

@KalleOlaviNiemitalo @martin-g

Thank you for the correspondence.

I'm taking another look at this. It seems less than ideal that there is synchronized code paths for fast-reading Avro data. I am not sure why this Collection was ever updated to be synchronized as this code is inherently not thread-safe. The GenericDatumReader read() method accepts a Decoder. Decoders are certainly not thread safe as they typically have unsynchronized source reads and internal buffer manipulation.

If we can relax this requirement then performance is measurably better as a read path will have no synchronization overhead.

belugabehr avatar Jan 04 '25 02:01 belugabehr

I am taking another pass at this one. The performance improvements are just too significant to pass up (11%+ in my local benchmarks). Here are my benchmarks:

# JMH version: 1.35
# VM version: JDK 17.0.13, OpenJDK 64-Bit Server VM, 17.0.13+11-Ubuntu-2ubuntu124.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java

# This is from AVRO-4069 (the worst result I measured for this branch)
Result "com.szymon_kaluza.protobuf_avro.benchmark.AvroBench.serializeAndDeserializeSmall":
  889717.049 ±(99.9%) 9103.527 ops/s [Average]
  (min, avg, max) = (860194.914, 889717.049, 909429.461), stdev = 13625.731
  CI (99.9%): [880613.522, 898820.576] (assumes normal distribution)
  
  
# This is from master (the best result I measured for this branch)
Result "com.szymon_kaluza.protobuf_avro.benchmark.AvroBench.serializeAndDeserializeSmall":
  796084.662 ±(99.9%) 12190.142 ops/s [Average]
  (min, avg, max) = (764448.008, 796084.662, 850984.160), stdev = 18245.632
  CI (99.9%): [783894.521, 808274.804] (assumes normal distribution)

The reason I even investigated this is because Visual VM profiling was lighting up on IdentityHashMap usage. By migrating to HashMap, I was able to achieve most of the performance gains. Also, another thing that I hadn't considered before is that IdentityHashMap uses object reference-equality for insertions and lookups. This means that the Map can look different for each instantiation of Avro because obviously the reference values will be different every time it runs. This made benchmarking kind of tricky and unreliable as well. If the keys just happened to have a lot of collisions, then performance would suffer, alternatively, benchmarking improved if the keys happen to be laid out more evenly in the bucket space. Yet another reason to simply get off this implementation.

There is one caveat in the JavaDocs:

For many JRE implementations and operation mixes, this class will yield better performance than HashMap (which uses chaining rather than linear-probing).

  • https://docs.oracle.com/javase/8/docs/api/java/util/IdentityHashMap.html

I get around this by exploiting the fact that the number of items in the map is fixed as instantiation. Therefore, I use a Map size sufficiently large such that there are no key collisions and therefore no chaining involved.

belugabehr avatar Feb 02 '25 04:02 belugabehr

One thing to note is that I had to change the Map type. The class key uses reference based equals/hashcode and so had to convert to String to get a consistent key type.

belugabehr avatar Feb 02 '25 04:02 belugabehr

@martin-g

belugabehr avatar Feb 02 '25 04:02 belugabehr

Sorry, I am not familiar with this code to be able to tell whether the changes are good or not.

martin-g avatar Feb 03 '25 08:02 martin-g

@martin-g Thanks for the pass. Any ideas who else might be available to review?

belugabehr avatar Mar 08 '25 18:03 belugabehr