Intermittent build failure with import idl from classpath
My builds fail on import idl from classpath resources intermittently. When they do fail I have to kill the gradle daemon to fix the build.
To Reproduce
I have a project with 3+ sub modules, common which contains a bunch of enums that I share with other modules and other modules.
For instance:
@namespace("com.mynamespace.common")
protocol ColorsProtocol {
enum Colors {
RED, YELLOW, BLUE
}
}
Then my other modules reuse this enum like this:
import idl "com/mynamespace/common/Colors.avdl";
When my build fails it fails with this stacktrace:
Caused by: org.apache.avro.compiler.idl.ParseException: Encountered "<EOF>" at line 1, column 0.
Was expecting one of:
"protocol" ...
"@" ...
at org.apache.avro.compiler.idl.Idl.generateParseException(Idl.java:1700)
at org.apache.avro.compiler.idl.Idl.jj_consume_token(Idl.java:1579)
at org.apache.avro.compiler.idl.Idl.ProtocolDeclaration(Idl.java:219)
at org.apache.avro.compiler.idl.Idl.CompilationUnit(Idl.java:117)
at org.apache.avro.compiler.idl.Idl.ImportIdl(Idl.java:452)
at org.apache.avro.compiler.idl.Idl.ProtocolBody(Idl.java:338)
at org.apache.avro.compiler.idl.Idl.ProtocolDeclaration(Idl.java:227)
at org.apache.avro.compiler.idl.Idl.CompilationUnit(Idl.java:117)
at com.github.davidmc24.gradle.plugin.avro.GenerateAvroProtocolTask.processIDLFile(GenerateAvroProtocolTask.java:96)
... 94 more
I'll try to replicate in test-project but I'm not sure if I'll be able to since the error is intermittent.
Expected behavior
Reliable builds and all that.
Environment (please complete the following information):
- Gradle Version [e.g. 6.6.1]
- Apache Avro Version [e.g. 1.10.2]
- Gradle-Avro Plugin Version [e.g. 1.2.1]
- Java Version [e.g. 11.0.6]
- OS: [Ubuntu 20.04]
I'm sorry to hear you're having issues. Intermittent problems are always the trickiest to diagnose.
Assuming (based on your report) that it is indeed an intermittent behavior related to imported files being capable of loading from the classpath or not, I see 3 possibilities:
- There's a (somewhat surprising) non-deterministic behavior in the Avro code after the plugin calls
Idl.CompilationUnit. - There's something about the build that makes the system classloader behavior vary
- There's something about the build that makes the paths passed to the
GenerateAvroProtocolTasktask's classpath vary
If it were the first option, I expect it would be in here somewhere, but I'm not seeing it.
The third option seems most likely. You could probably confirm/deny that hypothesis by adding some debug logging to GenerateAvroProtocolTask.assembleClassLoader() in a private build of the plugin, running the build with --debug, and comparing the results from a successful and failed run. If the classpath entries are the same, that means it isn't the third option. If the classpath entries differ, how they differ might be informative in what to look for in your build script to fix the problem.
You could probably apply a similar technique to inspect the contents of the system classloader, if it looks like it might be the second option. ClassLoaderUtils has an example of getting path info out of a URLClassLoader.
Thank you for your help. I looked at item 3 closely. The list of files on my classpath is consistent, my logging looks like this:
Assembling classpath with /project-dir/common/build/libs/common-1.20.2.jar Assembling classpath with ~/.gradle/caches/modules-2/files-2.1/io.confluent/kafka-avro-serializer/5.5.1/e5998aab50d8b89d82eaddf7d844769f78f0066e/kafka-avro-serializer-5.5.1.jar ...
From there I wrapped the classloader and logged each getResource call and the URL it returned, the URL I got back was consistently jar:file:/project-dir/common/build/libs/common-1.20.2.jar!/com/mynamespace/common/Colors.avdl
This looked fine so I started reading the first few bytes from the URL:
return new DelegatingClassloader(urls.isEmpty() ? ClassLoader.getSystemClassLoader()
: new URLClassLoader(urls.toArray(new URL[0]), ClassLoader.getSystemClassLoader()));
}
private class DelegatingClassloader extends ClassLoader {
public DelegatingClassloader(ClassLoader parent) {
super(parent);
}
@Override
public URL getResource(String name) {
URL resource = super.getResource(name);
getLogger().info("getResource: " + name + " returned: " + resource);
if (resource != null) {
byte[] firstChars = new byte[64];
try (InputStream stream = resource.openStream()) {
int read = stream.read(firstChars, 0, 64);
String firstSegment = new String(firstChars, 0, read);
getLogger().info("first64: " + firstSegment.split("\n")[0]);
} catch (IOException e) {
getLogger().info("unable to read chars from: " + name, e);
}
}
return resource;
}
}
When the IDL parsing failed I got this output from by debug logging, I think this is root cause:
java.util.zip.ZipException: ZipFile invalid LOC header (bad signature)
at java.base/java.util.zip.ZipFile$ZipFileInputStream.initDataOffset(ZipFile.java:1003)
at java.base/java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:1013)
at java.base/java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:468)
at java.base/java.util.zip.InflaterInputStream.read(InflaterInputStream.java:159)
at java.base/java.io.FilterInputStream.read(FilterInputStream.java:133)
at com.github.davidmc24.gradle.plugin.avro.GenerateAvroProtocolTask$DelegatingClassloader.getResource(GenerateAvroProtocolTask.java:145)
at org.apache.avro.compiler.idl.Idl.findFile(Idl.java:101)
at org.apache.avro.compiler.idl.Idl.ImportIdl(Idl.java:450)
at org.apache.avro.compiler.idl.Idl.ProtocolBody(Idl.java:338)
at org.apache.avro.compiler.idl.Idl.ProtocolDeclaration(Idl.java:227)
at org.apache.avro.compiler.idl.Idl.CompilationUnit(Idl.java:117)
at com.github.davidmc24.gradle.plugin.avro.GenerateAvroProtocolTask.processIDLFile(GenerateAvroProtocolTask.java:97)
I think there is a race condition between the writing of the jar from my submodule and the jar being read and my resource pulled out. Part of what I couldn't understand about the issue was that I had to restart my gradle daemon to recover from the problem. I stumbled on this thread in gradle forums and the workaround for that issue also solves this problem. I think that confirms the theory.
As a hack I tried disabling the url cache (the workaround from the gradle forum thread new URL("jar:file://valid_jar_url_syntax.jar!/").openConnection().setDefaultUseCaches(false)) whenever the read of the first few characters failed, if I retried the read before I let the getResource call return, then I could get things "working".
I think this means that there are a few separate problems.
First gradle shouldn't be running tasks with dependencies on other resources before those resources are actually available? I think this falls under gradle bugs because the dependency between the task and the runtime classpath is expressed in AvroPlugin.java
I think that is all that is needed for a jar dependency between modules?
Second I think that caching of jar URL calls is problematic for this plugin and it's interaction with submodules. If the gradle daemon is caching jar:file:/project-dir/common/build/libs/common-1.20.2.jar!/com/mynamespace/common/Colors.avdl (which it appears to be doing) then updates to that jar won't get picked up.
Another minor thing that I investigated and determined wasn't relevant but might still technically be a bug was the use of the system classloader as the parent in GenerateAvroProtocolTask. This would leak the buildscript classpath onto the build?
Good investigation. I think we’re making some progress. I think that depending on the runtime classpath should be sufficient, but I’ll verify that.
You might be right that using the system classpath as a parent may be a bad idea. I’ll consider removing that. Out of curiosity, does removing the parent class loader fix your problem?
For the potential Gradle daemon URL caching issue, it might be a good idea to see if the Gradle build tool team has any more recent ideas, probably by submitting a new Gradle GitHub issue.
Removing the parent class loader did not change anything.
I'll see if I can log an issue over there and then link back here.
So, specifying system ClassLoader as the parent or not specifying it is the same thing... the system ClassLoader is the default delegation parent ClassLoader. I've removed the explicit specification of the parent ClassLoader, as it's redundant. I've also added a bit of debug logging which will be included in the next release.
I set up a simple test project to try to determine if the configuration classpath dependency is sufficient. As far as I can tell, it appears to be. If I explicitly say to just run my :user:build task, it runs :common:jar before it. Adding a sleep to the common project still results in it waiting for it to complete, regardless of whether --parallel is enabled or not. A build scan seems to reinforce that there is indeed a dependency established.
So, specifying system ClassLoader as the parent or not specifying it is the same thing... the system ClassLoader is the default delegation parent ClassLoader. I've removed the explicit specification of the parent ClassLoader, as it's redundant.
I think the behavior I'd expect as a user is to explicitly say that the parent classloader is null. If you can't resolve the resource from the classpath of the urls you are specifying then you don't want to resolve it at all right?
I think the behavior I'd expect as a user is to explicitly say that the parent classloader is null. If you can't resolve the resource from the classpath of the urls you are specifying then you don't want to resolve it at all right?
That seems reasonable. I've queued up that change for the next release.
The way I am reading the implementation of jar file caching here seems relevant: https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/sun/net/www/protocol/jar/JarURLConnection.java#L122
It looks to me like caching keeps the jar file open and that would mean any subsequent reads to that file would come from the older version of the file (at least on Linux/Unix, I think you might see file locking issues on Windows?).
This is consistent with some of the behavior I've seen when making updates, occasionally it seems like the updates just don't take.
Really this caching behavior seems like the root of the problem.
I think the reason I personally am seeing this problem so often is that I am using this plugin in my build: https://plugins.gradle.org/plugin/com.gorylenko.gradle-git-properties. The plugin adds git info into jars as they are built and that means jars are rebuilt far more often than normal.
Lots of builds means lots of concurrency on common.jar so if there is a race between flushing to the filesystem and handing off the task then I am more likely to see it than other users.
Disabling URL caching would be one fix. Running the avro build externally of the daemon would also be a solution (I think you have architecture docs proposing that?)
Yes, version 2.0 (if I ever finish it) will run the Avro logic in a separate JVM from the build. The main reason for that is to allow supporting arbitrary Avro versions, but it would also help with problems related to corrupted JVMs.
So I just found out that the project is actually using version 0.1.5 
If I look at the codebase for that tag I think its indeed an issue, so I'm currently upgrading the project.
Yeah, using an older version might explain it. Let me know how the upgrade goes.
As expected that fixes it

Its a project that we took over from someone else and gradle is not my main build tool. I didn't realize that we were using a 4 year old dependency. I did try to verify why we hadn't updated it and we think that is because the package got moved and nobody looked into why the build started failing after upgrading the version. So it was just kept on the old version.
I'm really sorry about this wasted time, but thanks for all you help.
There may be cases where the schema files contain inline type definitions and it is undesirable to modify them. In this case, the plugin will automatically recognize any duplicate type definitions and check if they match. If any conflicts are identified, it will cause a build failure.
Is there a way to have multiple protocols use the record type as in import the same definition and reuse it? We have several protocols using the same Enum and this used to work
I believe that statement was with regards to standalone avsc files. The IDL supports import statements that hopefully addresses this need.