avrogen with multile avro schema files
Description
I woule like to use multiple avro files when generating C# classes with avrogen. Looking at apache.Avro code at: https://github.com/apache/avro/blob/e3e41dcd27822ac3fd4692291d1bbea4f8a294dc/lang/csharp/src/apache/main/Schema/Schema.cs#L210
when reading multipe files it is not possible to reuse previous loaded schema because "new SchemaNames()" is passed.
There is a way to use multipe avro files ?
How to reproduce
Checklist
Please provide the following information:
- [ ] A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
- [ latest] Confluent.Kafka nuget version.
- [ latest] Apache Kafka version.
- [ ] Client configuration.
- [ ] Operating system.
- [ ] Provide logs (with "debug" : "..." as necessary in configuration).
- [ ] Provide broker log excerpts.
- [ ] Critical issue.
you'll need to use GenericRecord. yep, I agree this API is less than ideal.
I did some changes in avrogen, available here https://github.com/alfhv/avro, to make it work on my use case. main changes are:
- allow extra parameter to specify multiple schema files (i use a list of files but it could be a directory) the file order is important, at least in this very early version, to load types references in correct order.
- add new method in AvroGen class:
static void GenSchema(List<string> infiles, string outdir, IEnumerable<KeyValuePair<string, string>> namespaceMapping) - add new method in Schema.cs class:
public static Schema Parse(string json, SchemaNames schemaNames)
it could be possible to allow in some way this functionality with avrogen tool ??
Actually, the current API should be more open so we can overload some methods and code generation behavior, my use case is wider than just use multiple schema files. ex: I have a base schema file (to inherit types from it) and then other schemas files from where I want to generate C# classes with avrogen but ignoring (not generating) classes from base schema file. I have already implemented in some way this behavior in my own version of avrogen but I would like to stay close to original Apache version in some way.
@alfhv your changes helped me a lot! Thank you and congratulations for your initiative.
@mhowlett is there any update here? it's been quite some time, and this is fully supported for other platforms. It'd be great to see some activity in supporting this on the C# side
(even if it's just adopting the changes from @alfhv's fork, maybe nothing more formal is needed)
My fork needs some correction/fixes, unfortunately the updated code is inside a client and I can't get it out. I'll try to reproduce the same on my personal repo.
@alfhv thanks for replying to this. If you want help it's possible I can contribute some. I am frankly quite surprised this functionality isn't natively supported, it's pretty critical
@mhowlett can you or one of the other maintainers follow up on this?
Hello! I'm a contributor on the Avro project. It's one of the slower-moving projects, but new features are welcome!
I'm not following the dotnet SDK, so I haven't seen this or the fork -- have you seen or made a JIRA on the ASF issue tracker? If someone makes a PR, this could be fixed in the next release.
Hi @RyanSkraba, thanks so much for replying! I appreciate it. I did make a Jira: https://issues.apache.org/jira/browse/AVRO-3739. I don't have the time for this right now, but if I find it I'll look to contribute. I was surprised to find the spec is basically just not implemented fully.. thank you for putting some eyes on this 😄
@alfhv I made a fork of your project and merged in apache/avro@master, but found some issues (the cross-file schema generation works, but it generates the shared schema pieces for each top-level schema. This works for single-schema projects, but if you have multiple schemas referencing the same building blocks, you end up with compilation errors because there are duplicate definitions of the same classes / enums / etc.
I have updated my fork with some fixes. @alexrosenfeld10 please give me sample avro files to reproduce duplicate definitions.
@alfhv I have created a repository that replicates my problem. If you could, kindly test your updated code against it, and let me know if you encounter the same issues I was. My testing was done on a fork of your fork, that I updated against the upstream repository, as your fork is several thousand commits out of date. I also made some code changes similar to your recent commit - it's entirely possible I introduced some bugs in my testing.
https://github.com/alexrosenfeld10/avro-sourcegen-csharp
I also left a comment on one of your recent commits here that is relevant to my sample project.
I hope this sample project is helpful for your testing. If you need more details I am happy to discuss and collaborate
@alexrosenfeld10 check with my latest commit on my fork, i'm using your json files (check prebuild event on sample project), generated code works ok for me. note: code still not finish, few things to correct, i'm just trying to check that it actually works.
@alfhv Thanks, I've completed my testing. I'm finding the same issues as the comment above
This works for single-schema projects, but if you have multiple schemas referencing the same building blocks, you end up with compilation errors because there are duplicate definitions of the same classes / enums / etc
In other words, when you define an enum in one schema and reference it in another, it gets generated twice.
Please find this commit detailing the problem. I would encourage you to clone my repo down and follow instructions in the readme to fully understand the issue, if you haven't already.
@alfhv I did some more work on this, it seems like if I use provider.GenerateCodeFromCompileUnit to generate the code it generates duplicate definitions. I'm not sure if there's an easy way to support that. However, using codeGen.WriteTypes seems to resolve the issues, it just requires writing to files which is a bit ugly.
Check out the fixes here - https://github.com/alexrosenfeld10/avro-sourcegen-csharp/pull/1. I left some comments, curious what you think.
I have commited a new version of "multiple avro files" on my fork. there are a few changes I would like having integrated in Avro official branch in order to use multiple schema source files. there is no breaking change at all, also changes open the way to more custom traitement, as I have done on my side but not yet commited in my fork. there is only one limitation: files should be read in order of dependencies, as the avro itself enforce inside a single file where dependencies should be declared in order of reference/use.
@alfhv good news thanks for the update.
How does avro handle dependency ordering in other languages? If it's solved for other languages, I'd imagine apache will reject the PR. I think we can solve it one of two ways:
- Complex approach. Construct a DAG of the dependencies, and check for cycles in the graph. If no cycles, then navigate and load the graph one by one.
- Keep track of failed schemas, retry them after reloading, when we hit max failures (pick some number like 30), give up.
@alexrosenfeld10 I was thinking in second approach which simpler but in any case, the current implementation, as I said, enforce dependency order inside a single file (at least for C#)... so even if we ensure reorder dependency on files, inside a single file we can't, except if we change this also in the main Avro project.
@alfhv hmm I see what you mean. I wonder if other languages force the same problem within files too. Do you know? I'll take a look too.
I guess, if this is the case then Apache should be OK with merging these enhancements in
@alfhv it seems the java implementation is MUCH more robust in almost all ways.. Would require a lot of development to get the C# side up to speed.
https://github.com/apache/avro/blob/master/lang/java/compiler/src/main/java/org/apache/avro/compiler/schema/Schemas.java