confluent-kafka-dotnet icon indicating copy to clipboard operation
confluent-kafka-dotnet copied to clipboard

avrogen with multile avro schema files

Open alfhv opened this issue 6 years ago • 21 comments

Description

I woule like to use multiple avro files when generating C# classes with avrogen. Looking at apache.Avro code at: https://github.com/apache/avro/blob/e3e41dcd27822ac3fd4692291d1bbea4f8a294dc/lang/csharp/src/apache/main/Schema/Schema.cs#L210

when reading multipe files it is not possible to reuse previous loaded schema because "new SchemaNames()" is passed.

There is a way to use multipe avro files ?

How to reproduce

Checklist

Please provide the following information:

  • [ ] A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
  • [ latest] Confluent.Kafka nuget version.
  • [ latest] Apache Kafka version.
  • [ ] Client configuration.
  • [ ] Operating system.
  • [ ] Provide logs (with "debug" : "..." as necessary in configuration).
  • [ ] Provide broker log excerpts.
  • [ ] Critical issue.

alfhv avatar Dec 02 '19 19:12 alfhv

you'll need to use GenericRecord. yep, I agree this API is less than ideal.

mhowlett avatar Dec 02 '19 20:12 mhowlett

I did some changes in avrogen, available here https://github.com/alfhv/avro, to make it work on my use case. main changes are:

  • allow extra parameter to specify multiple schema files (i use a list of files but it could be a directory) the file order is important, at least in this very early version, to load types references in correct order.
  • add new method in AvroGen class: static void GenSchema(List<string> infiles, string outdir, IEnumerable<KeyValuePair<string, string>> namespaceMapping)
  • add new method in Schema.cs class: public static Schema Parse(string json, SchemaNames schemaNames)

it could be possible to allow in some way this functionality with avrogen tool ??

Actually, the current API should be more open so we can overload some methods and code generation behavior, my use case is wider than just use multiple schema files. ex: I have a base schema file (to inherit types from it) and then other schemas files from where I want to generate C# classes with avrogen but ignoring (not generating) classes from base schema file. I have already implemented in some way this behavior in my own version of avrogen but I would like to stay close to original Apache version in some way.

alfhv avatar Dec 08 '19 16:12 alfhv

@alfhv your changes helped me a lot! Thank you and congratulations for your initiative.

gustavoferrazfontes avatar May 12 '21 02:05 gustavoferrazfontes

@mhowlett is there any update here? it's been quite some time, and this is fully supported for other platforms. It'd be great to see some activity in supporting this on the C# side

alexrosenfeld10 avatar Apr 26 '23 12:04 alexrosenfeld10

(even if it's just adopting the changes from @alfhv's fork, maybe nothing more formal is needed)

alexrosenfeld10 avatar Apr 26 '23 12:04 alexrosenfeld10

My fork needs some correction/fixes, unfortunately the updated code is inside a client and I can't get it out. I'll try to reproduce the same on my personal repo.

alfhv avatar Apr 26 '23 19:04 alfhv

@alfhv thanks for replying to this. If you want help it's possible I can contribute some. I am frankly quite surprised this functionality isn't natively supported, it's pretty critical

alexrosenfeld10 avatar Apr 26 '23 20:04 alexrosenfeld10

@mhowlett can you or one of the other maintainers follow up on this?

alexrosenfeld10 avatar May 02 '23 03:05 alexrosenfeld10

Hello! I'm a contributor on the Avro project. It's one of the slower-moving projects, but new features are welcome!

I'm not following the dotnet SDK, so I haven't seen this or the fork -- have you seen or made a JIRA on the ASF issue tracker? If someone makes a PR, this could be fixed in the next release.

RyanSkraba avatar May 02 '23 15:05 RyanSkraba

Hi @RyanSkraba, thanks so much for replying! I appreciate it. I did make a Jira: https://issues.apache.org/jira/browse/AVRO-3739. I don't have the time for this right now, but if I find it I'll look to contribute. I was surprised to find the spec is basically just not implemented fully.. thank you for putting some eyes on this 😄

@alfhv I made a fork of your project and merged in apache/avro@master, but found some issues (the cross-file schema generation works, but it generates the shared schema pieces for each top-level schema. This works for single-schema projects, but if you have multiple schemas referencing the same building blocks, you end up with compilation errors because there are duplicate definitions of the same classes / enums / etc.

alexrosenfeld10 avatar May 02 '23 16:05 alexrosenfeld10

I have updated my fork with some fixes. @alexrosenfeld10 please give me sample avro files to reproduce duplicate definitions.

alfhv avatar May 02 '23 21:05 alfhv

@alfhv I have created a repository that replicates my problem. If you could, kindly test your updated code against it, and let me know if you encounter the same issues I was. My testing was done on a fork of your fork, that I updated against the upstream repository, as your fork is several thousand commits out of date. I also made some code changes similar to your recent commit - it's entirely possible I introduced some bugs in my testing.

https://github.com/alexrosenfeld10/avro-sourcegen-csharp

I also left a comment on one of your recent commits here that is relevant to my sample project.

alexrosenfeld10 avatar May 03 '23 13:05 alexrosenfeld10

I hope this sample project is helpful for your testing. If you need more details I am happy to discuss and collaborate

alexrosenfeld10 avatar May 03 '23 13:05 alexrosenfeld10

@alexrosenfeld10 check with my latest commit on my fork, i'm using your json files (check prebuild event on sample project), generated code works ok for me. note: code still not finish, few things to correct, i'm just trying to check that it actually works.

alfhv avatar May 05 '23 20:05 alfhv

@alfhv Thanks, I've completed my testing. I'm finding the same issues as the comment above

This works for single-schema projects, but if you have multiple schemas referencing the same building blocks, you end up with compilation errors because there are duplicate definitions of the same classes / enums / etc

In other words, when you define an enum in one schema and reference it in another, it gets generated twice.

Please find this commit detailing the problem. I would encourage you to clone my repo down and follow instructions in the readme to fully understand the issue, if you haven't already.

alexrosenfeld10 avatar May 06 '23 11:05 alexrosenfeld10

@alfhv I did some more work on this, it seems like if I use provider.GenerateCodeFromCompileUnit to generate the code it generates duplicate definitions. I'm not sure if there's an easy way to support that. However, using codeGen.WriteTypes seems to resolve the issues, it just requires writing to files which is a bit ugly.

Check out the fixes here - https://github.com/alexrosenfeld10/avro-sourcegen-csharp/pull/1. I left some comments, curious what you think.

alexrosenfeld10 avatar May 06 '23 12:05 alexrosenfeld10

I have commited a new version of "multiple avro files" on my fork. there are a few changes I would like having integrated in Avro official branch in order to use multiple schema source files. there is no breaking change at all, also changes open the way to more custom traitement, as I have done on my side but not yet commited in my fork. there is only one limitation: files should be read in order of dependencies, as the avro itself enforce inside a single file where dependencies should be declared in order of reference/use.

alfhv avatar May 10 '23 19:05 alfhv

@alfhv good news thanks for the update.

How does avro handle dependency ordering in other languages? If it's solved for other languages, I'd imagine apache will reject the PR. I think we can solve it one of two ways:

  1. Complex approach. Construct a DAG of the dependencies, and check for cycles in the graph. If no cycles, then navigate and load the graph one by one.
  2. Keep track of failed schemas, retry them after reloading, when we hit max failures (pick some number like 30), give up.

alexrosenfeld10 avatar May 10 '23 19:05 alexrosenfeld10

@alexrosenfeld10 I was thinking in second approach which simpler but in any case, the current implementation, as I said, enforce dependency order inside a single file (at least for C#)... so even if we ensure reorder dependency on files, inside a single file we can't, except if we change this also in the main Avro project.

alfhv avatar May 10 '23 19:05 alfhv

@alfhv hmm I see what you mean. I wonder if other languages force the same problem within files too. Do you know? I'll take a look too.

I guess, if this is the case then Apache should be OK with merging these enhancements in

alexrosenfeld10 avatar May 10 '23 19:05 alexrosenfeld10

@alfhv it seems the java implementation is MUCH more robust in almost all ways.. Would require a lot of development to get the C# side up to speed.

https://github.com/apache/avro/blob/master/lang/java/compiler/src/main/java/org/apache/avro/compiler/schema/Schemas.java

alexrosenfeld10 avatar May 11 '23 12:05 alexrosenfeld10