cobrix
cobrix copied to clipboard
Unifying readers
I'm trying to write a JSON reader based on the new split of packages, but am finding that I'm repeating a lot of the code on the Spark side, so I was wondering whether we can come up with a more generic way to deal with creating readers (fixed vs var len) and bring those methods to the Reader trait. I tried doing something along those lines in https://github.com/tr11/cobrix/tree/unify-readers but I'm not sure if that's the best approach. Ideally, I was looking for a solution that
- creates an iterator of records that is independent of the type of file we have, so all the methods should be defined in trait vs in specialized Var vs Fixed readers
- push all the logic to the reader package, including parameter processing, etc Any thoughts on this?
I like the idea of abstracting away all Spark and HDFS specifics in dealing with various file formats. I'm going to try to come up with a solution and then compare it with yours.
I gave it a little thought according to the ideas that you've proposed. Here is my view of the target architecture of Cobrix that minimizes code duplication and holds all of the generic logic in cobol-parser:
CobrixComponents.pptx
I haven't tried to apply this to the code yet. And I haven't looked at your current solution as well. Will do it soon.
Dotted lines specified using relation, solid lines specify contains, and triangle arrows specify inheritance or instantiation of generics.
At first glance, it seems similar to what I did -- I brought the SimpleStream into the FixedLenReader class instead of letting it interact directly with a bytes array. From that point, it was a matter of making sure the getRecordIterator has the same signature for the FixedLen and VarLen classes. This is the part that could use some work since the fixed len doesn't use those extra parameters.
Looking at it a second time, this is similar but not exactly what I did. It's what I was trying to do, but the flow of what I have seems to mix the IndexBuilder, RawRecordIterator, and ParsedRecordIterator.
The rest of the diagram seems in line with what we already have with the generic extractor and the Extractor<Row> adapter dor spark-cobol.
I like the naming RawRecordIterator, and ParsedRecordIterator -- I'll make that change on my branch too.
👍 Cool. I'll take a look at the branch soon.
Sorry, it took so long. It looks like a step in the right direction. I was thinking after we align our implementations, I can take some time to focus on Cobrix for a few days to refactor it to match the desired architecture.