cobrix Unifying readers

I'm trying to write a JSON reader based on the new split of packages, but am finding that I'm repeating a lot of the code on the Spark side, so I was wondering whether we can come up with a more generic way to deal with creating readers (fixed vs var len) and bring those methods to the Reader trait. I tried doing something along those lines in https://github.com/tr11/cobrix/tree/unify-readers but I'm not sure if that's the best approach. Ideally, I was looking for a solution that

creates an iterator of records that is independent of the type of file we have, so all the methods should be defined in trait vs in specialized Var vs Fixed readers
push all the logic to the reader package, including parameter processing, etc Any thoughts on this?

Apr 18 '20 15:04 tr11

I like the idea of abstracting away all Spark and HDFS specifics in dealing with various file formats. I'm going to try to come up with a solution and then compare it with yours.

Apr 18 '20 18:04 yruslan

I gave it a little thought according to the ideas that you've proposed. Here is my view of the target architecture of Cobrix that minimizes code duplication and holds all of the generic logic in cobol-parser: CobrixComponents.pptx

I haven't tried to apply this to the code yet. And I haven't looked at your current solution as well. Will do it soon.

Apr 20 '20 13:04 yruslan

Dotted lines specified using relation, solid lines specify contains, and triangle arrows specify inheritance or instantiation of generics.

Apr 20 '20 13:04 yruslan

At first glance, it seems similar to what I did -- I brought the SimpleStream into the FixedLenReader class instead of letting it interact directly with a bytes array. From that point, it was a matter of making sure the getRecordIterator has the same signature for the FixedLen and VarLen classes. This is the part that could use some work since the fixed len doesn't use those extra parameters.

Apr 20 '20 13:04 tr11

Looking at it a second time, this is similar but not exactly what I did. It's what I was trying to do, but the flow of what I have seems to mix the IndexBuilder, RawRecordIterator, and ParsedRecordIterator.
The rest of the diagram seems in line with what we already have with the generic extractor and the Extractor<Row> adapter dor spark-cobol.

I like the naming RawRecordIterator, and ParsedRecordIterator -- I'll make that change on my branch too.

Apr 22 '20 02:04 tr11

👍 Cool. I'll take a look at the branch soon.

Apr 22 '20 06:04 yruslan

Sorry, it took so long. It looks like a step in the right direction. I was thinking after we align our implementations, I can take some time to focus on Cobrix for a few days to refactor it to match the desired architecture.

May 06 '20 11:05 yruslan

cobrix cobrix copied to clipboard

Unifying readers

cobrix
cobrix copied to clipboard