cobrix
cobrix copied to clipboard
Is it possible to read a nested Binary Field?
Background
Let's say that I'm reading a "normal" AVRO file using Spark. One of the fields in the schema of this Avro is a Binary encoded as EBCDIC that should be decoded using a copycobol referenced by another field within the same schema. Potentially each record can have its copycobol (so for each record the binary might have a different schema) and the desiderata is to produce a json version of the binary field to store somewhere else.
The DF looks something like this:
| ID | SCHEMA_ID | BINARY_FIELD | FIELD1 | FIELD2 | ..... |
|---|---|---|---|---|---|
| 1 | 001 | M1B1N4R11 | valueX | valueZ | .. |
| 2 | 010 | M1B1N4R12 | valueY | valueW | .. |
And in the folder copycobol/ I have:
- 001.cob
- 010.cob
Question
Is it possible to leverage the library to decode a field instead of a file? Or do I have to save the binary field temporarily in a file and decode it from there?
Thank you for any suggestion! :)
Hi, thanks for the interest in the library.
Yes, it is possible to use Cobrix in this case, but it can be quite involved. You can't use spark-cobol Spark data source to decode the data, but have to do it manually like this:
- You need to parse each copybook to get an AST:
val copybookForField1 = CopybookParser.parseSimple(copyBookContents) - Then, you can decode each value by applying the copybook to the binary field:
The resulting record will beval row = RecordExtractors.extractRecord(copybookForField1.ast, field1Bytes, 0, handler = handler) val record = handler.create(row.toArray, copybook.ast)Array[Any]and for each subfield you can cast to the corresponding Java data type. - If you want decoding to happen in parallel handeled by Spark SQL, you can write a UDF per field. Each UDF could contain pre-parsed copybook, and can just apply
extractRecord()andhandler.create()to each value. The resulting output can be a JSON string. See how Jackson could be used to convert each record to a JSON: https://github.com/AbsaOSS/cobrix/blob/68f7362ed55db66a51293de207c4ca0d83af0c83/cobol-converters/src/test/scala/za/co/absa/cobrix/cobol/converters/extra/SerializersSpec.scala#L161
Let me know if you decide to do it and have any issues.