cobrix
cobrix copied to clipboard
Add support for header segments
Background
Header segments are present only at the beginning of each file. Sometimes it is useful to propagate fields of the segment to all of the records in the file. Currently, windowing functions could be used for this purpose. But it is inefficient.
Feature
Add header records support directly in Cobrix.
Just wanted to create same issue, but saw this. When do you plan to start developing this feature?
Sorry, I can't provide a roadmap on this - I'm too swamped with other issues and this issue is not trivial to implement. We are using windowing functions for this purpose at the moment for our use cases as well, so we are in the same boat :)
I have a similar need, and here is how I got it to work:
/*
Read only the header
*/
val headerData = spark.read
.format("cobol")
.option("schema_retention_policy", "collapse_root")
.option("segment_field", "RECORD-ID")
.option("segment_filter", "HDR")
.option("copybook_contents", copybookLoader("/copybooks/HEADER.cbl"))
.load(inputFile)
val businessDateString = headerData.select("BUSINESS_DT").first()(0).toString
val businessDateFormat = new SimpleDateFormat("yyyyMMdd")
val businessDateTempValue = businessDateFormat.parse(businessDateString)
val businessDate = new java.sql.Date(businessDateTempValue.getTime())
/*
Read Main Record Section
*/
val inputData = spark
.read
.format("cobol")
.option("segment_field", "RECORD-ID")
.option("segment_filter", "BASEDATA")
.option("copybook_contents", copybookLoader("/copybooks/BASEDATA.cbl"))
.option("schema_retention_policy", "collapse_root")
.load(inputFile)
/*
Add column from header
*/
val outputData = inputData
.withColumn("BUSINESS_DATE", lit(businessDate))
Hi @yruslan
I have text format multi-header and multi segmented files there.
spec: header-1 details-1 header-2 details-2 . . . header-n details-n each header length different, by using custom record extractor class, I am able to get df.
can you help me , how to get multi-segment way df?
Hi, sorry for the late reply. In order to get a multisegment df you need to
- Define a copybook that has a segment id field (and other fields common across segments) and segment GROUPS that redefine each other, like this:
01 RECORD
05 SEGMENT-ID PIC X(2).
05 HEADER.
10 SOME_HEADER_FIELD PIC X(10).
...
06 DETAILS REDEFINES HEADER.
10 SOME_DETAILS_FIELD PIC X(10).
...
- Let's say, the segment id for the header is '1', and for details, it is '2'. You can load it like this:
.option("segment_field", "SEGMENT_ID")
.option("redefine_segment_id_map:1", "HEADER => 1")
.option("redefine-segment-id-map:2", "DETAILS => 2")
There are examples in README, in the example folder, and in integration tests of Cobrox.
Hi @yruslan
In multi-segment copybook, Each header value is different, and the length of the header and detail is different. header-1 details-1 header-2 details-2 . . . header-n details-n
I wrote custom logic and align each record as fixed-width like this
HDR1 - DETAILS1 HDR2 - DETAILS2 HDR3 - DETAILS3
here each detail is a different layout. Currently creating temp file and parsing with above options.
how do I avoid creating a temp file?
If copybook is multi-segment, how to read copybook segment-wise ?
First, if each segment has different layout you need to have either an RDW header for each record or a record length field so for each record its size can be somehow determined. This is fo ebcdic. For ascii files you don't need rdws since line end character is the record boundary. When you have a multisegment file like this you can read it as per this example: https://github.com/AbsaOSS/cobrix#automatic-segment-redefines-filtering or https://github.com/AbsaOSS/cobrix#automatic-segment-redefines-filtering or https://github.com/AbsaOSS/cobrix/blob/484d78f4977ab83ee8cec4b4142632fdf2589ab8/spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test5MultisegmentSpec.scala#L227
In this example,
my source file is not parseable directly, So I converted the file in parseable format and followed the below example.
val df = spark .read .format("cobol") .option("copybook_contents", copybook) .option("record_format", "D") .option("segment_field", "SEGMENT_ID") .option("segment_id_level0", "C") .option("segment_id_level1", "P") .option("redefine_segment_id_map:0", "STATIC-DETAILS => C") .option("redefine_segment_id_map:1", "CONTACTS => P") .load("examples/multisegment_data/COMP.DETAILS.SEP30.DATA.dat")
but without the temp file, I have fixed format length RDD present, but I can't use rdd as input in the above eample.
will you support multi-segment_field like .option("segment_field", "SEGMENT_ID1,SEGMENT_ID2") ?
Sorry, I'm not following what is the specific requirement. Could I ask you to describe what copybook do you have and how would you like to parse it?
Cobrix supports multisegment files, but the segment id fields should be present in all of the segments. I think I can help you with the solution if you provide a copybook. Maybe a simplified one.