cobrix Add support for header segments

Background

Header segments are present only at the beginning of each file. Sometimes it is useful to propagate fields of the segment to all of the records in the file. Currently, windowing functions could be used for this purpose. But it is inefficient.

Feature

Add header records support directly in Cobrix.

Jan 26 '21 08:01 yruslan

Just wanted to create same issue, but saw this. When do you plan to start developing this feature?

Jul 22 '21 08:07 MaksymFedorchuk

Sorry, I can't provide a roadmap on this - I'm too swamped with other issues and this issue is not trivial to implement. We are using windowing functions for this purpose at the moment for our use cases as well, so we are in the same boat :)

Jul 22 '21 08:07 yruslan

I have a similar need, and here is how I got it to work:

  /*
   Read only the header
  */
  val headerData = spark.read
    .format("cobol")
    .option("schema_retention_policy", "collapse_root")
    .option("segment_field", "RECORD-ID")
    .option("segment_filter", "HDR")
    .option("copybook_contents", copybookLoader("/copybooks/HEADER.cbl"))
    .load(inputFile)
  val businessDateString = headerData.select("BUSINESS_DT").first()(0).toString
  val businessDateFormat = new SimpleDateFormat("yyyyMMdd")
  val businessDateTempValue = businessDateFormat.parse(businessDateString)
  val businessDate = new java.sql.Date(businessDateTempValue.getTime())
  /*
   Read Main Record Section
   */
  val inputData = spark
    .read
    .format("cobol")
    .option("segment_field", "RECORD-ID")
    .option("segment_filter", "BASEDATA")
    .option("copybook_contents", copybookLoader("/copybooks/BASEDATA.cbl"))
    .option("schema_retention_policy", "collapse_root")
    .load(inputFile)
  /*
   Add column from header
   */
  val outputData = inputData
    .withColumn("BUSINESS_DATE", lit(businessDate))

Aug 23 '21 20:08 mark-weghorst

Hi @yruslan

I have text format multi-header and multi segmented files there.

spec: header-1 details-1 header-2 details-2 . . . header-n details-n each header length different, by using custom record extractor class, I am able to get df.

can you help me , how to get multi-segment way df?

Sep 21 '21 01:09 sree018

Hi, sorry for the late reply. In order to get a multisegment df you need to

Define a copybook that has a segment id field (and other fields common across segments) and segment GROUPS that redefine each other, like this:

01 RECORD
  05 SEGMENT-ID  PIC X(2).
  05 HEADER.
     10 SOME_HEADER_FIELD  PIC X(10). 
     ...
  06 DETAILS REDEFINES HEADER.
     10 SOME_DETAILS_FIELD PIC X(10).
     ...

Let's say, the segment id for the header is '1', and for details, it is '2'. You can load it like this:

          .option("segment_field", "SEGMENT_ID")
          .option("redefine_segment_id_map:1", "HEADER => 1")
          .option("redefine-segment-id-map:2", "DETAILS => 2")

There are examples in README, in the example folder, and in integration tests of Cobrox.

Oct 05 '21 07:10 yruslan

Hi @yruslan

In multi-segment copybook, Each header value is different, and the length of the header and detail is different. header-1 details-1 header-2 details-2 . . . header-n details-n

I wrote custom logic and align each record as fixed-width like this

HDR1 - DETAILS1 HDR2 - DETAILS2 HDR3 - DETAILS3

here each detail is a different layout. Currently creating temp file and parsing with above options.

how do I avoid creating a temp file?

If copybook is multi-segment, how to read copybook segment-wise ?

Oct 05 '21 12:10 sree018

First, if each segment has different layout you need to have either an RDW header for each record or a record length field so for each record its size can be somehow determined. This is fo ebcdic. For ascii files you don't need rdws since line end character is the record boundary. When you have a multisegment file like this you can read it as per this example: https://github.com/AbsaOSS/cobrix#automatic-segment-redefines-filtering or https://github.com/AbsaOSS/cobrix#automatic-segment-redefines-filtering or https://github.com/AbsaOSS/cobrix/blob/484d78f4977ab83ee8cec4b4142632fdf2589ab8/spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test5MultisegmentSpec.scala#L227

Oct 05 '21 12:10 yruslan

In this example,

my source file is not parseable directly, So I converted the file in parseable format and followed the below example.

val df = spark .read .format("cobol") .option("copybook_contents", copybook) .option("record_format", "D") .option("segment_field", "SEGMENT_ID") .option("segment_id_level0", "C") .option("segment_id_level1", "P") .option("redefine_segment_id_map:0", "STATIC-DETAILS => C") .option("redefine_segment_id_map:1", "CONTACTS => P") .load("examples/multisegment_data/COMP.DETAILS.SEP30.DATA.dat")

but without the temp file, I have fixed format length RDD present, but I can't use rdd as input in the above eample.

Oct 05 '21 12:10 sree018

will you support multi-segment_field like .option("segment_field", "SEGMENT_ID1,SEGMENT_ID2") ?

Oct 05 '21 12:10 sree018

Sorry, I'm not following what is the specific requirement. Could I ask you to describe what copybook do you have and how would you like to parse it?

Cobrix supports multisegment files, but the segment id fields should be present in all of the segments. I think I can help you with the solution if you provide a copybook. Maybe a simplified one.

Oct 06 '21 14:10 yruslan

cobrix cobrix copied to clipboard

Add support for header segments

Background

Feature

cobrix
cobrix copied to clipboard