cobrix icon indicating copy to clipboard operation
cobrix copied to clipboard

variable length copybook

Open geethab123 opened this issue 6 years ago • 15 comments

When I tried to parse variable length file with rdw headers. I tried like below

  cobolDataframe = spark
    .read
    .format("cobol")
    .option("copybook", v_copybook)
    .option("schema_retention_policy", "collapse_root") //removes the root record headerc
    .option("drop_group_fillers", "false")
    .option("generate_record_id", false) // this adds the file id and record id
    .option("is_record_sequence", "true") // reader to use 4 byte record headers to extract records from a mainframe file
    .option("is_rdw_big_endian", "true")
    .option("is_rdw_part_of_record_length", true)
    .option("rdw_adjustment", -4)
    .load(v_data)

After parsing the data is not correct. I tried by having different rdw options but in all case I did not get correct parsed data. Same file if I use with recordlength option .option("record_length_field",v_recordLengthField) its parsing correctly. But all of our mainframe files does not have record length field. How can I resolve this. so variable length files should work.

geethab123 avatar Jul 19 '19 19:07 geethab123

In order to parse variable record length (VLR) files you need either:

  • An RDW header or
  • A record length field that is located at the same place for every segment.

Please provide the first 4 bytes of your data file so we can see if it contains an RDW and whether it is a big endian one or not.

In your code snipped you have used 2 options that contradict each other. Please use only one of them:

    .option("is_rdw_part_of_record_length", true)
    .option("rdw_adjustment", -4)

yruslan avatar Jul 22 '19 13:07 yruslan

I will try by applying your suggestions

geethab123 avatar Jul 22 '19 18:07 geethab123

I tried it I was not able to parse the data. its same like as before only first row parses correctly and the next rows does not parse data shifts. I have sent email to you with the copy book and data file with 5 rows. Can you please let me know how to fix this.

geethab123 avatar Jul 23 '19 21:07 geethab123

I tried to parse variable length file as well. val df = spark .read .format("cobol") .option("copybook","/U:/CIFMSTCB.cob") .option("schema_retention_policy", "collapse_root") .option("is_record_sequence", "true") .load("/U:/full_scoring.dat") Only one record was parsed from 360. GSAM_RECORD 1 4092 4092 10 GSAM_HEADER_RECORD r 7 1 4092 4092 15 GSAM_APPL_NUM 2 1 12 12 15 GSAM_REC_TYPE 3 13 16 4 15 GSAM_CLIENT_NBR 4 17 20 4 15 GSAM_TRANS_ID 5 21 28 8 15 GSAM_TRANS_DTE 6 29 36 8 15 Filler 7 37 4092 4056 10 GSAM_DETAIL_RECORD rR 11 1 4092 4092 15 GSAM_APPL_NUM 9 1 12 12 15 GSAM_REC_TYPE 10 13 16 4 15 GSAM_DATA 11 17 4092 4076 10 GSAM_TRAILER_RECORD R 18 1 4092 4092 15 GSAM_APPL_NUM 13 1 12 12 15 GSAM_REC_TYPE 14 13 16 4 15 GSAM_CLIENT_NBR 15 17 20 4 15 GSAM_TRANS_ID 16 21 28 8 15 GSAM_TRAILER_AREA 17 29 228 200 15 FILLER 18 229 4092 3864 root |-- GSAM_HEADER_RECORD: struct (nullable = true) | |-- GSAM_APPL_NUM: string (nullable = true) | |-- GSAM_REC_TYPE: string (nullable = true) | |-- GSAM_CLIENT_NBR: string (nullable = true) | |-- GSAM_TRANS_ID: string (nullable = true) | |-- GSAM_TRANS_DTE: string (nullable = true) |-- GSAM_DETAIL_RECORD: struct (nullable = true) | |-- GSAM_APPL_NUM: string (nullable = true) | |-- GSAM_REC_TYPE: string (nullable = true) | |-- GSAM_DATA: string (nullable = true) |-- GSAM_TRAILER_RECORD: struct (nullable = true) | |-- GSAM_APPL_NUM: string (nullable = true) | |-- GSAM_REC_TYPE: string (nullable = true) | |-- GSAM_CLIENT_NBR: string (nullable = true) | |-- GSAM_TRANS_ID: string (nullable = true) | |-- GSAM_TRAILER_AREA: string (nullable = true) +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |GSAM_HEADER_RECORD |GSAM_DETAIL_RECORD |GSAM_TRAILER_RECORD | +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[000000009998, 2220, B222, 0MST2020, 0221 00]|[000000009998, 2220, B2220MST20200221 000000037185600A 222000000003718500 000000037185600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VGPVG 000000037351600A 222000000003735100 000000037351600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037352600A 222000000003735200 000000037352600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037353600A 222000000003735300 000000037353600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037358600A 222000000003735800 000000037358600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037359600A 222000000003735900 000000037359600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037360600A 222000000003736000 000000037360600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037361600A 222000000003736100 000000037361600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000007318600A 222000000000731800 000000007318600B -T0001``m 0

00349 *00 T3138T3507T3114T3411 % 00{ VSVCB 000000008737600A 222000000000873700 000000008737600B -T0001m 0 00000 *00 000 000 000 000 00{ VSPVC 000000008841600A 222000000000884100 000000008841600B -T0001m 0 00000 *00 000 000 000 000 00{ VSPVC 000000008842600A 222000000000884200 000000008842600B -T0001m 0 00000 *00 000 000 000 000 00{ VSPVC 000000009057600A 222000000000905700 000000009057600B -T0001m 0 00339 *00 T3206T3286T3033T3410 @ @ 00{ VSPVC 000000009104600A 222000000000910400 000000009104600B -T0001m 0 00000 *00 000 000 000 000 00{ VSPVC 000000009115600A 222000000000911500 000000009115600B -T0001m 0 < <00339 *00 T3206T3286T3033T3114 @ % 00{]|[000000009998, 2220, B222, 0MST2020, 0221 000000037185600A 222000000003718500 000000037185600B -PRB01``m 0 @ @A 00 PB NF CA TC]| +----------------------------------------------+-------------------------------------------- Please advise

marchins1952 avatar Apr 17 '20 17:04 marchins1952

We recently implemented a new way to debug copybook mismatches. Please, try using 2.0.7 version of spark-cobol. You can add .option("debug", "true") to the list of options. Each parsed field will have an accompanying field with _DEBUG suffix. That field will contain HEX values of the original data file. This way you can find why fields were parsed so you can find a displacement between the copybook and the data. You can then eliminate the displacement by adjusting 'rdw_adjustment' or adding filler fields at the end of the copybook.

yruslan avatar Apr 18 '20 17:04 yruslan

@yruslan How can we confirm that if file contains RDW for VL files ?

I tried loading the Data using spark cobol

val df = spark.read.format("cobol").option("is_record_sequence", "true").option("schema_retention_policy", "collapse_root").option("copybook", "gs://test/binzip/sdjwew.txt").load("gs://test/binzip/jhewew.bin.gz")
df.show()

I can see the order of data is quite different.

Tried using rdw_adjustment but no luck.

How Record length Field is different than copybook field schema ? Are these two are different ?

RamanandJaiswal avatar Sep 07 '21 15:09 RamanandJaiswal

In order to check if a file has an RDW header, you can open the file in a HEX editor and check first 4 bytes. If bytes 0 and 1 or 2 and 3 are 0x00, then the file is likely to have the RDW header. On Linux you can use hexdump to read a binary file.

yruslan avatar Sep 07 '21 16:09 yruslan

Thanks @yruslan This is what i can see in the data via hexdump

00000000  40 40 40 40 40 40 40 40  40 40 f0 f0 f0 f1 40 40  |@@@@@@@@@@....@@|
00000010  40 40 40 40 f2 f0 f2 f1  60 f0 f5 60 f1 f3 f2 f0  |@@@@....`..`....|

RamanandJaiswal avatar Sep 07 '21 17:09 RamanandJaiswal

This looks like a fixed-length EBCDIC file. With spark-cobol 2.4.0 you can use .option("record_format", "F") In older version of Cobrix you need to set .option("is_record_sequence", "false") (yes, false)

yruslan avatar Sep 07 '21 18:09 yruslan

I tried applying with given option params but it throws error

java.io.IOException: FixedLengthRecordReader does not support reading compressed files

RamanandJaiswal avatar Sep 08 '21 05:09 RamanandJaiswal

That's an interesting error. I can't remember ever seeing it. Please, clarify the following:

  • What is the file name you are reading from?
  • Which are the current options to spark-cobol?
  • What is the full stack trace of the error?

yruslan avatar Sep 08 '21 07:09 yruslan

File name :

  • File name cillold.bin.gz
  • Spark-cobol version 2.4
  • Current Spark options are
SparkSession.read().format("cobol")
                .option("copybook", "copybook.txt")
                .option("record_format", "F")
                .option("schema_retention_policy", "collapse_root")
                .load("cillold.bin.gz").show();
  • Full Stack trace of error
2021-09-08 13:47:19 INFO  o.a.s.r.NewHadoopRDD:54 - Input split: file:cillold.bin.gz:0+4778620
2021-09-08 13:47:20 ERROR o.a.s.e.Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: FixedLengthRecordReader does not support reading compressed files
	at org.apache.spark.input.FixedLengthBinaryRecordReader.initialize(FixedLengthBinaryRecordReader.scala:89)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:182)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

RamanandJaiswal avatar Sep 08 '21 08:09 RamanandJaiswal

I see now, thanks. Yes, unfortunately, compressed files are not supported. If the file is '.gz', you need to uncompress it first.

It is a Spark limitation for binary files.

yruslan avatar Sep 08 '21 08:09 yruslan

A follow-up question. The hexdump that you posted does not seem to be compressed. Which command did you use (full command fie the file name) to get the hex dump?

yruslan avatar Sep 08 '21 08:09 yruslan

hexdump -C cillold.bin.gz > test1

RamanandJaiswal avatar Sep 08 '21 09:09 RamanandJaiswal