cobrix
cobrix copied to clipboard
variable length copybook
When I tried to parse variable length file with rdw headers. I tried like below
cobolDataframe = spark
.read
.format("cobol")
.option("copybook", v_copybook)
.option("schema_retention_policy", "collapse_root") //removes the root record headerc
.option("drop_group_fillers", "false")
.option("generate_record_id", false) // this adds the file id and record id
.option("is_record_sequence", "true") // reader to use 4 byte record headers to extract records from a mainframe file
.option("is_rdw_big_endian", "true")
.option("is_rdw_part_of_record_length", true)
.option("rdw_adjustment", -4)
.load(v_data)
After parsing the data is not correct. I tried by having different rdw options but in all case I did not get correct parsed data. Same file if I use with recordlength option .option("record_length_field",v_recordLengthField) its parsing correctly. But all of our mainframe files does not have record length field. How can I resolve this. so variable length files should work.
In order to parse variable record length (VLR) files you need either:
- An RDW header or
- A record length field that is located at the same place for every segment.
Please provide the first 4 bytes of your data file so we can see if it contains an RDW and whether it is a big endian one or not.
In your code snipped you have used 2 options that contradict each other. Please use only one of them:
.option("is_rdw_part_of_record_length", true)
.option("rdw_adjustment", -4)
I will try by applying your suggestions
I tried it I was not able to parse the data. its same like as before only first row parses correctly and the next rows does not parse data shifts. I have sent email to you with the copy book and data file with 5 rows. Can you please let me know how to fix this.
I tried to parse variable length file as well.
val df = spark
.read
.format("cobol")
.option("copybook","/U:/CIFMSTCB.cob")
.option("schema_retention_policy", "collapse_root")
.option("is_record_sequence", "true")
.load("/U:/full_scoring.dat")
Only one record was parsed from 360.
GSAM_RECORD 1 4092 4092
10 GSAM_HEADER_RECORD r 7 1 4092 4092
15 GSAM_APPL_NUM 2 1 12 12
15 GSAM_REC_TYPE 3 13 16 4
15 GSAM_CLIENT_NBR 4 17 20 4
15 GSAM_TRANS_ID 5 21 28 8
15 GSAM_TRANS_DTE 6 29 36 8
15 Filler 7 37 4092 4056
10 GSAM_DETAIL_RECORD rR 11 1 4092 4092
15 GSAM_APPL_NUM 9 1 12 12
15 GSAM_REC_TYPE 10 13 16 4
15 GSAM_DATA 11 17 4092 4076
10 GSAM_TRAILER_RECORD R 18 1 4092 4092
15 GSAM_APPL_NUM 13 1 12 12
15 GSAM_REC_TYPE 14 13 16 4
15 GSAM_CLIENT_NBR 15 17 20 4
15 GSAM_TRANS_ID 16 21 28 8
15 GSAM_TRAILER_AREA 17 29 228 200
15 FILLER 18 229 4092 3864
root
|-- GSAM_HEADER_RECORD: struct (nullable = true)
| |-- GSAM_APPL_NUM: string (nullable = true)
| |-- GSAM_REC_TYPE: string (nullable = true)
| |-- GSAM_CLIENT_NBR: string (nullable = true)
| |-- GSAM_TRANS_ID: string (nullable = true)
| |-- GSAM_TRANS_DTE: string (nullable = true)
|-- GSAM_DETAIL_RECORD: struct (nullable = true)
| |-- GSAM_APPL_NUM: string (nullable = true)
| |-- GSAM_REC_TYPE: string (nullable = true)
| |-- GSAM_DATA: string (nullable = true)
|-- GSAM_TRAILER_RECORD: struct (nullable = true)
| |-- GSAM_APPL_NUM: string (nullable = true)
| |-- GSAM_REC_TYPE: string (nullable = true)
| |-- GSAM_CLIENT_NBR: string (nullable = true)
| |-- GSAM_TRANS_ID: string (nullable = true)
| |-- GSAM_TRAILER_AREA: string (nullable = true)
+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|GSAM_HEADER_RECORD |GSAM_DETAIL_RECORD |GSAM_TRAILER_RECORD |
+----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[000000009998, 2220, B222, 0MST2020, 0221 00]|[000000009998, 2220, B2220MST20200221 000000037185600A 222000000003718500 000000037185600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VGPVG 000000037351600A 222000000003735100 000000037351600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037352600A 222000000003735200 000000037352600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037353600A 222000000003735300 000000037353600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037358600A 222000000003735800 000000037358600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037359600A 222000000003735900 000000037359600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037360600A 222000000003736000 000000037360600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000037361600A 222000000003736100 000000037361600B -PRB01m 0 @ @A 00 PB NF CA TC 00{ VSPVC 000000007318600A 222000000000731800 000000007318600B -T0001``m 0
00349 *00 T3138T3507T3114T3411 % 00{ VSVCB 000000008737600A 222000000000873700 000000008737600B -T0001m 0 00000 *00 000 000 000 000 00{ VSPVC 000000008841600A 222000000000884100 000000008841600B -T0001m 0 00000 *00 000 000 000 000 00{ VSPVC 000000008842600A 222000000000884200 000000008842600B -T0001m 0 00000 *00 000 000 000 000 00{ VSPVC 000000009057600A 222000000000905700 000000009057600B -T0001m 0 00339 *00 T3206T3286T3033T3410 @ @ 00{ VSPVC 000000009104600A 222000000000910400 000000009104600B -T0001m 0 00000 *00 000 000 000 000 00{ VSPVC 000000009115600A 222000000000911500 000000009115600B -T0001m 0 < <00339 *00 T3206T3286T3033T3114 @ % 00{]|[000000009998, 2220, B222, 0MST2020, 0221 000000037185600A 222000000003718500 000000037185600B -PRB01``m 0 @ @A 00 PB NF CA TC]|
+----------------------------------------------+--------------------------------------------
Please advise
We recently implemented a new way to debug copybook mismatches. Please, try using 2.0.7 version of spark-cobol. You can add .option("debug", "true") to the list of options. Each parsed field will have an accompanying field with _DEBUG suffix. That field will contain HEX values of the original data file. This way you can find why fields were parsed so you can find a displacement between the copybook and the data. You can then eliminate the displacement by adjusting 'rdw_adjustment' or adding filler fields at the end of the copybook.
@yruslan How can we confirm that if file contains RDW for VL files ?
I tried loading the Data using spark cobol
val df = spark.read.format("cobol").option("is_record_sequence", "true").option("schema_retention_policy", "collapse_root").option("copybook", "gs://test/binzip/sdjwew.txt").load("gs://test/binzip/jhewew.bin.gz")
df.show()
I can see the order of data is quite different.
Tried using rdw_adjustment but no luck.
How Record length Field is different than copybook field schema ? Are these two are different ?
In order to check if a file has an RDW header, you can open the file in a HEX editor and check first 4 bytes. If bytes 0 and 1 or 2 and 3 are 0x00, then the file is likely to have the RDW header. On Linux you can use hexdump to read a binary file.
Thanks @yruslan This is what i can see in the data via hexdump
00000000 40 40 40 40 40 40 40 40 40 40 f0 f0 f0 f1 40 40 |@@@@@@@@@@....@@|
00000010 40 40 40 40 f2 f0 f2 f1 60 f0 f5 60 f1 f3 f2 f0 |@@@@....`..`....|
This looks like a fixed-length EBCDIC file.
With spark-cobol 2.4.0 you can use .option("record_format", "F")
In older version of Cobrix you need to set .option("is_record_sequence", "false") (yes, false)
I tried applying with given option params but it throws error
java.io.IOException: FixedLengthRecordReader does not support reading compressed files
That's an interesting error. I can't remember ever seeing it. Please, clarify the following:
- What is the file name you are reading from?
- Which are the current options to spark-cobol?
- What is the full stack trace of the error?
File name :
- File name cillold.bin.gz
- Spark-cobol version
2.4 - Current Spark options are
SparkSession.read().format("cobol")
.option("copybook", "copybook.txt")
.option("record_format", "F")
.option("schema_retention_policy", "collapse_root")
.load("cillold.bin.gz").show();
- Full Stack trace of error
2021-09-08 13:47:19 INFO o.a.s.r.NewHadoopRDD:54 - Input split: file:cillold.bin.gz:0+4778620
2021-09-08 13:47:20 ERROR o.a.s.e.Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: FixedLengthRecordReader does not support reading compressed files
at org.apache.spark.input.FixedLengthBinaryRecordReader.initialize(FixedLengthBinaryRecordReader.scala:89)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:182)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
I see now, thanks. Yes, unfortunately, compressed files are not supported. If the file is '.gz', you need to uncompress it first.
It is a Spark limitation for binary files.
A follow-up question. The hexdump that you posted does not seem to be compressed. Which command did you use (full command fie the file name) to get the hex dump?
hexdump -C cillold.bin.gz > test1