cobrix
cobrix copied to clipboard
Parsing variable length cobol file in which root element has multiple sub elements
Example :
0120156788PKumar Pndey 05201789654rDtr467788999000009988777666 05201789654ABCD467788999000009988777666 06201789654rDtr46778899900000998877766698765444ffghjjjj 088888997544332245t6yuuiiiiiiiiiiiiiiiiiiiiiiiffffffffffffffffffffffffffffffffffffffffffffffgggggggggggggggg
Here 01 records is having 2 05 records which needs to be collected as a single column.
The example looks like an ASCII text file. Please, correct me if I wrong.
Currently, hierarchical records reader is supported only for binary files that have RDW headers.
How do you load such files currently? Could you please post an example Spark Application code from your use case?
Hi Team,
Thanks for your response. Yes it is an ASCII with fixed length. While trying loading it cobrix library using copybook, it keep on throwing error the lenght of file is not divisible by some byte.
Hence currently I am trying to parse using fixed width logic.But unable to aggregate it as a single record.
It would very helpful if you can share the logic to parse it.
On Fri, Nov 8, 2019, 13:45 Ruslan Yushchenko [email protected] wrote:
The example looks like an ASCII text file. Please, correct me if I wrong.
Currently, hierarchical records reader is supported only for binary files that have RDW headers.
How do you load such files currently? Could you please post an example Spark Application code from your use case?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/cobrix/issues/210?email_source=notifications&email_token=AHUIF4XBHQX35544CS5WGGDQSUNYJA5CNFSM4JKHWHR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDPCNRY#issuecomment-551429831, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHUIF4VU24L45HXTAD6ITFDQSUNYJANCNFSM4JKHWHRQ .
Yes, loading fixed record length hierarchical files is possible. Please, refer to this example: https://github.com/AbsaOSS/cobrix#autoims
Basically, you need to define a segment files, specify segment redefines for each field, and also specify parent-child relationships between segment redefined fields.
A fully working example including a datafile and a copybook is available inside this repository: https://github.com/AbsaOSS/cobrix/blob/master/examples/spark-cobol-app/src/main/scala/com/example/spark/cobol/app/SparkCobolHierarchical.scala
Here is a README on how to run examples: https://github.com/AbsaOSS/cobrix/tree/master/examples/spark-cobol-app
Hi Team,
Thanks for the quick response. As per data file for each master record we have one or several records for a group. The master record is identified by the first two character of very first record. i.e 01. All other records under 01 till the next 01 will be part of record 01 and there could be multiple occurrence of child record i.e. multiple occurrence of 05 ,06 .......
By using this library , it is parsing the same 05 records into two different columns.
Lets me know if you need any other information.
Thanks & Regards, Pankaj Kumar
On Fri, Nov 8, 2019 at 8:32 PM Ruslan Yushchenko [email protected] wrote:
Yes, loading fixed record length hierarchical files is possible. Please, refer to this example: https://github.com/AbsaOSS/cobrix#autoims
Basically, you need to define a segment files, specify segment redefines for each field, and also specify parent-child relationships between segment redefined fields.
A fully working example including a datafile and a copybook is available inside this repository:
https://github.com/AbsaOSS/cobrix/blob/master/examples/spark-cobol-app/src/main/scala/com/example/spark/cobol/app/SparkCobolHierarchical.scala
Here is a README on how to run examples: https://github.com/AbsaOSS/cobrix/tree/master/examples/spark-cobol-app
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/cobrix/issues/210?email_source=notifications&email_token=AHUIF4TNIUDZ4SPUAS53723QSV5QHA5CNFSM4JKHWHR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDSL3VA#issuecomment-551861716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHUIF4VGETX3IHIFHLWSJLLQSV5QHANCNFSM4JKHWHRQ .
Please, provide the code you are using to parse the data file.
def parseFile= { val txDF = spark.read.format("cobol") .option("copybook", "file://C:/Users/Desktop/SmartUM/PVEMCDC1.cob") .option("encoding", "ascii") // .option("is_record_sequence", "true") // .option("is_rdw_part_of_record_length", true) .load("file:///C:/Users/Desktop/SmartUM/SampleTXFile.txt") txDF.printSchema() // val tx=txDF.head() txDF.show() // txDF.write.format("json").option("header","true").mode(SaveMode.Overwrite).save("file:///C:/Users/pankajkumar66/Desktop/data") // val txDF = spark.sparkContext.textFile("file:///C:/Users/pankajkumar66/Desktop/SmartUM/SampleTXFile.txt") // val parsedRDD = txDF.map(record => parseRecord(record)) // parsedRDD.saveAsTextFile("file:///C:/Users/pankajkumar66/Desktop/data/txt")
Regards, Pankaj
Thanks & Regards, Pankaj Kumar
On Tue, Nov 12, 2019 at 12:56 PM Ruslan Yushchenko [email protected] wrote:
Please, provide the code you are using to parse the data file.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/cobrix/issues/210?email_source=notifications&email_token=AHUIF4VXMI57WMKZBAOMMFTQTJLCVA5CNFSM4JKHWHR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDZJNUA#issuecomment-552769232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHUIF4U2FTT5PHWV3HQWFITQTJLCVANCNFSM4JKHWHRQ .
Reading hierarchical text files (where records are separated by a newline) is not supported in Cobrix.
However, from the code I see that your file looks more like a fixed length file. Let's confirm it, and then I can help you with options that you might need to add.
Could I ask you to provide the output of:
txDF.show()
exactly as it is displayed? I want to make sure that record boundaries are determined correctly.
Hi Ruslan,
Yes it is a fixed width file, however using the above code it is giving the below error.
19/11/12 17:10:24 ERROR FileUtils$: File file:/C:/Users/pankajkumar66/Desktop/SmartUM/SampleTXFile.txt IS NOT divisible by 335. Exception in thread "main" java.lang.IllegalArgumentException: There are some files in file:///C:/Users/pankajkumar66/Desktop/SmartUM/SampleTXFile.txt that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (335 bytes per record). Check the logs for the names of the files.
However i have validated the width of each using the hex editor , all the lines having the same width.
But when setting the below option.
.option("is_record_sequence", "true")
It give the output as below : +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | PPMD_RECORD_01| PP MD_RECORD_05| PP MD_RECORD_06| PP MD_RECORD_10| PPMD_RECORD_11| PP MD_RECORD_12| PPMD_RECORD_14| PPMD_RECORD_18| PPMD_RECORD_20| PPMD_RECORD_90| PPMD_RECORD_95| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |[15, 828*****,**, ...|[20, 1582810, 020...|[05, 2015828, 200...|[, [0620158, 28],...|[, [06201, 58], 2...|[, [062, 01], 582...|[, , , 06201582, ...|[, [, 0], 6201582...|[, [, ], 06, 20, ...|[, [, ], 0620158,...|[, [, ], 062015, ...|
Please let me know if you need any further details.
Thanks & Regards, Pankaj Kumar
On Tue, Nov 12, 2019 at 4:40 PM Ruslan Yushchenko [email protected] wrote:
Reading hierarchical text files (where records are separated by a newline) is not supported in Cobrix.
However, from the code I see that your file looks more like a fixed length file. Let's confirm it, and then I can help you with options that you might need to add.
Could I ask you to provide the output of:
txDF.show()
exactly as it is displayed? I want to make sure that record boundaries are determined correctly.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/cobrix/issues/210?email_source=notifications&email_token=AHUIF4TO3E75364V4O7HRW3QTKFJ7A5CNFSM4JKHWHR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDZ4XTY#issuecomment-552848335, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHUIF4VDNWY7JAFL56NDQADQTKFJ7ANCNFSM4JKHWHRQ .
- A record produced when
.option("is_record_sequence", "true")is used looks good. But just to make sure, could you please paste the output of.show()as a preformatted block surrounded by triple backquote sign (```)? - Also, you need to include several rows, not just one. One record output is not enough to see record boundaries.
- Also, I have a suspicion that
PPMD_RECORD_01,PP MD_RECORD_05etc are segment fields. Please make sure they redefine each other.
You can look/edit this comment to see how this can be done.
paste the text here
It is very important to make sure the record boundaries are properly parsed.
Hi Ruslan,
Please find below the output of show.
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | PPMD_RECORD_01| PPMD_RECORD_05| PPMD_RECORD_06| PPMD_RECORD_10| PPMD_RECORD_11| PPMD_RECORD_12| PPMD_RECORD_14| PPMD_RECORD_18| PPMD_RECORD_20| PPMD_RECORD_90| PPMD_RECORD_95| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |[15, 828****, *, ...|[20, 1582810, 020...|[05, 2015828, 200...|[, [0620158, 28],...|[, [06201, 58], 2...|[, [062, 01], 582...|[, , , 06201582, ...|[, [, 0], 6201582...|[, [, ], 06, 20, ...|[, [, ], 0620158,...|[, [, ], 062015, ...| |[, , , , 06201583...|[, , , , , 062015...|[, , , , , , 0620...|[, [, ], , , , , ...|[, [, ], , , , 06...|[, [, ], , , , , ...|[, , , , , 062015...|[, [, ], , , , 06...|[, [, ], , , , , ...|[, [, ], , , , 06...|[, [, ], , , , , ...| |[80, 4013999, 1, ...|[01, 8040139, 991...|[62, 0180401, 399...|[10, [PA, ], 201,...|[02, [BPA0618, 3]...|[36, [0209241, 44...|[58, 360, 212, 05...|[01, [5836023, 63...|[02, [0158360, 31...|[1, [1201583, 60]...|[, [142015, 836],...| |[, , , , 18201584...|[, , , , , 902015...|[, , , , , , 9520...|[, [, ], , , , , ...|[TX, [7746932, 46...|[, [, ], , , , , ...|[, , , , , 902015...|[, [, ], , , , 95...|[, [, ], , , , , ...|[, [, T], X774693...|[, [, ], , , , , ...| |[, , , 0620158, 7...|[, , , , , 062015...|[, , , , , , 0620...|[, [, ], , , , , ...|[, [, ], , , , 06...|[, [, ], , , , , ...|[, , , , , 062015...|[, [, ], , , , 06...|[, [, ], , , , , ...|[, [, ], , , 0620...|[, [, ], , , , , ...| |[, , , , 06201587...|[, , , , , 062015...|[, , , , , , 1020...|[, [, ], , , , S,...|[, [, ], , , , 12...|[, [, ], , , , , ...|[, , , , , 162015...|[, [, ], , , SAN ...|[, [, ], , , , , ...|[, [, ], , , SAN ...|[, [, ], , , SAN ...| |[, , , , 95201587...|[, , , , , 102015...|[02, , , , , , SA...|[, [, ], , , , , ...|[, [, ], , , , 16...|[, [, ], , , , , ...|[, , , , , 162015...|[, [, ], , , , 16...|[, [, ], , , , , ...|[, [, ], , , , 16...|[, [, ], , , , , ...| |[33, 302 N B, U, ...|[17, , , , , 1420...|[12, 31, , , , , ...|[81, [0980200, 90...|[E1, [, ], , , , ...|[X, [, ], , , , ,...|[D, 3, 200, 90513...|[6, [200, 90], 62...|[20, [0903273, 99...|[93, [4200011, 64...|[21, [8818296, 53...| |[, , , , 06201591...|[, , , , , 062015...|[, , , , , , 0620...|[, [, ], , , , , ...|[, [, ], , , , 06...|[, [, ], , , , , ...|[, , , , , 062015...|[, [, ], , , , 06...|[, [, ], , , , , ...|[, [, ], , , , 06...|[, [, ], , , , , ...| |[, , , 9020159190...|[, , , , , 952015...|[, , , , , , 1020...|[, [, ], , TX7, 5...|[, [, ], , , , 11...|[, [, ], , , , , ...|[, , , , , 902015...|[, [, ], , , , 95...|[, [, ], , , , , ...|[, [DALLAS, ], , ...|[, [, ], , , , , ...| |[, , , , 06201645...|[, , , , , 062016...|[, , , , , , 0620...|[, [, ], , , , , ...|[, [, ], , , , 06...|[, [, ], , , , , ...|[, , , , , 062016...|[, [, ], , , , 06...|[, [, ], , , , , ...|[, [, ], , , , 06...|[, [, ], , , , , ...| |[, , , , 20201645...|[, , , , , 202016...|[, , , , , , 9020...|[, [, ], , , , , ...|[, [, ], , , , 10...|[, [, ], , , , , ...|[, , , , , 122016...|[, [, ], , , , 16...|[, [, ], , , , , ...|[, [, ], , , 2020...|[, [, ], , , , , ...| |[, , , , 06201648...|[, , , , , 062016...|[, , , , , , 0620...|[, [, ], , , , , ...|[, [, ], , , , 06...|[, [, ], , , , , ...|[, , , , , 062016...|[, [, ], , , , 06...|[, [, ], , , , , ...|[, [, ], , , , 06...|[31, [, ], , , , ...| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
Thanks & Regards, Pankaj Kumar
On Tue, Nov 12, 2019 at 5:57 PM Ruslan Yushchenko [email protected] wrote:
- A record produced when .option("is_record_sequence", "true") is used looks good. But just to make sure, could you please paste the output of .show() as a preformatted block surrounded by triple backquote sign (```)?
- Also, you need to include several rows, not just one. One record output is not enough to see record boundaries.
- Also, I have a suspicion that PPMD_RECORD_01, PP MD_RECORD_05 etc are segment fields. Please make sure they redefine each other.
You can look/edit this comment to see how this can be done.
paste the text here
It is very important to make sure the record boundaries are properly parsed.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/cobrix/issues/210?email_source=notifications&email_token=AHUIF4RBGZ4S5ICNSGBIRZTQTKOMXA5CNFSM4JKHWHR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED2CVDA#issuecomment-552872588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHUIF4WIOO3UJ6WPKJKQNT3QTKOMXANCNFSM4JKHWHRQ .
Thanks for the output. Below you see the same output surrounded by triple backquotes.
Unfortunately, I cannot make sense of the data. Seems like record boundaries are parsed incorrectly. I'm surprised that when you use .option("is_record_sequence", "true"), Cobrix does not throw an exception.
In order to be able to extract hierarchical records you need to:
- Determine how many segments are in your multisegment file
- Determinate which field acts as a segment id.
- Make sure the copybook contains GROUPs that correspond to segments, and make sure these GROUPs redefine each other.
- Determinate which segment ids belong to the root segment, and which - to the child segment.
- Make sure all segments are parsed correctly, e.g. the copybook exactly matches the data file.
Only once the above information is defined you can extract hierarchical data. I could help if the case is obvious, but definitely your case is much more complicated.
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PPMD_RECORD_01| PPMD_RECORD_05| PPMD_RECORD_06| PPMD_RECORD_10| PPMD_RECORD_11| PPMD_RECORD_12|PPMD_RECORD_14 | PPMD_RECORD_18| PPMD_RECORD_20|PPMD_RECORD_90 | PPMD_RECORD_95|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|[15, 828****, *, ...|[20, 1582810, 020...|[05, 2015828, 200...|[,[0620158, 28],... |[, [06201, 58], 2...|[, [062, 01], 582...|[, , ,06201582, ... |[, [, 0], 6201582...|[, [, ], 06, 20, ...|[, [, ],0620158,... |[, [, ], 062015, ...|
|[, , , , 06201583...|[, , , , , 062015...|[, , , , , , 0620...|[, [, ], ,, , , ... |[, [, ], , , , 06...|[, [, ], , , , , ...|[, , , , , 062015...|[,[, ], , , , 06...|[, [, ], , , , , ...|[, [, ], , , , 06... |[, [, ], , , ,, ... |
|[80, 4013999, 1, ...|[01, 8040139, 991...|[62, 0180401, 399...|[10, [PA,], 201,... |[02, [BPA0618, 3]...|[36, [0209241, 44...|[58, 360, 212,05... |[01, [5836023, 63...|[02, [0158360, 31...|[1, [1201583, 60]...|[,[142015, 836],... |
|[, , , , 18201584...|[, , , , , 902015...|[, , , , , , 9520...|[, [, ], ,, , , ... |[TX, [7746932, 46...|[, [, ], , , , , ...|[, , , , , 902015...|[,[, ], , , , 95...|[, [, ], , , , , ...|[, [, T], X774693... |[, [, ], , , ,, ... |
|[, , , 0620158, 7...|[, , , , , 062015...|[, , , , , , 0620...|[, [, ], ,, , , ... |[, [, ], , , , 06...|[, [, ], , , , , ...|[, , , , , 062015...|[,[, ], , , , 06...|[, [, ], , , , , ...|[, [, ], , , 0620... |[, [, ], , , ,, ... |
|[, , , , 06201587...|[, , , , , 062015...|[, , , , , , 1020...|[, [, ], ,, , S,... |[, [, ], , , , 12...|[, [, ], , , , , ...|[, , , , , 162015...|[,[, ], , , SAN ...|[, [, ], , , , , ...|[, [, ], , , SAN ... |[, [, ], , ,SAN ... |
|[, , , , 95201587...|[, , , , , 102015...|[02, , , , , , SA...|[, [, ], ,, , , ... |[, [, ], , , , 16...|[, [, ], , , , , ...|[, , , , , 162015...|[,[, ], , , , 16...|[, [, ], , , , , ...|[, [, ], , , , 16... |[, [, ], , , ,, ... |
|[33, 302 N B, U, ...|[17, , , , , 1420...|[12, 31, , , , , ...|[81,[0980200, 90... |[E1, [, ], , , , ...|[X, [, ], , , , ,...|[D, 3, 200,90513...|[6, [200, 90], 62...|[20, [0903273, 99...|[93, [4200011,64... |[21, [8818296, 53...|
|[, , , , 06201591...|[, , , , , 062015...|[, , , , , , 0620...|[, [, ], ,, , , ... |[, [, ], , , , 06...|[, [, ], , , , , ...|[, , , , , 062015...|[,[, ], , , , 06...|[, [, ], , , , , ...|[, [, ], , , , 06... |[, [, ], , , ,, ... |
|[, , , 9020159190...|[, , , , , 952015...|[, , , , , , 1020...|[, [, ], ,TX7, 5... |[, [, ], , , , 11...|[, [, ], , , , , ...|[, , , , , 902015...|[,[, ], , , , 95...|[, [, ], , , , , ...|[, [DALLAS, ], , ... |[, [, ], , , ,, ... |
|[, , , , 06201645...|[, , , , , 062016...|[, , , , , , 0620...|[, [, ], ,, , , ... |[, [, ], , , , 06...|[, [, ], , , , , ...|[, , , , , 062016...|[,[, ], , , , 06...|[, [, ], , , , , ...|[, [, ], , , , 06... |[, [, ], , , ,, ... |
|[, , , , 20201645...|[, , , , , 202016...|[, , , , , , 9020...|[, [, ], ,, , , ... |[, [, ], , , , 10...|[, [, ], , , , , ...|[, , , , , 122016...|[,[, ], , , , 16...|[, [, ], , , , , ...|[, [, ], , , 2020... |[, [, ], , , ,, ... |
|[, , , , 06201648...|[, , , , , 062016...|[, , , , , , 0620...|[, [, ], ,, , , ... |[, [, ], , , , 06...|[, [, ], , , , , ...|[, , , , , 062016...|[,[, ], , , , 06...|[, [, ], , , , , ...|[, [, ], , , , 06... |[31, [, ], , ,, ... |
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+