cobrix
cobrix copied to clipboard
How can I load these multi-segment data from ASCII files?
Hello,
Thank you for your work on this project, it is of great help for me.
So far, I have been able to successfully load ASCII single segment files, but failed with multi-segment ones.
Here is a simplified example of the kind of data I am trying to load into a DataFrame:
Copybook:
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC 9(1).
05 STATIC-DETAILS.
10 NAME PIC X(2).
05 CONTACTS REDEFINES STATIC-DETAILS.
10 PERSON PIC X(3).
Data:
1BB
2CCC
Code:
val copybook =
""" 01 COMPANY-DETAILS.
| 05 SEGMENT-ID PIC 9(1).
| 05 STATIC-DETAILS.
| 10 NAME PIC X(2).
|
| 05 CONTACTS REDEFINES STATIC-DETAILS.
| 10 PERSON PIC X(3).
""".stripMargin
val df = spark.read
.format("cobol")
.option("copybook_contents", copybook)
.option("is_record_sequence", "true")
.option("schema_retention_policy", "collapse_root")
.option("encoding", "ascii")
.load("data_ascii/mini.txt")
Output:
+----------+--------------+--------+
|SEGMENT_ID|STATIC_DETAILS|CONTACTS|
+----------+--------------+--------+
| null| [2C]| [2CC]|
+----------+--------------+--------+
I can see 2 problems in my output:
- null value
- only one row : the 2 records seems to be read as if they were one (in my tests with my real data containing many records, I always end up with only one row in the dataframe)
After thoroughly reading your (very nice) README, I have tried to modify the copybook, data and several options, but I still fail to load my data correctly.
Since I am new to Cobol formats, I suspect either my use of Cobrix options to be incorrect, or my data format (ASCII, no record header in data) to be incompatible with Cobrix.
Can you see what is wrong here?
Thanks!
The copybook and the program look good. This is exactly how multisegment files can be loaded.
I'm not sure about the data file, however. Multisegment files can only be loaded if the input file is a variable record length file. The is_record_sequence says that the file is a variable record length file and each record starts with 4 byte RDW header. Here is more information in RDWs. The first 2 bytes are always zero and the second 2 bytes define record length in little-endian format.
RDWs can be big-endian as well. In that case the first 2 bytes contain record length, and the second 2 bytes are zeros.
You can take a look at data/test4_copybook.cob and corresponding data file data/test4/COMP.DETAILS.SEP30.DATA.dat for an example of a multisegment file.
Thanks for the quick answer :)
Thanks also for the information on RDW headers, I did not know that. I tried to add such headers to my data file, with little-endians then big-endians, but I still fail to get more than one row.
Data file I used (still ASCII, headers being the first 4 chars):
00071BB
00082CCC
Output:
+----------+--------------+--------+
|SEGMENT_ID|STATIC_DETAILS|CONTACTS|
+----------+--------------+--------+
| 1| [BB]| [BB]|
+----------+--------------+--------+
I looked at your example files but it is binary data (not human-readable for all the chars). Do you happen to have a working ASCII multi-segment example? I would like to start from in order to figure out what's wrong with my data or usage.
Thanks!
An ASCII multisegment example is data/test4_copybook.cob. But still Cobrix is designed for loading binary files. So many characters still won't be readable. If you have a test file separated by line end characters it might be out of scope of the project, at least for now.
Many binary file viewers support EBCDIC encoding. Like HxD for Windows or Hex Fiend for Mac.
Could you please attach your data example as a file. I'd like to take a look at it to understand why you are getting such results.
I confirm that my data is not a binary file, just a text file with ASCII encoding (text/plain; charset=us-ascii): test.txt
We receive that kind of files from the mainframe. Do we have to "convert" them to binary before being able to process them through Cobrix?
Loading test files both ASCII and EBCDIC is possible, but a little more involved. Take a look at this example: https://github.com/AbsaOSS/cobrix/issues/27#issuecomment-453597114
Let me know if you have questions in this approach.
Thanks for the example! With it, I have some promising results :)
I will post the final code as soon as I get a good output.
I have one remaining question though (since I am new to Cobol data, I may be missing something): usually in programming, ASCII it opposed to binary, as both are encodings ; but I get from what you say that we can have:
- text + ASCII (what many programmers call "ASCII")
- binary + ASCII (I don't get this one)
- text + EBCDIC (I also don't get this one)
- binary + EBCDIC (what I thought till now was just "EBCDIC")
Is that correct? If yes, do you know resources where I can understand the differences between those 4?
Thanks again !
Glad it worked for you! I'm considering adding this example to the documentation as loading mainframe text files seems a recurring use case.
Yes, all of the above combinations are possible.
- A text file is the one which contains only printable characters, and records are separated by line end characters (LF = 0x0A in ASCII in Unix/Linux, 0x25 in EBCDIC). Line end characters act as delimiters in text files. From COBOL perspective text files should contain only fields having DISPLAY format.
- A binary file is a file where all characters are used (printable and non-printable). There are no delimiters between records, but each record has a field specifying its size. Formats such as 'COMP', 'COMP-1', 'COMP-2', 'COMP-3' are only usable in binary files.
- ASCII and EBCDIC are charsets, e.g. different mappings from a character ordinal number to a character visual representation and meaning.
Both charsets are possible for text files and binary files. But when loading files from a mainframe binary files are mostly encoded in EBCDIC charset, while text files are more often encoded in ASCII.
Thanks for theses clarifications :) I proposed a pull request adding this example to the documentation, hope it helps
Yes, it is great. Thank you!