cobrix icon indicating copy to clipboard operation
cobrix copied to clipboard

How can I load these multi-segment data from ASCII files?

Open bastien-bonnet opened this issue 6 years ago • 9 comments

Hello,

Thank you for your work on this project, it is of great help for me.

So far, I have been able to successfully load ASCII single segment files, but failed with multi-segment ones.

Here is a simplified example of the kind of data I am trying to load into a DataFrame:

Copybook:

01  COMPANY-DETAILS.
    05  SEGMENT-ID PIC 9(1).
    05  STATIC-DETAILS.
        10  NAME PIC X(2).

    05  CONTACTS REDEFINES STATIC-DETAILS.
        10  PERSON PIC X(3).

Data:

1BB
2CCC

Code:

val copybook =
      """       01  COMPANY-DETAILS.
        |            05  SEGMENT-ID		PIC 9(1).
        |            05  STATIC-DETAILS.
        |               10  NAME      	PIC X(2).
        |
        |            05  CONTACTS REDEFINES STATIC-DETAILS.
        |               10  PERSON    	PIC X(3).
      """.stripMargin

val df = spark.read
      .format("cobol")
      .option("copybook_contents", copybook)
      .option("is_record_sequence", "true")
      .option("schema_retention_policy", "collapse_root")
      .option("encoding", "ascii")
      .load("data_ascii/mini.txt")

Output:

+----------+--------------+--------+
|SEGMENT_ID|STATIC_DETAILS|CONTACTS|
+----------+--------------+--------+
|      null|          [2C]|   [2CC]|
+----------+--------------+--------+

I can see 2 problems in my output:

  • null value
  • only one row : the 2 records seems to be read as if they were one (in my tests with my real data containing many records, I always end up with only one row in the dataframe)

After thoroughly reading your (very nice) README, I have tried to modify the copybook, data and several options, but I still fail to load my data correctly.

Since I am new to Cobol formats, I suspect either my use of Cobrix options to be incorrect, or my data format (ASCII, no record header in data) to be incompatible with Cobrix.

Can you see what is wrong here?

Thanks!

bastien-bonnet avatar Jul 18 '19 15:07 bastien-bonnet

The copybook and the program look good. This is exactly how multisegment files can be loaded.

I'm not sure about the data file, however. Multisegment files can only be loaded if the input file is a variable record length file. The is_record_sequence says that the file is a variable record length file and each record starts with 4 byte RDW header. Here is more information in RDWs. The first 2 bytes are always zero and the second 2 bytes define record length in little-endian format. RDWs can be big-endian as well. In that case the first 2 bytes contain record length, and the second 2 bytes are zeros.

You can take a look at data/test4_copybook.cob and corresponding data file data/test4/COMP.DETAILS.SEP30.DATA.dat for an example of a multisegment file.

yruslan avatar Jul 18 '19 19:07 yruslan

Thanks for the quick answer :)

Thanks also for the information on RDW headers, I did not know that. I tried to add such headers to my data file, with little-endians then big-endians, but I still fail to get more than one row.

Data file I used (still ASCII, headers being the first 4 chars):

00071BB
00082CCC

Output:

+----------+--------------+--------+
|SEGMENT_ID|STATIC_DETAILS|CONTACTS|
+----------+--------------+--------+
|         1|          [BB]|    [BB]|
+----------+--------------+--------+

I looked at your example files but it is binary data (not human-readable for all the chars). Do you happen to have a working ASCII multi-segment example? I would like to start from in order to figure out what's wrong with my data or usage.

Thanks!

bastien-bonnet avatar Jul 19 '19 10:07 bastien-bonnet

An ASCII multisegment example is data/test4_copybook.cob. But still Cobrix is designed for loading binary files. So many characters still won't be readable. If you have a test file separated by line end characters it might be out of scope of the project, at least for now.

Many binary file viewers support EBCDIC encoding. Like HxD for Windows or Hex Fiend for Mac.

Could you please attach your data example as a file. I'd like to take a look at it to understand why you are getting such results.

yruslan avatar Jul 19 '19 11:07 yruslan

I confirm that my data is not a binary file, just a text file with ASCII encoding (text/plain; charset=us-ascii): test.txt

We receive that kind of files from the mainframe. Do we have to "convert" them to binary before being able to process them through Cobrix?

bastien-bonnet avatar Jul 19 '19 15:07 bastien-bonnet

Loading test files both ASCII and EBCDIC is possible, but a little more involved. Take a look at this example: https://github.com/AbsaOSS/cobrix/issues/27#issuecomment-453597114

Let me know if you have questions in this approach.

yruslan avatar Jul 19 '19 17:07 yruslan

Thanks for the example! With it, I have some promising results :)

I will post the final code as soon as I get a good output.

I have one remaining question though (since I am new to Cobol data, I may be missing something): usually in programming, ASCII it opposed to binary, as both are encodings ; but I get from what you say that we can have:

  • text + ASCII (what many programmers call "ASCII")
  • binary + ASCII (I don't get this one)
  • text + EBCDIC (I also don't get this one)
  • binary + EBCDIC (what I thought till now was just "EBCDIC")

Is that correct? If yes, do you know resources where I can understand the differences between those 4?

Thanks again !

bastien-bonnet avatar Jul 22 '19 16:07 bastien-bonnet

Glad it worked for you! I'm considering adding this example to the documentation as loading mainframe text files seems a recurring use case.

Yes, all of the above combinations are possible.

  • A text file is the one which contains only printable characters, and records are separated by line end characters (LF = 0x0A in ASCII in Unix/Linux, 0x25 in EBCDIC). Line end characters act as delimiters in text files. From COBOL perspective text files should contain only fields having DISPLAY format.
  • A binary file is a file where all characters are used (printable and non-printable). There are no delimiters between records, but each record has a field specifying its size. Formats such as 'COMP', 'COMP-1', 'COMP-2', 'COMP-3' are only usable in binary files.
  • ASCII and EBCDIC are charsets, e.g. different mappings from a character ordinal number to a character visual representation and meaning.

Both charsets are possible for text files and binary files. But when loading files from a mainframe binary files are mostly encoded in EBCDIC charset, while text files are more often encoded in ASCII.

yruslan avatar Jul 23 '19 07:07 yruslan

Thanks for theses clarifications :) I proposed a pull request adding this example to the documentation, hope it helps

bastien-bonnet avatar Jul 25 '19 12:07 bastien-bonnet

Yes, it is great. Thank you!

yruslan avatar Jul 25 '19 15:07 yruslan