cobrix icon indicating copy to clipboard operation
cobrix copied to clipboard

Duplicate record_id is getting generated when option set to generate_record_id = "true" for US_ASCII

Open Loganhex2021 opened this issue 3 years ago • 15 comments

We are getting duplicate record_id while reading US_ASII file with below read option:

'encoding': 'ASCII', 'is_text': 'true'', 'generate_record_id':'true'

Actually record_id is duplicated but not the entire record.

Expected behaviour

With the above read option, the record_id should be generated with unique value, But we are getting duplicate record_id and noticed that one of the record_id has none value for most of the column, other one has proper values.

For example: Current output: record_id col1 col2 col3 1 2 none none 1 2 UK Chicago 2 4 none none 2 4 Asia XXXX

Expected output: record_id col1 col2 col3 1 2 UK Chicago 2 4 Asia XXXX

Loganhex2021 avatar Mar 25 '22 14:03 Loganhex2021

@yruslan - could you please take a look and help for the solution

Loganhex2021 avatar Mar 25 '22 14:03 Loganhex2021

Hi, thanks for the report. Looks like a very interesting bug. Keen to fix it.

Could I ask you to attach

  • the copybook
  • the file
  • the exact code snipped you used

so it would be easier for us to reproduce?

Also, could you try instead of is_text = true to use

.option("record_format", "D")

or

.option("record_format", "D2")

and check if the issue happens still.

yruslan avatar Mar 25 '22 14:03 yruslan

@yruslan - thanks for quick response.

Sorry, it is production issue and we can't able to get sample information due to security policy.

Anyway I will try with the option which you suggested and let you know the results.

Loganhex2021 avatar Mar 25 '22 14:03 Loganhex2021

I'm going to try reproducing the issue according to the description you gave, it might just take longer since for our use cases it works as expected.

yruslan avatar Mar 25 '22 15:03 yruslan

So far unable to reproduce. However I parse various text files I get unique record_id. Could you also check 'File_Id'. 'Record_id' is not unique if multiple files are read. But 'File_id', 'Record_id' is always unique.

val copybook =
    """         01  ENTITY.
           05  A    PIC X(1).
           05  B    PIC X(3).
    """

val text =
  """1
    |12
    |123
    |1234
    |12345
    |123456
    |1234567
    |12345678
    |123456789
    |""".stripMargin

withTempTextFile("ascii_nul", ".dat", StandardCharsets.UTF_8, text) { tmpFileName =>
  val df = spark
    .read
    .format("cobol")
    .option("copybook_contents", copybook)
    .option("pedantic", "true")
    .option("record_format", "D")
    .option("encoding", "ascii")
    .option("generate_record_id", "true")
    .load(tmpFileName)

   df.show
}

Output:

+-------+---------+---+---+
|File_Id|Record_Id|  A|  B|
+-------+---------+---+---+
|      0|        0|  1|   |
|      0|        1|  1|  2|
|      0|        2|  1| 23|
|      0|        3|  1|234|
|      0|        4|  1|234|
|      0|        5|  1|234|
|      0|        6|  5|  6|
|      0|        7|  1|234|
|      0|        8|  5| 67|
|      0|        9|  1|234|
|      0|       10|  5|678|
|      0|       11|  1|234|
|      0|       12|  5|678|
+-------+---------+---+---+

yruslan avatar Mar 25 '22 15:03 yruslan

Hi @Loganhex2021, how is it going?

The new upcoming version (0.2.10-SNAPSHOT, current master) has safeguards against partial records parsing caused by too long ASCII lines. You can try if it fixes your issue as well by any chance.

Let me know if you have found the solution for your issue.

yruslan avatar Mar 29 '22 07:03 yruslan

hi @yruslan ,

Thanks for following up. We noticed interesting thing in the source file, actual source file size is ~400 MB. Each RecordLength is 102 bytes. Almost every 32 MB we are getting duplicate Record_Id. So we are getting 12 duplicate records for this 400 MB file. (400/32 => 12 )

Is this helpful for you to identifying root cause for this issue?

Loganhex2021 avatar Mar 29 '22 12:03 Loganhex2021

Yes, it is very helpful! Will try to reproduce

yruslan avatar Mar 29 '22 14:03 yruslan

Still can't reproduce. Could you please send the code snippet you use to load the data, and spark-cobol version?

Are you reading a single file or multiple files? When Record_Id is duplicated, does File_Id too?

yruslan avatar Mar 30 '22 06:03 yruslan

``

Still can't reproduce. Could you please send the code snippet you use to load the data, and spark-cobol version?

Please find the below code snippet the generate sample file and the read option we used. `

To generate the file (pyspark - databricks)

_source_path = '/test_ascii/test/triage_ascii_3.txt' record = '' record_id = 1

data

for record_id in range(record_id, 23): if record_id == 1: record += (str(record_id).zfill(7) + 'dummydata'*10 + 'dum\r\n') else: record += record with open(_source_path, 'w') as testfile: testfile.write(record)

#To read the file , (got 6 duplicates) _ro = {'copybook_contents': ' 01 ASCII-FILE.\n 02 ID-COLUMN PIC X(7).\n 02 COL-TWO PIC X(09).\n 02 FILLER PIC X(86).\n', 'is_text': 'true', 'encoding': 'ASCII', 'ebcdic_code_page': 'cp037', 'string_trimming_policy': 'none', 'debug_ignore_file_size': 'true', 'generate_record_id': 'true'} entity_df = spark.read.format("cobol").options(**_ro).load(_source_path) entity_df.exceptAll(entity_df.drop_duplicates(['Record_Id'])).rdd.collect() `

spark version: 3.1.2 scala 2.1.2 cobrix version: 2.2.2

Loganhex2021 avatar Mar 30 '22 11:03 Loganhex2021

I see, thanks! Please, try the latest master of cobrix (2.4.10-SNAPSHOT ideally), or at least 2.4.9. Your issue might have been fixed already.

In addition,

  • Remove ebcdic_code_page, debug_ignore_file_size, is_text
  • Add record_format: 'D', pedantic: 'true'

yruslan avatar Mar 30 '22 12:03 yruslan

Thanks @yruslan , I will try with suggested option.

Loganhex2021 avatar Mar 31 '22 09:03 Loganhex2021

I see, thanks! Please, try the latest master of cobrix (2.4.10-SNAPSHOT ideally), or at least 2.4.9. Your issue might have been fixed already.

In addition,

  • Remove ebcdic_code_page, debug_ignore_file_size, is_text
  • Add record_format: 'D', pedantic: 'true'

With record_format: 'D' option, if the last record has lesser bytes, then the record got skipped.

Loganhex2021 avatar Apr 01 '22 14:04 Loganhex2021

Cool, glad to hear that record_id doe not have duplicates.

Will try to reproduce the last record issue you mentioned.

yruslan avatar Apr 01 '22 15:04 yruslan

Hi,

With record_format: 'D' option, if the last record has lesser bytes, then the record got skipped.

Can't reproduce it, records that have at least one byte (even a space character) are not skipped.

Probably the best course of action would be to wait for 2.4.10 to be released and try updating the version of spark-cobol and check if the error is still there.

yruslan avatar Apr 05 '22 11:04 yruslan