cobrix support for code page 930 to read japanese characters

Trying to read an ebcdic file received from main frame system using a copy book. The data has address lines which have Japanese characters. When reading the values are coming as blank - [, , , , ]. For Japanese characters we cannot pass code page 930 since only common (default), common_extended, cp037, cp037_extended, cp875. *_extended code pages are supported as of now. Rest all data is read correctly except this.

May 04 '22 19:05 archanasuhani

Hi,

It is possible to define a custom code page by defining an EBCDIC to Unicode conversion table. For example, Cobrix option: .option("ebcdic_code_page_class", "za.co.absa.cobrix.spark.cobol.source.utils.za.co.absa.cobrix.spark.cobol.source.utils" Source code: link

But adding support for 930 would be a good addition, so we are going to look into that.

May 05 '22 11:05 yruslan

Hi ,

So i am not very sure how the cobrix conversion works? Does it read the ebcdic data , converts to ascii and then to utf-8 ? I will check the custom code page implementation.

Thanks, Archana On Thu, May 5, 2022 at 4:21 AM Ruslan Yushchenko @.***> wrote:

Hi,

It is possible to define a custom code page by defining an EBCDIC to Unicode conversion table. For example, Cobrix option: .option("ebcdic_code_page_class", "za.co.absa.cobrix.spark.cobol.source.utils.za.co.absa.cobrix.spark.cobol.source.utils" Source code: link https://github.com/AbsaOSS/cobrix/blob/ab9ab1492e9d55aaa9003304c0ff2632f9dba332/spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/CustomCodePage.scala#L21-L21

But adding support for 930 would be a good addition, so we are going to look into that.

— Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/cobrix/issues/497#issuecomment-1118437891, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7N5B52MMK4HHCMWP7MLNLVIOVMDANCNFSM5VC4HQNA . You are receiving this because you authored the thread.Message ID: @.***>

May 08 '22 00:05 archanasuhani

Hi Archana,

After some research, we found that cp930 has 1 byte representation for Katakana, while all other characters are represented by 2 bytes.

Currently, Cobrix is having an assumption that all EBCDIC characters are represented by 1 byte. So the solution I proposed earlier won't work for non-Katakana characters.

We can implement the support for cp930, just trying to find a spec for it. Do you have a link to a 930 table by any chance?

Thank you, Ruslan

May 09 '22 11:05 yruslan

Hi Ruslan,

No i do not have any link for spec for cp930 , i am also trying to search but did not find any yet. Yes the multibyte characters are issue . We tried reading with diff charsets too but nothing worked.

Thanks, Archana

May 09 '22 18:05 archanasuhani

Hi Ruslan,

Using cobrix we can read the ebcdic data with a copy book. But cobrix does not provide an option to write back in ebcdic format ?

Thanks, Archana

May 09 '22 22:05 archanasuhani

Sorry, we don't have the write feature at the moment. It is in the plans, but probably won't be soon.

Regarding Japanese charset - we can implement support for it, but we need the spec - EBCDIC byte or byte pairs mapping to Unicode characters.

May 10 '22 13:05 yruslan

Hi @yruslan,

By any chance, could you also include the following 2 byte code pages as well?

https://web.archive.org/web/20141201234940/http://www-01.ibm.com/software/globalization/ccsid/ccsid300.html https://web.archive.org/web/20141129222534/http://www-01.ibm.com/software/globalization/ccsid/ccsid1364.html https://web.archive.org/web/20141129205408/http://www-01.ibm.com/software/globalization/ccsid/ccsid1388.html

Many thanks for your help.

Br., Bence.

Dec 01 '22 12:12 BenceBenedek

@BenceBenedek ,

Could you please create a separate GitHub issue for these?

It would be easier to manage for us.

Thank you!

Dec 01 '22 14:12 yruslan

cobrix cobrix copied to clipboard

support for code page 930 to read japanese characters

cobrix
cobrix copied to clipboard