cobrix
cobrix copied to clipboard
support for code page 930 to read japanese characters
Trying to read an ebcdic file received from main frame system using a copy book. The data has address lines which have Japanese characters. When reading the values are coming as blank - [, , , , ]. For Japanese characters we cannot pass code page 930 since only common (default), common_extended, cp037, cp037_extended, cp875. *_extended code pages are supported as of now. Rest all data is read correctly except this.
Hi,
It is possible to define a custom code page by defining an EBCDIC to Unicode conversion table. For example,
Cobrix option:
.option("ebcdic_code_page_class", "za.co.absa.cobrix.spark.cobol.source.utils.za.co.absa.cobrix.spark.cobol.source.utils"
Source code: link
But adding support for 930 would be a good addition, so we are going to look into that.
Hi ,
So i am not very sure how the cobrix conversion works? Does it read the ebcdic data , converts to ascii and then to utf-8 ? I will check the custom code page implementation.
Thanks, Archana On Thu, May 5, 2022 at 4:21 AM Ruslan Yushchenko @.***> wrote:
Hi,
It is possible to define a custom code page by defining an EBCDIC to Unicode conversion table. For example, Cobrix option: .option("ebcdic_code_page_class", "za.co.absa.cobrix.spark.cobol.source.utils.za.co.absa.cobrix.spark.cobol.source.utils" Source code: link https://github.com/AbsaOSS/cobrix/blob/ab9ab1492e9d55aaa9003304c0ff2632f9dba332/spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/CustomCodePage.scala#L21-L21
But adding support for 930 would be a good addition, so we are going to look into that.
— Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/cobrix/issues/497#issuecomment-1118437891, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7N5B52MMK4HHCMWP7MLNLVIOVMDANCNFSM5VC4HQNA . You are receiving this because you authored the thread.Message ID: @.***>
Hi Archana,
After some research, we found that cp930 has 1 byte representation for Katakana, while all other characters are represented by 2 bytes.
Currently, Cobrix is having an assumption that all EBCDIC characters are represented by 1 byte. So the solution I proposed earlier won't work for non-Katakana characters.
We can implement the support for cp930, just trying to find a spec for it. Do you have a link to a 930 table by any chance?
Thank you, Ruslan
Hi Ruslan,
No i do not have any link for spec for cp930 , i am also trying to search but did not find any yet. Yes the multibyte characters are issue . We tried reading with diff charsets too but nothing worked.
Thanks, Archana
Hi Ruslan,
Using cobrix we can read the ebcdic data with a copy book. But cobrix does not provide an option to write back in ebcdic format ?
Thanks, Archana
Sorry, we don't have the write feature at the moment. It is in the plans, but probably won't be soon.
Regarding Japanese charset - we can implement support for it, but we need the spec - EBCDIC byte or byte pairs mapping to Unicode characters.
Hi @yruslan,
By any chance, could you also include the following 2 byte code pages as well?
https://web.archive.org/web/20141201234940/http://www-01.ibm.com/software/globalization/ccsid/ccsid300.html https://web.archive.org/web/20141129222534/http://www-01.ibm.com/software/globalization/ccsid/ccsid1364.html https://web.archive.org/web/20141129205408/http://www-01.ibm.com/software/globalization/ccsid/ccsid1388.html
Many thanks for your help.
Br., Bence.
@BenceBenedek ,
Could you please create a separate GitHub issue for these?
It would be easier to manage for us.
Thank you!