libredwg Still NATIVE_WCHAR2 failures. Fixed: Unicode characters in Text entities (unsupported <r2007 codepages, like ANSI

Hello,

I'm trying to read a dwg file "2.dwg" (see attachment below 2.zip).

I have a problem in a text entity:

The Text entity with handle 890 should have the following text_value: "20ƒ10/18 l=420 (alt)" But, if you look at the json file the 'ƒ' character is missing.

Please look at the image below.

Searching in online tables I found that the 'ƒ' character should be the unicode "\u0192", but here I'm getting the unicode "\u0083".

It is not very clear to me how and if I can convert these values, do you have any suggestions?

2.zip

Mar 09 '23 13:03 gentilinigm

This looks like a bug in our utf-8 conversion. I'll check...

Mar 09 '23 14:03 rurban

Hello,

I have done some more testing on this. It looks like that the Unicode char "\u0192" is in fact very similar to "\u0083". Reference: https://www.fileformat.info/info/unicode/char/0192/index.htm https://www.fileformat.info/info/unicode/char/0083/index.htm

So maybe it's okay and not a bug? If I copy the character from autocad and paste is somewhere else I get "\u0192" as the right unicode character.

Sorry, I have very little experience on encodings, I don't understand if I'm doing it wrong or it's a problem with the libreDWG library.

If you can confirm that it's not a problem with libreDWG we can close the issue.

I'll leave the dxf here if it's useful. 2.zip

Thank you.

Mar 21 '23 11:03 gentilinigm

This is still a limitation on our codepage conversions. you have ANSI_1254 (codepage 33), which we do not decode yet properly into UTF-8, for all versions <r2007. libdxfrw does it via iconv / some handmade mapping tables.

Working on it in the work/iconv branch

Mar 21 '23 12:03 rurban

This should be fixed now

Apr 12 '23 15:04 rurban

I tested it out on the example above and it seems almost correct: immagine

"text_value": " 2010/18 l=420 (alt)", If I copy and paste the bytes I get: "Âƒ". Is that correct? I know that in this case I can just split that and take the "ƒ" only, but is it always the case?

I'll do a few more tests and let you know, thank you for your help.

Also, the new version gives me the error "libiconv-2.dll was not found" when I try to launch any .exe on Windows OS, this wasn't happening before.

Apr 21 '23 07:04 gentilinigm

ad Âƒ This seems to be still wrong in the codepage to utf-8 conversion. in the json it should be "text_value": " 20ƒ10/18 l=420 (alt)", I think

libiconv is new. this does optionally the faster charset conversions. when the compiler detects libiconv the system on which it runs also requires libiconv then. copying the dll is enough. I've added the missing libiconv-2.dll with GH #700 to appveyor.

Apr 21 '23 08:04 rurban

I understand, is there a way to deal with this? Maybe by trying to normalize the character with the unicode NFD?

Also, should we reopen this issue?

Apr 26 '23 13:04 gentilinigm

I recently came across Cyrillic texts and I noticed an error while reading 2007 .dwg files. I'm leaving an update here, in case it might be useful for testing.

Build 0.12.5.5887

I have created a test file with only one text entity containing two russian words, but when I try to read it in the 2007 version, the text_value is blank. Other versions seems fine.

Inside the attachment cyrillic-test.zip you will find 8 different files for 4 different versions (r14, 2000, 2004, 2007) in both formats dwg/dxf.

Jul 07 '23 11:07 gentilinigm

well, cyrillic-07_cyrillic-test.json contains "text_value": "\u041d\u043e\u043c\u0435\u0440\u0430 \u0441\u043a\u0432\u0430\u0436\u0438\u043d" which is correct for codepage 30. utf-8 "Номера скважин" would be better though.

Aug 04 '23 07:08 rurban

I see, sorry, I forgot to specify I was testing it on a windows os.

I tried again with the latest win64 nightly build (libredwg-0.12.5.6183-win64) and the text value it's still blank for 2007 files. Can confirm it's all good on linux.

Aug 22 '23 08:08 gentilinigm

Oh, that must be the NATIVE_WCHAR2 branch optimizations then

Aug 24 '23 07:08 rurban

Hello. Any progress on this issue? Do you need me to do more tests? Korean chars are also involved.

Jan 21 '24 16:01 timoria21

libredwg
libredwg copied to clipboard

Still NATIVE_WCHAR2 failures. Fixed: Unicode characters in Text entities (unsupported <r2007 codepages, like ANSI_1254)

libredwg libredwg copied to clipboard

Still NATIVE_WCHAR2 failures. Fixed: Unicode characters in Text entities (unsupported <r2007 codepages, like ANSI_1254)

libredwg
libredwg copied to clipboard