libredwg
libredwg copied to clipboard
Still NATIVE_WCHAR2 failures. Fixed: Unicode characters in Text entities (unsupported <r2007 codepages, like ANSI_1254)
Hello,
I'm trying to read a dwg file "2.dwg" (see attachment below 2.zip).
I have a problem in a text entity:
The Text entity with handle 890 should have the following text_value: "20ƒ10/18 l=420 (alt)" But, if you look at the json file the 'ƒ' character is missing.
Please look at the image below.

Searching in online tables I found that the 'ƒ' character should be the unicode "\u0192", but here I'm getting the unicode "\u0083".
It is not very clear to me how and if I can convert these values, do you have any suggestions?
This looks like a bug in our utf-8 conversion. I'll check...
Hello,
I have done some more testing on this. It looks like that the Unicode char "\u0192" is in fact very similar to "\u0083". Reference: https://www.fileformat.info/info/unicode/char/0192/index.htm https://www.fileformat.info/info/unicode/char/0083/index.htm
So maybe it's okay and not a bug? If I copy the character from autocad and paste is somewhere else I get "\u0192" as the right unicode character.
Sorry, I have very little experience on encodings, I don't understand if I'm doing it wrong or it's a problem with the libreDWG library.
If you can confirm that it's not a problem with libreDWG we can close the issue.
I'll leave the dxf here if it's useful. 2.zip
Thank you.
This is still a limitation on our codepage conversions. you have ANSI_1254 (codepage 33), which we do not decode yet properly into UTF-8, for all versions <r2007. libdxfrw does it via iconv / some handmade mapping tables.
Working on it in the work/iconv branch
This should be fixed now
I tested it out on the example above and it seems almost correct:

"text_value": " 2010/18 l=420 (alt)", If I copy and paste the bytes I get: "ƒ". Is that correct? I know that in this case I can just split that and take the "ƒ" only, but is it always the case?
I'll do a few more tests and let you know, thank you for your help.
Also, the new version gives me the error "libiconv-2.dll was not found" when I try to launch any .exe on Windows OS, this wasn't happening before.
ad ƒ This seems to be still wrong in the codepage to utf-8 conversion. in the json it should be "text_value": " 20ƒ10/18 l=420 (alt)", I think
libiconv is new. this does optionally the faster charset conversions. when the compiler detects libiconv the system on which it runs also requires libiconv then. copying the dll is enough. I've added the missing libiconv-2.dll with GH #700 to appveyor.
I understand, is there a way to deal with this? Maybe by trying to normalize the character with the unicode NFD?
Also, should we reopen this issue?
I recently came across Cyrillic texts and I noticed an error while reading 2007 .dwg files. I'm leaving an update here, in case it might be useful for testing.
Build 0.12.5.5887
I have created a test file with only one text entity containing two russian words, but when I try to read it in the 2007 version, the text_value is blank. Other versions seems fine.
Inside the attachment cyrillic-test.zip you will find 8 different files for 4 different versions (r14, 2000, 2004, 2007) in both formats dwg/dxf.
well, cyrillic-07_cyrillic-test.json contains "text_value": "\u041d\u043e\u043c\u0435\u0440\u0430 \u0441\u043a\u0432\u0430\u0436\u0438\u043d" which is correct for codepage 30. utf-8 "Номера скважин" would be better though.
I see, sorry, I forgot to specify I was testing it on a windows os.
I tried again with the latest win64 nightly build (libredwg-0.12.5.6183-win64) and the text value it's still blank for 2007 files. Can confirm it's all good on linux.
Oh, that must be the NATIVE_WCHAR2 branch optimizations then
Hello. Any progress on this issue? Do you need me to do more tests? Korean chars are also involved.