wikokit icon indicating copy to clipboard operation
wikokit copied to clipboard

Unrecognizable russian characters! (Question marks ???)

Open ArmanKabiri opened this issue 5 years ago • 6 comments

I'm trying to extract definitions of ruwiktionary_parsed.sql file. I have followed this page for creating empty wikt_parsed database. Then I loaded the Russian parsed SQL into the created database using the command: source RussianParsedFile.sql;

After a while, it executes all queries without any errors. Now, I am trying to extract the words list of Russian wiktionary using this query:

SELECT page_title FROM ruwikt20140904_parsed.index_native INTO OUTFILE 'C:/temp/result.csv';

What is written in the CSV file is all question marks (?????) instead of Russian words. I have tried to read the file using different application. The result is the same.

Does anybody have any idea or suggestion? Thanks.

ArmanKabiri avatar Dec 04 '19 17:12 ArmanKabiri

Dear @componavt , could you please let me know what the problem is in this case?

ArmanKabiri avatar Dec 04 '19 19:12 ArmanKabiri

Hello, Arman Kabiri!

You have the usual problem with encoding.

Before an export to the CSV file, SELECT data from database to display. If you will see ??? then it means that the loading of text from the Wiktionary parsed file to database was with incorrect encodings.

There is a lot of information how to set up correctly encodings in MySQL database.

See this page: https://github.com/componavt/wikokit/wiki/Encoding

You can try the interesting command of MySQL: SET NAMES some-encoding

Best regards, Andrew.

On Wed, 4 Dec 2019 at 22:01, Arman Kabiri [email protected] wrote:

Dear @componavt https://github.com/componavt , could you please let me know what the problem is in this case?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/componavt/wikokit/issues/16?email_source=notifications&email_token=ACEA74KMVNCYYUI5ZH2NCS3QW75BXA5CNFSM4JVMB322YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF6EB3Y#issuecomment-561791215, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEA74K3AL4CBJ7OD2QZPSDQW75BXANCNFSM4JVMB32Q .

componavt avatar Dec 06 '19 16:12 componavt

Hi, Thanks @componavt for your answer. Actually, when I load the sql file to mysql, I get errors like:

"ERROR 1062 (23000): Duplicate entry '??????-??????' for key 'foreign_native'" ERROR 1062 (23000): Duplicate entry '?????-avis' for key 'foreign_native' , ...

After the sql file is loaded, I use the connectionString mentioned in this page. The result of Select query is stil question marks ???.

ArmanKabiri avatar Dec 06 '19 22:12 ArmanKabiri

Hi!

Before loading dump into MySQL, use the MySQL command: SET NAMES utf8;

If this didn't work, then SET NAMES something-other-encoding;

Best regards, Andrew.

On Sat, 7 Dec 2019 at 01:23, Arman Kabiri [email protected] wrote:

Hi, Thanks @componavt https://github.com/componavt for your answer. Actually, when I load the sql file to mysql, I get errors like:

"ERROR 1062 (23000): Duplicate entry '??????-??????' for key 'foreign_native'" ERROR 1062 (23000): Duplicate entry '?????-avis' for key 'foreign_native' , ...

After the sql file is loaded, I use the connectionString mentioned in this page https://github.com/componavt/wikokit/wiki/Encoding. The result of Select query is stil question marks ???.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/componavt/wikokit/issues/16?email_source=notifications&email_token=ACEA74KDAQJXMM73USCYFJTQXLGGTA5CNFSM4JVMB322YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGFQO3A#issuecomment-562759532, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEA74KXL7AGQNWX3H3WW4TQXLGGTANCNFSM4JVMB32Q .

componavt avatar Dec 08 '19 13:12 componavt

Thank you so much @componavt for your help. The problem is solved. The correct encoding for russian language was 'Latin1'. So I executed these two commands before loading the sql:

cur.execute("SET NAMES 'latin1'");
cur.execute("SET CHARACTER SET 'latin1'");

ArmanKabiri avatar Dec 08 '19 17:12 ArmanKabiri

Great!

Best regards, Andrew Krizhanovsky.

On Sun, 8 Dec 2019 at 20:40, Arman Kabiri [email protected] wrote:

Thanks @componavt https://github.com/componavt for your help. The problem is solved. The correct encoding for russian language was 'Latin1'. So I executed these two commands before loading the sql:

cur.execute("SET NAMES 'latin1'"); cur.execute("SET CHARACTER SET 'latin1'");

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/componavt/wikokit/issues/16?email_source=notifications&email_token=ACEA74N3P4QAX4FXVFJXZFDQXUWQ7A5CNFSM4JVMB322YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGHET6Y#issuecomment-562973179, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEA74OGD5NTL5U7OI47UG3QXUWQ7ANCNFSM4JVMB32Q .

componavt avatar Dec 09 '19 11:12 componavt