polars Chinese problem

Hello, developers,

I think polars is an amazing tools for data science.

But I can't read csv, when csv contains Chinese Data.

Please help.

Error logs like the following,

ComputeError Traceback (most recent call last) Input In [113], in <cell line: 1>() ----> 1 b_small = pl.read_csv('test_s.csv', has_header=False, new_columns=['test1', 'test2','test23']) 2 b_small

File /opt/homebrew/Caskroom/miniforge/base/envs/ML/lib/python3.8/site-packages/polars/io.py:371, in read_csv(file, has_header, columns, new_columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char, **kwargs) 365 dtypes = { 366 new_to_current.get(column_name, column_name): column_dtype 367 for column_name, column_dtype in dtypes.items() 368 } 370 with _prepare_file_arg(file, **storage_options) as data: --> 371 df = DataFrame._read_csv( 372 file=data, 373 has_header=has_header, 374 columns=columns if columns else projection, 375 sep=sep, 376 comment_char=comment_char, 377 quote_char=quote_char, 378 skip_rows=skip_rows, 379 dtypes=dtypes, 380 null_values=null_values, 381 ignore_errors=ignore_errors, 382 parse_dates=parse_dates, 383 n_threads=n_threads, 384 infer_schema_length=infer_schema_length, 385 batch_size=batch_size, 386 n_rows=n_rows, 387 encoding=encoding, 388 low_memory=low_memory, 389 rechunk=rechunk, 390 skip_rows_after_header=skip_rows_after_header, 391 row_count_name=row_count_name, 392 row_count_offset=row_count_offset, 393 sample_size=sample_size, 394 eol_char=eol_char, 395 ) 397 if new_columns: 398 return update_columns(df, new_columns)

File /opt/homebrew/Caskroom/miniforge/base/envs/ML/lib/python3.8/site-packages/polars/internals/frame.py:622, in DataFrame._read_csv(cls, file, has_header, columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char) 615 raise ValueError( 616 "cannot use glob patterns and integer based projection as columns" 617 " argument; Use columns: List[str]" 618 ) 620 projection, columns = handle_projection_columns(columns) --> 622 self._df = PyDataFrame.read_csv( 623 file, 624 infer_schema_length, 625 batch_size, 626 has_header, 627 ignore_errors, 628 n_rows, 629 skip_rows, 630 projection, 631 sep, 632 rechunk, 633 columns, 634 encoding, 635 n_threads, 636 path, 637 dtype_list, 638 dtype_slice, 639 low_memory, 640 comment_char, 641 quote_char, 642 processed_null_values, 643 parse_dates, 644 skip_rows_after_header, 645 _prepare_row_count_args(row_count_name, row_count_offset), 646 sample_size=sample_size, 647 eol_char=eol_char, 648 ) 649 return self

ComputeError: invalid utf8 data in csv

Aug 11 '22 09:08 Michael-Yan-wun

Please help.

Can you provide a small sample file, with a few rows that demonstrate the error you're seeing?

Aug 11 '22 10:08 alexander-beedie

Yes, of course

the following link will download the test.csv file which contains Traditional Chinese character. https://drive.google.com/file/d/10NvCXoEJ3ZtsifUO9XXWqueitYZP2UpV/view?usp=sharing

I just use pl.read_csv(test.csv)

Aug 11 '22 11:08 Michael-Yan-wun

Your input file is not in UTF-8 encoding, but a different one.

Opening your file with WPS and saving it again as a CSV file encodes the text properly as UTF-8.

❯ hexdump -C test_original.csv 
00000000  56 61 6c 75 65 31 2c 56  61 6c 75 65 32 2c 56 61  |Value1,Value2,Va|
00000010  6c 75 65 33 2c 56 61 6c  75 65 34 2c 52 65 67 69  |lue3,Value4,Regi|
00000020  6f 6e 0d 0a 2d 33 30 2c  37 2e 35 2c 32 35 37 38  |on..-30,7.5,2578|
00000030  2c 31 2c a5 78 a5 5f 0d  0a 2d 33 32 2c 37 2e 39  |,1,.x._..-32,7.9|
00000040  37 2c 33 30 30 36 2c 31  2c a5 78 a4 a4 0d 0a 2d  |7,3006,1,.x....-|
00000050  33 31 2c 38 2c 33 32 34  32 2c 32 2c b7 73 a6 cb  |31,8,3242,2,.s..|
00000060  0d 0a 2d 33 33 2c 37 2e  39 37 2c 33 33 30 30 2c  |..-33,7.97,3300,|
00000070  33 2c b0 aa b6 af 0d 0a  2d 32 30 2c 37 2e 39 31  |3,......-20,7.91|
00000080  2c 33 33 38 34 2c 34 2c  ac fc b0 ea              |,3384,4,....|
0000008c


❯ hexdump -C ~/test_wps.csv 
00000000  56 61 6c 75 65 31 2c 56  61 6c 75 65 32 2c 56 61  |Value1,Value2,Va|
00000010  6c 75 65 33 2c 56 61 6c  75 65 34 2c 52 65 67 69  |lue3,Value4,Regi|
00000020  6f 6e 0d 0a 2d 33 30 2c  37 2e 35 2c 32 35 37 38  |on..-30,7.5,2578|
00000030  2c 31 2c e5 8f b0 e5 8c  97 0d 0a 2d 33 32 2c 37  |,1,........-32,7|
00000040  2e 39 37 2c 33 30 30 36  2c 31 2c e5 8f b0 e4 b8  |.97,3006,1,.....|
00000050  ad 0d 0a 2d 33 31 2c 38  2c 33 32 34 32 2c 32 2c  |...-31,8,3242,2,|
00000060  e6 96 b0 e7 ab b9 0d 0a  2d 33 33 2c 37 2e 39 37  |........-33,7.97|
00000070  2c 33 33 30 30 2c 33 2c  e9 ab 98 e9 9b 84 0d 0a  |,3300,3,........|
00000080  2d 32 30 2c 37 2e 39 31  2c 33 33 38 34 2c 34 2c  |-20,7.91,3384,4,|
00000090  e7 be 8e e5 9c 8b 0d 0a                           |........|
00000098

❯ head test_original.csv test_wps.csv
==> test_original.csv <==
Value1,Value2,Value3,Value4,Region
-30,7.5,2578,1,�x�_
-32,7.97,3006,1,�x��
-31,8,3242,2,�s��
-33,7.97,3300,3,����
-20,7.91,3384,4,����
==> test_wps.csv <==
Value1,Value2,Value3,Value4,Region
-30,7.5,2578,1,台北
-32,7.97,3006,1,台中
-31,8,3242,2,新竹
-33,7.97,3300,3,高雄
-20,7.91,3384,4,美國

Aug 11 '22 16:08 ghuls

With the following iconv command, I managed to convert the CSV to UTF-8 format too.

iconv -f BIG5 -t UTF-8 test_original.csv > test_iconv.csv

It is possible that you need another encoding for other files: https://docs.oracle.com/cd/E19455-01/806-3487/6jckovvg1/index.html

Aug 11 '22 16:08 ghuls

There is an encoding argument for read_csv, but it only supports the values utf8 and utf8-lossy. Is there scope to support additional encodings? For example, read_csv in Pandas also has an encoding argument, but it supports all encodings supported by Python, including big5, as mentioned in this issue, and the very common latin1.

EDIT: Pyarrow also supports specifying the encoding for read_csv since this PR as follows, which allows the following workaround to load non-UTF encoded data into Polars:

import pyarrow
import polars as pl
csv_read_options = pyarrow.csv.ReadOptions(encoding='latin1')
arrow_df = pyarrow.csv.read_csv('file.csv', read_options=csv_read_options)
polars_df = pl.from_arrow(arrow_df)

Aug 12 '22 05:08 daviewales

Thanks all,

In reality, our data usually have 2G, even more over.

So, I will try to decode the csv file upload from user, and encode it to utf8 to use Polars.

Aug 12 '22 05:08 Michael-Yan-wun

This will also work:

In [67]: with open('test_original.csv', 'r', encoding='big5') as fh:
    ...:     df = pl.read_csv(fh.read().encode('utf-8'))
    ...: 

In [68]: df
Out[68]: 
shape: (5, 5)
┌────────┬────────┬────────┬────────┬────────┐
│ Value1 ┆ Value2 ┆ Value3 ┆ Value4 ┆ Region │
│ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ i64    ┆ f64    ┆ i64    ┆ i64    ┆ str    │
╞════════╪════════╪════════╪════════╪════════╡
│ -30    ┆ 7.5    ┆ 2578   ┆ 1      ┆ 台北   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -32    ┆ 7.97   ┆ 3006   ┆ 1      ┆ 台中   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -31    ┆ 8.0    ┆ 3242   ┆ 2      ┆ 新竹   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -33    ┆ 7.97   ┆ 3300   ┆ 3      ┆ 高雄   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -20    ┆ 7.91   ┆ 3384   ┆ 4      ┆ 美國   │
└────────┴────────┴────────┴────────┴────────┘

Aug 12 '22 13:08 ghuls

https://github.com/pola-rs/polars/pull/4464

Aug 17 '22 15:08 ghuls

polars polars copied to clipboard

Chinese problem

polars
polars copied to clipboard