polars
                                
                                 polars copied to clipboard
                                
                                    polars copied to clipboard
                            
                            
                            
                        Chinese problem
Hello, developers,
I think polars is an amazing tools for data science.
But I can't read csv, when csv contains Chinese Data.
Please help.
Error logs like the following,
ComputeError Traceback (most recent call last) Input In [113], in <cell line: 1>() ----> 1 b_small = pl.read_csv('test_s.csv', has_header=False, new_columns=['test1', 'test2','test23']) 2 b_small
File /opt/homebrew/Caskroom/miniforge/base/envs/ML/lib/python3.8/site-packages/polars/io.py:371, in read_csv(file, has_header, columns, new_columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char, **kwargs) 365 dtypes = { 366 new_to_current.get(column_name, column_name): column_dtype 367 for column_name, column_dtype in dtypes.items() 368 } 370 with _prepare_file_arg(file, **storage_options) as data: --> 371 df = DataFrame._read_csv( 372 file=data, 373 has_header=has_header, 374 columns=columns if columns else projection, 375 sep=sep, 376 comment_char=comment_char, 377 quote_char=quote_char, 378 skip_rows=skip_rows, 379 dtypes=dtypes, 380 null_values=null_values, 381 ignore_errors=ignore_errors, 382 parse_dates=parse_dates, 383 n_threads=n_threads, 384 infer_schema_length=infer_schema_length, 385 batch_size=batch_size, 386 n_rows=n_rows, 387 encoding=encoding, 388 low_memory=low_memory, 389 rechunk=rechunk, 390 skip_rows_after_header=skip_rows_after_header, 391 row_count_name=row_count_name, 392 row_count_offset=row_count_offset, 393 sample_size=sample_size, 394 eol_char=eol_char, 395 ) 397 if new_columns: 398 return update_columns(df, new_columns)
File /opt/homebrew/Caskroom/miniforge/base/envs/ML/lib/python3.8/site-packages/polars/internals/frame.py:622, in DataFrame._read_csv(cls, file, has_header, columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char)
615         raise ValueError(
616             "cannot use glob patterns and integer based projection as columns"
617             " argument; Use columns: List[str]"
618         )
620 projection, columns = handle_projection_columns(columns)
--> 622 self._df = PyDataFrame.read_csv(
623     file,
624     infer_schema_length,
625     batch_size,
626     has_header,
627     ignore_errors,
628     n_rows,
629     skip_rows,
630     projection,
631     sep,
632     rechunk,
633     columns,
634     encoding,
635     n_threads,
636     path,
637     dtype_list,
638     dtype_slice,
639     low_memory,
640     comment_char,
641     quote_char,
642     processed_null_values,
643     parse_dates,
644     skip_rows_after_header,
645     _prepare_row_count_args(row_count_name, row_count_offset),
646     sample_size=sample_size,
647     eol_char=eol_char,
648 )
649 return self
ComputeError: invalid utf8 data in csv
Please help.
Can you provide a small sample file, with a few rows that demonstrate the error you're seeing?
Yes, of course
the following link will download the test.csv file which contains Traditional Chinese character. https://drive.google.com/file/d/10NvCXoEJ3ZtsifUO9XXWqueitYZP2UpV/view?usp=sharing
I just use pl.read_csv(test.csv)
Your input file is not in UTF-8 encoding, but a different one.
Opening your file with WPS and saving it again as a CSV file encodes the text properly as UTF-8.
❯ hexdump -C test_original.csv 
00000000  56 61 6c 75 65 31 2c 56  61 6c 75 65 32 2c 56 61  |Value1,Value2,Va|
00000010  6c 75 65 33 2c 56 61 6c  75 65 34 2c 52 65 67 69  |lue3,Value4,Regi|
00000020  6f 6e 0d 0a 2d 33 30 2c  37 2e 35 2c 32 35 37 38  |on..-30,7.5,2578|
00000030  2c 31 2c a5 78 a5 5f 0d  0a 2d 33 32 2c 37 2e 39  |,1,.x._..-32,7.9|
00000040  37 2c 33 30 30 36 2c 31  2c a5 78 a4 a4 0d 0a 2d  |7,3006,1,.x....-|
00000050  33 31 2c 38 2c 33 32 34  32 2c 32 2c b7 73 a6 cb  |31,8,3242,2,.s..|
00000060  0d 0a 2d 33 33 2c 37 2e  39 37 2c 33 33 30 30 2c  |..-33,7.97,3300,|
00000070  33 2c b0 aa b6 af 0d 0a  2d 32 30 2c 37 2e 39 31  |3,......-20,7.91|
00000080  2c 33 33 38 34 2c 34 2c  ac fc b0 ea              |,3384,4,....|
0000008c
❯ hexdump -C ~/test_wps.csv 
00000000  56 61 6c 75 65 31 2c 56  61 6c 75 65 32 2c 56 61  |Value1,Value2,Va|
00000010  6c 75 65 33 2c 56 61 6c  75 65 34 2c 52 65 67 69  |lue3,Value4,Regi|
00000020  6f 6e 0d 0a 2d 33 30 2c  37 2e 35 2c 32 35 37 38  |on..-30,7.5,2578|
00000030  2c 31 2c e5 8f b0 e5 8c  97 0d 0a 2d 33 32 2c 37  |,1,........-32,7|
00000040  2e 39 37 2c 33 30 30 36  2c 31 2c e5 8f b0 e4 b8  |.97,3006,1,.....|
00000050  ad 0d 0a 2d 33 31 2c 38  2c 33 32 34 32 2c 32 2c  |...-31,8,3242,2,|
00000060  e6 96 b0 e7 ab b9 0d 0a  2d 33 33 2c 37 2e 39 37  |........-33,7.97|
00000070  2c 33 33 30 30 2c 33 2c  e9 ab 98 e9 9b 84 0d 0a  |,3300,3,........|
00000080  2d 32 30 2c 37 2e 39 31  2c 33 33 38 34 2c 34 2c  |-20,7.91,3384,4,|
00000090  e7 be 8e e5 9c 8b 0d 0a                           |........|
00000098
❯ head test_original.csv test_wps.csv
==> test_original.csv <==
Value1,Value2,Value3,Value4,Region
-30,7.5,2578,1,�x�_
-32,7.97,3006,1,�x��
-31,8,3242,2,�s��
-33,7.97,3300,3,����
-20,7.91,3384,4,����
==> test_wps.csv <==
Value1,Value2,Value3,Value4,Region
-30,7.5,2578,1,台北
-32,7.97,3006,1,台中
-31,8,3242,2,新竹
-33,7.97,3300,3,高雄
-20,7.91,3384,4,美國
With the following iconv command, I managed to convert the CSV to UTF-8 format too.
iconv -f BIG5 -t UTF-8 test_original.csv > test_iconv.csv
It is possible that you need another encoding for other files: https://docs.oracle.com/cd/E19455-01/806-3487/6jckovvg1/index.html
There is an encoding argument for read_csv, but it only supports the values utf8 and utf8-lossy. Is there scope to support additional encodings? For example, read_csv in Pandas also has an encoding argument, but it supports all encodings supported by Python, including big5, as mentioned in this issue, and the very common latin1.
- Polars read_csv
- Pandas read_csv
- Standard Python encodings
EDIT: Pyarrow also supports specifying the encoding for read_csv since this PR as follows, which allows the following workaround to load non-UTF encoded data into Polars:
import pyarrow
import polars as pl
csv_read_options = pyarrow.csv.ReadOptions(encoding='latin1')
arrow_df = pyarrow.csv.read_csv('file.csv', read_options=csv_read_options)
polars_df = pl.from_arrow(arrow_df)
Thanks all,
In reality, our data usually have 2G, even more over.
So, I will try to decode the csv file upload from user, and encode it to utf8 to use Polars.
This will also work:
In [67]: with open('test_original.csv', 'r', encoding='big5') as fh:
    ...:     df = pl.read_csv(fh.read().encode('utf-8'))
    ...: 
In [68]: df
Out[68]: 
shape: (5, 5)
┌────────┬────────┬────────┬────────┬────────┐
│ Value1 ┆ Value2 ┆ Value3 ┆ Value4 ┆ Region │
│ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ i64    ┆ f64    ┆ i64    ┆ i64    ┆ str    │
╞════════╪════════╪════════╪════════╪════════╡
│ -30    ┆ 7.5    ┆ 2578   ┆ 1      ┆ 台北   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -32    ┆ 7.97   ┆ 3006   ┆ 1      ┆ 台中   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -31    ┆ 8.0    ┆ 3242   ┆ 2      ┆ 新竹   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -33    ┆ 7.97   ┆ 3300   ┆ 3      ┆ 高雄   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -20    ┆ 7.91   ┆ 3384   ┆ 4      ┆ 美國   │
└────────┴────────┴────────┴────────┴────────┘
https://github.com/pola-rs/polars/pull/4464