polars.EagerPolarsDataset "calamine" engine error for reading Excel files
Description
polars.EagerPolarsDataset cannot read excel files using the "calamine" engine.
Context
The "calamine" engine (see: link), is significantly faster at reading and writing Excel files compared to the other polars.read_excel() engines. This is a practical requirement for reading/writing large Excel files without sacrificing performance.
Steps to Reproduce
- Set up a new Kedro project via
kedro new - Create a
.xlsx,.xls, or.xlsbfile indata/directory - Install Polars pyarrow extra
pip install -U polars[pyarrow] - Install fastexcel (required for using the
polars.read_excel()"calamine" engine)pip install -U fastexcel - Add to
catalog.yml:
excel_input_file:
type: polars.EagerPolarsDataset
file_format: excel
filepath: data/my_excel_file.xlsx
load_args:
engine: "calamine"
- Create a Kedro pipeline that reads in the
excel_input_fileas an input and run withkedro run
Expected Result
kedro run to load Excel file via polars.EagerPolarsDataset as per catalog.yml.
Actual Result
Error when using kedro run:
kedro.io.core.DatasetError: Failed while loading data from data set EagerPolarsDataset(file_format=excel, filepath=D:/NickB/repos/PSMO_FILES/psmo-core/data/00_user_input/psmo_inputs.xlsx, load_args={'engine': calamine, 'sheet_id': 0}, protocol=file, save_args={}).
expected str, bytes or os.PathLike object, not LocalFileOpener```
Note that when changing the polars.read_excel() to another engine (e.g., xlsx2csv), the polars.EagerPolarsDataset loads as expected.
Your Environment
- Kedro version used (
pip show kedroorkedro -V): 0.19.3 - Kedro plugin and kedro plugin version used (
pip show kedro-airflow): kedro-datasets: 2.1.0 - Python version used (
python -V): 3.11.7 - Operating system and version: Windows 10
Hi @butterlyn , thanks for reporting!
I suspect that this is a problem of the error message being unhelpful here. IIRC, the calamine engine requires extra dependencies, namely https://pypi.org/project/fastexcel/
Could you try pip install fastexcel in your environment and re-run your pipeline again?
Hi @butterlyn , thanks for reporting!
I suspect that this is a problem of the error message being unhelpful here. IIRC, the
calamineengine requires extra dependencies, namely https://pypi.org/project/fastexcel/Could you try
pip install fastexcelin your environment and re-run your pipeline again?
Hi @astrojuanlu, yes I'm running fastexcel version 0.9.1, still getting the same error.
Thanks for checking. Can this error be reproduced with https://github.com/kedro-org/kedro-starters/blob/main/spaceflights-pandas/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/shuttles.xlsx ?
Thanks for checking. Can this error be reproduced with https://github.com/kedro-org/kedro-starters/blob/main/spaceflights-pandas/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/shuttles.xlsx ?
Hi @astrojuanlu, just downloaded the file and checked, yes getting the same error message.
Kedro datasets (generally) work by using fsspec to abstract filesytem differences. The "issue" is that Polars doesn't support reading from fsspec objects (e.g. LocalFileOpener or HTTPFile) consistently. In its defense, Polars seems to just push this responsibility to the underlying engine, so the "fault" is that fastexcel doesn't support fsspec objects. Therefore, I think the best way to support this is to request them to support fsspec objects.
>>> import fastexcel
>>> import fsspec
>>> import polars as pl
>>>
>>> fs = fsspec.filesystem('http')
>>> path = fs.open('https://github.com/kedro-org/kedro-viz/raw/main/demo-project/data/01_raw/shuttles.xlsx')
>>> pl.read_excel(path, engine='openpyxl')
shape: (77_096, 13)
┌───────┬───────────────────────────┬──────────────┬─────────────┬───┬──────────────────┬─────────────────────────┬──────────┬────────────┐
│ id ┆ shuttle_location ┆ shuttle_type ┆ engine_type ┆ … ┆ d_check_complete ┆ moon_clearance_complete ┆ price ┆ company_id │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ i64 │
╞═══════╪═══════════════════════════╪══════════════╪═════════════╪═══╪══════════════════╪═════════════════════════╪══════════╪════════════╡
│ 63561 ┆ Niue ┆ Type V5 ┆ Quantum ┆ … ┆ f ┆ f ┆ $1,325.0 ┆ 35029 │
│ 36260 ┆ Anguilla ┆ Type V5 ┆ Quantum ┆ … ┆ t ┆ f ┆ $1,780.0 ┆ 30292 │
│ 57015 ┆ Russian Federation ┆ Type V5 ┆ Quantum ┆ … ┆ f ┆ f ┆ $1,715.0 ┆ 19032 │
│ 14035 ┆ Barbados ┆ Type V5 ┆ Plasma ┆ … ┆ f ┆ f ┆ $4,770.0 ┆ 8238 │
│ 10036 ┆ Sao Tome and Principe ┆ Type V2 ┆ Plasma ┆ … ┆ f ┆ f ┆ $2,820.0 ┆ 30342 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 4368 ┆ Barbados ┆ Type V5 ┆ Quantum ┆ … ┆ t ┆ f ┆ $4,107.0 ┆ 6654 │
│ 2983 ┆ Bouvet Island (Bouvetoya) ┆ Type F5 ┆ Quantum ┆ … ┆ t ┆ f ┆ $1,169.0 ┆ 8000 │
│ 69684 ┆ Micronesia ┆ Type V5 ┆ Plasma ┆ … ┆ t ┆ f ┆ $1,910.0 ┆ 14296 │
│ 21738 ┆ Uzbekistan ┆ Type V5 ┆ Plasma ┆ … ┆ t ┆ f ┆ $2,170.0 ┆ 27363 │
│ 72645 ┆ Malta ┆ Type F5 ┆ Quantum ┆ … ┆ t ┆ f ┆ $1,455.0 ┆ 12542 │
└───────┴───────────────────────────┴──────────────┴─────────────┴───┴──────────────────┴─────────────────────────┴──────────┴────────────┘
>>> pl.read_excel(path, engine='calamine')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/io/spreadsheet/functions.py", line 253, in read_excel
return _read_spreadsheet(
File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/io/spreadsheet/functions.py", line 469, in _read_spreadsheet
reader_fn, parser, worksheets = _initialise_spreadsheet_parser(
File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/io/spreadsheet/functions.py", line 592, in _initialise_spreadsheet_parser
parser = fxl.read_excel(source, **engine_options)
File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/fastexcel/__init__.py", line 202, in read_excel
return ExcelReader(_read_excel(expanduser(path)))
File "/opt/miniconda3/envs/polars-test/lib/python3.10/posixpath.py", line 232, in expanduser
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not HTTPFile
>>> fastexcel.read_excel(path)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/fastexcel/__init__.py", line 202, in read_excel
return ExcelReader(_read_excel(expanduser(path)))
File "/opt/miniconda3/envs/polars-test/lib/python3.10/posixpath.py", line 232, in expanduser
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not HTTPFile
We can of course ask, but the Rust ecosystem has its own alternative to fsspec, called object_store, so there's a chance that this request doesn't fly.