kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

polars.EagerPolarsDataset "calamine" engine error for reading Excel files

Open butterlyn opened this issue 1 year ago • 7 comments

Description

polars.EagerPolarsDataset cannot read excel files using the "calamine" engine.

Context

The "calamine" engine (see: link), is significantly faster at reading and writing Excel files compared to the other polars.read_excel() engines. This is a practical requirement for reading/writing large Excel files without sacrificing performance.

Steps to Reproduce

  1. Set up a new Kedro project via kedro new
  2. Create a .xlsx, .xls, or .xlsb file in data/ directory
  3. Install Polars pyarrow extra pip install -U polars[pyarrow]
  4. Install fastexcel (required for using the polars.read_excel() "calamine" engine) pip install -U fastexcel
  5. Add to catalog.yml:
excel_input_file:
  type: polars.EagerPolarsDataset
  file_format: excel
  filepath: data/my_excel_file.xlsx
  load_args:
    engine: "calamine"
  1. Create a Kedro pipeline that reads in the excel_input_file as an input and run with kedro run

Expected Result

kedro run to load Excel file via polars.EagerPolarsDataset as per catalog.yml.

Actual Result

Error when using kedro run:

kedro.io.core.DatasetError: Failed while loading data from data set EagerPolarsDataset(file_format=excel, filepath=D:/NickB/repos/PSMO_FILES/psmo-core/data/00_user_input/psmo_inputs.xlsx, load_args={'engine': calamine, 'sheet_id': 0}, protocol=file, save_args={}).
expected str, bytes or os.PathLike object, not LocalFileOpener```

Note that when changing the polars.read_excel() to another engine (e.g., xlsx2csv), the polars.EagerPolarsDataset loads as expected.

Your Environment

  • Kedro version used (pip show kedro or kedro -V): 0.19.3
  • Kedro plugin and kedro plugin version used (pip show kedro-airflow): kedro-datasets: 2.1.0
  • Python version used (python -V): 3.11.7
  • Operating system and version: Windows 10

butterlyn avatar Feb 29 '24 15:02 butterlyn

Hi @butterlyn , thanks for reporting!

I suspect that this is a problem of the error message being unhelpful here. IIRC, the calamine engine requires extra dependencies, namely https://pypi.org/project/fastexcel/

Could you try pip install fastexcel in your environment and re-run your pipeline again?

astrojuanlu avatar Feb 29 '24 15:02 astrojuanlu

Hi @butterlyn , thanks for reporting!

I suspect that this is a problem of the error message being unhelpful here. IIRC, the calamine engine requires extra dependencies, namely https://pypi.org/project/fastexcel/

Could you try pip install fastexcel in your environment and re-run your pipeline again?

Hi @astrojuanlu, yes I'm running fastexcel version 0.9.1, still getting the same error.

butterlyn avatar Feb 29 '24 15:02 butterlyn

Thanks for checking. Can this error be reproduced with https://github.com/kedro-org/kedro-starters/blob/main/spaceflights-pandas/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/shuttles.xlsx ?

astrojuanlu avatar Feb 29 '24 15:02 astrojuanlu

Thanks for checking. Can this error be reproduced with https://github.com/kedro-org/kedro-starters/blob/main/spaceflights-pandas/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/shuttles.xlsx ?

Hi @astrojuanlu, just downloaded the file and checked, yes getting the same error message.

butterlyn avatar Feb 29 '24 15:02 butterlyn

Kedro datasets (generally) work by using fsspec to abstract filesytem differences. The "issue" is that Polars doesn't support reading from fsspec objects (e.g. LocalFileOpener or HTTPFile) consistently. In its defense, Polars seems to just push this responsibility to the underlying engine, so the "fault" is that fastexcel doesn't support fsspec objects. Therefore, I think the best way to support this is to request them to support fsspec objects.

>>> import fastexcel
>>> import fsspec
>>> import polars as pl
>>>
>>> fs = fsspec.filesystem('http')
>>> path = fs.open('https://github.com/kedro-org/kedro-viz/raw/main/demo-project/data/01_raw/shuttles.xlsx')
>>> pl.read_excel(path, engine='openpyxl')
shape: (77_096, 13)
┌───────┬───────────────────────────┬──────────────┬─────────────┬───┬──────────────────┬─────────────────────────┬──────────┬────────────┐
│ id    ┆ shuttle_location          ┆ shuttle_type ┆ engine_type ┆ … ┆ d_check_complete ┆ moon_clearance_complete ┆ price    ┆ company_id │
│ ---   ┆ ---                       ┆ ---          ┆ ---         ┆   ┆ ---              ┆ ---                     ┆ ---      ┆ ---        │
│ i64   ┆ str                       ┆ str          ┆ str         ┆   ┆ str              ┆ str                     ┆ str      ┆ i64        │
╞═══════╪═══════════════════════════╪══════════════╪═════════════╪═══╪══════════════════╪═════════════════════════╪══════════╪════════════╡
│ 63561 ┆ Niue                      ┆ Type V5      ┆ Quantum     ┆ … ┆ f                ┆ f                       ┆ $1,325.0 ┆ 35029      │
│ 36260 ┆ Anguilla                  ┆ Type V5      ┆ Quantum     ┆ … ┆ t                ┆ f                       ┆ $1,780.0 ┆ 30292      │
│ 57015 ┆ Russian Federation        ┆ Type V5      ┆ Quantum     ┆ … ┆ f                ┆ f                       ┆ $1,715.0 ┆ 19032      │
│ 14035 ┆ Barbados                  ┆ Type V5      ┆ Plasma      ┆ … ┆ f                ┆ f                       ┆ $4,770.0 ┆ 8238       │
│ 10036 ┆ Sao Tome and Principe     ┆ Type V2      ┆ Plasma      ┆ … ┆ f                ┆ f                       ┆ $2,820.0 ┆ 30342      │
│ …     ┆ …                         ┆ …            ┆ …           ┆ … ┆ …                ┆ …                       ┆ …        ┆ …          │
│ 4368  ┆ Barbados                  ┆ Type V5      ┆ Quantum     ┆ … ┆ t                ┆ f                       ┆ $4,107.0 ┆ 6654       │
│ 2983  ┆ Bouvet Island (Bouvetoya) ┆ Type F5      ┆ Quantum     ┆ … ┆ t                ┆ f                       ┆ $1,169.0 ┆ 8000       │
│ 69684 ┆ Micronesia                ┆ Type V5      ┆ Plasma      ┆ … ┆ t                ┆ f                       ┆ $1,910.0 ┆ 14296      │
│ 21738 ┆ Uzbekistan                ┆ Type V5      ┆ Plasma      ┆ … ┆ t                ┆ f                       ┆ $2,170.0 ┆ 27363      │
│ 72645 ┆ Malta                     ┆ Type F5      ┆ Quantum     ┆ … ┆ t                ┆ f                       ┆ $1,455.0 ┆ 12542      │
└───────┴───────────────────────────┴──────────────┴─────────────┴───┴──────────────────┴─────────────────────────┴──────────┴────────────┘
>>> pl.read_excel(path, engine='calamine')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/io/spreadsheet/functions.py", line 253, in read_excel
    return _read_spreadsheet(
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/io/spreadsheet/functions.py", line 469, in _read_spreadsheet
    reader_fn, parser, worksheets = _initialise_spreadsheet_parser(
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/polars/io/spreadsheet/functions.py", line 592, in _initialise_spreadsheet_parser
    parser = fxl.read_excel(source, **engine_options)
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/fastexcel/__init__.py", line 202, in read_excel
    return ExcelReader(_read_excel(expanduser(path)))
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/posixpath.py", line 232, in expanduser
    path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not HTTPFile
>>> fastexcel.read_excel(path)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/site-packages/fastexcel/__init__.py", line 202, in read_excel
    return ExcelReader(_read_excel(expanduser(path)))
  File "/opt/miniconda3/envs/polars-test/lib/python3.10/posixpath.py", line 232, in expanduser
    path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not HTTPFile

deepyaman avatar Mar 11 '24 15:03 deepyaman

We can of course ask, but the Rust ecosystem has its own alternative to fsspec, called object_store, so there's a chance that this request doesn't fly.

astrojuanlu avatar Mar 11 '24 16:03 astrojuanlu