polars icon indicating copy to clipboard operation
polars copied to clipboard

Add write_excel and improve read_excel to also use openpyxl for better type inferring

Open bvanelli opened this issue 2 years ago • 8 comments
trafficstars

Closes https://github.com/pola-rs/polars/issues/5568

Improvements on Excel support for polars, including a new exporter and an importer that does better type inferring, i.e. handling null entries, correctly parsing datetimes, etc.

There are some things to still do:

  • [x] Fix pipelines, lint and format it properly, make sure mypy is passing.
  • [x] Improve test coverage
  • [x] Add openpyxl as optional installable
  • [x] Improve the api a little bit
  • [ ] (OPTIONAL) Include basic optional styling

bvanelli avatar Jan 11 '23 22:01 bvanelli

@bvanelli: Thanks for taking a look! However, I suspect that an xlsxwriter / xlsxwriter_rs approach is going to be the best way forward, given the native Rust integration (and previous experience with that library).

alexander-beedie avatar Jan 12 '23 06:01 alexander-beedie

@bvanelli: Thanks for taking a look! However, I suspect that an xlsxwriter / xlsxwriter_rs approach is going to be the best way forward, given the native Rust integration (and previous experience with that library).

Well.. I really don't think we should ship an excel reader / writer in our binary. It is a format I really don't want to push either.

Having this as an optional python dependency could be added as a utility function if it doesn't adds much maintenance and complexity.

ritchie46 avatar Jan 12 '23 07:01 ritchie46

Having this as an optional python dependency could be added as a utility function if it doesn't adds much maintenance and complexity.

Gotcha; in that case I'll take care of it on the Python side (with xlsxwriter), and if it gets traction or there are further pushes for it on the Rust side later we'll be in a good position to translate (though your point about it ending up in the binary is well-taken; my mental model was defaulting to Python-space there, where it's just a pip install away, and the library doesn't actually live in our code ;)

alexander-beedie avatar Jan 12 '23 08:01 alexander-beedie

Currently, the read_excel API uses a non-pythonic array, as the index of sheets start at 1: https://github.com/dilshod/xlsx2csv/blob/3180d9490f64baa25495a8098903acd28f1aa131/xlsx2csv.py#L438

Ideally, the first sheet id should be 0, but I don't want to make changes that could potentially break people's codes, so should I

  • keep the behavior for the new parser the same, starting at 1 or
  • make the new behavior for only openpyxl driver, keeping the same for xlsx2csv or
  • make both behaviors the same, possibly breaking current implementations

Also thanks @alexander-beedie for fixing the format_path issue, it was driving me insane because it always ran fine in my local machine but not on CI 😄

bvanelli avatar Jan 14 '23 16:01 bvanelli

Currently, the read_excel API uses a non-pythonic array, as the index of sheets start at 1: https://github.com/dilshod/xlsx2csv/blob/3180d9490f64baa25495a8098903acd28f1aa131/xlsx2csv.py#L438

Maybe make an issue there that they should use None for all sheets instead of 0.

ghuls avatar Jan 14 '23 22:01 ghuls

I marked the PR as ready, as I have wrote some tests and improved test coverage for the old reader also.

If possible please someone take a look at how I used the methods on tests, for example the following:

    df_by_sheet_id = pl.read_excel(  # type: ignore[call-overload]
        example_file, sheet_id=1
    )

I had to introduce the ignore[call-overload] because mypy was complaining. I'm not exactly sure why because I have little experience with mypy. This was also used on the tests from before (see line below, taken from master) so I used it also for my tests, but there must be a reason why it's triggering.

https://github.com/pola-rs/polars/blob/32c97cf924445b73001df5ff5a6a602dc3acf977/py-polars/tests/unit/io/test_excel.py#L18

bvanelli avatar Jan 21 '23 14:01 bvanelli

Ahh, I feel a little bad as I may not have been clear enough that I was planning to add write_excel functionality myself, as I previously wrote some powerful excel/dataframe export functionality in a previous job and was planning something similar here. I'll check over the PR tomorrow/shortly to see if we can preserve some of it... 😅

alexander-beedie avatar Jan 22 '23 17:01 alexander-beedie

Ahh, I feel a little bad as I may not have been clear enough that I was planning to add write_excel functionality myself, as I previously wrote some powerful excel/dataframe export functionality in a previous job and was planning something similar here. I'll check over the PR tomorrow/shortly to see if we can preserve some of it... 😅

Ah, sorry, I misunderstood then. I implemented it through engines just like with pandas, so multiple engines can be used, so you can have multiple libraries as optional dependencies. Maybe it can be adapted together?

bvanelli avatar Jan 22 '23 18:01 bvanelli

Any update on this? I have some use cases for writing to excel where this would be great to have.

Is there a blocker?

cnpryer avatar Feb 19 '23 05:02 cnpryer

Any update on this? I have some use cases for writing to excel where this would be great to have.

Is there a blocker?

I'd like to know also. I can resolve merge conflicts/adapt changes if someone is willing to review it. 😃

bvanelli avatar Feb 19 '23 09:02 bvanelli

Sorry for the hold-up; I've finished the first iteration of write_excel now, and it's running through CI as we speak: https://github.com/pola-rs/polars/pull/7251. Still very interested in improvements to reading though! 👍

alexander-beedie avatar Feb 28 '23 21:02 alexander-beedie

I took a look at the MR, and it looks pretty good from the exporting perspective, however, I still feel like there is stuff missing on the importing perspective, which I think xlsxwriter does not touch. The importer library is ok for simple scenarios but it does not handle data types well, that openpyxl can do natively. See for example my datatypes test.

Also, I feel like the you should include engine in your exporter just like pandas does for their read/write excel methods, as in the future one parser/writer written in rust could be added to the available drivers for, let's say, very fast exports. One example here.

bvanelli avatar Feb 28 '23 21:02 bvanelli

I still feel like there is stuff missing on the importing perspective

Definitely; if you can update your PR to focus on just the import, I'll review - improvements in that area are very welcome 😄

Also, I feel like the you should include engine in your exporter just like pandas does

I don't see much purpose in doing so. If we were to move to a Rust-based exporter (for instance), it would most likely be the Rust version of xlsxwriter, in which case we could transition the internals/API pretty transparently.

The main reason pandas offers different engines is that it doesn't actually do anything with them itself; it's more like a trivial bootstrap, eg: "here's the data, now you're on your own". The way I've implemented it here, we actually handle all of the table/sheet relative positioning and column range determination so that you can declare what you want on a per-column/dtype basis, while still allowing direct integration with the underlying xlsxwriter library. It's a different (and, imho, a more productive/intuitive) approach...

alexander-beedie avatar Mar 01 '23 08:03 alexander-beedie

@alexander-beedie Finally had some time to fix the MR. There are maybe two open points I needed some input:

  1. I had to import the DataFrame inside the function to avoid circular imports. I wonder if there is a better way to do it

https://github.com/pola-rs/polars/blob/83cf521d657d03f3a9cb0465444582dc471f1630/py-polars/polars/io/excel/functions.py#L228-L236

  1. The readers are fundamentally different, I tried to wrap them on the same API. Unfortunately, the lazy loading of the read_only=True does not allow this behavior, as the wb.close() must be called (reference). If the method raises an exception, then it will not be called. This is better suited for a context handle. If I remove the read_only, then the behavior will match, at the cost of reading speed.

https://github.com/pola-rs/polars/blob/83cf521d657d03f3a9cb0465444582dc471f1630/py-polars/polars/io/excel/functions.py#L190

In any case, could someone review the changes? The biggest improvement of the MR is having native data types like boolean and datetime without manual conversion, which are currently not inferred:

https://github.com/pola-rs/polars/blob/83cf521d657d03f3a9cb0465444582dc471f1630/py-polars/tests/unit/io/test_excel.py#L48-L68

bvanelli avatar Jun 27 '23 19:06 bvanelli

@bvanelli This is good work and I would like to have openpyxl support - it's a more modern/feature rich option when compared to xlsx2csv, in my opinion.

@alexander-beedie Would you mind taking a look if this PR is ready / can be adopted into our current Excel capabilities?

stinodego avatar Aug 26 '23 16:08 stinodego

@alexander-beedie Would you mind taking a look

Will do - I'm definitely in favour of improving the read functionality, and having an openpyxl option is almost certainly the way to go (apologies for missing the update earlier) 👍

@bvanelli: would you mind updating/rebasing the PR? Let's see if we can get it in :)

alexander-beedie avatar Aug 27 '23 14:08 alexander-beedie

@bvanelli: would you mind updating/rebasing the PR? Let's see if we can get it in :)

Hello, thanks for the update. I do not mind mind updating/doing changes but I'm currently in vacation away from any computer for a week, so I won't be able able to do it until the 4th of September.

bvanelli avatar Aug 27 '23 15:08 bvanelli

@alexander-beedie I merged current main into my branch and solved the conflicts. I also slightly updated the documentation to reflect the added features. As a sidenote, here is a benchmark test comparing both libraries:

from __future__ import annotations
import polars as pl
from pathlib import Path


def test_openpyxl() -> None:
    excel_file_path = Path(__file__).parent / "example_benchmark_file.xlsx"
    df = pl.read_excel(excel_file_path, sheet_id=0, engine="openpyxl")


def test_xlsx2csv() -> None:
    excel_file_path = Path(__file__).parent / "example_benchmark_file.xlsx"
    df = pl.read_excel(excel_file_path, sheet_id=0)
  • xlsx2csv: 1.168 seconds
  • openpyxl: 3.303 seconds

I used an excel file with 50k rows and 3 columns.

bvanelli avatar Sep 03 '23 10:09 bvanelli

Ok, the integration looks fine; I'll merge as-is and follow-up shortly (probably later today) with some additional enhancements that will be common to both engines 👍

Thanks for this one @bvanelli!

alexander-beedie avatar Sep 06 '23 13:09 alexander-beedie