polars
polars copied to clipboard
Add write_excel and improve read_excel to also use openpyxl for better type inferring
Closes https://github.com/pola-rs/polars/issues/5568
Improvements on Excel support for polars, including a new exporter and an importer that does better type inferring, i.e. handling null entries, correctly parsing datetimes, etc.
There are some things to still do:
- [x] Fix pipelines, lint and format it properly, make sure mypy is passing.
- [x] Improve test coverage
- [x] Add openpyxl as optional installable
- [x] Improve the api a little bit
- [ ] (OPTIONAL) Include basic optional styling
@bvanelli: Thanks for taking a look! However, I suspect that an xlsxwriter / xlsxwriter_rs approach is going to be the best way forward, given the native Rust integration (and previous experience with that library).
@bvanelli: Thanks for taking a look! However, I suspect that an
xlsxwriter/xlsxwriter_rsapproach is going to be the best way forward, given the native Rust integration (and previous experience with that library).
Well.. I really don't think we should ship an excel reader / writer in our binary. It is a format I really don't want to push either.
Having this as an optional python dependency could be added as a utility function if it doesn't adds much maintenance and complexity.
Having this as an optional python dependency could be added as a utility function if it doesn't adds much maintenance and complexity.
Gotcha; in that case I'll take care of it on the Python side (with xlsxwriter), and if it gets traction or there are further pushes for it on the Rust side later we'll be in a good position to translate (though your point about it ending up in the binary is well-taken; my mental model was defaulting to Python-space there, where it's just a pip install away, and the library doesn't actually live in our code ;)
Currently, the read_excel API uses a non-pythonic array, as the index of sheets start at 1: https://github.com/dilshod/xlsx2csv/blob/3180d9490f64baa25495a8098903acd28f1aa131/xlsx2csv.py#L438
Ideally, the first sheet id should be 0, but I don't want to make changes that could potentially break people's codes, so should I
- keep the behavior for the new parser the same, starting at 1 or
- make the new behavior for only openpyxl driver, keeping the same for xlsx2csv or
- make both behaviors the same, possibly breaking current implementations
Also thanks @alexander-beedie for fixing the format_path issue, it was driving me insane because it always ran fine in my local machine but not on CI 😄
Currently, the read_excel API uses a non-pythonic array, as the index of sheets start at 1: https://github.com/dilshod/xlsx2csv/blob/3180d9490f64baa25495a8098903acd28f1aa131/xlsx2csv.py#L438
Maybe make an issue there that they should use None for all sheets instead of 0.
I marked the PR as ready, as I have wrote some tests and improved test coverage for the old reader also.
If possible please someone take a look at how I used the methods on tests, for example the following:
df_by_sheet_id = pl.read_excel( # type: ignore[call-overload]
example_file, sheet_id=1
)
I had to introduce the ignore[call-overload] because mypy was complaining. I'm not exactly sure why because I have little experience with mypy. This was also used on the tests from before (see line below, taken from master) so I used it also for my tests, but there must be a reason why it's triggering.
https://github.com/pola-rs/polars/blob/32c97cf924445b73001df5ff5a6a602dc3acf977/py-polars/tests/unit/io/test_excel.py#L18
Ahh, I feel a little bad as I may not have been clear enough that I was planning to add write_excel functionality myself, as I previously wrote some powerful excel/dataframe export functionality in a previous job and was planning something similar here. I'll check over the PR tomorrow/shortly to see if we can preserve some of it... 😅
Ahh, I feel a little bad as I may not have been clear enough that I was planning to add
write_excelfunctionality myself, as I previously wrote some powerful excel/dataframe export functionality in a previous job and was planning something similar here. I'll check over the PR tomorrow/shortly to see if we can preserve some of it... 😅
Ah, sorry, I misunderstood then. I implemented it through engines just like with pandas, so multiple engines can be used, so you can have multiple libraries as optional dependencies. Maybe it can be adapted together?
Any update on this? I have some use cases for writing to excel where this would be great to have.
Is there a blocker?
Any update on this? I have some use cases for writing to excel where this would be great to have.
Is there a blocker?
I'd like to know also. I can resolve merge conflicts/adapt changes if someone is willing to review it. 😃
Sorry for the hold-up; I've finished the first iteration of write_excel now, and it's running through CI as we speak: https://github.com/pola-rs/polars/pull/7251. Still very interested in improvements to reading though! 👍
I took a look at the MR, and it looks pretty good from the exporting perspective, however, I still feel like there is stuff missing on the importing perspective, which I think xlsxwriter does not touch. The importer library is ok for simple scenarios but it does not handle data types well, that openpyxl can do natively. See for example my datatypes test.
Also, I feel like the you should include engine in your exporter just like pandas does for their read/write excel methods, as in the future one parser/writer written in rust could be added to the available drivers for, let's say, very fast exports. One example here.
I still feel like there is stuff missing on the importing perspective
Definitely; if you can update your PR to focus on just the import, I'll review - improvements in that area are very welcome 😄
Also, I feel like the you should include
enginein your exporter just like pandas does
I don't see much purpose in doing so. If we were to move to a Rust-based exporter (for instance), it would most likely be the Rust version of xlsxwriter, in which case we could transition the internals/API pretty transparently.
The main reason pandas offers different engines is that it doesn't actually do anything with them itself; it's more like a trivial bootstrap, eg: "here's the data, now you're on your own". The way I've implemented it here, we actually handle all of the table/sheet relative positioning and column range determination so that you can declare what you want on a per-column/dtype basis, while still allowing direct integration with the underlying xlsxwriter library. It's a different (and, imho, a more productive/intuitive) approach...
@alexander-beedie Finally had some time to fix the MR. There are maybe two open points I needed some input:
- I had to import the DataFrame inside the function to avoid circular imports. I wonder if there is a better way to do it
https://github.com/pola-rs/polars/blob/83cf521d657d03f3a9cb0465444582dc471f1630/py-polars/polars/io/excel/functions.py#L228-L236
- The readers are fundamentally different, I tried to wrap them on the same API. Unfortunately, the lazy loading of the
read_only=Truedoes not allow this behavior, as thewb.close()must be called (reference). If the method raises an exception, then it will not be called. This is better suited for a context handle. If I remove theread_only, then the behavior will match, at the cost of reading speed.
https://github.com/pola-rs/polars/blob/83cf521d657d03f3a9cb0465444582dc471f1630/py-polars/polars/io/excel/functions.py#L190
In any case, could someone review the changes? The biggest improvement of the MR is having native data types like boolean and datetime without manual conversion, which are currently not inferred:
https://github.com/pola-rs/polars/blob/83cf521d657d03f3a9cb0465444582dc471f1630/py-polars/tests/unit/io/test_excel.py#L48-L68
@bvanelli This is good work and I would like to have openpyxl support - it's a more modern/feature rich option when compared to xlsx2csv, in my opinion.
@alexander-beedie Would you mind taking a look if this PR is ready / can be adopted into our current Excel capabilities?
@alexander-beedie Would you mind taking a look
Will do - I'm definitely in favour of improving the read functionality, and having an openpyxl option is almost certainly the way to go (apologies for missing the update earlier) 👍
@bvanelli: would you mind updating/rebasing the PR? Let's see if we can get it in :)
@bvanelli: would you mind updating/rebasing the PR? Let's see if we can get it in :)
Hello, thanks for the update. I do not mind mind updating/doing changes but I'm currently in vacation away from any computer for a week, so I won't be able able to do it until the 4th of September.
@alexander-beedie I merged current main into my branch and solved the conflicts. I also slightly updated the documentation to reflect the added features. As a sidenote, here is a benchmark test comparing both libraries:
from __future__ import annotations
import polars as pl
from pathlib import Path
def test_openpyxl() -> None:
excel_file_path = Path(__file__).parent / "example_benchmark_file.xlsx"
df = pl.read_excel(excel_file_path, sheet_id=0, engine="openpyxl")
def test_xlsx2csv() -> None:
excel_file_path = Path(__file__).parent / "example_benchmark_file.xlsx"
df = pl.read_excel(excel_file_path, sheet_id=0)
- xlsx2csv: 1.168 seconds
- openpyxl: 3.303 seconds
I used an excel file with 50k rows and 3 columns.
Ok, the integration looks fine; I'll merge as-is and follow-up shortly (probably later today) with some additional enhancements that will be common to both engines 👍
Thanks for this one @bvanelli!