pandas Support writing/reading notes to/from excel files

trafficstars

[x] closes #58070
[x] [Tests added and passed]
[x] All [code checks passed]
[x] Added [type annotations]
[x] Added an entry in the latest doc/source/whatsnew/v3.0.0.rst file if fixing a bug or adding a new feature.

Me and my colleague @Dacops developed this feature with the use case issued in mind. We believe the solution can be better understood when looking at the examples we provide in the documentation but also in the tests we made.

Unfortunately, not every engine supported by pandas had the capability to handle notes (calamine, odfreader, pyxlsb, odswriter), on the remaining ones we took the liberty to not only add writing capability as requested in #58070 but also add the complementary operation of reading notes.

In summary, now you can write notes to an excel (using either openpyxl or xlswriter) by doing:

df = pandas.DataFrame(...) # DataFrame holding cell content
df_notes = pandas.DataFrame(...)  # DataFrame holding notes
df.style.set_tooltips(df_notes).to_excel(excel_file.xlsx)

The complementary of this (reading using either xlrd or openpyxl) would look something like:

df_notes = pandas.DataFrame() # empty DataFrame
pandas.read_excel(path, engine="xlrd", notes=df_notes) # df_notes now hold the notes as they were in the excel

We will be available and ready to answer any doubts related to the pr.

Hope everything is to your liking.

May 26 '24 17:05 diogomsmiranda

I think this looks like a good PR overall but this is not actually what I originally suggested. The ExcelFormatter class in pandas is capable of detecting a Styler object and what I suggested was that that branching could be extended to handle an attached Styler tooltips object. I.e. I was proposing modifying the Styler.to_excel method and not the core DataFrame.to_excel method.

However, what you have built is the generic ability for the to_excel method to handle comments in the forms of notes which is an additional keyword argument to the underlying function. Additionally when looking at first glance you seem to have added the ability to read notes as well. This may well all be quite useful functionality.

So this is a comprehensive PR. It may have been best to segregate these different components into different successive PRs, to make it easier to properly review and consider all the edge cases.

One case that springs to mind is that the way tooltips works for Styler is to reindex_like so that its columns and index align with the underlying dataframe. This does not appear to be the case for notes, so some kind of consideration for alignment, or properly documenting the difference would probably be needed.

Before doing a full review it will be interesting to hear from other core devs. Also in regards to potential spilt of the PR.

May 26 '24 19:05 attack68

I am -.5 on adding this as a keyword argument to the excel I/O functions; the fact that so few of the underlying libraries supports this is a red flag. Generally we have a really high bar to expand our API, and adding this without supporting all underlying libraries is confusing for end users

May 28 '24 12:05 WillAyd

I think this looks like a good PR overall but this is not actually what I originally suggested. The ExcelFormatter class in pandas is capable of detecting a Styler object and what I suggested was that that branching could be extended to handle an attached Styler tooltips object. I.e. I was proposing modifying the Styler.to_excel method and not the core DataFrame.to_excel method.

@attack68 Thank you for the comment, and sorry for our miss interpretation, I've adjusted the description of the PR accordingly.

One case that springs to mind is that the way tooltips works for Styler is to reindex_like so that its columns and index align with the underlying dataframe. This does not appear to be the case for notes, so some kind of consideration for alignment, or properly documenting the difference would probably be needed.

That's indeed a point we noticed once we were testing the code, however, we didn't know what would be the right way to approach this, we'll await further instructions from core developers as you suggested.

May 28 '24 14:05 diogomsmiranda

I am -.5 on adding this as a keyword argument to the excel I/O functions; the fact that so few of the underlying libraries supports this is a red flag. Generally we have a really high bar to expand our API, and adding this without supporting all underlying libraries is confusing for end users

@WillAyd Thank you for having a look at our pull request.

We acknowledge that introducing a feature without support for all underlying libraries may not appear ideal. Initially, we did not foresee this would be a problem. However, after a detailed examination of the source code from the relevant libraries, we discovered that many are no longer actively developed and/or have limited functionality, python-calamine, however, is exception to this as it's relatively new still. This likely contributes to the current lack of support for notes.

Despite these limitations, implementing this feature in pandas will enable support for notes in Excel formats such as .xls, .xlsx, .xlsm, .xltx, and .xltm. While this support is somewhat restricted, we believe it will still significantly improve user workflows.

We leave links to the libraries so you can see for yourself.

https://pypi.org/project/odswriter/
https://pypi.org/project/odfpy/
https://pypi.org/project/pyxlsb/
https://pypi.org/project/python-calamine/

May 28 '24 15:05 diogomsmiranda

I.e. I was proposing modifying the Styler.to_excel method and not the core DataFrame.to_excel method.

@attack68 Thank you for the comment, and sorry for our miss interpretation, I've adjusted the description of the PR accordingly.

But this PR is still modifying the core DataFrame.to_excel method?

However, after a detailed examination of the source code from the relevant libraries, we discovered that many are no longer actively developed and/or have limited functionality, python-calamine, however, is exception to this as it's relatively new still. This likely contributes to the current lack of support for notes.

While python-calamine is new, the Rust library it uses, calamine has been around since 2016. And in the ReadMe, they state

Many (most) part of the specifications are not implemented, the focus has been put on reading cell values and vba code.

From this, it seems unlikely for calamine to gain the ability to read notes. If I'm understanding this PR correctly, only openpyxl and xlrd support reading comments. Coupled with the fact that the API is, in my opinion, somewhat awkward (passing in a DataFrame to store the result), I am rather negative on supporting reading notes as well.

Somewhat of a separate question - are there engines which are no longer maintained and they provide no additional functionality from libraries that are maintained? These could be deprecated - xlrd comes to mind, but I'm not sure if calamine completely replaces it.

May 30 '24 20:05 rhshadrach

But this PR is still modifying the core DataFrame.to_excel method?

No, to_excel is not modified, the notes are kept as an attribute of ExcelFormatter (defaulted to None) that is then received by the writers.

From this, it seems unlikely for calamine to gain the ability to read notes. If I'm understanding this PR correctly, only openpyxl and xlrd support reading comments. Coupled with the fact that the API is, in my opinion, somewhat awkward (passing in a DataFrame to store the result), I am rather negative on supporting reading notes as well.

Reading notes was initially added for testing purposes as an afterthought. It's a more intrusive method and is supported by fewer libraries. Now that writing notes has demonstrated its intended functionality, this feature could be removed.

Somewhat of a separate question - are there engines which are no longer maintained and they provide no additional functionality from libraries that are maintained? These could be deprecated - xlrd comes to mind, but I'm not sure if calamine completely replaces it.

Based on our research, the writers xlswriter and openpyxl both write to .xlsx files (I believe Pandas uses both in parallel due to some compatibility issues). The odswriter writes to .ods files. For the readers, xlrd reads from old-style Excel formats (.xls), while openpyxl reads from .xlsx, .xlsm, .xltx, and .xltm files. Odfreader handles OpenDocument type files (.odf, .ods, .odt), and pyxlsb reads binary (.xlsb) files. Calamine is the most interesting package as it reads from both Excel (.xls, .xlsx, .xlsm, .xlsb) and OpenDocument (.ods) formats. It is also a newer package. I believe some readers could be replaced by Calamine; it’s just a matter of checking for any compatibility issues.

Jun 03 '24 00:06 Dacops

Rebased to address a conflict with this commit

Jun 06 '24 19:06 Dacops

Reading notes was initially added for testing purposes as an afterthought. It's a more intrusive method and is supported by fewer libraries. Now that writing notes has demonstrated its intended functionality, this feature could be removed.

@attack68 - you're good with just writing notes instead of having reading as well?

Jun 06 '24 21:06 rhshadrach

Yes that sounds good. I think the writing mode is a good feature and relatively speaking it is a fairly neglible extension to the overall API. I think this should be accepted to merge.

I understand that the reading was added probably as a 'round trip' test whcih is sensible but reading is a separate process and this could (and should) be added in a separate, atomic PR, albeit that seems like lesser chance of acceptance.

Jun 07 '24 08:06 attack68

Yes that sounds good. I think the writing mode is a good feature and relatively speaking it is a fairly neglible extension to the overall API. I think this should be accepted to merge.

I understand that the reading was added probably as a 'round trip' test whcih is sensible but reading is a separate process and this could (and should) be added in a separate, atomic PR, albeit that seems like lesser chance of acceptance.

We are also ok with this. @rhshadrach should we now delete the reading from this pr and open a separate one?

Jun 07 '24 18:06 diogomsmiranda

@diogomsmiranda - modifying this PR would be preferable.

Jun 08 '24 11:06 rhshadrach

@rhshadrach We've modified the PR, however we are having a weird issue with the docstrings validation, any idea what might be causing this problem?

Jun 09 '24 07:06 Dacops

@rhshadrach

I put remarks on one of the writers - I think this applies to all writers.

I've updated this in all writers, except the first example, which I didn't understand, could you better explain how do you want us to change that logic?

This also needs tests. While pandas does not have the ability to read the notes, you can use one of the Excel engines that does to read them directly.

I've brought all the tests from the original issue except now the reading is handled directly by the library openpyxl in the testing file.

What happens in the case of merged cells?

The comment writing occurs after all the content is written, so if the notes are passed in mind with cell merging that's how the writing will proceed.

Also some datetime tests started failing on the 32-bits test suites due to overflows, is this a general problem?

Jun 16 '24 23:06 Dacops

This looks a lot cleaner now, more targeted and the tests are written to detect the notes. Is there much outstanding other than fixing any failing tests?

Jul 11 '24 18:07 attack68

This looks a lot cleaner now, more targeted and the tests are written to detect the notes. Is there much outstanding other than fixing any failing tests?

Nope, I believe this is done, the failing tests were something that was failing globally a while back (32bit test suites were suffering from overflows) but I believe this is fixed now. I'll rebase the PR to address the conflict at the whatsnew file and it should pass everything.

Jul 12 '24 16:07 Dacops

Hey, @rhshadrach for some reason Github is still showing that you requested changes, but I believe we already addressed them. Can you confirm everything is ok?

Jul 12 '24 21:07 diogomsmiranda

Can you confirm everything is ok?

I should have some time in the next few days. As a side note, this would be easier on my end if there wasn't force-pushes involved, as then I could just view the diff from the last time I reviewed this PR. However, because there are force-pushes, I need to review the entire PR again.

Jul 14 '24 11:07 rhshadrach

As a side note, this would be easier on my end if there wasn't force-pushes involved, as then I could just view the diff from the last time I reviewed this PR. However, because there are force-pushes, I need to review the entire PR again.

Thanks for the feedback and sorry for the inconvenience, it's noted for any future contributions.

Jul 14 '24 12:07 diogomsmiranda

@rhshadrach can you take a second look when possible to avoid this stalling whilst its green?

Aug 12 '24 17:08 attack68

/preview

Aug 13 '24 10:08 rhshadrach

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Sep 13 '24 00:09 github-actions[bot]

Thanks for the PR, but it appears to have gone stale. If you could address the comments/requests above and we'd be more than happy to reopen!

Sep 14 '24 13:09 rhshadrach

pandas pandas copied to clipboard

Support writing/reading notes to/from excel files

pandas
pandas copied to clipboard