haystack feat: Save the output of pipe.eval() as a preformatted Excel

Related Issues

fixes #2431

Proposed Changes:

There are two new functions in schema.py, save_excel() and load_excel(), which work similarly to save() and load(). The main difference is that they write and read a file in excel format.

save_excel() saves the evaluation result in Excel format. The result of each node is saved in a separate sheet of the out_file file. The output file is generated according to the eval_template.xlsx template file. For each node, if its corresponding sheet appears in the template and contains a header line, the node evaluation results are appended respecting the specified columns. If there is no preexisting sheet or no header line, all metrics are written. Formatting options from the template still appear in the generated file. multilabel_id column of created file is saved as string to prevent automatic rounding of large numbers (numbers that have greater than 15 digits) by excel.
load_excel() loads the evaluation result saved by save_excel()

How did you test it?

Updated test_extractive_qa_eval and test_extractive_qa_eval_multiple_queries test functions.

Notes for the reviewer

This is still a draft; I prefer to make an early PR to get feedback as suggested in the contribution guidelines.

The save_excel() function creates only one file with several sheets, instead of several files like it's done in save(). It seemed to be the "logical" way to do with excel documents.

As I work with Linux, I created the template file and ran the tests with LibreOffice. I believe it would be safer to create the template file and run some more tests with official Microsoft Excel.

I also fixed what I believe to be a small issue with updated test functions: the code saved and reloaded eval results, but then the same tests were run again on the same unchanged variable; so I updated this variable with the loaded data (cf. commit 04afb133bb3178213ec9098355310800cc69141a). I probably should have create separate issue and PR for this, and I can still do it if you prefer.

TODO:

recreate template file with Microsoft Office
add template sheet for each node type (only Retriever and Reader have been created)
delete sheets coming from template which were not used for a given evaluation result instance (there was no corresponding evaluated node)
move template file somewhere else -> where ?
multiple lines are repeated in updated test functions -> factorize them?

Checklist

[X] I have read the contributors guidelines and the code of conduct
[ ] I have updated the related issue with new insights and changes
[X] I added tests that demonstrate the correct behavior of the change
[X] I've used the conventional commit convention for my PR title
[X] I documented my code
[X] I ran pre-commit hooks and fixed any issue

Aug 08 '22 15:08 camillepradel

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Aug 08 '22 15:08 CLAassistant

Hi @agnieszka-m! Thank you for your review :) I committed all your suggestions.

Aug 09 '22 09:08 camillepradel

Thanks for the feedback @tstadel ! ok, I will try formatting the excel in code instead of using a template.

Aug 09 '22 15:08 camillepradel

@tstadel I added some basic formatting using openpyxl.

pylint still produces an error because I am "instantiating" an abstract class (whose __new__() function actually returns an instance of one of its subclasses). This seems to be [the recommended way](https://pandas.pydata.org/docs/reference/api/pandas.ExcelWriter.html).

I could remove the error by directly instantiating the concrete subclass, replacing

        with pd.ExcelWriter(out_file, engine="openpyxl") as writer:

with

        with pd.io.excel._openpyxl.OpenpyxlWriter(out_file) as writer:

but it doesn't feel right to me. Any opinion on this?

Aug 11 '22 08:08 camillepradel

@tstadel I added some basic formatting using openpyxl.

pylint still produces an error because I am "instantiating" an abstract class (whose __new__() function actually returns an instance of one of its subclasses). This seems to be [the recommended way](https://pandas.pydata.org/docs/reference/api/pandas.ExcelWriter.html).

I could remove the error by directly instantiating the concrete subclass, replacing
        with pd.ExcelWriter(out_file, engine="openpyxl") as writer:
with
        with pd.io.excel._openpyxl.OpenpyxlWriter(out_file) as writer:
but it doesn't feel right to me. Any opinion on this?

There's an issue in pylint for this: https://github.com/PyCQA/pylint/issues/3060 So, let's suppress it like it's suggested here.

Aug 11 '22 09:08 tstadel

I think everything is ok now. Apologies for the multiple runs of the CI actions.

Aug 12 '22 14:08 camillepradel

@tstadel thanks again for the review.

load_excel doesn't work for me if I use save_excel and load_excel in Tutorial 5 instead of the ordinary save and load functions. Let's ensure this works

→ load_excel() indeed lacked robustness; I fixed it for this case.

When saving to excel the resulting xlsx file still has a index column. I'm not sure whether this is a pandas bug, but let's investigate that.

→ It seems that it is not related to save_excel() but to the dataframes returned by pipeline.eval() which contain a column explicitly named index (these dataframes are created here).
The same phenomenon can be observed in files created by save() (for instance when running Tutorial 5).

Aug 16 '22 14:08 camillepradel

hey @camillepradel, thanks a lot for your work, this is awesome! :) Do you think it would also be possible to make the design more generic? Design is a quite subjective thing I think and it would be nice if we could get people to define their own style, e.g. font, header format, row colors, freeze header. You used before a template I saw, was this for the whole structure (column labels etc.), or just the design? If it just uses the design, I think it would be a nice alternative to give the user more customization options if a template was provided

Aug 17 '22 16:08 ju-gu

Hi @ju-gu. The template was used for the design but it depended on the data to be saved as it was required to define a sheet with a header line for each node eval result to save (this in order to specify columns to write and their respective formatting).

We could make a compromise by adding an optional parameter template_file to save_excel().

If the parameter is provided, we use the template file the same way it was done in the early implementation

The output file is generated according to the template file.
For each node, if a sheet with the same name appears in the template and 
contains a header line, the node evaluation results are appended respecting 
the specified columns and the formatting options from the template are kept.
If there is no preexisting sheet or no header line, all metrics are written 
and no formatting is applied.

Otherwise, we apply basic formatting as it is done in current implementation.

Would you like me to implement it that way, or were you thinking about something else?

Aug 18 '22 10:08 camillepradel

that is already pretty cool. But as @tstadel pointed out it would mean a template for each node, which need to be kept up to date. I think it would be better if we could just use the design of a template. Then it would also work if the headers are not the same. Do you have an idea how to do this? Maybe by just defining a header and then overwriting it?

Aug 19 '22 07:08 ju-gu

hey, @camillepradel so I checked again with @ZanSara and it would be also fine if you would just add the previous template option as optional, as you previously described :). I actually will use this a lot, as I am doing a lot of analyzation. So thanks again, for your contribution! looking very much forward to see this :)

Aug 19 '22 14:08 ju-gu

Hi @ju-gu ! I restored the template option. There is now only one template sheet, as you suggested. It is a bit less customizable, but much more simple!

Aug 24 '22 14:08 camillepradel

thanks! That sounds very good :). I did yet not get to test it and won't for this week, so maybe @bogdankostic and @bglearning can have a look?

Aug 29 '22 15:08 ju-gu

Hi @camillepradel, thanks for your PR! I will have a look at this in the following days. In the meantime, could you please delete the setup.cfg file from your branch and add the dependency to the pyproject.toml file? We recently restructured our package to comply with PEP 621.

Aug 31 '22 09:08 bogdankostic

Closing for lack of activity

Nov 23 '23 08:11 masci

haystack haystack copied to clipboard

feat: Save the output of pipe.eval() as a preformatted Excel

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

haystack
haystack copied to clipboard