haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: Save the output of pipe.eval() as a preformatted Excel

Open camillepradel opened this issue 2 years ago • 14 comments

Related Issues

  • fixes #2431

Proposed Changes:

There are two new functions in schema.py, save_excel() and load_excel(), which work similarly to save() and load(). The main difference is that they write and read a file in excel format.

  • save_excel() saves the evaluation result in Excel format. The result of each node is saved in a separate sheet of the out_file file. The output file is generated according to the eval_template.xlsx template file. For each node, if its corresponding sheet appears in the template and contains a header line, the node evaluation results are appended respecting the specified columns. If there is no preexisting sheet or no header line, all metrics are written. Formatting options from the template still appear in the generated file. multilabel_id column of created file is saved as string to prevent automatic rounding of large numbers (numbers that have greater than 15 digits) by excel.
  • load_excel() loads the evaluation result saved by save_excel()

How did you test it?

Updated test_extractive_qa_eval and test_extractive_qa_eval_multiple_queries test functions.

Notes for the reviewer

This is still a draft; I prefer to make an early PR to get feedback as suggested in the contribution guidelines.

The save_excel() function creates only one file with several sheets, instead of several files like it's done in save(). It seemed to be the "logical" way to do with excel documents.

As I work with Linux, I created the template file and ran the tests with LibreOffice. I believe it would be safer to create the template file and run some more tests with official Microsoft Excel.

I also fixed what I believe to be a small issue with updated test functions: the code saved and reloaded eval results, but then the same tests were run again on the same unchanged variable; so I updated this variable with the loaded data (cf. commit 04afb133bb3178213ec9098355310800cc69141a). I probably should have create separate issue and PR for this, and I can still do it if you prefer.

TODO:

  • recreate template file with Microsoft Office
  • add template sheet for each node type (only Retriever and Reader have been created)
  • delete sheets coming from template which were not used for a given evaluation result instance (there was no corresponding evaluated node)
  • move template file somewhere else -> where ?
  • multiple lines are repeated in updated test functions -> factorize them?

Checklist

camillepradel avatar Aug 08 '22 15:08 camillepradel

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Aug 08 '22 15:08 CLAassistant

Hi @agnieszka-m! Thank you for your review :) I committed all your suggestions.

camillepradel avatar Aug 09 '22 09:08 camillepradel

Thanks for the feedback @tstadel ! ok, I will try formatting the excel in code instead of using a template.

camillepradel avatar Aug 09 '22 15:08 camillepradel

@tstadel I added some basic formatting using openpyxl.

pylint still produces an error because I am "instantiating" an abstract class (whose __new__() function actually returns an instance of one of its subclasses). This seems to be [the recommended way](https://pandas.pydata.org/docs/reference/api/pandas.ExcelWriter.html).

I could remove the error by directly instantiating the concrete subclass, replacing

        with pd.ExcelWriter(out_file, engine="openpyxl") as writer:

with

        with pd.io.excel._openpyxl.OpenpyxlWriter(out_file) as writer:

but it doesn't feel right to me. Any opinion on this?

camillepradel avatar Aug 11 '22 08:08 camillepradel

@tstadel I added some basic formatting using openpyxl.

pylint still produces an error because I am "instantiating" an abstract class (whose __new__() function actually returns an instance of one of its subclasses). This seems to be [the recommended way](https://pandas.pydata.org/docs/reference/api/pandas.ExcelWriter.html).

I could remove the error by directly instantiating the concrete subclass, replacing

        with pd.ExcelWriter(out_file, engine="openpyxl") as writer:

with

        with pd.io.excel._openpyxl.OpenpyxlWriter(out_file) as writer:

but it doesn't feel right to me. Any opinion on this?

There's an issue in pylint for this: https://github.com/PyCQA/pylint/issues/3060 So, let's suppress it like it's suggested here.

tstadel avatar Aug 11 '22 09:08 tstadel

I think everything is ok now. Apologies for the multiple runs of the CI actions.

camillepradel avatar Aug 12 '22 14:08 camillepradel

@tstadel thanks again for the review.

  • load_excel doesn't work for me if I use save_excel and load_excel in Tutorial 5 instead of the ordinary save and load functions. Let's ensure this works

load_excel() indeed lacked robustness; I fixed it for this case.

  • When saving to excel the resulting xlsx file still has a index column. I'm not sure whether this is a pandas bug, but let's investigate that.

→ It seems that it is not related to save_excel() but to the dataframes returned by pipeline.eval() which contain a column explicitly named index (these dataframes are created here).
The same phenomenon can be observed in files created by save() (for instance when running Tutorial 5).

camillepradel avatar Aug 16 '22 14:08 camillepradel

hey @camillepradel, thanks a lot for your work, this is awesome! :) Do you think it would also be possible to make the design more generic? Design is a quite subjective thing I think and it would be nice if we could get people to define their own style, e.g. font, header format, row colors, freeze header. You used before a template I saw, was this for the whole structure (column labels etc.), or just the design? If it just uses the design, I think it would be a nice alternative to give the user more customization options if a template was provided

ju-gu avatar Aug 17 '22 16:08 ju-gu

Hi @ju-gu. The template was used for the design but it depended on the data to be saved as it was required to define a sheet with a header line for each node eval result to save (this in order to specify columns to write and their respective formatting).

We could make a compromise by adding an optional parameter template_file to save_excel().

  • If the parameter is provided, we use the template file the same way it was done in the early implementation
    The output file is generated according to the template file.
    For each node, if a sheet with the same name appears in the template and 
    contains a header line, the node evaluation results are appended respecting 
    the specified columns and the formatting options from the template are kept.
    If there is no preexisting sheet or no header line, all metrics are written 
    and no formatting is applied.	
    
  • Otherwise, we apply basic formatting as it is done in current implementation.

Would you like me to implement it that way, or were you thinking about something else?

camillepradel avatar Aug 18 '22 10:08 camillepradel

that is already pretty cool. But as @tstadel pointed out it would mean a template for each node, which need to be kept up to date. I think it would be better if we could just use the design of a template. Then it would also work if the headers are not the same. Do you have an idea how to do this? Maybe by just defining a header and then overwriting it?

ju-gu avatar Aug 19 '22 07:08 ju-gu

hey, @camillepradel so I checked again with @ZanSara and it would be also fine if you would just add the previous template option as optional, as you previously described :). I actually will use this a lot, as I am doing a lot of analyzation. So thanks again, for your contribution! looking very much forward to see this :)

ju-gu avatar Aug 19 '22 14:08 ju-gu

Hi @ju-gu ! I restored the template option. There is now only one template sheet, as you suggested. It is a bit less customizable, but much more simple!

camillepradel avatar Aug 24 '22 14:08 camillepradel

thanks! That sounds very good :). I did yet not get to test it and won't for this week, so maybe @bogdankostic and @bglearning can have a look?

ju-gu avatar Aug 29 '22 15:08 ju-gu

Hi @camillepradel, thanks for your PR! I will have a look at this in the following days. In the meantime, could you please delete the setup.cfg file from your branch and add the dependency to the pyproject.toml file? We recently restructured our package to comply with PEP 621.

bogdankostic avatar Aug 31 '22 09:08 bogdankostic

Closing for lack of activity

masci avatar Nov 23 '23 08:11 masci