pandas icon indicating copy to clipboard operation
pandas copied to clipboard

CLN/API: implemented to_html in terms of .style

Open jreback opened this issue 9 years ago • 17 comments
trafficstars

Implement to_html / notebook repr based on .style.

prob need to expand this to take a use argument (to select the style, needs to be 'classic' for a while, to replicate the current .to_html one).

jreback avatar Nov 25 '15 13:11 jreback

Some discussion related to this was going on in https://github.com/pandas-dev/pandas/pull/14975#issuecomment-269956133. Summarizing some elements here:

Barriers: some missing features are needed before such a replacement is possible (see also some elements in https://github.com/pandas-dev/pandas/issues/11610)

  • truncated display
  • writing to a file (#13379)

Advantages:

  • would eliminate a lot of code that gives similar functionality (HTMLFormatter, possibly other formatters) -> converging to one formatting system

Disadvantages:

  • formally adding jinja2 as a dependency.
  • performance?
    • plain html rendering on dataframe of 10 columns /10,000 rows of floats: df.style.render(): 19.6 s vs df.to_html() 2.7 s
    • for notebook reprs (which are typically truncated) this will probably not be a problem

cc @TomAugspurger For basic html output / notebook repr, it would maybe be useful to have a base class that has a simpler template and does not support all the different customization methods? For example, I can imagine that leaving out all the id=.. (which are not needed for basic display I think?) can improve perf / simplify things.

jorisvandenbossche avatar Jan 03 '17 22:01 jorisvandenbossche

For basic html output / notebook repr, it would maybe be useful to have a base class that has a simpler template and does not support all the different customization methods?

100% agree with your comments here. This wouldn't really be implementing df.to_html using .style. Instead we'd have a common Jinja2 template that would handle the logic of iterating over rows, inserting tags. Then .to_html() and .style would extend that base template. .to_html probably wouldn't change much from the base really.

Also, Jinja depends on MarkupSafe, so that becomes another dependency.

TomAugspurger avatar Jan 03 '17 22:01 TomAugspurger

Was there ever any progression on these ideas?

FYI the performance disadvantage above is much improved from 2017. 19.6s vs 2.7s, I now get about 3.9s versus 1.9s.

Also note #39951

attack68 avatar Feb 21 '21 11:02 attack68

I don't agree with the advantage mentioned by @jorisvandenbossche: While I'm all for one convergent formatting system, a templating engine is not the solution. It just does not work for everything: As I said in #21673, there are other formats (like Excel) that can not (realistically) be built using a templating engine.

Also, I am not enthusiastic about making Jinja a hard dependence to render templates (for both HTML and LaTex, or anything else).

EDIT: My idea is that the various (styleable) *Formatters (HTMLFormatter, NotebookFormatter, ExcelFormatter, ...) should be extended to get the ability to optionally apply styles to their output (like I described in #21673).

moi90 avatar Mar 10 '21 09:03 moi90

EDIT: My idea is that the various (styleable) *Formatters (HTMLFormatter, NotebookFormatter, ExcelFormatter, ...) should be extended to get the ability to optionally apply styles to their output

Isn't ExcelFormatter already used to do precisely this?

toobaz avatar Mar 10 '21 16:03 toobaz

I don't agree with the advantage mentioned by @jorisvandenbossche: While I'm all for one convergent formatting system, a templating engine is not the solution. It just does not work for everything:

I don't believe the objective here is to have one convergent system for everything, rather this post is about having one convergent formatting system for to_html, as opposed to Styler with jinja2 and DataFrame.to_html with HTMLFormatter.

jinja2 is a goto for python generating HTML due to packages like flask and Django, so if you are rendering HTML tables from pandas it is a logical combination, as well as the additional template extension flexibility it gives users, that HTMLFormatter cannot.

Since jinja2 is a dependency of Styler and if we assume that is not going away, then any Styler.to_latex method would have jinja2 available to it and some initial work done suggests this is quite easy to incorporate, or at least replicate the existing Dataframe.to_latex() functionality, without having, imo, the horrible subclassing of Formatters. https://github.com/pandas-dev/pandas/compare/master...attack68:latex_styler_mvp

attack68 avatar Mar 10 '21 21:03 attack68

I'm conflicted. On one hand, it's nice to remove code. On the other, I'm not sure of how much code we would really save in exchange for a "stronger" dependency on jinja2. In #40344, you say that some of the arguments of to_html() (e.g. min_rowsint) are pointless because they are "related to console display"... but if the idea is that DataFrame.to_html() and Styler.to_html() are formatted with templates but not DataFrame._repr_html_(), then we are not really gaining much - we still need internal code to produce html for console display, right? And by the way, the fact that Styler._repr_html() does not truncate data like DataFrame._repl_html_() does should probably be considered a bug.

The possibility to export to other formats via jinja2 is also something potentially interesting but to be better investigated. While your attempt in https://github.com/pandas-dev/pandas/compare/master...attack68:latex_styler_mvp is cool, I suspect the complexity will increase quite a bit once we start supporting formatting (which won't use stuff like css), to the point that what jinja2 actually delivers is only a small part of the task of formatting to LaTeX.

I would be happy to be proven wrong though. How difficult would it be, in https://github.com/pandas-dev/pandas/pull/40312, to run the test suite with DataFrame.to_html() replaced with the jinja2 implementation, just to see what breaks?

toobaz avatar Mar 10 '21 23:03 toobaz

I don't believe the objective here is to have one convergent system for everything, rather this post is about having one convergent formatting system for to_html, as opposed to Styler with jinja2 and DataFrame.to_html with HTMLFormatter.

You're right if it is certain that HTMLFormatter can be completely removed. Is that the case? It seems not, guessing from @toobaz' comment.

moi90 avatar Mar 11 '21 10:03 moi90

You're right if it is certain that HTMLFormatter can be completely removed. Is that the case? It seems not, guessing from @toobaz' comment.

@moi90 If the goal is to replicate all of the functionality from DataFrame.to_html() then yes it can be done and a lot has already been done in my wip pr. Not all though, because I wanted to raise the issue about simply blindly replicating a function which in some cases produces deprecated HTML, and instead consider the merits of making some changes perhaps with a view to pandas 2.0.

While your attempt in master...attack68:latex_styler_mvp is cool, I suspect the complexity will increase quite a bit once we start supporting formatting (which won't use stuff like css), to the point that what jinja2 actually delivers is only a small part of the task of formatting to LaTeX.

@toobaz I progressed the MVP to state where it now has a lot of general conditional styling capability for latex tables. See my response here I still want to be able to add some table level styles like column colouring or odd/even colouring but these are quite easy extensions.

I would be happy to be proven wrong though. How difficult would it be, in #40312, to run the test suite with DataFrame.to_html() replaced with the jinja2 implementation, just to see what breaks?

Quite easy, just need to redirect the method, when I push it I will ping you to take a look at test results.

attack68 avatar Mar 11 '21 15:03 attack68

And by the way, the fact that Styler._repr_html() does not truncate data like DataFrame.repl_html() does should probably be considered a bug.

Actually I think the opposite. The docstring for _repr_html states it is mainly for Ipython / Jupyter, which has its own auto scrolling feature. I find it a real nuisance when pandas truncates my dataframes, so always revert to the default df.style display because it shows everything. If you want to view a dataframe in a console don't use a html represenatation, no?

attack68 avatar Mar 11 '21 18:03 attack68

The docstring for _repr_html states it is mainly for Ipython / Jupyter, which has its own auto scrolling feature.

Sure, but passing the notebook a table with millions of rows will just make it crash, whether or not you scroll. We can discuss the optimal numer of rows to show (notice that you can easily customize it), but I'm afraid "no limit" is not an option.

If you want to view a dataframe in a console don't use a html represenatation, no?

Sure, the point is indeed about notebooks.

toobaz avatar Mar 11 '21 18:03 toobaz

Sure, but passing the notebook a table with millions of rows will just make it crash, whether or not you scroll. We can discuss the optimal numer of rows to show (notice that you can easily customize it), but I'm afraid "no limit" is not an option.

Do pandas set a limit of the size of a DataFrame you can construct, or is its limit just naturally determined by system constraints? Same logic could be argued here, albeit one is inside native python and the the other is rendering in external application like Jupyter in a browser (so error might not be as obvious)

I have seen multiple use cases of wanting to visualise large tables one is here with the other upto 20,000 rows. To be honest thats the largest I've seen so even if I'm not convinced a limit is necessary I think having one above that would not have affected any use case I have seen so far - and from memory that only took seconds to render, so would be happy with that.

attack68 avatar Mar 12 '21 07:03 attack68

I have seen multiple use cases of wanting to visualise large tables one is here with the other upto 20,000 rows.

I regularly use tables with a couple of million rows inside Jupyter and it's great to see them easily. I would hate to crash my notebook every time I view them without thinking about truncating them. I'm sure many people use pandas with much larger databases. Again, I think deprecating the truncated visualization is not an option. I might be wrong on the need to truncate Styler too, however, so we can leave that option out of this discussion.

toobaz avatar Mar 12 '21 08:03 toobaz

Indeed, removing truncation from the default html repr is currently not an option I think (unless we would use a more advanced widget that eg does that automatically, but that's another discussion). There are already settings to change the number of rows to show, if you want to change this as a user.

So if we want to replace the to_html/_repr_html_ with Styler, the truncation functionality will need to be added to Styler (although I don't think that Styler needs to do that by default).

jorisvandenbossche avatar Mar 12 '21 08:03 jorisvandenbossche

OK seems well supported, adding this to the list of things needed.

attack68 avatar Mar 12 '21 10:03 attack68

This wasn't really closed by #40312, which only added a Styler.to_html, and didn't implement the main to_html in terms of Styler

jorisvandenbossche avatar Aug 25 '21 20:08 jorisvandenbossche

In #45382 I'm proposing changing the signature of DataFrame.to_latex to:

DataFrame.to_latex(hide, format, format_index, render_kwargs)

and this will perform the following:

DataFrame.style.hide(**hide).format(**format).format_index(**format_index).to_latex(**render_kwargs)

This has the advantage of:

  • converting the method to use Styler implementation
  • not require updates to the arguments signature of DataFrame.to_latex since it passes the kwargs through
  • allows a structured deprecation cycle where all the existing args can be restructured into this format as documented.

Is this reasonable and would it be appropriate to aim for something similar with to_html for v2.0?

attack68 avatar Jan 26 '22 21:01 attack68