polars icon indicating copy to clipboard operation
polars copied to clipboard

Improve Notebook Rendering of DataFrame and Series for html and plain (whitespace, consistency)

Open Julian-J-S opened this issue 1 year ago • 6 comments

Problem description

I currently see 3 problems when rendering DataFrame or Series in Notebooks:

  1. "html": whitespace collapse
  2. "plain": DataFrame does not show quotes for str columns
  3. "plain": Series output is not nice (why not use pretty format of DataFrame?)

1. Whitespace collapse

Problem:

  • due to html you actually do not see the correct data rendered if multiple whitespace are present! (might not see data problems)
  • if you copy & paste a value with multiple whitespace in the rendered output and use that in a filter you will NOT FIND IT!

Example:

DATA = {'text': ['no', '     before', 'after     ', '     both     ', 'bet     ween']}
df = pl.DataFrame(data=DATA)
ser = df.get_column('text')

df.filter(pl.col('text') == '<PASTED FROM RENDERED OUTPUT>')  # will NOT work!

image

Solution:

I am not an html expert but there seems to be a 1 line solution for this using css (Can make PR if accepted?!)

  • use white-space: pre; in the NotebookFormatter style tag

2. DataFrame str columns missing quotes in "plain" output

Problem

  • you cannot tell if there is trailing whitespace in the text because unlike the html format there are no quotes around str columns

Example

# current
┌────────────────┐
│ text           │
│ ---            │
│ str            │
╞════════════════╡
│ no             │
│      before    │
│ after          │
│      both      │
│ bet     ween   │
└────────────────┘

# desired
┌──────────────────┐
│ text             │
│ ---              │
│ str              │
╞══════════════════╡
│ "no"             │
│ "     before"    │
│ "after     "     │
│ "     both     " │
│ "bet     ween"   │
└──────────────────┘

3. Series str output in "plain" mode

Problem

  • output is not nice (inconsistent)

Example

# current
shape: (5,)
Series: 'text' [str]
[
	"no"
	"     before"
	"after     "
	"     both     "
	"bet     ween"
]

# desired (same as data frame
┌──────────────────┐
│ text             │
│ ---              │
│ str              │
╞══════════════════╡
│ "no"             │
│ "     before"    │
│ "after     "     │
│ "     both     " │
│ "bet     ween"   │
└──────────────────┘

Julian-J-S avatar Aug 18 '23 15:08 Julian-J-S

These are really 3 different issues, but let me try to give my two cents:

  1. Agree, I would welcome a PR for this.
  2. This is probably a good idea - not 100% sure. Including quotes can hurt readability.
  3. I also strongly dislike our Series output, but in your proposal it would be hard to distinguish between Series and DataFrame output. There is probably a good solution out there though.

stinodego avatar Aug 19 '23 09:08 stinodego

@stinodego thanks for your feedback!

regarding 2): imo this doesn't hurt readability that much. I feel this goes more in line with polars idea of make everything very explicit and correct (showing potential leading/trailing whitespace). Also I guess there is a reason for using quotes in the html cell output which I really like!

regarding 3): you are right! there must be a way to differentiate between Series and DataFrame for 1 column data but this is already handled by the current output with the shape line above and I would suggest to adopt this

image

Should I create 3 distinct issues, link them here and close this one? ;)

Julian-J-S avatar Aug 20 '23 17:08 Julian-J-S

created individual issues: #10643 #10646 #10648

Julian-J-S avatar Aug 21 '23 08:08 Julian-J-S

Series might look better horizontal, what do you think? Here are a few options:

                    ╭──────┬───────┬──────┬───────────┬───────╮
╭─────────────┬─────╯  0   │   1   │  2   │     3     │   4   │
│ "my_series" │ i64 │ test │ hello │ nope │ my_string │ where │  
╰─────────────┴─────┴──────┴───────┴──────┴───────────┴───────╯


                    ┌──────┬───────┬──────┬───────────┬───────┐
┌─────────────┬─────┘  0   │   1   │  2   │     3     │   4   │
│ "my_series" │ i64 │ test │ hello │ nope │ my_string │ where │  
└─────────────┴─────┴──────┴───────┴──────┴───────────┴───────┘


┌┄┄┄┄┄┄┄┄┄┄┄┬┄┄┄┄┄┬──────┬───────┬──────┬───────────┬───────┐
┆ my_series ┆ i64 ┆ test │ hello │ nope │ my_string │ where │  
└┄┄┄┄┄┄┄┄┄┄┄┴┄┄┄┄┄┴──────┴───────┴──────┴───────────┴───────┘

mcrumiller avatar Aug 21 '23 21:08 mcrumiller

In terms of HTML display, I really think we should adopt the display strategy used in pandas. Added my two cents here: https://github.com/pola-rs/polars/issues/10643#issuecomment-1687284107

stevenlis avatar Aug 22 '23 01:08 stevenlis

Series might look better horizontal, what do you think? Here are a few options:

                    ╭──────┬───────┬──────┬───────────┬───────╮
╭─────────────┬─────╯  0   │   1   │  2   │     3     │   4   │
│ "my_series" │ i64 │ test │ hello │ nope │ my_string │ where │  
╰─────────────┴─────┴──────┴───────┴──────┴───────────┴───────╯


                    ┌──────┬───────┬──────┬───────────┬───────┐
┌─────────────┬─────┘  0   │   1   │  2   │     3     │   4   │
│ "my_series" │ i64 │ test │ hello │ nope │ my_string │ where │  
└─────────────┴─────┴──────┴───────┴──────┴───────────┴───────┘


┌┄┄┄┄┄┄┄┄┄┄┄┬┄┄┄┄┄┬──────┬───────┬──────┬───────────┬───────┐
┆ my_series ┆ i64 ┆ test │ hello │ nope │ my_string │ where │  
└┄┄┄┄┄┄┄┄┄┄┄┴┄┄┄┄┄┴──────┴───────┴──────┴───────────┴───────┘

nice! I really like the idea of horizontal Series and especially the last one which is very clean.

Resoning

  1. very easy to distinguish between Series and DataFrame (with single col)!
  2. you can go vertical just by calling to_frame()
  3. you can then add line numbers if you need by calling with_row_count()

Julian-J-S avatar Aug 22 '23 07:08 Julian-J-S

I'm closing this as there are separate issues for these points.

stinodego avatar Apr 29 '24 07:04 stinodego