polars
polars copied to clipboard
Improve Notebook Rendering of DataFrame and Series for html and plain (whitespace, consistency)
Problem description
I currently see 3 problems when rendering DataFrame or Series in Notebooks:
- "html": whitespace collapse
- "plain": DataFrame does not show quotes for str columns
- "plain": Series output is not nice (why not use pretty format of DataFrame?)
1. Whitespace collapse
Problem:
- due to html you actually do not see the correct data rendered if multiple whitespace are present! (might not see data problems)
- if you copy & paste a value with multiple whitespace in the rendered output and use that in a filter you will NOT FIND IT!
Example:
DATA = {'text': ['no', ' before', 'after ', ' both ', 'bet ween']}
df = pl.DataFrame(data=DATA)
ser = df.get_column('text')
df.filter(pl.col('text') == '<PASTED FROM RENDERED OUTPUT>') # will NOT work!
Solution:
I am not an html expert but there seems to be a 1 line solution for this using css (Can make PR if accepted?!)
- use
white-space: pre;
in theNotebookFormatter
style tag
2. DataFrame str columns missing quotes in "plain" output
Problem
- you cannot tell if there is trailing whitespace in the text because unlike the html format there are no quotes around str columns
Example
# current
┌────────────────┐
│ text │
│ --- │
│ str │
╞════════════════╡
│ no │
│ before │
│ after │
│ both │
│ bet ween │
└────────────────┘
# desired
┌──────────────────┐
│ text │
│ --- │
│ str │
╞══════════════════╡
│ "no" │
│ " before" │
│ "after " │
│ " both " │
│ "bet ween" │
└──────────────────┘
3. Series str output in "plain" mode
Problem
- output is not nice (inconsistent)
Example
# current
shape: (5,)
Series: 'text' [str]
[
"no"
" before"
"after "
" both "
"bet ween"
]
# desired (same as data frame
┌──────────────────┐
│ text │
│ --- │
│ str │
╞══════════════════╡
│ "no" │
│ " before" │
│ "after " │
│ " both " │
│ "bet ween" │
└──────────────────┘
These are really 3 different issues, but let me try to give my two cents:
- Agree, I would welcome a PR for this.
- This is probably a good idea - not 100% sure. Including quotes can hurt readability.
- I also strongly dislike our Series output, but in your proposal it would be hard to distinguish between Series and DataFrame output. There is probably a good solution out there though.
@stinodego thanks for your feedback!
regarding 2): imo this doesn't hurt readability that much. I feel this goes more in line with polars idea of make everything very explicit and correct (showing potential leading/trailing whitespace). Also I guess there is a reason for using quotes in the html cell output which I really like!
regarding 3): you are right! there must be a way to differentiate between Series and DataFrame for 1 column data but this is already handled by the current output with the shape line above and I would suggest to adopt this
Should I create 3 distinct issues, link them here and close this one? ;)
created individual issues: #10643 #10646 #10648
Series might look better horizontal, what do you think? Here are a few options:
╭──────┬───────┬──────┬───────────┬───────╮
╭─────────────┬─────╯ 0 │ 1 │ 2 │ 3 │ 4 │
│ "my_series" │ i64 │ test │ hello │ nope │ my_string │ where │
╰─────────────┴─────┴──────┴───────┴──────┴───────────┴───────╯
┌──────┬───────┬──────┬───────────┬───────┐
┌─────────────┬─────┘ 0 │ 1 │ 2 │ 3 │ 4 │
│ "my_series" │ i64 │ test │ hello │ nope │ my_string │ where │
└─────────────┴─────┴──────┴───────┴──────┴───────────┴───────┘
┌┄┄┄┄┄┄┄┄┄┄┄┬┄┄┄┄┄┬──────┬───────┬──────┬───────────┬───────┐
┆ my_series ┆ i64 ┆ test │ hello │ nope │ my_string │ where │
└┄┄┄┄┄┄┄┄┄┄┄┴┄┄┄┄┄┴──────┴───────┴──────┴───────────┴───────┘
In terms of HTML display, I really think we should adopt the display strategy used in pandas. Added my two cents here: https://github.com/pola-rs/polars/issues/10643#issuecomment-1687284107
Series might look better horizontal, what do you think? Here are a few options:
╭──────┬───────┬──────┬───────────┬───────╮ ╭─────────────┬─────╯ 0 │ 1 │ 2 │ 3 │ 4 │ │ "my_series" │ i64 │ test │ hello │ nope │ my_string │ where │ ╰─────────────┴─────┴──────┴───────┴──────┴───────────┴───────╯ ┌──────┬───────┬──────┬───────────┬───────┐ ┌─────────────┬─────┘ 0 │ 1 │ 2 │ 3 │ 4 │ │ "my_series" │ i64 │ test │ hello │ nope │ my_string │ where │ └─────────────┴─────┴──────┴───────┴──────┴───────────┴───────┘ ┌┄┄┄┄┄┄┄┄┄┄┄┬┄┄┄┄┄┬──────┬───────┬──────┬───────────┬───────┐ ┆ my_series ┆ i64 ┆ test │ hello │ nope │ my_string │ where │ └┄┄┄┄┄┄┄┄┄┄┄┴┄┄┄┄┄┴──────┴───────┴──────┴───────────┴───────┘
nice! I really like the idea of horizontal Series and especially the last one which is very clean.
Resoning
- very easy to distinguish between Series and DataFrame (with single col)!
- you can go vertical just by calling
to_frame()
- you can then add line numbers if you need by calling
with_row_count()
I'm closing this as there are separate issues for these points.