arche icon indicating copy to clipboard operation
arche copied to clipboard

Replace DataFrames's default `_repr_html_` (closes #76)

Open tcurvelo opened this issue 5 years ago • 2 comments

~I added a subclass for DataFrame, in order to override its to_html(), allowing us to define some defaults styling, like the clickable URLs from #76 .~

I changed my approach on this feature. The way I've tried previously doesn't work on new DataFramess created by common pandas functions (eg. df.head() df[df['url'].notna()]). Now I'm replacing the default's _repr_html_() method from DataFrames.

Let me know your thougths on this one.

tcurvelo avatar Oct 20 '19 22:10 tcurvelo

Codecov Report

Merging #175 into master will increase coverage by 0.2%. The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #175     +/-   ##
=========================================
+ Coverage      81%   81.21%   +0.2%     
=========================================
  Files          24       25      +1     
  Lines        1606     1634     +28     
  Branches      279      281      +2     
=========================================
+ Hits         1301     1327     +26     
- Misses        251      252      +1     
- Partials       54       55      +1
Impacted Files Coverage Δ
src/arche/__init__.py 100% <100%> (ø) :arrow_up:
src/arche/tools/dataframe.py 91.66% <91.66%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 476a9dd...0655121. Read the comment docs.

codecov[bot] avatar Oct 20 '19 22:10 codecov[bot]

Here is a simple benchmark I did for measuring the runtime. Below is the script I used. It loads a dataset of 100K+ items and prints its HTML representation. I forced it to display 100_000 lines instead of truncate them.

# render_links_benchmark.py
import time
import arche
import pandas as pd

df = pd.read_json("./327565_39_252_items.jl", lines=True)

with pd.option_context("display.min_rows", 100_000, "display.max_rows", 100_000):
    t = time.process_time()
    out = df._repr_html_()
    print(f"Time expended on `_repr_html_`: {time.process_time() - t}")
    print(f"Len: {len(out)}")

Executing it:

$ for branch in master clickable_urls; do git checkout $branch; ./render_links_benchmark.py; done
Already on 'master'
Time expended on `_repr_html_`: 164.999099792
Len: 276061183
Switched to branch 'clickable_urls'
Time expended on `_repr_html_`: 208.39284144
Len: 322665630

It turns out that, for that dataset, rendering links generates about 17% more data and takes about 27% longer to complete.

tcurvelo avatar Nov 18 '19 19:11 tcurvelo