lux
lux copied to clipboard
Redundant data in timeseries analysis
When running the analysis on the NYC Taxi dataset, I found that the JSON spec created using Altair backend was storing most of the data (about more than 90%) for a single timeseries temporal plot, where each datapoint in the JSON was for a very short time over a vast range. This plot took a lot of time to render when timed separately. The other recommended plots had performed binning (monthly, yearly, or day of the week) and so where very fast. In such cases where a single plot is taking majority of the time, we could possibly give the user an option to render or skip such a chart?
To Reproduce
lux.config.sampling = False
lux.config.default_display = "lux"
df = pd.read_csv("./data/nyc_taxi.csv")
df['tpep_pickup_datetime'] = pd.to_datetime(df.tpep_pickup_datetime, format="%Y-%m-%d")
df['tpep_dropoff_datetime'] = pd.to_datetime(df.tpep_dropoff_datetime, format="%Y-%m-%d")
df
This is the graph in particular