hvplot icon indicating copy to clipboard operation
hvplot copied to clipboard

Difference between pandas and hvplot for ecobee dataset

Open michaelaye opened this issue 6 years ago • 11 comments

Versions

Package Version
hvplot 0.4.0
holoviews 1.12.5
bokeh 1.3.4
notebook 6.0.0
python 3.7.3
Browser Chrome 75, Safari 12.1.1

Description of expected behavior and the observed behavior

The basic shape of the graph produced should be the same.

Complete, minimal, self-contained example code that reproduces the issue

Get CSV from here (250 KB): https://www.dropbox.com/s/m4cwi5kdve9x67n/report-319299697687-2019-08-07-to-2019-08-14.csv?dl=1

import pandas as pd
import hvplot.pandas

df = pd.read_csv("report-319299697687-2019-08-07-to-2019-08-14.csv",
                             comment="#")
col = "Thermostat Temperature (C)"
df[col].hvplot()
df[col].plot()

Screenshots or screencasts of the bug in action

Screenshot 2019-08-19 14 58 48

michaelaye avatar Aug 19 '19 20:08 michaelaye

My guess is that pandas was able to detect both the Date and Time columns and automagically combined them into a pd.DatetimeIndex, while hvplot only took the first column Date and plotted that.

ahuang11 avatar Aug 19 '19 23:08 ahuang11

That's some serious assumption making on the part of Pandas, but it makes sense!

jbednar avatar Aug 20 '19 01:08 jbednar

Could it not simply be that within a day, the data points are plotted sequentially?

michaelaye avatar Aug 20 '19 04:08 michaelaye

Ah we can check that by inspecting the returned axis I guess.

michaelaye avatar Aug 20 '19 04:08 michaelaye

image You can still plot it like pandas if you specify it explicitly (although here, I don't know why pandas is offsetting the columns by one)

import pandas as pd
import hvplot.pandas

df = pd.read_csv("report-319299697687-2019-08-07-to-2019-08-14 (1).csv", comment="#")
col = "Thermostat Temperature (C)"
df.head()
df[col].hvplot('index', col)
df = df.reset_index()
df.index = pd.to_datetime(df['index'] + df['Date'], format='%Y-%m-%d%H:%M:%S')
df[col].hvplot('index', col)

ahuang11 avatar Aug 20 '19 04:08 ahuang11

I only get one plot when I execute above code? (The last one)

michaelaye avatar Aug 20 '19 13:08 michaelaye

pandas.read_csv needs to have the parameter index_col=False, because otherwise it takes the first column as an index outside the parsed column names. Then there's no offset in columns.

michaelaye avatar Aug 20 '19 14:08 michaelaye

I'm very confused now. Correcting the read_csv parsing aligns the behavior of plot() and hvplot(), but why was it different then before?

michaelaye avatar Aug 20 '19 14:08 michaelaye

The hvplot's x-axis is just an arbitrary sequential index now; it doesn't recreate pandas' automagic merging of Date and Time.

image

ahuang11 avatar Aug 20 '19 23:08 ahuang11

Honestly I don't know how matplotlib does this since when you index with df[col] the times are dropped. I'm guessing it just divides the day by the number of entries for that date in the index and spaces them equally.

philippjfr avatar Jan 14 '21 12:01 philippjfr

Alternatively it simply uses the sequential index and uses the index to label the axes.

philippjfr avatar Jan 14 '21 12:01 philippjfr