clumper
clumper copied to clipboard
Datetime utilities
It would be nice if we could group-by day/week/hour given a timestamp.
We should first discuss a proper API before making an implementation but this would really be a nice feature.
We could add an optional dependency and an optional helper for datetime parsing using datefinder. Given that we have dates in a proper iso-standard it would be nice to have some other helpers to aid in rounding. Something like;
(clump.mutate(round_s=round_dt(key='dt', by='second'))
We could have variants of it too.
round_dt(key='dt', by='minute')
round_dt(key='dt', by='hour')
round_dt(key='dt', by='day')
This sounds like an interesting feature to me as well! How about we take inspiration from Grouper object in pandas?
The responsibility of Grouper
would be to find and mark the rows in the data that the given frequency. These marked information can be used by the .agg(...)
to compute the aggregations.
I can imagine its use like this
from clumper import Clumper, Grouper
sensor_data = [
{'date_time':'2020-01-01 19:01','temperature':30, 'humidity':21},
{'date_time':'2020-01-01 19:59','temperature':25, 'humidity':22},
{'date_time':'2020-01-01 20:01','temperature':25, 'humidity':40},
{'date_time':'2020-01-01 20:59','temperature':20, 'humidity':22},
]
clump = Clumper(sensor_data)
hourly_average = (
clump
.group_by(Grouper(key='date_time',freq='H')) # For start frequency can be hourly, daily, and monthly
.agg(
avg_temperature = ('temperature','mean'),
avg_humdity = ('humdity','mean')
)
)
hourly_average.collect()
"""
{'date_time':'2020-01-01 19:00','avg_temperature':27.5, 'avg_humidity':21.5},
{'date_time':'2020-01-01 20:00','avg_temperature':20, 'avg_humidity':31}
""
What I'm proposing here is certainly similar to the grouper, but I think I'm going for something that is more minimal. I'm also not 100% sure that I like their abbreviation scheme. For example, freq="M"
, is that month or minute?
Do you have an example in mind? Indeed the abbreviation on the surface can be confusing but shouldn't the document make it clear? The pandas doc makes it clear what the abbreviation means, so I am not against using the same abbreviation. People may actually already be familiar with the abbreviations so it will be user friendly, don't you think?
The doc makes it clear, but a user would need to leave the notebook and find that page in order to move on. I'd prefer to just use plain words. I also wouldn't want to assume that users of this package are familiar with python. There are no dependencies with the pydata stack and the API is more aligned with dplyr than pandas. It's my understanding that there's folks in infosec who like this tool because it is so lightweight. It seems that because pandas is so dependency heavy that it's not universally appreciated.
Meta : Maybe we should chat about this instead discuss in the issue to keep it clear?
I agree that it is lightweight compared to pandas. However, in this case I think using abbreviation makes more sense because we might be limited as to how many frequencies we are going to force the user's to spell out. For example, there is not just monthly frequency but its derivative like month start frequency and month end frequency. This holds for quarterly and yearly as well. It would be difficult to write them down in words. Therefore, my question is, what exact frequencies are we going to support? If it's just by second, minute, month, and year then your approach is better but as soon as it gets complicated then it might be difficult to use.
My preferred attitude here is only to start supporting extra features if there is a strong need for it. I added this issue because I had a need for basic rounding. It's a slippery slope to start supporting elaborate features here because it's going to end up in a lot of maintenance.
In my mind, second
, minute
, hour
, day
, month
, year
should just about cover 80% of use-cases. For the other 20%, I'd be fine with requiring the user to customise something themselves using a tool like arrow. This is mainly to keep any technical debt at bay and to prevent that this library might require dependencies.
That is, unless I've missed a cool trick that is implemented in the standard datetime
submodule.
Now that I am thinking about it, maybe this is a more consistent API:
clump.mutate(rounded_date = lambda d: round_dt_str(d['dt'], by='hour'))
The benefits:
- This keeps the API functional/less magic.
- By calling it
round_dt_str
we don't imply that the object going in needs to be a datetime-object and can be a string instead.
My preferred attitude here is only to start supporting extra features if there is a strong need for it. I added this issue because I had a need for basic rounding. It's a slippery slope to start supporting elaborate features here because it's going to end up in a lot of maintenance.
Fair point!
In my mind,
second
,minute
,hour
,day
,month
,year
should just about cover 80% of use-cases. For the other 20%, I'd be fine with requiring the user to customise something themselves using a tool like arrow. This is mainly to keep any technical debt at bay and to prevent that this library might require dependencies.That is, unless I've missed a cool trick that is implemented in the standard
datetime
submodule.
As far as I know datetime doesn't has automatic format detection like datefinder. So this means you agree to have datefinder as a dependency?
Now that I am thinking about it, maybe this is a more consistent API:
clump.mutate(rounded_date = lambda d: round_dt_str(d['dt'], by='hour'))
The benefits:
1. This keeps the API functional/less magic. 2. By calling it `round_dt_str` we don't imply that the object going in needs to be a datetime-object and can be a string instead.
Thanks for the example. It helps better visualize what you want. However, having this function is not enough right? We also have to create groups based on these "rounded" data points which can then be supplied to agg
function. Is this a correct assumption?
As far as I know datetime doesn't has automatic format detection like datefinder. So this means you agree to have datefinder as a dependency?
@samarpan-rai if we add such a utility I'd only consider it as an optional dependency, but for the simple case with datetime-string rounding, I think we won't need it. Again, I really prefer omitting any dependencies where possible. The Github actions currently run in 8s, which is something we can be proud of :)
Thanks for the example. It helps better visualize what you want. However, having this function is not enough right? We also have to create groups based on these "rounded" data points which can then be supplied to agg function. Is this a correct assumption?
What's wrong with this pattern:
(clump
.mutate(grp = lambda d: round_dt_str(d['date'], 'hour'))
.group_by("grp"))
As far as I know datetime doesn't has automatic format detection like datefinder. So this means you agree to have datefinder as a dependency?
@samarpan-rai if we add such a utility I'd only consider it as an optional dependency, but for the simple case with datetime-string rounding, I think we won't need it. Again, I really prefer omitting any dependencies where possible. The Github actions currently run in 8s, which is something we can be proud of :)
indeed it's something to be proud of. We will have to force the user to tell us the datime format. Are you okay with that?
Thanks for the example. It helps better visualize what you want. However, having this function is not enough right? We also have to create groups based on these "rounded" data points which can then be supplied to agg function. Is this a correct assumption?
What's wrong with this pattern:
(clump .mutate(grp = lambda d: round_dt_str(d['date'], 'hour')) .group_by("grp"))
Well in your case making of the group requires an additional step of first rounding it. Should we abstract this by detecting date time as key or should we let the user explicitly specify this two step process? The former might be easier to use and the later is easier to implement.
The former might be easier to use and the later is easier to implement.
Suppose we would like to implement something like a grouper, we still need to have a name for the groups we create. Pandas has an index that allows for unnamed columns, but that won't work here. Instead, we could do something like this;
(clump
.group_by(grp = lambda d: round_dt_str(d['date'], 'hour')))
To me, this feels like the "cleanest" way to implement a grouper-kind-of-thing.
But this also feels like an anti-pattern. Currently we have mutate
to add columns to a clumper. When-ever a user thinks "gotta add a column" we'd like them to think about mutate
immediately. By allowing a group_by
to also add columns we would be adding exceptions to this general pattern. I'm less worried about difficulty of use, I'm more worried about difficulty to conceptualise.
We might still explore this option, but I would need some convincing of a group of users that can give me some clear examples where this is a mayor pain. This library isn't meant for performance, it's meant to help clump json data in a function way.
So for now, to keep things simple, I'd prefer to ignore any changes to the group_by
verb for now.
The former might be easier to use and the later is easier to implement.
Suppose we would like to implement something like a grouper, we still need to have a name for the groups we create. Pandas has an index that allows for unnamed columns, but that won't work here. Instead, we could do something like this;
(clump .group_by(grp = lambda d: round_dt_str(d['date'], 'hour')))
To me, this feels like the "cleanest" way to implement a grouper-kind-of-thing.
But this also feels like an anti-pattern. Currently we have
mutate
to add columns to a clumper. When-ever a user thinks "gotta add a column" we'd like them to think aboutmutate
immediately. By allowing agroup_by
to also add columns we would be adding exceptions to this general pattern. I'm less worried about difficulty of use, I'm more worried about difficulty to conceptualise.
That's a fair point. Being able to put concept with function is quite important.
We might still explore this option, but I would need some convincing of a group of users that can give me some clear examples where this is a mayor pain. This library isn't meant for performance, it's meant to help clump json data in a function way.
So for now, to keep things simple, I'd prefer to ignore any changes to the
group_by
verb for now.
Agreed. So shall I start implementing round_dt_str
function then?
Sure thing! Go for it :)