summingbird
summingbird copied to clipboard
Add a WithTime node
Summingbird internally keeps a timestamp for all values, but we don't expose that to the user. It is a pain to always thread it through. We could add it back by adding a new node:
case class WithTime[P, T](p: Producer[P, T]) extends Producer[P, (T, Timestamp)]
case class ValueWithTime[P, K, V](p: Producer[P, (K, V)]) extends Producer[P, (K, (V, Timestamp))]
then at plan time, we can just treat this like a map that adds the timestamp, which we know at the time.
This would clean up some internal APIs we have if summingbird supported it, and it would also close #688 since can always recover the timestamp at any point.
@pankajroark what do you think of this? I can work on adding it if we can ever get our tests to not OOM.
ping on this @ttim ?
We need the time in user land a lot at Stripe. I can possibly find time to work on this unless you see any blockers.
@johnynek it introduces (conceptually) notion of time into core platform.
Pros: it's already a case and makes everything more consistent. Cons:
- We need to change memory platform (not a big issue, let's assume
0
timestamp for memory platform in the beginning) - I had some thought how to integrate tsar functionality into SB. For example you can treat
sumByKey
as something which do aggregation over different ranges of time
In general I like the idea to put time into core and build everything else around.
Will this mean that users will be able to specify a summingbird job without time? That may not be a bad idea because that would support online only use cases more efficiently, right now users fix the timestamp for that.
Or are these nodes solely aimed at being able to extract time which is hidden. The api seems a bit magical in that case. It will be great if you could give an example.
@pankajroark I don't think it helps you run without time as I am conceiving this, almost the opposite: you have to be able to give a time for each event.
So, what I want is this: we have a system similar to tsar which aggregates keys into many buckets. To do this, we need to know the time. We currently carry a copy of the time around in the value. That is a waste since internally summingbird knows the time. The .withTime
method would make a copy of the internal time of the event out so we could bucked without carrying that copy everywhere (which is especially painful across store/sumByKey boundaries).
Even though the extraction of time out of nothing seems a bit magical to me, I realize the practical utility. I'm onboard.