meangirls icon indicating copy to clipboard operation
meangirls copied to clipboard

add timestamps for json representation for GC

Open reiddraper opened this issue 13 years ago • 12 comments
trafficstars

I don't have a suggestion (yet), but wanted to put this issue down while I was thinking about it. Currently the JSON representation for these types, like the or-set, don't have the timestamp information that is needed to perform garbage collection.

reiddraper avatar Jan 26 '12 03:01 reiddraper

I was thinking one could choose logically or time-ordered tags for observed-remove which are amenable to GC.

On 01/25/2012 07:09 PM, Reid Draper wrote:

I don't have a suggestion (yet), but wanted to put this issue down while I was thinking about it. Currently the JSON representation for these types, like the or-set, don't have the timestamp information that is needed to perform garbage collection.


Reply to this email directly or view it on GitHub: https://github.com/aphyr/meangirls/issues/3

aphyr avatar Jan 26 '12 03:01 aphyr

Not sure I follow, could you give an example?

reiddraper avatar Jan 26 '12 03:01 reiddraper

Let's say you use snowflake for unique ID generation. Snowflake is k-ordered and has a time component, so each ID can be roughly located in time. Use snowflake IDs for the tags in your observed-removed set. For GC, remove any tag which has a tag with a time component older than your threshold. Or just order all the tags and keep the highest n.

On 01/25/2012 07:18 PM, Reid Draper wrote:

Not sure I follow, could you give an example?


Reply to this email directly or view it on GitHub: https://github.com/aphyr/meangirls/issues/3#issuecomment-3663780

aphyr avatar Jan 26 '12 03:01 aphyr

Ah, I see, cool idea. What about the two-phase set? I've almost thought about removing it in knockbox, to be honest.

reiddraper avatar Jan 26 '12 03:01 reiddraper

It makes sense for certain ephemeral structures (say, purchase orders), and I'm inclined to keep it for completeness' sake. Not really married to the concept, though.

On 01/25/2012 07:43 PM, Reid Draper wrote:

Ah, I see, cool idea. What about the two-phase set? I've almost thought about removing it in knockbox, to be honest.


Reply to this email directly or view it on GitHub: https://github.com/aphyr/meangirls/issues/3#issuecomment-3663926

aphyr avatar Jan 26 '12 03:01 aphyr

Regarding using k-ordered timestamps for tag, I'm afraid it's a bit of a large dependency. Suppose you want to use CRDTs in javascript on the browserside, or you just don't want your CRDT library to depend on an external ID generation service. It seems there should also be an additional "normal" timestamp field, that can be used across all CRDT libs without assumptions about infrastructure such as snowflake. Unfortunately logically ordered timestamps don't help for what I think is going to be the most common pruning predicate, keeping garbage around longer than your longest expected partition.

reiddraper avatar Jan 26 '12 04:01 reiddraper

Or you could use unix epoch seconds, or use iso8601 timestamps, etc, plus hostnames... there are many possible strategies here.

What bothers me is that you might use timestamps for your tags. Then the timestamp information would be redundant but necessary. LWW-sets imply a clock of some kind too, though it could be logical and not wallclock time... hmmmmmm.

I don't want to define the tags to be a specific kind of timestamp, because you might want to use a logical timestamp or a tag handed to you by some other system. Maybe we should establish special variants of the LWW and OR set types that include a predefined timestamp and GC scheme?

On 01/25/2012 08:29 PM, Reid Draper wrote:

Regarding using k-ordered timestamps for tag, I'm afraid it's a bit of a large dependency. Suppose you want to use CRDTs in javascript on the browserside, or you just don't want your CRDT library to depend on an external ID generation service. It seems there should also be an additional "normal" timestamp field, that can be used across all CRDT libs without assumptions about infrastructure such as snowflake. Unfortunately logically ordered timestamps don't help for what I think is going to be the most common pruning predicate, keeping garbage around longer than your longest expected partition.


Reply to this email directly or view it on GitHub: https://github.com/aphyr/meangirls/issues/3#issuecomment-3664211

aphyr avatar Jan 26 '12 04:01 aphyr

Timestamp + hostname might be good enough for the cases I'm thinking, at least if it's at millisecond resolution. I suppose it also wouldn't be hard to have an incrementing counter for the library, so tags are time + hostname + counter. I definitely like the idea of predefined timestamp/GC schemes. I'm (hopefully) going to have some time this weekend to dive back into knockbox, and GC is high on my TODO list.

reiddraper avatar Jan 26 '12 04:01 reiddraper

I like that idea. I'm cool with implementing a flake-like at the library level. Say default unique tags are

[unix_time, host, pid, unique]

and for lww-sets, just unix-time. Floats OK for times?

Note, unix time can jump backwards, stop, hiccup, etc... we expose people to all of that but at the timescales we're talking (days, right?) it shouldn't be an issue.

Obviously there will be a (gc obj) function to force garbage collection, probably with age or size constraints. If you just go by "preserve n tags" it's easy to implement regardless of tag type: just sort the tags and take the biggest n. If you want to go by time then we filter (< threshold (first tag)).

For users who want to use their own tag strategy, they can easily extend this by just using [unix-time, custom-tag]. The GC culling won't know the difference. Sound good?

--Kyle

On 01/25/2012 08:44 PM, Reid Draper wrote:

Timestamp + hostname might be good enough for the cases I'm thinking, at least if it's at millisecond resolution. I suppose it also wouldn't be hard to have an incrementing counter for the library, so tags are time + hostname + counter. I definitely like the idea of predefined timestamp/GC schemes. I'm (hopefully) going to have some time this weekend to dive back into knockbox, and GC is high on my TODO list.


Reply to this email directly or view it on GitHub: https://github.com/aphyr/meangirls/issues/3#issuecomment-3664329

aphyr avatar Jan 26 '12 04:01 aphyr

What are your thoughts on where to store pruning parameters, ie. # of items to keep and/or # of seconds to keep? My gut says it's fine to have the library just decide this, but I could imagine some use-cases where you'd want to be able to have the object carry around this information, as it might be specific for the object. Thoughts?

reiddraper avatar Feb 02 '12 04:02 reiddraper

I would do both; if provided, you use, say "gc_max_items" and "gc_max_seconds" for history pruning. If not provided, it's library dependent.

--Kyle

On 02/01/2012 08:13 PM, Reid Draper wrote:

What are your thoughts on where to store pruning parameters, ie. # of items to keep and/or # of seconds to keep? My gut says it's fine to have the library just decide this, but I could imagine some use-cases where you'd want to be able to have the object carry around this information, as it might be specific for the object. Thoughts?


Reply to this email directly or view it on GitHub: https://github.com/aphyr/meangirls/issues/3#issuecomment-3772376

aphyr avatar Feb 02 '12 04:02 aphyr

+1 to that idea

reiddraper avatar Feb 02 '12 14:02 reiddraper