traildb icon indicating copy to clipboard operation
traildb copied to clipboard

Should TrailDB deduplicate events?

Open ckuethe opened this issue 9 years ago • 4 comments

For discussion - it would be nice if TrailDB could deduplicate events. Below is a simple script that inserts some records twice. Clearly it's a little bit silly to append the exact same database twice, but it's possible that I might have some duplicate events when merging a bunch of different log types for a given time period.

from traildb import TrailDB, TrailDBConstructor
from uuid import uuid4

fields = ['text']
cons = TrailDBConstructor('/tmp/test1', fields)
for x in range(2):
    uid = uuid4().hex
    for ts in range(5):
        cons.add(uid, ts, ['trail {}, time {}'.format(uid, ts)])

tdb = cons.finalize()
print '{} fields, {} trails, {} events'.format(tdb.num_fields, tdb.num_trails, tdb.num_events)

cons = TrailDBConstructor('/tmp/test2', fields)
cons.append(tdb)
cons.append(tdb)

tdb = cons.finalize()
print '{} fields, {} trails, {} events'.format(tdb.num_fields, tdb.num_trails, tdb.num_events)

prints

2 fields, 2 trails, 10 events
2 fields, 2 trails, 20 events

ckuethe avatar Jul 08 '16 00:07 ckuethe

What did you have in mind for the semantics of deduplication? Are you picturing like a flag that you pass to the constructor that causes it to drop exact duplicates of previously handled events on the floor?

gregn-adroll avatar Jul 08 '16 05:07 gregn-adroll

Yes, a flag to the constructor to silently drop dups would be great. That would allow me to backfill logs and still have unique events.

ckuethe avatar Jul 08 '16 05:07 ckuethe

duplicates in this context means that all fields are equal, including the timestamp and the uuid? Implementing dedup logic like this should be quite doable.

tuulos avatar Jul 08 '16 20:07 tuulos

Yes, all the fields including timestamp and uuid would be equal if the event was to be considered a duplicate.

  • Different UUID? Lightning struck Alice instead of Bob. Log it.
  • Different timestamp? Bob got hit by lightning again. Log it.
  • Alice and Carol both telling me that Bob got hit by lightning at noon? If deduplication is active, I don't care who told me, only that I have a record of the event. (The logged event may or may not have a source host field, as appropriate).

ckuethe avatar Jul 08 '16 21:07 ckuethe