promnesia discussion: browser history export & integration into Promnesia

Follow-up on https://github.com/karlicoss/promnesia/pull/207#issuecomment-786999837

Mar 01 '21 21:03 karlicoss

@gms8994 wrote:

re-indexing the entire file every hour doesn't seem like a great plan from an efficiency perspective as 99% or more will not have changed. Do you have thoughts on extending the Source for Promnesia to allow overriding the query that's generated (or at least adding a "where" clause to it) that would allow this?

By 'adding where clause' you mean something like adding SELECT ... WHERE timestamp > *last previously processed timestamp*?

Short answer is that I've not been doing that, sacrificing some extra CPU for simplicity and avoiding data losses. Longer answer that it would be possible, perhaps in several ways:

Use the WHERE clause when 'snapshotting' the database. So would need to modify the code somewhere here
- upside: fairly simple, won't require changes to Promnesia (if the schema stays the same), snapshots take less disk space
- downside: like you said, would need some extra handling, .e.g to prune urls table. In my experience, such things can also be error prone if you misinterpret schema, you might lose data without knowing it.
Keep indexing intact, merge diffs into a single 'superdatabase' (instead of snapshotting)
- upside: fast indexing, takes less disk space
- downside: very risky, easy to lose all data https://beepb00p.xyz/unnecessary-db.html#example_chrome. Especially tricky to test for all browsers, so similar downsides to 1.
Keep snapshotting process intact, but change the promnesia.sources.browser code to be more aware of the previously processed timestamp.
- upside: robust, even if there is a bug in the 'merging' logic, it can't impact the data already kept on disk
- downside: might be tedious to support for all browsers (but also not really necessary if they aren't around to test, can do gradually). Takes extra disk space (unless you prune the 'snapshots'). Otherwise similar downsides to 1.

So overall, at least to me seems that the safest option is 3. The disk space issue can be dealt with, either with some manual pruning (e.g. 'keep one database per week', what I've been doing so far), and it's possible to have better and more robust pruning strategies, agnostic of exact database contents.

It would also work nicely with 'update' indexer policy (currently Promnesia overwrites the index by defaults). With 'update' policy, it would be possible to only process the 'diff' without processing the rest of the history.

Let me know if it all makes sense? What do you think?

Mar 01 '21 21:03 karlicoss

Apologies for not being more clear; I was thinking updating the Source class (or down the line to the Browser class) to allow taking an extra param in the config file that would allow passing a where clause to the data... so the workflow would look something like this (high level):

Set up config file pointing to browser path (without where)
Manually run promnesia index
Update config file to have where="something something dark side" for filtering results
Enable cron and automatic indexing continues

This would allow the user to define the amount of data they want to index. Optionally, a flag to index could be made to ignore the where clause (set up config with the where clause, but it's ignored because of the flag, allowing full indexing).

I'm also totally okay if you think that this is too much effort for not a lot of gain; knowing about the flag from #210 makes it so that I could conceivably write something that would manually split the database out to only include a specific amount of data means that I could make this work without changes to promnesia itself...

Mar 02 '21 15:03 gms8994