HPI icon indicating copy to clipboard operation
HPI copied to clipboard

HPI local installation caches Reddit exported data and does not refresh

Open sergio-ns opened this issue 2 years ago • 4 comments

I am experimenting with HPI as I was looking for a system that would allow me to create a repository of my digital traces: cool stuff.

I've installed HPI according as per the local/editable option.

I'm testing it with Reddit. I've configured the path to the Reddit export file in $HOME/.config/my/my/init.py by adding:

export_path = "/home/ubuntu/hpi/reddit/*.json"

Rexport is using the information in secret.py to dump the Reddit data: python3 -m rexport.export --secrets $HOME/git/rexport/secrets.py > ./reddit/"export-$(date -I).json"

This piece of code I've found in the documentation should report the list of the 4 subreddits with most saved posts:

import my.reddit.all
from collections import Counter
print(Counter(s.subreddit for s in my.reddit.all.saved()).most_common(4))

But what happens is that the information processed by my.reddit gets cached in $HOME/.cache and does not update when I rerun the rexport script

ubuntu@MARS:~/.cache/my$ ls -la
-rw-r--r-- 1 ubuntu ubuntu 1433600 Nov 21 15:35 my.reddit.rexport:comments
-rw-r--r-- 1 ubuntu ubuntu 1400832 Nov 21 15:34 my.reddit.rexport:saved
-rw-r--r-- 1 ubuntu ubuntu   94208 Nov 21 15:35 my.reddit.rexport:submissions
-rw-r--r-- 1 ubuntu ubuntu  561152 Nov 21 15:35 my.reddit.rexport:upvoted

To see the refreshed dump I must first delete the cached files.

What am I missing?

Thanks s.

sergio-ns avatar Nov 21 '21 15:11 sergio-ns

This is probably intended behavior and the sqlite files are created in the .cache folder by cachew per design. Question is how do I get those file recreated after re-running the rexport script, perhaps removing them as part of the script execution is the most logical approach

sergio-ns avatar Nov 28 '21 14:11 sergio-ns

cachew should automatically pick up that there have been new files picked up, and should recalculate new comments and overwrite that database

On line 88 in my/reddit/rexport.py:

diff --git a/my/reddit/rexport.py b/my/reddit/rexport.py
index cca3e35..5c4d045 100755
--- a/my/reddit/rexport.py
+++ b/my/reddit/rexport.py
@@ -85,7 +85,7 @@ Upvote     = dal.Upvote
 def _dal() -> dal.DAL:
     inp = list(inputs())
     return dal.DAL(inp)
-cache = mcachew(depends_on=inputs) # depends on inputs only
+cache = mcachew(depends_on=inputs, logger=logger) # depends on inputs only


 @cache

If you modify the line to add the logger (this should actually probably be done by default), you can then see what cachew is doing by settings the HPI_LOGS variable like this:

HPI_LOGS=debug hpi query my.reddit.all.comments >/dev/null
[my.reddit.rexport:saved] using inferred type <class 'rexport.dal.Save'>
[my.reddit.rexport:comments] using inferred type <class 'rexport.dal.Comment'>
[my.reddit.rexport:submissions] using inferred type <class 'rexport.dal.Submission'>
[my.reddit.rexport:upvoted] using inferred type <class 'rexport.dal.Upvote'>
using /home/sean/.cache/cachew/my.reddit.rexport:comments for db cache
new hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'))
old hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'))
hash matched: loading from cache

In most cases you'll see the same hash matched: loading from cache, since the input filenames are the same as last time it ran

If you then add a new one by running the rexport, and re-run that:

HPI_LOGS=debug hpi query my.reddit.all.comments >/dev/null
[my.reddit.rexport:saved] using inferred type <class 'rexport.dal.Save'>
[my.reddit.rexport:comments] using inferred type <class 'rexport.dal.Comment'>
[my.reddit.rexport:submissions] using inferred type <class 'rexport.dal.Submission'>
[my.reddit.rexport:upvoted] using inferred type <class 'rexport.dal.Upvote'>
using /home/sean/.cache/cachew/my.reddit.rexport:comments for db cache
new hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'), PosixPath('/home/sean/data/rexport/20211209T191206Z.json'))
old hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'))
hash mismatch: computing data and writing to db
[D 211209 11:13:17 dal:167] comments: finished processing /home/sean/data/rexport/20200930T214405Z.json:  999/ 999 new; total: 999
[D 211209 11:13:18 dal:167] comments: finished processing /home/sean/data/rexport/20211113T102439Z.json:   21/1000 new; total: 1020
[D 211209 11:13:18 dal:167] comments: finished processing /home/sean/data/rexport/20211209T191206Z.json:    0/1000 new; total: 1020

You should hopefully see it recalculating (hash mismatch: computing data and writing to db) the results to include the new data

seanbreckenridge avatar Dec 09 '21 19:12 seanbreckenridge

Oh -- The only case where I see an issue if the filesnames of the new data are the same as the old, and you seem to be using date -I, which returns something like

date -I
2021-12-09

so it may be expecting that exports made by rexport on the same day have the same data (or rather, if you make multiple exports in the same day, the new one is overwriting the old one), but cachew assumes the data is the same.

If you change the date command to be specific to the second rather than the date, to something like:

python3 -m rexport.export --secrets /path/to/secrets.py >"export-$(date +%s).json"

... may fix this issue, unsure.

seanbreckenridge avatar Dec 09 '21 19:12 seanbreckenridge

Yep, I think @seanbreckenridge is right -- it would be due to cachew using filenames by default, so it assumes no changes if you only use the date.

There is something experimental to use the file modification time, but still need thing how/if we should rely on it by default https://github.com/karlicoss/cachew/blob/49d349f5c32ae25d6f5a36279c8f0c5090242da2/src/cachew/init.py#L623-L626

And yeah, IMO it's best to keep full timstamp.. either by date +%s or date -Iseconds --utc (a bit more human readable).

karlicoss avatar Dec 19 '21 19:12 karlicoss