HPI
HPI copied to clipboard
HPI local installation caches Reddit exported data and does not refresh
I am experimenting with HPI as I was looking for a system that would allow me to create a repository of my digital traces: cool stuff.
I've installed HPI according as per the local/editable option.
I'm testing it with Reddit. I've configured the path to the Reddit export file in $HOME/.config/my/my/init.py by adding:
export_path = "/home/ubuntu/hpi/reddit/*.json"
Rexport is using the information in secret.py to dump the Reddit data:
python3 -m rexport.export --secrets $HOME/git/rexport/secrets.py > ./reddit/"export-$(date -I).json"
This piece of code I've found in the documentation should report the list of the 4 subreddits with most saved posts:
import my.reddit.all
from collections import Counter
print(Counter(s.subreddit for s in my.reddit.all.saved()).most_common(4))
But what happens is that the information processed by my.reddit gets cached in $HOME/.cache and does not update when I rerun the rexport script
ubuntu@MARS:~/.cache/my$ ls -la
-rw-r--r-- 1 ubuntu ubuntu 1433600 Nov 21 15:35 my.reddit.rexport:comments
-rw-r--r-- 1 ubuntu ubuntu 1400832 Nov 21 15:34 my.reddit.rexport:saved
-rw-r--r-- 1 ubuntu ubuntu 94208 Nov 21 15:35 my.reddit.rexport:submissions
-rw-r--r-- 1 ubuntu ubuntu 561152 Nov 21 15:35 my.reddit.rexport:upvoted
To see the refreshed dump I must first delete the cached files.
What am I missing?
Thanks s.
This is probably intended behavior and the sqlite files are created in the .cache folder by cachew per design. Question is how do I get those file recreated after re-running the rexport script, perhaps removing them as part of the script execution is the most logical approach
cachew
should automatically pick up that there have been new files picked up, and should recalculate new comments and overwrite that database
On line 88 in my/reddit/rexport.py
:
diff --git a/my/reddit/rexport.py b/my/reddit/rexport.py
index cca3e35..5c4d045 100755
--- a/my/reddit/rexport.py
+++ b/my/reddit/rexport.py
@@ -85,7 +85,7 @@ Upvote = dal.Upvote
def _dal() -> dal.DAL:
inp = list(inputs())
return dal.DAL(inp)
-cache = mcachew(depends_on=inputs) # depends on inputs only
+cache = mcachew(depends_on=inputs, logger=logger) # depends on inputs only
@cache
If you modify the line to add the logger (this should actually probably be done by default), you can then see what cachew is doing by settings the HPI_LOGS
variable like this:
HPI_LOGS=debug hpi query my.reddit.all.comments >/dev/null
[my.reddit.rexport:saved] using inferred type <class 'rexport.dal.Save'>
[my.reddit.rexport:comments] using inferred type <class 'rexport.dal.Comment'>
[my.reddit.rexport:submissions] using inferred type <class 'rexport.dal.Submission'>
[my.reddit.rexport:upvoted] using inferred type <class 'rexport.dal.Upvote'>
using /home/sean/.cache/cachew/my.reddit.rexport:comments for db cache
new hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'))
old hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'))
hash matched: loading from cache
In most cases you'll see the same hash matched: loading from cache
, since the input filenames are the same as last time it ran
If you then add a new one by running the rexport
, and re-run that:
HPI_LOGS=debug hpi query my.reddit.all.comments >/dev/null
[my.reddit.rexport:saved] using inferred type <class 'rexport.dal.Save'>
[my.reddit.rexport:comments] using inferred type <class 'rexport.dal.Comment'>
[my.reddit.rexport:submissions] using inferred type <class 'rexport.dal.Submission'>
[my.reddit.rexport:upvoted] using inferred type <class 'rexport.dal.Upvote'>
using /home/sean/.cache/cachew/my.reddit.rexport:comments for db cache
new hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'), PosixPath('/home/sean/data/rexport/20211209T191206Z.json'))
old hash: cachew: 0.9.0, schema: [Column('raw', Json(), table=None)], dependencies: (PosixPath('/home/sean/data/rexport/20200930T214405Z.json'), PosixPath('/home/sean/data/rexport/20211113T102439Z.json'))
hash mismatch: computing data and writing to db
[D 211209 11:13:17 dal:167] comments: finished processing /home/sean/data/rexport/20200930T214405Z.json: 999/ 999 new; total: 999
[D 211209 11:13:18 dal:167] comments: finished processing /home/sean/data/rexport/20211113T102439Z.json: 21/1000 new; total: 1020
[D 211209 11:13:18 dal:167] comments: finished processing /home/sean/data/rexport/20211209T191206Z.json: 0/1000 new; total: 1020
You should hopefully see it recalculating (hash mismatch: computing data and writing to db
) the results to include the new data
Oh -- The only case where I see an issue if the filesnames of the new data are the same as the old, and you seem to be using date -I
, which returns something like
date -I
2021-12-09
so it may be expecting that exports made by rexport
on the same day have the same data (or rather, if you make multiple exports in the same day, the new one is overwriting the old one), but cachew assumes the data is the same.
If you change the date command to be specific to the second rather than the date, to something like:
python3 -m rexport.export --secrets /path/to/secrets.py >"export-$(date +%s).json"
... may fix this issue, unsure.
Yep, I think @seanbreckenridge is right -- it would be due to cachew
using filenames by default, so it assumes no changes if you only use the date.
There is something experimental to use the file modification time, but still need thing how/if we should rely on it by default https://github.com/karlicoss/cachew/blob/49d349f5c32ae25d6f5a36279c8f0c5090242da2/src/cachew/init.py#L623-L626
And yeah, IMO it's best to keep full timstamp.. either by date +%s
or date -Iseconds --utc
(a bit more human readable).