RedditArchiver-standalone Some feedback and thanks!

I used this to download a ton of my data from reddit. Figured I'd give some feedback on how I used it/what worked/what didn't

"[X] It looks like you are not authenticated well ..."

Sometimes I would get "[X] It looks like you are not authenticated well. [X] Please check your credentials and retry.". Upon further inspection it was the result of querying a link_id that returned a 403 response, and was not an issue with authentication. As an example, link_id 8vkhv8, as the subreddit is now private. Probably needs a better error message for that exception/another except case.

JSON format

I found the html format to be nice but not machine-digestable. I added a pretty botched json export along with my html export. After "Submission downloaded", I added

    @jsonpickle.handlers.register(praw.models.reddit.submission.Submission, base=True)
    class SubmissionHandler(jsonpickle.handlers.BaseHandler):
        def flatten(self, obj, data):
            return {}
    @jsonpickle.handlers.register(praw.reddit.Reddit, base=True)
    class RedditHandler(jsonpickle.handlers.BaseHandler):
        def flatten(self, obj, data):
            return {}
    write_json(jsonpickle.encode(submission.comments[:]), submission, submission_id, now, args.output)

I also added jsonpickle to make this work. Someone might find this useful, I liked having both exports

-i take an array?

It would be nice if -i could take an array. I used jq and xargs to pipe all link_ids from my rexport dump into RedditArchiver-standlone. The command was jq '[.submissions[].id, .saved[].id, .upvoted[].id, .comments[].link_id[3:]] | unique | .[]' ~/Seafile/archive/ExportedServiceData/reddit-apiexport/export-username-2023-06-11.json | xargs -i python ~/Seafile/projects/FORKED/RedditArchiver-standalone/RedditArchiver.py -c ./config-username.yml -i {} -o /home/username/Seafile/archive/ExportedServiceData/redditarchiver if you're interested. Gets all submission ids, saved ids, uploaded ids, and link_id from comments and pipes it to xargs against RedditArchiver.py . It would have been nice if I could have send the entire array to redditarchiver but this worked. Probably make it faster not having to spin up the python interpreter for every invocation

praw.ini / config.yml

I had to make a praw.ini to use refresh_token.py. Kind of annoying to archive multiple accounts because now I have 3 praw.inis and config.ymls. Would be nice if this repo just read praw.ini for all client-specific secrets, or ran refresh_token for you. Probably more work that its worth

Thanks

Thanks for the library! Really saved my weekend. And no pressure to implement this just thought I'd share my pain points

Jun 12 '23 08:06 Cobertos

Hello! Thank you for your useful feedback 🦉

Answering each point:

I believe it when you say the "You are not authenticated well" message is too general, as I did not implement a lot of checks for it. I will have to run more tests against edge-case situations to see the possible failure reasons.
It could be possible to add JSON output instead of HTML output, so it could be processed more easily (probably with a flag argument?). That being said, the JSON structure could become complicated if there is a lot of nested replies levels, and some information would be lost (if only the comment body/author/date/etc. are saved). I guess it can be done, but in case you are curious, there is another project that produces a machine-readable output with all data: https://github.com/aliparlakci/bulk-downloader-for-reddit
Being able to handle multiple submissions ID is indeed interesting (and in the future, I plan to add arguments so someone can save all their saved submissions, or the submissions of a particular person for example...). It could be done by multiple ways: specifying -i several times with a single ID each time, specifying -i once but with delimiters (for example -i 14ect9e,14dkc44,14d6lqz), or by reading stdin (like echo '14ect9e,14dkc44,14d6lqz' | python3 RedditArchiver.py...). What do you think?
I am not really familiar with praw.ini so I am not really sure about what you mean here? But I agree that indeed, being able to get a refresh token directly with the script would be more convenient.

Jun 20 '23 16:06 Ailothaen

Yeah, a lot of what I wrote was more documentation for myself in the future or for other people who end up searching by keywords for similar things. Didn't necessarily require action to be taken/code to be written, at least from personal experience I feel like when I get comments on any of my repos my initial instinct is to try to find some fix that makes people happy but it makes me burn out real fast.

Being able to handle multiple submissions ID is indeed interesting... What do you think?

Hmm, I'm not sure, I've never had to implement array-based CLI options like that before. I feel like I've seen multiple arguments (-i several times) before and would lean toward that just because 1,2,3 type lists feels more complex to me (requires a join(',')-type operation to omit the trailing comma, which is maybe harder to do with sed/awk? I dont really know CLI programming all that well though)

I am not really familiar with praw.ini so I am not really sure about what you mean here? But I agree that indeed, being able to get a refresh token directly with the script would be more convenient.

It looks like the setup might have changed since I last did it so it might not apply anymore actually.

Thanks again!

Jun 27 '23 19:06 Cobertos

Forgot to mention it here, but since your last message I made an update that now allows to specify multiple IDs on the command line, like this:

python3 RedditArchiver.py \
    -i https://www.reddit.com/r/Superbowl/comments/14hczkk/elf_owl_enjoying_our_pond/ \
    -i https://www.reddit.com/r/Superbowl/comments/14gozc4/adult_and_hungry_juvenile_great_horned_owl_norcal/ \
    -i 14iard6

Also, there are now options to download your saved submissions, the ones you upvoted, etc. This probably solves some of the use-cases you had in your first message.

Jul 08 '23 08:07 Ailothaen

RedditArchiver-standalone RedditArchiver-standalone copied to clipboard

Some feedback and thanks!

"[X] It looks like you are not authenticated well ..."

JSON format

-i take an array?

praw.ini / config.yml

Thanks

RedditArchiver-standalone
RedditArchiver-standalone copied to clipboard