auto-archiver
auto-archiver copied to clipboard
Facebook archiving
I've got a Facebook archiver working by using the wacz_enricher.py
https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L159
Am using a stored profile to be able to get images which require you to be logged in.
Am running this archiver from a residential IP as if run from a cloud, then FB will block the requests.
This archiver is run as well as the main archiver (which runs on a cloud)
- looks for any url which contains facebook.com and has an archive status of: wayback: (have added a new config flag called fb_archiver so that the gsheet_feeder.py only gets the rows we want)
- runs the wacz archiver only
- runs hash_enricher and screenshot_enricher
It may be that this can be much simpler if I can run everything sequentially (and not on 2 servers)., Need to wait for more bandwidth on residential network, then can potentially do a PR.
Also I've found I need to keep testing the profile as it will need to be re-logged in after a few weeks.
Looking forward to that PR, we can indeed have an option to run a specific archiver via a residential IP proxy.
Taking another look at this, can you clarify if you're doing any extra downloads/requests or simply parsing data form inside the wacz?
Hi Miguel
From:
https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L159
Probably best to follow along on link above.
Apart from the /photo special case, I get the root page, then parse it for resources, getting the fb_id and set_id. Then jump down to
https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L400
which does another request (and another wacz download), then returns the next fb_id back to the main function above.
Regards Dave
Facebook archiving using the yt-dlp built-in code is now merged into main. See https://github.com/bellingcat/auto-archiver/pull/223
It's not perfect (one limitation is it only gets the first 100 characters of text posts), but it seems to be working reliably without logging in.
For facebook, I'd still recommend using the screenshot_enricher (with cookies/login info), and WACZ, but this now means that basic facebook support is built in to the tool :)
Closing this as facebook archiving is now part of the generic_extractor. It is not perfect, but I will open another issue for this