auto-archiver icon indicating copy to clipboard operation
auto-archiver copied to clipboard

Facebook archiving

Open djhmateer opened this issue 2 years ago • 3 comments
trafficstars

I've got a Facebook archiver working by using the wacz_enricher.py

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L159

Am using a stored profile to be able to get images which require you to be logged in.

Am running this archiver from a residential IP as if run from a cloud, then FB will block the requests.

This archiver is run as well as the main archiver (which runs on a cloud)

  • looks for any url which contains facebook.com and has an archive status of: wayback: (have added a new config flag called fb_archiver so that the gsheet_feeder.py only gets the rows we want)
  • runs the wacz archiver only
  • runs hash_enricher and screenshot_enricher

It may be that this can be much simpler if I can run everything sequentially (and not on 2 servers)., Need to wait for more bandwidth on residential network, then can potentially do a PR.

Also I've found I need to keep testing the profile as it will need to be re-logged in after a few weeks.

djhmateer avatar Nov 02 '23 16:11 djhmateer

Looking forward to that PR, we can indeed have an option to run a specific archiver via a residential IP proxy.

msramalho avatar Nov 13 '23 10:11 msramalho

Taking another look at this, can you clarify if you're doing any extra downloads/requests or simply parsing data form inside the wacz?

msramalho avatar Feb 20 '24 11:02 msramalho

Hi Miguel

From:

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L159

Probably best to follow along on link above.

Apart from the /photo special case, I get the root page, then parse it for resources, getting the fb_id and set_id. Then jump down to

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L400

which does another request (and another wacz download), then returns the next fb_id back to the main function above.

Regards Dave

djhmateer avatar Feb 21 '24 15:02 djhmateer

Facebook archiving using the yt-dlp built-in code is now merged into main. See https://github.com/bellingcat/auto-archiver/pull/223

It's not perfect (one limitation is it only gets the first 100 characters of text posts), but it seems to be working reliably without logging in.

For facebook, I'd still recommend using the screenshot_enricher (with cookies/login info), and WACZ, but this now means that basic facebook support is built in to the tool :)

pjrobertson avatar Mar 17 '25 12:03 pjrobertson

Closing this as facebook archiving is now part of the generic_extractor. It is not perfect, but I will open another issue for this

pjrobertson avatar Mar 26 '25 11:03 pjrobertson