auto-archiver icon indicating copy to clipboard operation
auto-archiver copied to clipboard

Feature Request: Allow local text or TSV files instead of Google Spreadsheets

Open kkarhan opened this issue 1 year ago • 5 comments
trafficstars

Hi, as I asked on the fediverse, there's like a not-so insignificant need to allow self-hosting, which admittedly it doesn't do as of now.

  • Ideally one could just specify i.e. a --local [filename] flag (similar to the batch option of youtube-dlp and curl) and just use that.
  • I have built something like that in the past out of need to circumvent extra-obnoxious anti-botting firewalls and similar attempts at preventing people from archiving contents.
  • It is not always practical to or possible to invoke or access Google Services which may or may not be blocked by networks, ISPs and upstreams.

I sincerely hope this will help your project going forward and if needed I'll gladly provide samples of sites that one may want to archive.

Yours faithfully, Kevin Karhan

kkarhan avatar Aug 05 '24 13:08 kkarhan

Hi kkarhan, thanks for opening the issue - this is something we may look at and would welcome pull requests to add a TSV feeder.

Currently we do support using a command line feeder (cli_feeder) if you wanted to bypass the Google dependency in the interim - you can set this in your orchestration.yaml.

We are planning on working on the documentation of the auto-archiver so hopefully that will help with correctly configuring for different workflows.

GalenReich avatar Aug 06 '24 09:08 GalenReich

Hey @kkarhan thanks for the clear issue and suggestion.

Adding to Galen's answer: for now we only implemented 2 main feeders: GoogleSheets and CommandLine. Internally, that covers all our needs, so this is not something we will not be worked on by us atm (adding wontfix label).

Still, we'll leave this issue open for a while in case you or others find it a valuable addition and want to contribute it to the project.

msramalho avatar Aug 21 '24 11:08 msramalho

Thanks so far for the feedback and keeping the issue open.

Is there any conclusive documentation re: cli_feeder ?

Cuz if similar to wget & curl I could just iterate over things that way...

kkarhan avatar Aug 22 '24 03:08 kkarhan

No good documentation on it unfortunately.

If you look at the code https://github.com/bellingcat/auto-archiver/blob/b166d57e61285dba585ca3bfd3af2acfb5696501/src/auto_archiver/feeders/cli_feeder.py#L17-L24

it is essentially expecting a --cli_feeder.urls parameter, an example call would be: python -m src.auto_archiver --config secrets/orchestration.yaml --cli_feeder.urls="https://example.com,https://example2.wow"

What I'd suggest is you either create a new, very similar feeder, that accepts a filename instead of a csv of hardcoded urls OR actually modify the cli_feeder to have another parameter just for filenames and force at least one of them to be present.

This should not be hard to achieve assuming you've been able to run/test the auto-archiver locally on your development environment.

msramalho avatar Aug 26 '24 11:08 msramalho

*this would be preferable to piping giving the current sofware architecture of the library.

msramalho avatar Aug 26 '24 11:08 msramalho

A new csv_feeder will be added in an upcoming release of auto-archiver.

Further clearer documentation will be provided once the new version is released, but as a note you will be able to use the following command:

auto-archiver --feeders=csv_feeder --csv_feeder.files /path/to/file.csv

It will take an input file that is a list of URLs, or a valid CSV with a header row. To choose the column that you want to get the URLs for, you can use the --csv_feeder.column="My Column Name" option

Ref

pjrobertson avatar Feb 04 '25 12:02 pjrobertson

The changes have been merged into maser, and we hope to get a new release out within the next ~week. You can view the full documentation on the new CSV feeder here: https://auto-archiver.readthedocs.io/en/latest/modules/autogen/feeder/csv_feeder.html

pjrobertson avatar Feb 12 '25 11:02 pjrobertson