internetarchive-downloader icon indicating copy to clipboard operation
internetarchive-downloader copied to clipboard

[Feature Request] NOT boolean for -f?

Open blarg1980 opened this issue 2 years ago • 4 comments

I get that's what --invertfilefiltering does. There are some additional files you want to skip that contain a word you cannot filter out other than filtering everything else. (If I make any sense?) or would that be possible?

Example: -f "(USA)" NOT "(Beta)"

I get --invertfilefiltering helps, but for something like what I'm doing, I'll need to filter out a good chunk of countries

Example: -f "(Demo)" "(Japan)" "(Korea)" "(Europe)" "(Australia)" "(Greece)" "(Germany)" "(Italy)" "(Spain)" "(France)" "(Europe, Australia)" --invertfilefiltering ^Some file names include said words, so I need to leave them in the parenthesis as it is part of a section of a file name that is used for regions.

I'm terrible with coding, so I might not make a lot of sense, but I hope I can help others who are having the same issue as I am.

blarg1980 avatar May 26 '22 06:05 blarg1980

Thanks for the note, makes perfect sense! Filtering could be improved in a few ways - I'll have a think about this over the weekend and likely add a few additional options.

john-corcoran avatar May 26 '22 07:05 john-corcoran

Hi thanks for getting back to me. I didn't see that you replied. Hardly use git on my end, so it's all new to me. I look forward to any and all improvements. Keep up the great work on it : D

On Thu, May 26, 2022, 12:12 AM john-corcoran @.***> wrote:

Thanks for the note, makes perfect sense! Filtering could be improved in a few ways - I'll have a think about this over the weekend and likely add a few additional options.

— Reply to this email directly, view it on GitHub https://github.com/john-corcoran/internetarchive-downloader/issues/7#issuecomment-1138236755, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGW35IWDHGNN3AMPEDMUPVDVL4P4HANCNFSM5W7ZOSBA . You are receiving this because you authored the thread.Message ID: @.***>

blarg1980 avatar Jun 02 '22 21:06 blarg1980

You are saying you want filenames that match: "(USA)" NOT "(Beta)" and "([not matching 'USA'])" AND "(Beta)"

One way to go about this is to have IA downloader create a line-delimited list of files that an item has and put it in a text file. Next, use vim or something to modify the text file :g/usa.*beta/di | :g/beta.*usa/di. Next use that modified TXT file as what IA downloader should download.

Implementing complex string matching could lead to IA downloader being a complex mess of code that does regular expression (regex) matching and whatever. Well, that is one way to look at it: that it should be separated to some other program. Maybe it would be good that it had a filename pattern matching thing that would match via regex as seen in sed and perl in GNU/Linux. The regex could be specified in a text file (like what grab-site does) for better compatibility across Linux, Windows, etc.

The whole "(USA)" NOT "(Beta)" string format is probably weak, so use regex instead. Regular expression is pretty much all you need when it comes to matching patterns of text. Regex for everything not usa.*beta (not implemented as of now): --invertfilefiltering -f file "filter.txt"; contents of file.txt: /usa.*beta/gi /beta.*usa/gi

Notice '-f file "filter.txt"' for a file with regex and '-f "pattern"' for the pattern directly

ProximaNova avatar Mar 03 '23 19:03 ProximaNova

Correction: "contents of filter.txt"

Also, I don't think this downloader can download metadata in/at https://catalogd.archive.org/history/[item_id] (login required). If it did it should download it to folder "itemid~history". The tilde character (~) is not allowed in item IA IDs.

ProximaNova avatar Mar 03 '23 20:03 ProximaNova