internetarchive icon indicating copy to clipboard operation
internetarchive copied to clipboard

Extended glob capabilities?

Open Firehawke opened this issue 5 years ago • 2 comments

I only took a quick glance, but it looks like the globbing on this script only does inclusive and doesn't support the exclusive ! operator.

Here's an example of a case where that'd be useful, at least to me. I'm going to walk you through this step by step just so you get an exact 100% view of what I'm doing and why.

ia search 'collection:wozaday' --parameters="page=1&rows=1" --sort="publicdate desc" -i > currentitems.txt

At this point, currentitems.txt now has "wozaday_Dr_Ruth_Computer_Game_of_Good_Sex"

I'd like to look at JUST the main .zip to parse the disk images therein. Let's look at what all is in the archive.

ia download -d -i --no-directories --itemlist=currentitems.txt

The dry run tells us the files are:

https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/00playable.woz https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/00playable_screenshot.png https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/Dr.%20Ruth%27s%20Computer%20Game%20of%20Good%20Sex%20%28woz-a-day%20collection%29.zip https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/Dr.%20Ruth%27s%20Computer%20Game%20of%20Good%20Sex%20extras%20%28woz-a-day%20collection%29.zip https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/__ia_thumb.jpg https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex_archive.torrent https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex_files.xml https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex_meta.sqlite https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex_meta.xml

So far so good. Let's glob it down to just the zips.

ia download -d -i --glob=*.zip --no-directories --itemlist=currentitems.txt

https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/Dr.%20Ruth%27s%20Computer%20Game%20of%20Good%20Sex%20%28woz-a-day%20collection%29.zip https://archive.org/download/wozaday_Dr_Ruth_Computer_Game_of_Good_Sex/Dr.%20Ruth%27s%20Computer%20Game%20of%20Good%20Sex%20extras%20%28woz-a-day%20collection%29.zip

Here's where we hit the problem. I don't desire to waste IA bandwidth or my own on snagging the extras.zip file when I only want to parse the main .zip (and the XML, but I'll do that separately and it's beside the point)

I started looking into globbing options a bit when I hit this snag and found that some globbing libraries support extended operators (e.g. http://man7.org/linux/man-pages/man7/glob.7.html ) for these kinds of situations but a quick glance over your codebase suggests you're not implementing any of that.

Any ideas, suggestions? My last resort will be to pipe the output to a text file, use standard unix tools to strip the extras, and pipe it back through wget, but this feels like something that might be better handled at the script side.

Firehawke avatar Jul 18 '19 15:07 Firehawke

I'm using the workaround I'd previously mentioned for the time being, so this isn't an immediate need thing but I'll leave this open because I still feel expanded glob capabilities would be useful in the long run.

Firehawke avatar Aug 16 '19 18:08 Firehawke

Python's fnmatch doesn't support these GNU extensions, so this would require some additional dependency. wcmatch would probably be a candidate for that.

Alternatively, what about regex? Python's standard re module is quite powerful, including negative lookaheads, which would work for the specific case above, but also lots of other fun things that fnmatch can't do (I think).

JustAnotherArchivist avatar Feb 11 '22 03:02 JustAnotherArchivist