internetarchive
internetarchive copied to clipboard
Make 'is_collection' return False for non-collection items (rather than simply being absent)
Not sure if I'm missing something, but is it possible to run a query and access the 'is_collection' parameter? My personal example is this: I want to download metadata from a collection, but I just want to download only other collections, not single items.
I didn't find a way to do it yet, until I discovered the 'is_collection' parameter in the JSON... which seems to be not accessible.
is_collection
should be available as an attribute on the Item
object if the item is a collection. If it's absent, it's not a collection. For example:
In [1]: from internetarchive import get_item
In [2]: item = get_item('nasa')
In [3]: item.is_collection
Out[3]: True
If you try to access the attribute on a non-collection item, and the attribute is absent, it will throw an AttributeError exception. Now that I think of it, that's not very helpful. It should be set to False for non-collection items. I'll keep this issue open, so we can address that in a future release.
However, I think it might be easier for you to do the filtering in your query, so you don't even have to bother checking that:
for item in search_items('collection:nasa AND NOT mediatype:collection').iter_as_items():
item.download()
Or, from the command-line:
ia download --search 'collection:nasa AND NOT mediatype:collection'
And, from the command-line with GNU Parallel:
ia search 'collection:nasa AND NOT mediatype:collection' -i > itemlist.txt
parallel 'ia download {}' < itemlist.txt
I hope this answers your question, let me know if it doesn't.
Thanks, it works also with the Advanced search
query, which sometimes is the easiest way to get the CSV I need. I missed the collection:nasa AND NOT mediatype:collection
syntax. Thanks again.