collectionsonline icon indicating copy to clipboard operation
collectionsonline copied to clipboard

Certain category names cause the API to eliminate all params and filter on category

Open zenlan opened this issue 7 years ago • 7 comments

For instance: https://collection.sciencemuseum.org.uk/search?q=art

Categories: Materia Medica & Pharmacology Surgery Railway Posters, Notices & Handbills Photographs Photographic Technology Classical & Medieval Medicine Art Wellcome Medals Therapeutics Documents

Not all category names cause this behaviour, 'art' does, 'photographs' does while 'surgery' does not.

This is causing issues in 2 projects of mine where I expect results for the query term and cannot handle category results.

Results from logs of one project's API calls where the URLs are spawned by paginator buttons, each page returns exactly the same result set i.e. that of a category search:

[18-02-05 11:16:40:133 GMT] ** URL ** [18-02-05 11:16:40:134 GMT] "https://collection.sciencemuseum.org.uk/search/objects/images?page[number]=0&page[size]=50&q=art" [18-02-05 11:16:41:353 GMT] id: co27959 [18-02-05 11:16:41:387 GMT] id: co8023087 [18-02-05 11:16:41:416 GMT] id: co8023088 ....... [18-02-05 11:16:42:841 GMT] id: co65431 [18-02-05 11:16:42:872 GMT] id: co67231 [18-02-05 11:18:20:938 GMT] ** URL ** [18-02-05 11:18:20:938 GMT] "https://collection.sciencemuseum.org.uk/search/objects/images?page[number]=1&page[size]=50&q=art" [18-02-05 11:18:22:157 GMT] id: co27959 [18-02-05 11:18:22:187 GMT] id: co8023087 [18-02-05 11:18:22:219 GMT] id: co8023088 ....... [18-02-05 11:18:23:749 GMT] id: co65431 [18-02-05 11:18:23:785 GMT] id: co67231

A second project has a page that merges results from a range of museum APIs, getting 5 from each. It was my misfortune to select the word 'art' as the default search term which leads to the Science Museum results flooding the page, outnumbering all other results. Also repetition of the first set of results for every subsequent call.

zenlan avatar Feb 05 '18 12:02 zenlan

Ah, yes...this is likely a HTML only feature creeping into the API/JSON queries/response, will take a look.

But is there a reason your using q=art over searching the art category specifically /search/objects/categories/art or searching for specific object types ie. /search/objects/object_type/oil-painting.

Using the q= will return you anything that matches the word art(including fuzzy matches), rather than objects that are actually categorised as art, is that really what you want?

jamieu avatar Feb 05 '18 13:02 jamieu

The URLs in the log records I pasted show the actual queries that I use, i.e. exclusively /search/objects/images. The first query was just an adhoc example.

zenlan avatar Feb 05 '18 13:02 zenlan

I think there maybe two separate issues here:

  • The hijacking of the keyword “art” to provided a better experience for front end users ( http://collection.sciencemuseum.org.uk/search/categories/art) creaping into the API calls. I can fix that for you this week.

  • Harvesting of large data sets via pagination. We’ve turned this off for now (past 10 pages) and are looking at a better cursor based means of providing that functionality. On roadmap, but not something we can get in place this week. In the short term to get the whole data set you’d be better off using the date stamps in the sitemap.xml file to crawl the site.

jamieu avatar Feb 05 '18 13:02 jamieu

I am aware of the pagination issue and it does not affect this issue. I limit the queries to pages 0 - 9.

Even though the collections search page allows to search pages 0 - 10 it seems. https://collection.sciencemuseum.org.uk/search/objects/images?page[number]=10

zenlan avatar Feb 05 '18 14:02 zenlan

Btw there is no harvesting, my projects are search apps. I don't store any data.

zenlan avatar Feb 05 '18 14:02 zenlan

Yes, as I explained we 'hijack' the queries for category names and treat those queries differently (effectively sending you off to a different results page). I need to turn that off for the JSON/API query/response. Unlikely to get to it today, but will look at it this week.

As for the pagination issue, I've attached a list of records with images, in the short term it's probably easier for to modify your app to use a local copy of this data. Although we do add new records/images, the frequency isn't so regular that you'll be missing vast numbers of records.

smg-objs-with-images.txt

jamieu avatar Feb 05 '18 14:02 jamieu

Thanks but I just found a workaround for this, I think. If I wrap the term in quotes, either single or double, the query remains intact.

https://collection.sciencemuseum.org.uk/search/images?q=art https://collection.sciencemuseum.org.uk/search/images?q="art"

https://www.zenlan.com/collage/science/#art https://www.zenlan.com/collage/science/#"art"

This workaround will suffice until there is a proper resolution.

zenlan avatar Feb 05 '18 15:02 zenlan