biblio.el icon indicating copy to clipboard operation
biblio.el copied to clipboard

Backend for html only search engines

Open oatmealm opened this issue 3 years ago • 5 comments

Looking at the API seems like it was meant for sites that return xml or json... is there an example of working maybe with css selectors directly on the html returned to a query when that's the only option?

oatmealm avatar Sep 10 '20 14:09 oatmealm

Looking at the API seems like it was meant for sites that return xml or json

Not really; basically each backend is responsible for extracting the data and returning it in structured form.

is there an example of working maybe with css selectors directly on the html returned to a query when that's the only option?

There was https://github.com/cpitclaudel/biblio.el/pull/25/files , but it uses regexp, I think. You'd want to use libxml + some query selector engine (maybe https://github.com/zweifisch/enlive?) or direct recursion. I can help if you have a concrete example.

cpitclaudel avatar Sep 10 '20 14:09 cpitclaudel

Hi. Thanks for the reply.

I'm looking at Israel's "Union List" (National Library), which seems to be a hosted Exlibris Primo site (I'm guessing). It's a convoluted and rather slow Angular based site, it seems. No API as far as I can tell.

Here's a sample query (in English):

http://merhav.nli.org.il/primo-explore/search?query=any,contains,postcolonial&tab=default_tab&search_scope=ULI&vid=ULI&lang=en_US&offset=0&fromRedirectFilter=true

Fiddling around I also found this "bare" query form: http://merhav.nli.org.il/primo_library/libweb/webservices/rest/primo-explore/v1/search.do?mode=Advanced&ct=AdvancedSearch

Would the Google Scholar example be easy to adapt in this case?

oatmealm avatar Sep 10 '20 15:09 oatmealm

The API seems to be at 'http://merhav.nli.org.il/primo_library/libweb/webservices/rest/primo-explore/v1/pnxs, but it requires a cookie apparently.

Would the Google Scholar example be easy to adapt in this case?

I don't think so. This seems to be a dynamic website, s parsing the HTML won't give you anything, since it doesn't contain results. However, it should be possible to get the JSON returned by the API and used by the website. I would recommend writing to the website's authors at this point.

cpitclaudel avatar Sep 10 '20 19:09 cpitclaudel

I was asking around. They had an hackathon few years back to test out an iiif based api but it seems it didn't go anywhere.

https://github.com/OriHoch/hackathon-tasks/issues/1

oatmealm avatar Sep 10 '20 21:09 oatmealm

I see. I think you can ask about the current API though: clearly the website is a JavaScript program that downloads JSON data; you should be able to download that same JSON data from ELisp; you just need to figure out the exact query and headers, and they should be able to help with that, I think.

cpitclaudel avatar Sep 10 '20 21:09 cpitclaudel