translators IEEE Xplore: Don't use export format; save full text PDF

The "endpoint" for export formats (BibTeX, RIS, etc.) may be unreliable unless the referrer is sent, but this is not yet feasible for Chromium-family browsers in Connector.

(See: https://forums.zotero.org/discussion/108011/ieee-xplore-falied; cf. #3150)

This is changed to a scraper that looks at just the page content, therefore avoiding the endpoint call.

The pages are rendered on the client side, so we read the data source as JSON directly, without scraping the DOM.

In addition, the following changes are made --

The full-text PDF file is now saved as an attachment (if the user has access to it).
The translator now also works on IEEE Xplore's own full-text PDF viewer web app. This is done via an additional network request to the abstract page where the JSON metadata are hosted. A test case for this is added.
Tags (keywords) are deduplicated.
Only the electronic-media ISBN and ISSN are kept.
Other minor bugfixes.

Sep 26 '23 10:09 zoe-translates

TODO: Clarify the detectWeb() function's logic.

Sep 26 '23 10:09 zoe-translates

We're fixing the Referer thing (for now) in the Connector. Do we still want to do this if Referer works? It sounds like there are improvements, but could those have been made to the API-based version? Presumably this is more fragile?

Sep 26 '23 11:09 dstillman

@dstillman If the referrer issue can be fixed soon for Chromium, that will help a lot, beyond merely this translator.

The new feature of PDF saving isn't dependent on the data source (API vs page data)

That said, the BibTeX and other export formats offer just the most basic metadata fields. Fields not covered include -

ISBN/ISSN
Some item type may be an issue. For this technical standard, the bibtex identifies it as @ARTICLE: https://ieeexplore.ieee.org/document/8332112; the RIS file looks slightly better but our RIS import can't handle TY - STD yet.
Keywords are entirely missing

So the export formats will be less fragile, in the sense that the export output will still likely look sensible if and when the site's data models change. But those export formats meanwhile miss information.

In summary -

Requesting export formats -

Pros:

Easier to maintain if other aspects of site model change
More straightforward to understand
Provide some of the most essential metadata info

Cons:

Rely on an informal API; access isn't always straightforward
Some fields are missing
Possible issues with types

On-page scraping -

Pros:

No additional network requests; no reliance on informal access to API
The page already provides everything about the item; we can get as much as possible

Cons:

Harder to explain
Tied to the source format

Sep 26 '23 14:09 zoe-translates

In this case it sounds like page scraping is the better bet, since the "API" returns lossy export-format data and the "page scraper" gets the site's native representation. Page scraping being harder to explain isn't much of a drawback - the explanation for pretty much every bit of logic in a page scraper is "because that's how the website is" - and neither is being tied to the source format, since that's inevitable.

I mean, we could wrap the page-scraping routine in a try-catch and fall back to an export format if it fails, but I think that's more effort than it's worth.

Sep 26 '23 16:09 AbeJellinek

And, for what it's worth, they have an official API and it seems to be free(?).

Sep 26 '23 16:09 AbeJellinek

The IEEE API doc says that an API key is required:

The API key you received after completing the registration process MUST be appended to EVERY query. You can find your API key under My Account.

So I guess this means No for us.

Sep 27 '23 07:09 zoe-translates

@AbeJellinek, So I doubled down on this approach, primarily because it is easier to account for the variety of item types. I've also added a variety of new test cases to cover them (including, fortunately, a paper that is permanently stuck in "Early Access"). To me these results look ok in Scaffold or browser, with one exception: Full text PDF attachment -- a problem that will happen regardless of which metadata source we're going to use.

EDIT: This doesn't appear to be a problem for IP-authenticated subscribers; see https://github.com/zotero/translators/pull/3151#issuecomment-1754228001

The problem is that the server guards the PDF file by some behind-the-scenes "security" mechanism (I think they're using BigIP), and if some request features (strongly suspected to be the referrer + cookie combination, see below) doesn't match the expected one, you'll get a 302 redirection.

What this means is best illustrated by an example of "normal" browsing vs. translator workflow, for the article at https://ieeexplore.ieee.org/document/8101526 and its PDF file.

Normal browsing:

Go to the page at https://ieeexplore.ieee.org/document/8101526; click on the "PDF" button
A "landing" or pdf-viewer page opens (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8101526), with the PDF file in an iframe.
The PDF file is requested from the landing page. This passes the firewall and we're served the PDF.

Translator:

User initiates translation -- in this example from the document page https://ieeexplore.ieee.org/document/8101526.
The landing page is bypassed entirely, and the actual PDF attachment's URL is computed.
The PDF attachment is requested asynchronously using the computed URL by the client. The referrer is set to the page on which the translation is initiated.
The application firewall sees this client-initiated request as anomalous, and redirects with 302 to the landing page. This makes the PDF saving fail.

It seems that the client-initiated request for attachment (no. 3 in the above) doesn't match the features of a "natural" browser request, which triggers the application firewall.

This behaviour is perhaps most clearly seen when we initiate the translation from the landing page itself at https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8101526. If we initiate the translation from here, the same attachment URL will be computed and we'll actually be served the real PDF file in response to the attachment request from the client.

I suspect that the mechanism is that the landing page response sets a session cookie which is validated by the server for the PDF request. If this cookie is valid and the referrer is the landing page, the PDF request passes. Otherwise, you're redirected to the landing page so as to receive the cookie and set the referrer. If these conditions (cookie + referrer) cannot be replicated by the client when saving the attachment, the saving will fail.

I don't think there's an easy solution from the translator side. It seems that there's some code for handling Elsevier sciencedirect PDF in xpcom/attachments.js but I'm not sure about what to make of it ATM.

Oct 09 '23 14:10 zoe-translates

OK, I did further experiments with my subscription access and found that the problem described above (https://github.com/zotero/translators/pull/3151#issuecomment-1753175789) is not an issue for the IP-authenticated subscriber. The computed PDF link will work, there will be a redirect to the "stamped" PDF (with a line like Authorized licensed use limited to: [...] Downloaded on [...] on each page), and that redirect will be handled cleanly by the client. Both OA and subscriber-only content are delivered this way. The referrer doesn't matter.

The problem reported above is only for the guest access without subscription. But in that case, the actual impact is limited. For paywalled articles, it will fail anyway. For OA articles, ~~it could be a problem, but I'll double check~~ Edit: it still fails for non-subscriber access to OA articles. In any case, the full-text auto-discovery will serve as a fallback.

Oct 10 '23 02:10 zoe-translates

In any case, the full-text auto-discovery will serve as a fallback.

Yeah, the OA PDF will work fine for those cases. I don't think this will be that big of a deal.

I might be misunderstanding what triggers this, but https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8101526 loads without a redirect when I navigate to it directly (without first going to the catalog page).

Oct 10 '23 13:10 AbeJellinek

Thanks for helping with this. The HTML page https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8101526 (PDF-viewer "frame" page, which I inaccurately called "landing page") loads without redirection no matter how you arrive at it. But the PDF document inside the iframe on that page may be subject to firewalling. On that page, for example, the PDF file's url is https://ieeexplore.ieee.org/ielx7/83/8103362/08101526.pdf?tp=&arnumber=8101526&isnumber=8103362&ref=

For non-subscribers, if you request this URL directly in a new browsing session, it redirects back to the "frame" or PDF-viewer page (see note below). If you craft a request with the correct cookie and referrer as if you were loading the iframe in browsing session, you get the PDF. But if you send the cookie but the wrong referrer, you'll be redirect to the frame page.

It is the "wrong" referrer in the Zotero-client-initiated request that trips the firewall.

For IP-auth'ed subscribers this is not a problem. Only the cookie (and IP, I presume) matters. The referrer can be anything.

So anyway I don't think this will be a breaking issue for non-subscribers, because they still get a second go at the PDF via discovery (not that this is guaranteed to hit). The idea of somehow fine-tuning the client's request headers from translator (e.g. by additional fields in the attachment object) might be a little tempting, but so far I haven't seen another instance where that might be useful.

Note: you may even be redirected to the wrong frame page. I tried this (curl'ing the PDF direct link with the barest request possible, no cookie, no referrer)

curl -D h.txt 'https://ieeexplore.ieee.org/ielx7/83/8103362/08101526.pdf?tp=&arnumber=8101526&isnumber=8103362&ref='

and got the Location: header pointing to the frame page of a wrong article (I got https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7448813 but your experience may vary). This was comical and I'm not sure if the firewall was trolling me.

Oct 10 '23 15:10 zoe-translates

Oh, right, I didn't realize that that page wrapped the PDF in an iframe (and wasn't just the PDF itself). I could've sworn we had code in the client to extract PDFs from iframes, but I can't find it now.

Oct 10 '23 16:10 AbeJellinek

I could've sworn we had code in the client to extract PDFs from iframes, but I can't find it now.

We have code in the Connector that (if I recall) shows the PDF icon if there's a PDF in a frame and the parent page doesn't detect (and maybe that puts PDF as an option in the Save to Zotero menu even if there is a translator?).

Oct 10 '23 16:10 dstillman

Ah, right.

We could expand our ScienceDirect PDF code to support this case (add the domain, make it detect PDFs in child frames) but it hardly seems worth it, since anyone without full access will only be getting open-access PDFs anyway.

Oct 10 '23 17:10 AbeJellinek

translators translators copied to clipboard

IEEE Xplore: Don't use export format; save full text PDF

translators
translators copied to clipboard