stumptown-content
stumptown-content copied to clipboard
Scraping should probably be done via the Document JSON API instead
I'm naive in my understanding of the progress on scrape-mdn.js
but one thing I've understood is that when pulling down a page wants to find out what the BCD table identifier was and the way it does it is by a second HTTP fetch with + '?raw'
.
Once this lands, and given some time for pages to re-render, the Document JSON API should have all of this in a much better package. It's a structured JSON document after all. (You'd still need to pull out JSDOM to parse the blobs of HTML text within.
The Document JSON API can be reached in two ways: Either over HTTP (e.g. https://developer.mozilla.org/api/v1/doc/en-US/Web/HTML/Element/progress) or you pull it directly from S3. The name of the bucket is (uh, I need to look this up) which means that can list objects in big batches. See https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjects-property for example.
One caveat is that a list of contributors is NOT available in the Document JSON API. Since this was the new React MDN isn't using a list of contributors, it got removed from the API.
Actually, I wonder if it's all that different. The current scraper is pretty quick to parse and extract the stuff it needs and the current scraper isn't meant to be used for downloading ALL documents which only really feasible with the AWS S3 Node SDK.
Perhaps this issue is just a bunch of loud notes :)
https://github.com/mdn/kumascript/pull/1247 has landed and pushed to prod.
Pages, like this one that have been re-rendered now has <div class="bc-data" id="bcd:html.elements.details">
and it's already in the API too: https://developer.mozilla.org/api/v1/doc/en-US/Web/HTML/Element/details
To test this at scale, we'd have to trigger a mass re-render of a bunch of docs or we can manually Shift-Refresh a bunch of pages.
mdn/kumascript#1247 has landed and pushed to prod.
Pages, like this one that have been re-rendered now has
<div class="bc-data" id="bcd:html.elements.details">
and it's already in the API too: https://developer.mozilla.org/api/v1/doc/en-US/Web/HTML/Element/detailsTo test this at scale, we'd have to trigger a mass re-render of a bunch of docs or we can manually Shift-Refresh a bunch of pages.
I can run a re-render of all of the documents that use the compat
macro in Kuma, if that helps. A new BCD release will probably be deployed today, so that's another reason to do that.
@peterbe I'll run a re-render of all of the documents that use the compat
macro tonight (it takes a while -- I started work on a way to speed it up a while ago, but it's not ready yet -- sigh), so you should have a clean fresh slate to test with tomorrow morning. I'll let you know how it goes in this thread.
FYI, in a video meeting @peterbe and I discussed the re-rendering of all of the documents that use the compat
macro, and decided that it wasn't necessary at this time.