stumptown-content Scraping should probably be done via the Document JSON API instead

Scraping should probably be done via the Document JSON API instead

Open peterbe opened this issue 4 years ago • 6 comments

I'm naive in my understanding of the progress on scrape-mdn.js but one thing I've understood is that when pulling down a page wants to find out what the BCD table identifier was and the way it does it is by a second HTTP fetch with + '?raw'.

Once this lands, and given some time for pages to re-render, the Document JSON API should have all of this in a much better package. It's a structured JSON document after all. (You'd still need to pull out JSDOM to parse the blobs of HTML text within.

The Document JSON API can be reached in two ways: Either over HTTP (e.g. https://developer.mozilla.org/api/v1/doc/en-US/Web/HTML/Element/progress) or you pull it directly from S3. The name of the bucket is (uh, I need to look this up) which means that can list objects in big batches. See https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjects-property for example.

Sep 25 '19 14:09 peterbe

One caveat is that a list of contributors is NOT available in the Document JSON API. Since this was the new React MDN isn't using a list of contributors, it got removed from the API.

Sep 25 '19 14:09 peterbe

Actually, I wonder if it's all that different. The current scraper is pretty quick to parse and extract the stuff it needs and the current scraper isn't meant to be used for downloading ALL documents which only really feasible with the AWS S3 Node SDK.

Perhaps this issue is just a bunch of loud notes :)

Sep 25 '19 14:09 peterbe

https://github.com/mdn/kumascript/pull/1247 has landed and pushed to prod.

Pages, like this one that have been re-rendered now has <div class="bc-data" id="bcd:html.elements.details"> and it's already in the API too: https://developer.mozilla.org/api/v1/doc/en-US/Web/HTML/Element/details

To test this at scale, we'd have to trigger a mass re-render of a bunch of docs or we can manually Shift-Refresh a bunch of pages.

Sep 26 '19 12:09 peterbe

mdn/kumascript#1247 has landed and pushed to prod.

Pages, like this one that have been re-rendered now has <div class="bc-data" id="bcd:html.elements.details"> and it's already in the API too: https://developer.mozilla.org/api/v1/doc/en-US/Web/HTML/Element/details

To test this at scale, we'd have to trigger a mass re-render of a bunch of docs or we can manually Shift-Refresh a bunch of pages.

I can run a re-render of all of the documents that use the compat macro in Kuma, if that helps. A new BCD release will probably be deployed today, so that's another reason to do that.

Sep 26 '19 16:09 escattone

@peterbe I'll run a re-render of all of the documents that use the compat macro tonight (it takes a while -- I started work on a way to speed it up a while ago, but it's not ready yet -- sigh), so you should have a clean fresh slate to test with tomorrow morning. I'll let you know how it goes in this thread.

Sep 26 '19 19:09 escattone

FYI, in a video meeting @peterbe and I discussed the re-rendering of all of the documents that use the compat macro, and decided that it wasn't necessary at this time.

Sep 27 '19 00:09 escattone

stumptown-content stumptown-content copied to clipboard

Scraping should probably be done via the Document JSON API instead

stumptown-content
stumptown-content copied to clipboard