vecty icon indicating copy to clipboard operation
vecty copied to clipboard

Switch from MDN-doc-scraper to custom open data format

Open slimsag opened this issue 6 years ago • 14 comments

Right now our generated packages (elem, prop, etc) are created by scraping the MDN documentation website and pulling relevant information. At first, this gave us good coverage of the entire DOM API and worked well, but now it is a waste of time and gives us very inaccurate results.

MDN Background

As time went on, I noticed that a significant portion of the pages on the MDN are in very inconsistent formats which makes the docs extremely hard to scrape for information accurately:

  • Some pages use a table layout, while others use separate headers to represent each object method/property.
  • Many pages have incorrect/inconsistent icon markers for experimental/deprecated/donotuse features.
  • Documentation strings do not start in any consistent English way ("background" -- background color, "background" alters the background color, "background" property, etc.) which means our godocs read very strangely.

To resolve the above issues, I spent upwards of 80+ hours contributing to the MDN in order to resolve these issues. I found the best layouts on the most popular MDN pages, and ensured other pages follow that same style consistently in page layout and wording.

Unfortunately, this was mostly in vain. The MDN is a bit like the wild-west: anyone can make changes if they have a GitHub account, without any peer review(!), and anyone can revert changes without any peer review(!).

Although I made over ~85 pages use a consistent layout to the rest of the MDN and clearly documented my changes as doing this, almost 18 of those pages were reverted by another MDN contributor without any reason mentioned in the history. I tried to reach this contributor via the IRC channel and mailing lists, as he had no public contact information, but came up with still no way to contact this contributor after several weeks.

With no way to contact this contributor, I made attempts to change the page layout on a few of those pages again and directly mentioned in the changelog that his revert made the page not follow the consistent style used on other popular MDN pages, and that I was trying to adopt a consistent MDN format. Again, the changes were reverted.

Better approach

The MDN's content license is permissive enough for us to use their documentation in our godocs, and so I think we should use an alternative method of generating our packages from (initially) the MDN documentation.

What this would look like is creating a separate Vecty repository, maybe github.com/vecty/webdoc with some type of file format (YAML, XML, etc) that documents individual web APIs (objects, data types, function signatures, docstrings, etc) for use in our generators.

This is similar in concept to how Glow, a Go OpenGL binding generator I have worked on, operates.

slimsag avatar Sep 02 '17 23:09 slimsag

Any thoughts on perhaps using WhatWG documentation instead? Producing a locally maintained custom representation for an evolving specification seems like a pretty serious commitment over time.

pdf avatar Sep 03 '17 01:09 pdf

Yes, that is something to look into.

On Sep 2, 2017 6:39 PM, "Peter Fern" [email protected] wrote:

Any thoughts on perhaps using WhatWG documentation instead? Producing a locally maintained custom representation for an evolving specification seems like a pretty serious commitment over time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gopherjs/vecty/issues/136#issuecomment-326778915, or mute the thread https://github.com/notifications/unsubscribe-auth/ADBrOPf2RWz8rwsahVJ1Ov8CB-vB1M4Xks5segNNgaJpZM4PLCuq .

slimsag avatar Sep 03 '17 02:09 slimsag

That MDN experience sounds awful. It's not acceptable to revert a reasonable change without giving a reason/rationale, much less so doing without any contact information or due process. I'll definitely take it into account in the future and be unlikely to contribute myself.

On the topic of alternative sources, is there anything like WebIDL available for these APIs?

When I was thinking about writing a Go wrapper for WebGL 2 API, the best source of its API I found was the specification, expressed as WebIDL. E.g., see webgl2.idl.

dmitshur avatar Sep 06 '17 00:09 dmitshur

On the topic of alternative sources, is there anything like WebIDL available for these APIs?

I had a brief look the other day, and AFAICT since HTML5, IDL definitions are pretty sparse, and don't come at all close to covering the full spec.

pdf avatar Sep 06 '17 01:09 pdf

If not WebIDL, what do browsers use as reference to implement these things?

dmitshur avatar Sep 06 '17 01:09 dmitshur

Luck ;-). I also wondered this, and went looking for test suites, what I found looked like a total shambles of hand-written stuff.

pdf avatar Sep 06 '17 02:09 pdf

If we can find something that outlines the APIs signatures (symbol names and data types), then we can use a more additive approach (i.e. to ensure we have good coverage of the ever-changing API)

slimsag avatar Sep 07 '17 04:09 slimsag

Relevant news: https://blogs.windows.com/msedgedev/2017/10/18/documenting-web-together-mdn-web-docs/. /cc @slimsag

Hopefully the consolidation results in improvements to quality and consistency of Web docs.

dmitshur avatar Oct 19 '17 01:10 dmitshur

Since the Blink repository is over 5 GB and cloning it takes quite a while, I've created a subrepo that will host just the *.idl files from the blink repository and created a little Go script to update the repository. Others (/cc @myictv ) may find this useful.

https://github.com/vecty/blink-idl

slimsag avatar Oct 29 '17 23:10 slimsag

@myitcv (spelled the name wrong)

slimsag avatar Oct 29 '17 23:10 slimsag

@slimsag thanks very much for the cc

myitcv avatar Oct 30 '17 12:10 myitcv

That's a good find, much better coverage than anything I found.

pdf avatar Nov 04 '17 00:11 pdf

Yeah my research basically uncovered that those blink IDL files would provide:

  1. All of the JS type names (HTMLBodyElement, equivilents for svg, etc).
  2. All of their properties (href for an HTMLAnchorElement, title, etc).

But it's not all perfect. We would need:

  1. Documentation for those types and properties (like what the MDN has, but preferably in Go style because right now we do a lot of hacks to reword MDN documentation to match Go style).
  2. Some form of mapping from JS type name (HTMLBodyElement) -> HTML tag name (body).
  3. A way to actually parse those IDL files (a language with inheritance, etc. in itself).

I think the IDL files will be good for validating that we cover the entire (moving) spec going forward. But not good for producing the actual documentation, etc. This will probably be some mixture of a scraper like what we have today for the MDN and manual work -- I'm not sure.

Also whatwg has a 'developer edition' (targeting web developers) but valuable / concise information there seems sparse (although the documentation for events seems quite good) https://html.spec.whatwg.org/dev/indices.html#index

slimsag avatar Nov 04 '17 03:11 slimsag

I don't think we should let recreating documentation get in the way of this. We can link to a relevant page from the Godocs. Most are going to be self explanatory to any web developer. Template systems don't document every HTML element they support. Another IDL we can use is the TypeScript definition which is pretty compact and parsable.

progrium avatar Oct 30 '18 21:10 progrium