dartdoc icon indicating copy to clipboard operation
dartdoc copied to clipboard

remove the use of package:html from dartdoc

Open devoncarew opened this issue 3 years ago • 5 comments

We're trying to remove the SDK repo dep on package:html. It looks like this is the last thing consuming package:html from the SDK.

From looking at the code here, it looks like package:html is used for two things here:

Validating that links are correct in the generated html. Dartdoc generates html, and then optionally, re-parses that html and validates that it doesn't have urls pointing to files that don't exists. This feature is not enabled by default for the dart doc command (it's available as an option but off by default). Some options here are to remove the validation entirely, or, implement it before the dartdoc html generation phase (not rely on re-parsing the generated html).

Light modification of the html generated from markdown. It looks like what dartdoc is doing is 1) parsing markdown 2) generating html from that markdown (via a visitor) 3) validation of the generated html and tweaks to some of the generated html. I think this use of package:html could be replaced by slightly modifying the markdown => html visitor - doing validation and tree modifications there, instead of re-parsing the generated html.

devoncarew avatar Jun 27 '22 17:06 devoncarew

Here are the two places where package:html is imported:

lib/src/render/documentation_renderer.dart:import 'package:html/dom.dart' as dom;
lib/src/render/documentation_renderer.dart:import 'package:html/parser.dart' show parseFragment;
lib/src/validator.dart:import 'package:html/parser.dart' show parse;

devoncarew avatar Jun 27 '22 17:06 devoncarew

Some options here are to remove the validation entirely,

Yeah I've considered this. I think it's mostly a tool to make sure that changes to dartdoc don't result in broken links. It's not so useful to end-users, as the code should be tested enough that there are not surprises when generating docs for a user's package. I haven't decided whether we can remove it yet...

or, implement it before the dartdoc html generation phase (not rely on re-parsing the generated html).

I've considered this too. Again, you would lose some functionality, since templates can have links, or can put arbitrary elements in a 'href' attribute. This could be viable though.

Light modification of the html generated from markdown. [...] 3) validation of the generated html and tweaks to some of the generated html.

Is this the validation for security purposes, to ensure we're not generating javascript: links?

I think this use of package:html could be replaced by slightly modifying the markdown => html visitor - doing validation and tree modifications there, instead of re-parsing the generated html.

Might be possible, but might not. See Markdown's XSS Vulnerability which is why we don't do any validation in the Markdown package.

srawlins avatar Jun 27 '22 18:06 srawlins

Some historical notes....

On Mon, Jun 27, 2022 at 11:30 AM Sam Rawlins @.***> wrote:

Some options here are to remove the validation entirely,

Yeah I've considered this. I think it's mostly a tool to make sure that changes to dartdoc don't result in broken links. It's not so useful to end-users, as the code should be tested enough that there are not surprises when generating docs for a user's package. I haven't decided whether we can remove it yet...

This is correct. It is not as useful now as dartdoc is better tested than it once was. Having it off by default (or even in a separate tool) is definitely reasonable and I could see a case for removing it. One advantage of having the link checking in dartdoc is that with the debugger and printf you can figure out more directly why dartdoc put a broken link there, if there is one, because you can find the object that created it at runtime. Again, as dartdoc's code structure incrementally improves this becomes less important as it is easier to figure out.

or, implement it before the dartdoc html generation phase (not rely on re-parsing the generated html).

I've considered this too. Again, you would lose some functionality, since templates can have links, or can put arbitrary elements in a 'href' attribute. This could be viable though.

Light modification of the html generated from markdown. [...] 3) validation of the generated html and tweaks to some of the generated html.

Is this the validation for security purposes, to ensure we're not generating javascript: links?

I think this use of package:html could be replaced by slightly modifying the markdown => html visitor - doing validation and tree modifications there, instead of re-parsing the generated html.

Might be possible, but might not. See Markdown's XSS Vulnerability https://github.com/showdownjs/showdown/wiki/Markdown%27s-XSS-Vulnerability-(and-how-to-mitigate-it) which is why we don't do any validation in the Markdown package.

This validation was never intended to do anything regarding security, was intended for debugging only.

Janice

— Reply to this email directly, view it on GitHub https://github.com/dart-lang/dartdoc/issues/3064#issuecomment-1167719774, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADLWPW3QVGW7AJ5GMFRPTITVRHXLRANCNFSM5Z7HU4GA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jcollins-g avatar Jun 29 '22 19:06 jcollins-g

This is correct. It is not as useful now as dartdoc is better tested than it once was. Having it off by default (or even in a separate tool) is definitely reasonable

Ah - thanks for the context. It sounds like two possible options here are to:

  • remove the functionality entirely, or
  • keep it, but move it from a user-time flag, to a dev-time / CI check for the libraries that dartdoc integration tests against (itself? flutter?)

For the 2nd use of package:html - validation of markdown generated html:

This validation was never intended to do anything regarding security, was intended for debugging only.

I'm not sure if this could also just be removed, or if we think we do get value here wrt mitigating markdown xss issues. I'm not sure if we run the readme.md content through the above validator as well? Or what pub.dev does wrt sanitizing the user provided markdown (cc @jonasfj ).

devoncarew avatar Jul 01 '22 15:07 devoncarew

Or what pub.dev does wrt sanitizing the user provided markdown

pub.dev uses sanitize_html it's aims to implement the same rules as Github uses when sanitizing GFM.

I'm not particularly proud of it :see_no_evil:, but this is also what dartdoc uses, we just vendored the sanitize_html logic in dartdoc to mitigate any risk of XSS issues: https://github.com/dart-lang/dartdoc/blob/5cb56cd3f4fa7f8473bcbf0126bb1a0d55629cac/lib/src/render/documentation_renderer.dart#L91-L123

We could probably consider promoting sanitize_html to be a tools.dart.dev package, it's not super generic, it's really narrowly aimed at sanitizing rendered GFM.

jonasfj avatar Jul 13 '22 11:07 jonasfj