debiman icon indicating copy to clipboard operation
debiman copied to clipboard

optimization: tokenize HTML or process textually entirely

Open stapelberg opened this issue 8 years ago • 8 comments

Tokenizing shaves off about 1 minute on a 6 minute rendering of Debian unstable.

The code is not entirely straight-forward to port due to the HTML-tag-agnostic cross reference detection (e.g. for <i>crontab</i>(5)) which requires us to keep state after all.

If we could improve mandoc’s cross reference detection and id generation, we could probably get away with textually processing the HTML, which has the potential to shave off another 30 seconds.

stapelberg avatar Jan 15 '17 14:01 stapelberg

Another interesting measurement: peak memory usage during conversion is reduced by about 150MB when using HTML tokenization instead of HTML parsing. (with -concurrency_render=20)

stapelberg avatar Jan 15 '17 14:01 stapelberg

I decided to not work on this for the time being, unless it becomes a blocker for anything. Help welcome :).

stapelberg avatar Jan 28 '17 20:01 stapelberg

The post-processing for cross reference detection is necessary only for the man pages written in the old man(7) language, which is not semantic and references are usually written with the .BR or .IR macros. I think it should really be improved in mandoc itself, also as a way of working on https://github.com/Debian/debiman/issues/56.

Until then, you can probably detect if the manual is written in man(7) or mdoc(7) and post-process only the first case :wink:

lahwaacz avatar Aug 27 '17 13:08 lahwaacz

That’s orthogonal to the issue in this ticket, I think: we do our own cross-referencing for internationalization.

stapelberg avatar Aug 27 '17 13:08 stapelberg

Could you elaborate on what else is necessary to post-process besides the example <i>crontab</i>(5) above? And where does the internationalization part come in?

lahwaacz avatar Aug 27 '17 14:08 lahwaacz

Have a look at https://github.com/Debian/debiman/blob/7d479b8e5480a069d2898cc6940a4642a3d15395/internal/convert/convert.go#L249

Post-processing consists of 3 steps:

  1. We strip <html>, <head> and <body> tags because we’re inserting the resulting HTML into an existing document.
  2. We set IDs for each heading. I know that mandoc ≥ 1.14.2 does this as well, but unfortunately with a slightly different algorithm than we use, so we need to keep ours in order to not break existing links.
  3. We find cross-references and URLs and turn them into links.

Notably, ③ finds cross-references even if they include formatting directives (such as the italic tag in the example).

Internationalization in this context means linking to the best language match for the target, as viewed from the source. For example, if the user is browsing manpages in Danish, but the target is only available in Norwegian and English, than we link to the Norwegian version. However, if the target is only available in, say, Italian and English, we’d link to the English version.

mandoc doesn’t know which manpages are available in which language (at least in the way we’re invoking it), so doing language matching when cross-referencing is out of scope for mandoc, I think.

stapelberg avatar Aug 27 '17 14:08 stapelberg

  1. You're running mandoc with -Ofragment, so stripping <html>, <head> and <body> again should be useless: https://github.com/Debian/debiman/blob/3715b1eaf9c1793b9a8c7b1787e2d6511ca2b004/internal/convert/mandoc.go#L112
  2. I was actually running an older version, thanks for pointing this out!
  3. I admit that post-processing is indeed necessary to get cross-language links on a static site, thanks for the description. Though if mandoc was improved to handle <i>crontab</i>(5) etc., then you could pass -O man=/something/definitely/unique/%N.%S.html to mandoc and do just a (probably much simpler) replacement on the <a href="..."> tags.

lahwaacz avatar Aug 27 '17 14:08 lahwaacz

Fair point, the stripping must be a remnant of when we didn’t use -Ofragment. We should remove it eventually (pull requests welcome!)

stapelberg avatar Aug 27 '17 19:08 stapelberg