yari icon indicating copy to clipboard operation
yari copied to clipboard

fix(document-extractor): split h4 sections

Open tannerdolby opened this issue 2 years ago • 20 comments

Summary

Fixes #4370

Problem

Currently only <h1>-<h3> are turned into direct links, which are incredible useful for sharing a link directly to a section in the docs. Since the h4 tags aren't apart of the section splitting logic, searching for a <h4> which is deeply nested in a section can be time consuming if your trying to quickly share a direct reference to a section that begins with an h4.

Solution

Add <h4> tags into the existing section splitting logic.


Screenshots

Before

Screen Shot 2022-05-13 at 10 28 17 PM

After

Screen Shot 2022-05-13 at 10 30 42 PM

How did you test this change?

Running Yari locally and checking a few pages manually.


Original description

This branch is up-to-date with the latest main so the changes to the production workflow are included. I'm still in the process of running a full build of all the content to pinpoint where exactly the problematic code was.

After running a full build (mdn/content and mdn/translated-content) using yarn build, it does successfully build the 42k pages in 74.5 minutes but with a bunch of flaws. Which the flaws might be fine, I'm just not sure what the baseline was in order to compare before/after.

There were lots of MacroLiveSampleError saying "unable to find any live code samples for 'X'"

MacroLiveSampleError within /Users/TannerDolby/translated-content/files/de/learn/javascript/first_steps/what_is_javascript/index.html, line 84 column 4 (unable to find any live code samples for "A_high-level_definition" within /de/docs/Learn/JavaScript/First_steps/What_is_JavaScript) 

MacroLiveSampleError within /Users/TannerDolby/translated-content/files/de/web/css/calc()/index.html, line 77 column 4 (unable to find any live code samples for "Relative_Grösse_eines_Objekts_mit_einer_absoluten_Positionierung" within /de/docs/Web/CSS/calc())

but other than that nothing actually made the build break or fail. There were lots of CSSRef issues where the smartlink is broken

In CSSRef the smartLink to /de/docs/Web/CSS/Layout_cookbook/Pagination is broken
In CSSRef the smartLink to /de/docs/Web/CSS/Layout_cookbook/Card is broken
In CSSRef the smartLink to /de/docs/Web/CSS/Layout_cookbook/Grid_wrapper is broken

and this is the output from the full build (content and translated content)

...
/Users/TannerDolby/yari/client/build/en-us/docs/webassembly/understanding_the_text_format
/Users/TannerDolby/yari/client/build/en-us/docs/webassembly/using_the_javascript_api
Built 42,430 pages in 74.5 minutes, at a rate of 9.5 documents per second.
Peak heap memory usage: 1.48 GB

Total_Flaws_Count
bad_bcd_links            2,558
bad_bcd_queries          836
bad_pre_tags             3,166
broken_links             87,346
heading_links            1,842
image_widths             6,758
images                   10,068
macros                   78,801
sectioning               2,204
translation_differences  17,756

✨  Done in 4475.46s.

cc @escattone

tannerdolby avatar Sep 03 '21 03:09 tannerdolby

It looks like within the Functional Testing suite the playwright test headless command is displaying this error:

  1. testing/tests/headless.sitesearch.spec.js:10:3 › Site search submit the autocomplete search form will redirect to site search .. Timeout of 30000ms exceeded.

tannerdolby avatar Sep 03 '21 03:09 tannerdolby

It looks like within the Functional Testing suite the playwright test headless command is displaying this error:

  1. testing/tests/headless.sitesearch.spec.js:10:3 › Site search submit the autocomplete search form will redirect to site search .. Timeout of 30000ms exceeded.

Looks like that was a hiccup. I reran all the actions and everything looks to be green now.

schalkneethling avatar Sep 07 '21 09:09 schalkneethling

@tannerdolby I cannot remember. With this PR, did you manage to address the issue that we ran into the last time we merged this work, or is that still on the to-do list? Thanks!

schalkneethling avatar Sep 27 '21 07:09 schalkneethling

@schalkneethling No problem! I've looked through the proposed changes here and couldn't pinpoint exactly what caused the issue when it was previously merged. If we can figure out how to create a reproducible example of that error then the problematic code should hopefully become more clear to us.

tannerdolby avatar Sep 28 '21 03:09 tannerdolby

@schalkneethling No problem! I've looked through the proposed changes here and couldn't pinpoint exactly what caused the issue when it was previously merged. If we can figure out how to create a reproducible example of that error then the problematic code should hopefully become more clear to us.

@escattone Thoughts?

schalkneethling avatar Sep 28 '21 06:09 schalkneethling

@tannerdolby Thanks for all of your work to date on this! Unfortunately, the engineering team doesn't have much time to look into this further given other priorities. If you have the time and desire, for next steps I would recommend:

  1. Running the build within this branch on the translated content to identify one or more documents that fail.
  2. Selecting one of those documents as your test case to use for debugging the root cause.

escattone avatar Oct 04 '21 19:10 escattone

You're welcome @escattone! I have the time/desire to see this through so I'm happy to stick around until we pinpoint the root cause. I've done the steps you suggested and learned a few things.

When running a full build on translated content within this branch, the first translated content page to raise an error is:

/Users/TannerDolby/translated-content/files/de/learn/html/introduction_to_html/advanced_text_formatting/index.html

Error: MacroLiveSampleError within /Users/TannerDolby/translated-content/files/de/learn/html/introduction_to_html/advanced_text_formatting/index.html, line 145 column 4 (unable to find any live code samples for "Playable_code_1" within /de/docs/Learn/HTML/Introduction_to_HTML/Advanced_text_formatting)

and for the remainder of the build many errors are logged but none of the pages actual fail and cause the build to halt.

Inside of index.js where this MacroLiveSampleError flaw is handled, if (flaw.name === "MacroLiveSampleError") { ... there is a comment mentioning,

// As of April 2021 there are 0 pages in mdn/content that trigger
// a MacroLiveSampleError. So we can be a lot more strict with en-US
// until the translated-content has had a chance to clean up all
// their live sample errors.
// See https://github.com/mdn/yari/issues/2489

As the only major errors when running a build on translated content are MacroLiveSampleErrors, it leads me to believe that this is probably where the error that broke the build might be coming from.

if (document.metadata.locale === "en-US") {
          throw new Error(
            `MacroLiveSampleError within ${flaw.filepath}, line ${flaw.line} column ${flaw.column} (${flaw.error.message})`
          );
        } else {
          console.warn(
            `MacroLiveSampleError within ${flaw.filepath}, line ${flaw.line} column ${flaw.column} (${flaw.error.message})`
          );
        }

tannerdolby avatar Nov 07 '21 23:11 tannerdolby

In this comment, Peter mentions, "So now, for en-US, it will crash the CI build, for translated-content it will just be a console.warn". Which is what I'm seeing, it will just be a console.warn (not an error that will crash the CI build) which I'm receiving in the logs, I just can't reproduce the CI build crashing. Since I didn't see any MacroLiveSampleErrors for en-US pages and only for translated-content, it led me to believe the throw new Error() was never being called since that would produce an error that crashes the build. I'm not entirely sure if this is conflicting the with logic for handling headings in this PR but it seemed worthwhile to learn more about.

There were other errors (aside from the MacroLiveSampleError) in the build for content and translated content like:

  • In CSS_Ref the smartLink to /es/docs/Web/CSS/transform-function/rotateX is broken (many of these)
  • Images missing src attributes: In /es/docs/Learn/Server-side/Configuring_server_MIME_types there's an img tag without src (<img align="right" alt="Example of an incorrect MIME type result">)
  • Can't get English original from fr/conflicting/learn/css/building_blocks/styling_tables
  • Unable to decodeURI 'https://www.ecma-international.org/ecma-262/6.0/#sec-get-%typedarray%.prototype.buffer'. Will proceed without.

@schalkneethling I will use one of these first few translated pages as a test case and scan through my proposed changes in this PR to better figure out what could be causing problems for translated-content. I will do some further investigating.

tannerdolby avatar Nov 08 '21 04:11 tannerdolby

Thanks so much @tannerdolby. Much appreciated.

schalkneethling avatar Nov 08 '21 08:11 schalkneethling

@tannerdolby I am wondering, is this still relevant after the redesign? If so, are you still keen on getting this one over the finish line? Thank so much!

schalkneethling avatar Apr 05 '22 10:04 schalkneethling

@schalkneethling No problem! Thanks for keeping this PR alive. I am beginning to wonder if this is still relevant after the redesign as well. If there was a metric that noted how many "direct link interactions" happen on mdn e.g. # of times users click direct links (to any h2 or h3). That result would sway my decision towards finishing this feature and splitting sections by h4s or maybe retiring this PR if not many users utilize the direct links.

Pages like https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Classes#field_declarations which I originally felt needed the <h4>s linkified, now seem to look really compact/more readable after the redesign which makes reaching h4s in a specific h3 section much easier as there isn't so far to scroll.

I would like to figure out the issue and get this one over the finish line to allow h4s to be direct links, but I'm not sure if the feature request yields that much value for how much code is being added/changed. You know more about user interaction metrics than I, so ultimately I'll let you decide if you think this feature would enhance or hinder page content. I like not every heading having a direct link, but in some situations if the parent h3 is really far away from a child h4 having the direct link would be preferred. We can either continue forward and I can try triaging what originally broke production, or we scrap this. Let me know what you think!

tannerdolby avatar Apr 06 '22 20:04 tannerdolby

Thank you for the detailed feedback @tannerdolby. I am going to loop in @fiji-flo here to make a call. Thanks!

schalkneethling avatar Apr 12 '22 16:04 schalkneethling

@caugner 👋 Sure thing. I filled out the PR template to provide some context for the proposed changes. I will merge this branch with main shortly and resolve conflicting files.

tannerdolby avatar May 14 '22 05:05 tannerdolby

@caugner 👋 Sure thing. I filled out the PR template to provide some context for the proposed changes. I will merge this branch with main shortly and resolve conflicting files.

Thanks so much, @tannerdolby 🎉

schalkneethling avatar May 14 '22 12:05 schalkneethling

No problem! :) @caugner. Let me know if you need me to update anything in the meantime to help with your review.

tannerdolby avatar May 14 '22 15:05 tannerdolby

Thank you for all the review work here @caugner! Also, thank you for your continued work on this, @tannerdolby. Excited to see this one land. 🎉

schalkneethling avatar May 17 '22 09:05 schalkneethling

This pull request has merge conflicts that must be resolved before it can be merged.

github-actions[bot] avatar Jun 20 '22 17:06 github-actions[bot]

@schalkneethling I implemented the requested changes for DisplayHeading to be utilized and things are looking good. It seems everything is passing besides the "Functional Testing" for the Testing Yari / build (pull_request) workflow.

Run yarn test:testing
yarn run v1.22.19
$ jest --rootDir testing
FAIL testing/tests/index.test.ts
  ● Test suite failed to run
testing/tests/index.test.ts:1780:30 - error TS234: Argument of type 'Buffer' is not assignable to parameter of type 'string'.

const { doc } = JSON.parse(fs.readFileSync(jsonFile)); // <--- problematic line

I will investigate this locally to see if DisplayHeading is rendering headings and splitting sections correctly. Then dive into index.test.ts to see why reading the jsonFile would be a string instead of a Buffer.

tannerdolby avatar Jun 22 '22 21:06 tannerdolby

This pull request has merge conflicts that must be resolved before it can be merged.

github-actions[bot] avatar Aug 29 '22 17:08 github-actions[bot]

@caugner Thanks for review so far. I've added the suggestions you had and everything is running now. The only problem is that functional testing is failing and<h4> elements don't have the expected <a> tags, the test cases for h4s are only picking up the text content when .html() is asserted against so the links arent being generated.

Apologies for all the testing/debugging commits in this last batch. The yarn test:testing wasn't producing the live results when running locally so I was relying on the build & testing pipeline logs here on GitHub.

Expected: "<a href=\"#final_heading\" title=\"Permalink to Final heading\">Final heading</a>"
    Received: "Final heading"

1805 |   expect(h4.html()).toBe(
     |                    ^
1806 |     '<a href="#final_heading" title="Permalink to Final heading">Final heading</a>'
1807 |   );

I haven't been in the codebase in awhile so I'm still getting familiarized with things again, but I could imagine the document-extractor.js just needs to be updated to implement similar logic as the h2/h3s within the _addSectionProse function. I tried experimenting but wasn't entirely sure how the whole flow worked. It might be something straightforward that I'm not seeing.

I had to add isH4 as a property to the types in document.ts that could be assigned to the Section type. This was for resolving the errors that were being thrown to get things running but it shouldn't have any destructive behavior.

tannerdolby avatar Sep 05 '22 01:09 tannerdolby

FWIW We need to make sure that doesn't break the EmbedLiveSample macro (see this issue).

caugner avatar Nov 11 '22 16:11 caugner

This pull request has merge conflicts that must be resolved before it can be merged.

github-actions[bot] avatar Nov 23 '22 09:11 github-actions[bot]