parser icon indicating copy to clipboard operation
parser copied to clipboard

Node.js version extracts incorrect content compared to browser version

Open twang3 opened this issue 2 months ago • 0 comments

Expected Behavior

When parsing the same HTML content, the Node.js version should extract only the main article content, excluding metadata, subtitles, and descriptions that appear outside the article body.

Current Behavior

The Node.js version incorrectly includes extra content that should be excluded.

Steps to Reproduce

  1. Parse the same HTML content using both Node.js and browser versions
  2. Compare the extracted content
  3. Observe that Node.js version selects incorrect content (e.g., subtitles, descriptions)

Detailed Description

URL tested: https://www.theverge.com/news/766311/anthropic-class-action-ai-piracy-authors-settlement

I'm attaching three HTML files for comparison:

  1. original.html - The original HTML from the source
  2. node-processed.html - HTML after processing by Node.js version
  3. browser-processed.html - HTML after processing by browser version

The Node.js version includes text like:

"The Amazon-backed startup won't have to go to trial over claims it trained AI models on 'millions' of pirated works."

This is a subtitle/description that appears in the page metadata and should not be part of the main article content.

Possible Solution

The convertToParagraphs function was converting <div> elements containing block-level elements (like <main>, <section>, <article>) into <p> tags. This violates the HTML5 specification, which states that <p> tags cannot contain block-level elements. Example of invalid HTML generated:

<p class="container">
  <main>
    <div>Content</div>
  </main>
</p>

Browser behavior: Browsers automatically fix this invalid structure by closing the <p> tag before the <main> element:

<p class="container"></p>
<main>
  <div>Content</div>
</main>

This causes the <p> tag to become empty, affecting scoring.

Node.js Cheerio behavior: Cheerio allows this invalid structure, preserving the content inside the <p> tag, leading to incorrect scoring and content selection.

Update DIV_TO_P_BLOCK_TAGS constant to include all HTML5 block-level elements: File: src/extractors/generic/content/scoring/constants.js

export const DIV_TO_P_BLOCK_TAGS = [
  'a',
  'article',      // Added
  'aside',        // Added
  'blockquote',
  'dl',
  'div',
  'footer',       // Added
  'header',       // Added
  'img',
  'main',         // Added
  'nav',          // Added
  'p',
  'pre',
  'section',      // Added
  'table',
].join(',');

Also update: src/utils/dom/convert-to-paragraphs.js Change from children() to find() to check all descendants, not just direct children:

function convertDivs($) {
  $('div').each((index, div) => {
    const $div = $(div);
    // Use find() instead of children() to check all descendants
    const convertible = $div.find(DIV_TO_P_BLOCK_TAGS).length === 0;

    if (convertible) {
      convertNodeTo($div, $, 'p');
    }
  });

  return $;
}

twang3 avatar Oct 24 '25 03:10 twang3