docling icon indicating copy to clipboard operation
docling copied to clipboard

fix: guess HTML content starting with script tag

Open ceberam opened this issue 7 months ago • 2 comments

This PR improves the heuristic rule to detect the file type from the first section of its content. In particular, the function docling.datamodel.document._DocumentConversionInput._detect_html_xhtml.

Even though the HTML5 specification recommends HTML documents to start with <!DOCTYPE html>, some documents may have other HTML indicators like or

at the beginning of the content. Some documents may start with scripts (tag <script>), like in https://hometheaterhifi.com/reviews/headphone-earphone/hifiman-he1000-unveiled-planar-magnetic-headphone-review/. This PR addresses these cases.

Resolves #1535

Checklist:

  • [x] Documentation has been updated, if necessary.
  • [x] Examples have been added, if necessary.
  • [x] Tests have been added, if necessary.

ceberam avatar May 28 '25 18:05 ceberam

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar May 28 '25 18:05 mergify[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

:loudspeaker: Thoughts on this report? Let us know!

codecov[bot] avatar May 28 '25 18:05 codecov[bot]