fix: guess HTML content starting with script tag
This PR improves the heuristic rule to detect the file type from the first section of its content. In particular, the function docling.datamodel.document._DocumentConversionInput._detect_html_xhtml.
Even though the HTML5 specification recommends HTML documents to start with <!DOCTYPE html>, some documents may have other HTML indicators like or
<script>), like in https://hometheaterhifi.com/reviews/headphone-earphone/hifiman-he1000-unveiled-planar-magnetic-headphone-review/. This PR addresses these cases.
Resolves #1535
Checklist:
- [x] Documentation has been updated, if necessary.
- [x] Examples have been added, if necessary.
- [x] Tests have been added, if necessary.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
:loudspeaker: Thoughts on this report? Let us know!