docling icon indicating copy to clipboard operation
docling copied to clipboard

Option to set the default ContentLayer in the Body for html_backend

Open Vdaleke opened this issue 2 months ago • 4 comments

Requested feature

I use Docling in RAG pipelines. One of the input data types is some HTML content that can be retrieved via a specific HTML endpoint, which returns only what's contained in a specific field on the page and may not contain a heading at the beginning.

Currently, the HTML backend logic is such that if the text contains headings, the content up to the first heading goes to the Furniture content layer (code). But the purpose of retrieving data via a specific endpoint is precisely to filter the data going to the Body and not transfer this content layer auto-detection logic to Docling.

I'd like to have an option to set the default layer for the html_backend to Body, so that all content goes there and after present in chunks or export to Markdown.

Alternatives

Adding a title to the beginning of the content yourself isn't always convenient, and it doesn't always exist.

Caused by discussion in https://github.com/docling-project/docling/pull/2388

Vdaleke avatar Oct 17 '25 12:10 Vdaleke

Isn't the alternative simply setting included_content_layers? see the docs page at: https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/documents/e596ee79-fc7f-43a4-90e2-74891e0cf12f

dolfim-ibm avatar Oct 17 '25 13:10 dolfim-ibm

Isn't the alternative simply setting included_content_layers? see the docs page at: https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/documents/e596ee79-fc7f-43a4-90e2-74891e0cf12f

HybridChunker doesn't seem to have such an option, and that's my case. And it's probably a conceptually poor solution, because classification occurs in the wrong layer, and setting this option feels like a hack.

Vdaleke avatar Oct 17 '25 13:10 Vdaleke

Then I think it is better to make add the option to the HybridChunker. It might be more generic, and it will apply to all document types.

FYI @vagenas

dolfim-ibm avatar Oct 17 '25 14:10 dolfim-ibm

Then I think it is better to make add the option to the HybridChunker. It might be more generic, and it will apply to all document types.

These options for including specific layers seem applicable to chunking specific file types, but my chunker is configured universally for all files, and adding an option for specific files doesn't sound right.

For some file types, I wouldn't want the Furniture layer to be present in the chunk data. I'll have to add the required column to the database table to specify the correct layers, since conversion and chunking are separated in my system for caching purposes.

As I understand it, the decision was made to assign data in the HTML backend to the Furniture and Body layers based on the fact that HTML pages typically have a header before the main content. But that's not always the case, is it? I think this option could be useful for more than just me.

Vdaleke avatar Oct 17 '25 14:10 Vdaleke