Michele Dolfi
Michele Dolfi
As discovered in #542, some MS Office XML archives have the meta file `[Content_Types].xml` at the end, which is not captured by the 8K bytes signature. One way of improving...
There is no doubt the logic has to be fixed and improved, maybe also simplified altogether. The initial use case which was pretty relevant for us is iterating through a...
Actually, we have already the first steps for this feature. The code downloading the files allows for custom headers, see https://github.com/DS4SD/docling-core/blob/main/docling_core/utils/file.py#L52. We only need to propagate the arguments all the...
CI cancelled. It seems to have a deadlock.
Tested with concurrent processing on mac and linux container.
The choice of the models is done at the Pipeline level. For example, the PDF pipeline (called `StandardPdfPipeline`) is defined in [docling/pipeline/standard_pdf_pipeline.py](../blob/main/docling/pipeline/standard_pdf_pipeline.py). You can make your own pipeline with different...
> I am not very familiar with the DCO Action thing , can you help me a bit out on that The DCO checks requires all the commits to be...
@danhertztech Docling supports multiple OCR engines (and the possibility to also bring your own). Out-of-the-box we have already Tesseract which could cover your use case. See more here https://ds4sd.github.io/docling/installation/
@FengCeUp can you please share the file?
I confirm this can be reproduced with both parser v1 and v2. ```sh # parser v1 docling --pdf-backend dlparse_v1 TestDoc3.pdf # parser v2 docling --pdf-backend dlparse_v2 TestDoc3.pdf ``` The error...