Michele Dolfi
Michele Dolfi
At the moment `convert()` returns generators on the number of documents in the input argument, but not within those.
On the [HF page](https://huggingface.co/ByteDance/Dolphin?library=transformers) I found this ```py # Load model directly from transformers import AutoTokenizer, AutoModelForImageTextToText tokenizer = AutoTokenizer.from_pretrained("ByteDance/Dolphin") model = AutoModelForImageTextToText.from_pretrained("ByteDance/Dolphin") ``` `AutoModelForImageTextToText` is not yet in the...
You can try to add `AutoModelForImageTextToText`. Enum definition: https://github.com/docling-project/docling/blob/0432a31b2f7c9fe944c3a1d4b608ef938b4f2299/docling/datamodel/pipeline_options_vlm_model.py#L26-L29 Usage: https://github.com/docling-project/docling/blob/0432a31b2f7c9fe944c3a1d4b608ef938b4f2299/docling/models/vlm_models_inline/hf_transformers_model.py#L83-L93 And in case you have to use a different prompt, you can use another `if/else` in https://github.com/docling-project/docling/blob/0432a31b2f7c9fe944c3a1d4b608ef938b4f2299/docling/models/vlm_models_inline/hf_transformers_model.py#L163
This PR should simplify all of it: https://github.com/DS4SD/docling/pull/876
I'm summarizing here the target of this PR, I will submit code proposals later. ## `VlmPipeline` Specs of the new pipeline - Input: (PDF) Document - Processing: using a vision...
| source | model_id | framework | num_pages | time | |------------------------------------| ----------------------------------------- | ------------------------------------------ | ----------- | ---------- | |tests/data/pdf/2305.03393v1-pg9.pdf | ds4sd_SmolDocling-256M-preview | InferenceFramework.TRANSFORMERS_VISION2SEQ | 1 | 102.212 |...
We could check if the docx api allows to detect something like the checkboxes (top-left in the figure)
Do you mean out `CONTRIBUTING.md`? We are very happy having the community building up these extensions. Thanks a lot for the contribution.
We are planning to address this with custom serializers for picture items, i.e. some use cases need the description, others the text which was produced by OCR, other the graph...
Yes, that is what we would like to allow. It is clear that each use case will need a different output and instead of trying to overload with content we...