docling
docling copied to clipboard
[Feature Request] Add ByteDance/Dolphin model for Docling
Requested feature
Add ByteDance/Dolphin for Docling the customize document paring model . ...
Alternatives
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:
🔍 Stage 1: Comprehensive page-level layout analysis by generating element sequence in natural reading order 🧩 Stage 2: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompst Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.
The model is implemented as a Hugging Face VisionEncoderDecoderModel for easy integration with the Transformers ecosystem. ...
Here is more example with Dolphin:
Once we have finalized this PR (https://github.com/docling-project/docling/pull/1570), I will add support for Dolphin!
something like
DOLPHIN_VISION_TRANSFORMERS = InlineVlmOptions(
repo_id="ByteDance/dolphin",
prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
response_format=ResponseFormat.MARKDOWN,
inference_framework=InferenceFramework.TRANSFORMERS,
transformers_model_type=TransformersModelType.AUTOMODEL_VISION2SEQ,
supported_devices=[
AcceleratorDevice.CPU,
AcceleratorDevice.CUDA,
AcceleratorDevice.MPS,
],
scale=2.0,
temperature=0.0,
)
fails for me with
ValueError: Cannot use apply_chat_template because this processor does not have a chat template.
On the HF page I found this
# Load model directly
from transformers import AutoTokenizer, AutoModelForImageTextToText
tokenizer = AutoTokenizer.from_pretrained("ByteDance/Dolphin")
model = AutoModelForImageTextToText.from_pretrained("ByteDance/Dolphin")
AutoModelForImageTextToText is not yet in the list. it could be added.
on the other hand, you could also try the generic AutoModel: TransformersModelType.AUTOMODEL
Also using: transformers_model_type=TransformersModelType.AUTOMODEL, as suggested fails. What should be changed to enable dolfin?
DOLPHIN_VISION_TRANSFORMERS = InlineVlmOptions(
repo_id="ByteDance/dolphin",
prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
response_format=ResponseFormat.MARKDOWN,
inference_framework=InferenceFramework.TRANSFORMERS,
transformers_model_type=TransformersModelType.AUTOMODEL,
supported_devices=[
AcceleratorDevice.CPU,
AcceleratorDevice.CUDA,
AcceleratorDevice.MPS,
],
scale=2.0,
temperature=0.0,
)
ValueError: Unrecognized configuration class <class 'transformers.models.vision_encoder_decoder.configuration_vision_encoder_decoder.VisionEncoderDecoderConfig'> for this kind of AutoModel: AutoModel.
ValueError: Unrecognized configuration class <class 'transformers.models.vision_encoder_decoder.configuration_vision_encoder_decoder.VisionEncoderDecoderConfig'> for this kind of AutoModel: AutoModel.
Model type should be one of AlbertConfig, AlignConfig, AltCLIPConfig, AriaConfig, AriaTextConfig, ASTConfig, AutoformerConfig, AyaVisionConfig, BambaConfig, BarkConfig, BartConfig, BeitConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BitConfig, BitNetConfig, BlenderbotConfig, BlenderbotSmallConfig, BlipConfig, Blip2Config, Blip2QFormerConfig, BloomConfig, BridgeTowerConfig, BrosConfig, CamembertConfig, CanineConfig, ChameleonConfig, ChineseCLIPConfig, ChineseCLIPVisionConfig, ClapConfig, CLIPConfig, CLIPTextConfig, CLIPVisionConfig, CLIPSegConfig, ClvpConfig, LlamaConfig, CodeGenConfig, CohereConfig, Cohere2Config, ConditionalDetrConfig, ConvBertConfig, ConvNextConfig, ConvNextV2Config, CpmAntConfig, CsmConfig, CTRLConfig, CvtConfig, DFineConfig, DabDetrConfig, DacConfig, Data2VecAudioConfig, Data2VecTextConfig, Data2VecVisionConfig, DbrxConfig, DebertaConfig, DebertaV2Config, DecisionTransformerConfig, DeepseekV3Config, DeformableDetrConfig, DeiTConfig, DepthProConfig, DetaConfig, DetrConfig, DiffLlamaConfig, DinatConfig, Dinov2Config, Dinov2WithRegistersConfig, DistilBertConfig, DonutSwinConfig, DPRConfig, DPTConfig, EfficientFormerConfig, EfficientNetConfig, ElectraConfig, Emu3Config, EncodecConfig, ErnieConfig, ErnieMConfig, EsmConfig, FalconConfig, FalconMambaConfig, FastSpeech2ConformerConfig, FlaubertConfig, FlavaConfig, FNetConfig, FocalNetConfig, FSMTConfig, FunnelConfig, FuyuConfig, GemmaConfig, Gemma2Config, Gemma3Config, Gemma3TextConfig, GitConfig, GlmConfig, Glm4Config, GLPNConfig, GotOcr2Config, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, GPTSanJapaneseConfig, GraniteConfig, GraniteMoeConfig, GraniteMoeHybridConfig, GraniteMoeSharedConfig, GraphormerConfig, GroundingDinoConfig, GroupViTConfig, HeliumConfig, HGNetV2Config, HieraConfig, HubertConfig, IBertConfig, IdeficsConfig, Idefics2Config, Idefics3Config, Idefics3VisionConfig, IJepaConfig, ImageGPTConfig, InformerConfig, InstructBlipConfig, InstructBlipVideoConfig, InternVLConfig, InternVLVisionConfig, JambaConfig, JanusConfig, JetMoeConfig, JukeboxConfig, Kosmos2Config, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LevitConfig, LiltConfig, LlamaConfig, Llama4Config, Llama4TextConfig, LlavaConfig, LlavaNextConfig, LlavaNextVideoConfig, LlavaOnevisionConfig, LongformerConfig, LongT5Config, LukeConfig, LxmertConfig, M2M100Config, MambaConfig, Mamba2Config, MarianConfig, MarkupLMConfig, Mask2FormerConfig, MaskFormerConfig, MaskFormerSwinConfig, MBartConfig, MCTCTConfig, MegaConfig, MegatronBertConfig, MgpstrConfig, MimiConfig, MistralConfig, Mistral3Config, MixtralConfig, MLCDVisionConfig, MllamaConfig, MobileBertConfig, MobileNetV1Config, MobileNetV2Config, MobileViTConfig, MobileViTV2Config, ModernBertConfig, MoonshineConfig, MoshiConfig, MPNetConfig, MptConfig, MraConfig, MT5Config, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NatConfig, NemotronConfig, NezhaConfig, NllbMoeConfig, NystromformerConfig, OlmoConfig, Olmo2Config, OlmoeConfig, OmDetTurboConfig, OneFormerConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, Owlv2Config, OwlViTConfig, PaliGemmaConfig, PatchTSMixerConfig, PatchTSTConfig, PegasusConfig, PegasusXConfig, PerceiverConfig, PersimmonConfig, PhiConfig, Phi3Config, Phi4MultimodalConfig, PhimoeConfig, PixtralVisionConfig, PLBartConfig, PoolFormerConfig, ProphetNetConfig, PvtConfig, PvtV2Config, QDQBertConfig, Qwen2Config, Qwen2_5_VLConfig, Qwen2_5_VLTextConfig, Qwen2AudioEncoderConfig, Qwen2MoeConfig, Qwen2VLConfig, Qwen2VLTextConfig, Qwen3Config, Qwen3MoeConfig, RecurrentGemmaConfig, ReformerConfig, RegNetConfig, RemBertConfig, ResNetConfig, RetriBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RTDetrConfig, RTDetrV2Config, RwkvConfig, SamConfig, SamHQConfig, SamHQVisionConfig, SamVisionConfig, SeamlessM4TConfig, SeamlessM4Tv2Config, SegformerConfig, SegGptConfig, SEWConfig, SEWDConfig, SiglipConfig, Siglip2Config, SiglipVisionConfig, SmolVLMConfig, SmolVLMVisionConfig, Speech2TextConfig, SpeechT5Config, SplinterConfig, SqueezeBertConfig, StableLmConfig, Starcoder2Config, SuperGlueConfig, SwiftFormerConfig, SwinConfig, Swin2SRConfig, Swinv2Config, SwitchTransformersConfig, T5Config, TableTransformerConfig, TapasConfig, TextNetConfig, TimeSeriesTransformerConfig, TimesFmConfig, TimesformerConfig, TimmBackboneConfig, TimmWrapperConfig, TrajectoryTransformerConfig, TransfoXLConfig, TvltConfig, TvpConfig, UdopConfig, UMT5Config, UniSpeechConfig, UniSpeechSatConfig, UnivNetConfig, VanConfig, VideoLlavaConfig, VideoMAEConfig, ViltConfig, VipLlavaConfig, VisionTextDualEncoderConfig, VisualBertConfig, ViTConfig, ViTHybridConfig, ViTMAEConfig, ViTMSNConfig, VitDetConfig, VitsConfig, VivitConfig, Wav2Vec2Config, Wav2Vec2BertConfig, Wav2Vec2ConformerConfig, WavLMConfig, WhisperConfig, XCLIPConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, YolosConfig, YosoConfig, ZambaConfig, Zamba2Config.
Stack Trace:
File "/site-packages/dagster/_core/execution/plan/utils.py", line 56, in op_execution_error_boundary
yield
File "/site-packages/dagster/_utils/__init__.py", line 392, in iterate_with_context
next_output = next(iterator)
File "/Users/geoheil/development/promonow/jubust/services/data-pipeline-patents/src/code_location_patents/code_location_patents/assets/patents/pipeline.py", line 207, in full_naive_ocr_vlm
conversion_result = compute_full_naive_ocr_vlm(context, raw_patent)
File "/Users/geoheil/development/promonow/jubust/services/data-pipeline-patents/src/code_location_patents/code_location_patents/utils/timing.py", line 40, in wrapper
result, execution_time = timeit(func)(context, *args, **kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/geoheil/development/promonow/jubust/services/data-pipeline-patents/src/code_location_patents/code_location_patents/utils/timing.py", line 23, in wrapper
result = func(*args, **kwargs)
File "/Users/geoheil/development/promonow/jubust/services/data-pipeline-patents/src/code_location_patents/code_location_patents/assets/patents/full_naive_ocr_vlm.py", line 82, in compute_full_naive_ocr_vlm
conv_res = doc_converter.convert(raw_patent)
File "/site-packages/pydantic/_internal/_validate_call.py", line 38, in wrapper_function
return wrapper(*args, **kwargs)
File "/site-packages/pydantic/_internal/_validate_call.py", line 111, in __call__
res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
File "/site-packages/docling/document_converter.py", line 227, in convert
return next(all_res)
File "/site-packages/docling/document_converter.py", line 250, in convert_all
for conv_res in conv_res_iter:
^^^^^^^^^^^^^
File "/site-packages/docling/document_converter.py", line 285, in _convert
for item in map(
~~~^
partial(self._process_document, raises_on_error=raises_on_error),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
input_batch,
^^^^^^^^^^^^
):
^
File "/site-packages/docling/document_converter.py", line 331, in _process_document
conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
File "/site-packages/docling/document_converter.py", line 352, in _execute_pipeline
pipeline = self._get_pipeline(in_doc.format)
File "/site-packages/docling/document_converter.py", line 314, in _get_pipeline
self.initialized_pipelines[cache_key] = pipeline_class(
~~~~~~~~~~~~~~^
pipeline_options=pipeline_options
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/site-packages/docling/pipeline/vlm_pipeline.py", line 99, in __init__
HuggingFaceTransformersVlmModel(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
enabled=True, # must be always enabled for this pipeline to make sense.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<2 lines>...
vlm_options=vlm_options,
^^^^^^^^^^^^^^^^^^^^^^^^
),
^
File "/site-packages/docling/models/vlm_models_inline/hf_transformers_model.py", line 99, in __init__
self.vlm_model = model_cls.from_pretrained(
~~~~~~~~~~~~~~~~~~~~~~~~~^
artifacts_path,
^^^^^^^^^^^^^^^
...<7 lines>...
trust_remote_code=vlm_options.trust_remote_code,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/site-packages/transformers/models/auto/auto_factory.py", line 574, in from_pretrained
raise ValueError(
...<2 lines>...
)
You can try to add AutoModelForImageTextToText.
Enum definition:
https://github.com/docling-project/docling/blob/0432a31b2f7c9fe944c3a1d4b608ef938b4f2299/docling/datamodel/pipeline_options_vlm_model.py#L26-L29
Usage:
https://github.com/docling-project/docling/blob/0432a31b2f7c9fe944c3a1d4b608ef938b4f2299/docling/models/vlm_models_inline/hf_transformers_model.py#L83-L93
And in case you have to use a different prompt, you can use another if/else in https://github.com/docling-project/docling/blob/0432a31b2f7c9fe944c3a1d4b608ef938b4f2299/docling/models/vlm_models_inline/hf_transformers_model.py#L163
I have added and made these changes:
class TransformersModelType(str, Enum):
AUTOMODEL = "automodel"
AUTOMODEL_VISION2SEQ = "automodel-vision2seq"
AUTOMODEL_CAUSALLM = "automodel-causallm"
AUTOMODEL_FORIMAGETEXTTOTEXT = "automodel-forimagetexttotext"
DOLPHIN_VISION_TRANSFORMERS = InlineVlmOptions(
repo_id="ByteDance/dolphin",
prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
response_format=ResponseFormat.MARKDOWN,
inference_framework=InferenceFramework.TRANSFORMERS,
transformers_model_type=TransformersModelType.AUTOMODEL_FORIMAGETEXTTOTEXT,
supported_devices=[
AcceleratorDevice.CPU,
AcceleratorDevice.CUDA,
AcceleratorDevice.MPS,
],
scale=2.0,
)
model_cls: Any = AutoModel
if (
self.vlm_options.transformers_model_type
== TransformersModelType.AUTOMODEL_CAUSALLM
):
model_cls = AutoModelForCausalLM
elif (
self.vlm_options.transformers_model_type
== TransformersModelType.AUTOMODEL_VISION2SEQ
):
model_cls = AutoModelForVision2Seq
elif (self.vlm_options.transformers_model_type
== TransformersModelType.AUTOMODEL_FORIMAGETEXTTOTEXT):
model_cls = AutoModelForImageTextToText
the error still is (albeit the right model type is now used):
ValueError: Cannot use apply_chat_template because this processor does not have a chat template.
hf_transformers_model.py", line 139, in __call__
prompt = self.formulate_prompt()
File "site-packages/docling/models/vlm_models_inline/hf_transformers_model.py", line 202, in formulate_prompt
prompt = self.processor.apply_chat_template(
Then the formulate_prompt was also adapted by adding this condition:
if self.vlm_options.repo_id.lower().startswith("bytedance/dolphin"):
# Dolphin is a vision-encoder-decoder model, *not* a chat model.
# It wants plain text: <s> ...prompt... <Answer/>
# more info here https://huggingface.co/ByteDance/Dolphin
return f"<s>{self.vlm_options.prompt} <Answer/>"
See a PR with these changes: https://github.com/docling-project/docling/pull/1772
However, for an input document of
A output document like
is generated
but a lot of content is missing - in particular, compared to a normal docling OCR pipeline with rapidocr -
Perhaps this is too much (i.e. not only integration of dolphin to docling - but in order to make the integration meaningful it would be rather neat if it would deliver similar or better output quality.
What would need to be changeed?
As written before:
DOLPHIN_VISION_TRANSFORMERS = InlineVlmOptions(
repo_id="ByteDance/dolphin",
prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
response_format=ResponseFormat.MARKDOWN,
inference_framework=InferenceFramework.TRANSFORMERS,
transformers_model_type=TransformersModelType.AUTOMODEL_FORIMAGETEXTTOTEXT,
supported_devices=[
AcceleratorDevice.CPU,
AcceleratorDevice.CUDA,
AcceleratorDevice.MPS,
],
scale=2.0,
temperature=0.0,
)
was used.