docling icon indicating copy to clipboard operation
docling copied to clipboard

[Feature Request] Add ByteDance/Dolphin model for Docling

Open NeroHin opened this issue 6 months ago • 1 comments
trafficstars

Requested feature

Add ByteDance/Dolphin for Docling the customize document paring model . ...

Alternatives

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:

🔍 Stage 1: Comprehensive page-level layout analysis by generating element sequence in natural reading order 🧩 Stage 2: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompst Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.

The model is implemented as a Hugging Face VisionEncoderDecoderModel for easy integration with the Transformers ecosystem. ...

Here is more example with Dolphin:

Image

NeroHin avatar May 21 '25 06:05 NeroHin

Once we have finalized this PR (https://github.com/docling-project/docling/pull/1570), I will add support for Dolphin!

PeterStaar-IBM avatar May 27 '25 05:05 PeterStaar-IBM

Once we have finalized this PR (#1570), I will add support for Dolphin!

any updates?

geoHeil avatar Jun 13 '25 13:06 geoHeil

something like

DOLPHIN_VISION_TRANSFORMERS = InlineVlmOptions(
    repo_id="ByteDance/dolphin",
    prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
    response_format=ResponseFormat.MARKDOWN,
    inference_framework=InferenceFramework.TRANSFORMERS,
    transformers_model_type=TransformersModelType.AUTOMODEL_VISION2SEQ,
    supported_devices=[
        AcceleratorDevice.CPU,
        AcceleratorDevice.CUDA,
        AcceleratorDevice.MPS,
    ],
    scale=2.0,
    temperature=0.0,
)

fails for me with

ValueError: Cannot use apply_chat_template because this processor does not have a chat template.

geoHeil avatar Jun 13 '25 13:06 geoHeil

On the HF page I found this

# Load model directly
from transformers import AutoTokenizer, AutoModelForImageTextToText

tokenizer = AutoTokenizer.from_pretrained("ByteDance/Dolphin")
model = AutoModelForImageTextToText.from_pretrained("ByteDance/Dolphin")

AutoModelForImageTextToText is not yet in the list. it could be added.

on the other hand, you could also try the generic AutoModel: TransformersModelType.AUTOMODEL

dolfim-ibm avatar Jun 13 '25 14:06 dolfim-ibm

Also using: transformers_model_type=TransformersModelType.AUTOMODEL, as suggested fails. What should be changed to enable dolfin?

DOLPHIN_VISION_TRANSFORMERS = InlineVlmOptions(
    repo_id="ByteDance/dolphin",
    prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
    response_format=ResponseFormat.MARKDOWN,
    inference_framework=InferenceFramework.TRANSFORMERS,
    transformers_model_type=TransformersModelType.AUTOMODEL,
    supported_devices=[
        AcceleratorDevice.CPU,
        AcceleratorDevice.CUDA,
        AcceleratorDevice.MPS,
    ],
    scale=2.0,
    temperature=0.0,
)
ValueError: Unrecognized configuration class <class 'transformers.models.vision_encoder_decoder.configuration_vision_encoder_decoder.VisionEncoderDecoderConfig'> for this kind of AutoModel: AutoModel.


ValueError: Unrecognized configuration class <class 'transformers.models.vision_encoder_decoder.configuration_vision_encoder_decoder.VisionEncoderDecoderConfig'> for this kind of AutoModel: AutoModel.
Model type should be one of AlbertConfig, AlignConfig, AltCLIPConfig, AriaConfig, AriaTextConfig, ASTConfig, AutoformerConfig, AyaVisionConfig, BambaConfig, BarkConfig, BartConfig, BeitConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BitConfig, BitNetConfig, BlenderbotConfig, BlenderbotSmallConfig, BlipConfig, Blip2Config, Blip2QFormerConfig, BloomConfig, BridgeTowerConfig, BrosConfig, CamembertConfig, CanineConfig, ChameleonConfig, ChineseCLIPConfig, ChineseCLIPVisionConfig, ClapConfig, CLIPConfig, CLIPTextConfig, CLIPVisionConfig, CLIPSegConfig, ClvpConfig, LlamaConfig, CodeGenConfig, CohereConfig, Cohere2Config, ConditionalDetrConfig, ConvBertConfig, ConvNextConfig, ConvNextV2Config, CpmAntConfig, CsmConfig, CTRLConfig, CvtConfig, DFineConfig, DabDetrConfig, DacConfig, Data2VecAudioConfig, Data2VecTextConfig, Data2VecVisionConfig, DbrxConfig, DebertaConfig, DebertaV2Config, DecisionTransformerConfig, DeepseekV3Config, DeformableDetrConfig, DeiTConfig, DepthProConfig, DetaConfig, DetrConfig, DiffLlamaConfig, DinatConfig, Dinov2Config, Dinov2WithRegistersConfig, DistilBertConfig, DonutSwinConfig, DPRConfig, DPTConfig, EfficientFormerConfig, EfficientNetConfig, ElectraConfig, Emu3Config, EncodecConfig, ErnieConfig, ErnieMConfig, EsmConfig, FalconConfig, FalconMambaConfig, FastSpeech2ConformerConfig, FlaubertConfig, FlavaConfig, FNetConfig, FocalNetConfig, FSMTConfig, FunnelConfig, FuyuConfig, GemmaConfig, Gemma2Config, Gemma3Config, Gemma3TextConfig, GitConfig, GlmConfig, Glm4Config, GLPNConfig, GotOcr2Config, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, GPTSanJapaneseConfig, GraniteConfig, GraniteMoeConfig, GraniteMoeHybridConfig, GraniteMoeSharedConfig, GraphormerConfig, GroundingDinoConfig, GroupViTConfig, HeliumConfig, HGNetV2Config, HieraConfig, HubertConfig, IBertConfig, IdeficsConfig, Idefics2Config, Idefics3Config, Idefics3VisionConfig, IJepaConfig, ImageGPTConfig, InformerConfig, InstructBlipConfig, InstructBlipVideoConfig, InternVLConfig, InternVLVisionConfig, JambaConfig, JanusConfig, JetMoeConfig, JukeboxConfig, Kosmos2Config, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LevitConfig, LiltConfig, LlamaConfig, Llama4Config, Llama4TextConfig, LlavaConfig, LlavaNextConfig, LlavaNextVideoConfig, LlavaOnevisionConfig, LongformerConfig, LongT5Config, LukeConfig, LxmertConfig, M2M100Config, MambaConfig, Mamba2Config, MarianConfig, MarkupLMConfig, Mask2FormerConfig, MaskFormerConfig, MaskFormerSwinConfig, MBartConfig, MCTCTConfig, MegaConfig, MegatronBertConfig, MgpstrConfig, MimiConfig, MistralConfig, Mistral3Config, MixtralConfig, MLCDVisionConfig, MllamaConfig, MobileBertConfig, MobileNetV1Config, MobileNetV2Config, MobileViTConfig, MobileViTV2Config, ModernBertConfig, MoonshineConfig, MoshiConfig, MPNetConfig, MptConfig, MraConfig, MT5Config, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NatConfig, NemotronConfig, NezhaConfig, NllbMoeConfig, NystromformerConfig, OlmoConfig, Olmo2Config, OlmoeConfig, OmDetTurboConfig, OneFormerConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, Owlv2Config, OwlViTConfig, PaliGemmaConfig, PatchTSMixerConfig, PatchTSTConfig, PegasusConfig, PegasusXConfig, PerceiverConfig, PersimmonConfig, PhiConfig, Phi3Config, Phi4MultimodalConfig, PhimoeConfig, PixtralVisionConfig, PLBartConfig, PoolFormerConfig, ProphetNetConfig, PvtConfig, PvtV2Config, QDQBertConfig, Qwen2Config, Qwen2_5_VLConfig, Qwen2_5_VLTextConfig, Qwen2AudioEncoderConfig, Qwen2MoeConfig, Qwen2VLConfig, Qwen2VLTextConfig, Qwen3Config, Qwen3MoeConfig, RecurrentGemmaConfig, ReformerConfig, RegNetConfig, RemBertConfig, ResNetConfig, RetriBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RTDetrConfig, RTDetrV2Config, RwkvConfig, SamConfig, SamHQConfig, SamHQVisionConfig, SamVisionConfig, SeamlessM4TConfig, SeamlessM4Tv2Config, SegformerConfig, SegGptConfig, SEWConfig, SEWDConfig, SiglipConfig, Siglip2Config, SiglipVisionConfig, SmolVLMConfig, SmolVLMVisionConfig, Speech2TextConfig, SpeechT5Config, SplinterConfig, SqueezeBertConfig, StableLmConfig, Starcoder2Config, SuperGlueConfig, SwiftFormerConfig, SwinConfig, Swin2SRConfig, Swinv2Config, SwitchTransformersConfig, T5Config, TableTransformerConfig, TapasConfig, TextNetConfig, TimeSeriesTransformerConfig, TimesFmConfig, TimesformerConfig, TimmBackboneConfig, TimmWrapperConfig, TrajectoryTransformerConfig, TransfoXLConfig, TvltConfig, TvpConfig, UdopConfig, UMT5Config, UniSpeechConfig, UniSpeechSatConfig, UnivNetConfig, VanConfig, VideoLlavaConfig, VideoMAEConfig, ViltConfig, VipLlavaConfig, VisionTextDualEncoderConfig, VisualBertConfig, ViTConfig, ViTHybridConfig, ViTMAEConfig, ViTMSNConfig, VitDetConfig, VitsConfig, VivitConfig, Wav2Vec2Config, Wav2Vec2BertConfig, Wav2Vec2ConformerConfig, WavLMConfig, WhisperConfig, XCLIPConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, YolosConfig, YosoConfig, ZambaConfig, Zamba2Config.

Stack Trace:
  File "/site-packages/dagster/_core/execution/plan/utils.py", line 56, in op_execution_error_boundary
    yield
  File "/site-packages/dagster/_utils/__init__.py", line 392, in iterate_with_context
    next_output = next(iterator)
  File "/Users/geoheil/development/promonow/jubust/services/data-pipeline-patents/src/code_location_patents/code_location_patents/assets/patents/pipeline.py", line 207, in full_naive_ocr_vlm
    conversion_result = compute_full_naive_ocr_vlm(context, raw_patent)
  File "/Users/geoheil/development/promonow/jubust/services/data-pipeline-patents/src/code_location_patents/code_location_patents/utils/timing.py", line 40, in wrapper
    result, execution_time = timeit(func)(context, *args, **kwargs)
                             ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/geoheil/development/promonow/jubust/services/data-pipeline-patents/src/code_location_patents/code_location_patents/utils/timing.py", line 23, in wrapper
    result = func(*args, **kwargs)
  File "/Users/geoheil/development/promonow/jubust/services/data-pipeline-patents/src/code_location_patents/code_location_patents/assets/patents/full_naive_ocr_vlm.py", line 82, in compute_full_naive_ocr_vlm
    conv_res = doc_converter.convert(raw_patent)
  File "/site-packages/pydantic/_internal/_validate_call.py", line 38, in wrapper_function
    return wrapper(*args, **kwargs)
  File "/site-packages/pydantic/_internal/_validate_call.py", line 111, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
  File "/site-packages/docling/document_converter.py", line 227, in convert
    return next(all_res)
  File "/site-packages/docling/document_converter.py", line 250, in convert_all
    for conv_res in conv_res_iter:
                    ^^^^^^^^^^^^^
  File "/site-packages/docling/document_converter.py", line 285, in _convert
    for item in map(
                ~~~^
        partial(self._process_document, raises_on_error=raises_on_error),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        input_batch,
        ^^^^^^^^^^^^
    ):
    ^
  File "/site-packages/docling/document_converter.py", line 331, in _process_document
    conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
  File "/site-packages/docling/document_converter.py", line 352, in _execute_pipeline
    pipeline = self._get_pipeline(in_doc.format)
  File "/site-packages/docling/document_converter.py", line 314, in _get_pipeline
    self.initialized_pipelines[cache_key] = pipeline_class(
                                            ~~~~~~~~~~~~~~^
        pipeline_options=pipeline_options
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/site-packages/docling/pipeline/vlm_pipeline.py", line 99, in __init__
    HuggingFaceTransformersVlmModel(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        enabled=True,  # must be always enabled for this pipeline to make sense.
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        vlm_options=vlm_options,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    ),
    ^
  File "/site-packages/docling/models/vlm_models_inline/hf_transformers_model.py", line 99, in __init__
    self.vlm_model = model_cls.from_pretrained(
                     ~~~~~~~~~~~~~~~~~~~~~~~~~^
        artifacts_path,
        ^^^^^^^^^^^^^^^
    ...<7 lines>...
        trust_remote_code=vlm_options.trust_remote_code,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/site-packages/transformers/models/auto/auto_factory.py", line 574, in from_pretrained
    raise ValueError(
    ...<2 lines>...
    )

geoHeil avatar Jun 13 '25 15:06 geoHeil

You can try to add AutoModelForImageTextToText.

Enum definition:

https://github.com/docling-project/docling/blob/0432a31b2f7c9fe944c3a1d4b608ef938b4f2299/docling/datamodel/pipeline_options_vlm_model.py#L26-L29

Usage:

https://github.com/docling-project/docling/blob/0432a31b2f7c9fe944c3a1d4b608ef938b4f2299/docling/models/vlm_models_inline/hf_transformers_model.py#L83-L93

And in case you have to use a different prompt, you can use another if/else in https://github.com/docling-project/docling/blob/0432a31b2f7c9fe944c3a1d4b608ef938b4f2299/docling/models/vlm_models_inline/hf_transformers_model.py#L163

dolfim-ibm avatar Jun 13 '25 15:06 dolfim-ibm

I have added and made these changes:

class TransformersModelType(str, Enum):
    AUTOMODEL = "automodel"
    AUTOMODEL_VISION2SEQ = "automodel-vision2seq"
    AUTOMODEL_CAUSALLM = "automodel-causallm"
    AUTOMODEL_FORIMAGETEXTTOTEXT = "automodel-forimagetexttotext"


DOLPHIN_VISION_TRANSFORMERS = InlineVlmOptions(
    repo_id="ByteDance/dolphin",
    prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
    response_format=ResponseFormat.MARKDOWN,
    inference_framework=InferenceFramework.TRANSFORMERS,
    transformers_model_type=TransformersModelType.AUTOMODEL_FORIMAGETEXTTOTEXT,
    supported_devices=[
        AcceleratorDevice.CPU,
        AcceleratorDevice.CUDA,
        AcceleratorDevice.MPS,
    ],
    scale=2.0,
)

model_cls: Any = AutoModel
            if (
                self.vlm_options.transformers_model_type
                == TransformersModelType.AUTOMODEL_CAUSALLM
            ):
                model_cls = AutoModelForCausalLM
            elif (
                self.vlm_options.transformers_model_type
                == TransformersModelType.AUTOMODEL_VISION2SEQ
            ):
                model_cls = AutoModelForVision2Seq
            elif (self.vlm_options.transformers_model_type
                  == TransformersModelType.AUTOMODEL_FORIMAGETEXTTOTEXT):
                model_cls = AutoModelForImageTextToText

the error still is (albeit the right model type is now used):

ValueError: Cannot use apply_chat_template because this processor does not have a chat template.

hf_transformers_model.py", line 139, in __call__
    prompt = self.formulate_prompt()
  File "site-packages/docling/models/vlm_models_inline/hf_transformers_model.py", line 202, in formulate_prompt
    prompt = self.processor.apply_chat_template(

Then the formulate_prompt was also adapted by adding this condition:

if self.vlm_options.repo_id.lower().startswith("bytedance/dolphin"):
            # Dolphin is a vision-encoder-decoder model, *not* a chat model.
            # It wants plain text:  <s> ...prompt...  <Answer/>
            # more info here https://huggingface.co/ByteDance/Dolphin
            return f"<s>{self.vlm_options.prompt} <Answer/>"

See a PR with these changes: https://github.com/docling-project/docling/pull/1772

geoHeil avatar Jun 14 '25 06:06 geoHeil

However, for an input document of

WO2021041671-eval-small.pdf

A output document like

WO2021041671-eval-small.json

is generated

but a lot of content is missing - in particular, compared to a normal docling OCR pipeline with rapidocr -

WO2021041671-eval-small.json

Perhaps this is too much (i.e. not only integration of dolphin to docling - but in order to make the integration meaningful it would be rather neat if it would deliver similar or better output quality.

What would need to be changeed?

As written before:

DOLPHIN_VISION_TRANSFORMERS = InlineVlmOptions(
    repo_id="ByteDance/dolphin",
    prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
    response_format=ResponseFormat.MARKDOWN,
    inference_framework=InferenceFramework.TRANSFORMERS,
    transformers_model_type=TransformersModelType.AUTOMODEL_FORIMAGETEXTTOTEXT,
    supported_devices=[
        AcceleratorDevice.CPU,
        AcceleratorDevice.CUDA,
        AcceleratorDevice.MPS,
    ],
    scale=2.0,
    temperature=0.0,
)

was used.

geoHeil avatar Jun 14 '25 06:06 geoHeil