marker icon indicating copy to clipboard operation
marker copied to clipboard

[BUG: Breaking] AttributeError: 'ExtractionOutput' object has no attribute 'metadata'

Open kipavy opened this issue 2 months ago β€’ 4 comments

🧨 Describe the Bug

Hi, so the docs wasn't clear about how to save output but I assumed I needed to use from marker.output import save_output so it works well when using PdfConverter but not when I use ExtractionConverter (I use it for structured extraction). I'm getting an error AttributeError: 'ExtractionOutput' object has no attribute 'metadata'. On top of that, all my attempts to use ollama to do structured extraction have failed while it works well with gemini but that's another issue I guess (PS: I've found this closed issue that is exactly my second issue with marker + ollama but I wonder why its closed because its still happening https://github.com/datalab-to/marker/issues/785 )

πŸ“„ Input Document

It happens with any pdf but here's a short 3 pages pdf to test. hal.pdf

πŸ“€ Output Trace / Stack Trace

Click to expand
Running page extraction: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:03<00:00,  3.42s/it]
Traceback (most recent call last):
  File "/home/kp276129/Documents/ontoflow/pdf_analysis/test.py", line 32, in <module>
    save_output(rendered, output_dir=OUTPUT_DIR, fname_base="hal_extracted_structured")
  File "/nobackup/kp276129/envs/ontoflow/lib/python3.12/site-packages/marker/output.py", line 97, in save_output
    f.write(json.dumps(rendered.metadata, indent=2))
                       ^^^^^^^^^^^^^^^^^
  File "/nobackup/kp276129/envs/ontoflow/lib/python3.12/site-packages/pydantic/main.py", line 1026, in __getattr__
    raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'ExtractionOutput' object has no attribute 'metadata

βš™οΈ Environment

Please fill in all relevant details:

  • Marker version: marker-pdf 1.10.1
  • Surya version: 0.17.0
  • Python version: 3.12.3
  • PyTorch version: 2.9.0+cu126
  • Transformers version: 4.57.1
  • Operating System :
    • Distributor ID: Ubuntu
    • Description: Ubuntu 24.04.3 LTS
    • Release: 24.04
    • Codename: noble

βœ… Expected Behavior

I expected Marker to output hal_extracted_structured.json in OUTPUT_DIR without any error.

πŸ“Ÿ Command or Code Used

Click to expand
# https://github.com/datalab-to/marker?tab=readme-ov-file#structured-extraction-beta
from pathlib import Path
from marker.models import create_model_dict
from marker.config.parser import ConfigParser
from marker.converters.extraction import ExtractionConverter
from marker.output import save_output
from templates import PaperMetadata

INPUT_DIR = Path("/home/kp276129/Documents/ontoflow/pdf_analysis/input")
OUTPUT_DIR = Path("/home/kp276129/Documents/ontoflow/pdf_analysis/output")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


schema = PaperMetadata.model_json_schema()

config_parser = ConfigParser({
    "page_schema": schema,
    "use_llm": True,
    "disable_image_extraction": True,
    "ollama_base_url": "http://localhost:11434",
    "ollama_model": "gemma3",
    "llm_service": "marker.services.ollama.OllamaService",
})

converter = ExtractionConverter(
    artifact_dict=create_model_dict(),
    config=config_parser.generate_config_dict(),
    llm_service=config_parser.get_llm_service(),
)

rendered = converter(str(INPUT_DIR / "hal.pdf"))
save_output(rendered, output_dir=OUTPUT_DIR, fname_base="hal_extracted_structured")

Oops, forgot to include my PaperMetadata template:

from __future__ import annotations

from typing import List, Optional, Dict
from pydantic import BaseModel, Field


class Figure(BaseModel):
    """ReprΓ©sente une figure, un diagramme ou une image dans le document."""

    caption: Optional[str] = Field(
        None, description="La lΓ©gende exacte de la figure, si elle existe."
    )
    description: str = Field(
        ...,
        description=(
            "Une description textuelle dΓ©taillΓ©e de ce que l'image montre."
        ),
    )
    page_number: Optional[int] = Field(
        None, description="Le numΓ©ro de la page oΓΉ se trouve la figure."
    )


class PaperMetadata(BaseModel):
    """Modèle de métadonnées pour un article scientifique / rapport technique."""

    title: str = Field(..., description="Titre de l'article")
    authors: List[str] = Field(
        default_factory=list, description="Liste d'auteurs, ordre conservΓ©"
    )
    affiliations: Optional[List[str]] = Field(
        default=None, description="Liste d'affiliations"
    )
    abstract: Optional[str] = Field(None, description="RΓ©sumΓ© / abstract")
    keywords: Optional[List[str]] = Field(default=None, description="Mots-clΓ©s")
    doi: Optional[str] = Field(None, description="DOI si prΓ©sent")
    publication_date: Optional[str] = Field(
        None,
        description=(
            "Date de publication (ISO 'YYYY-MM-DD' prΓ©fΓ©rΓ©e). "
            "Formats acceptΓ©s: 'YYYY-MM-DD', '25 Jul 2017', 'Submitted on 25 Jul 2017' β€” "
        ),
    )
    journal: Optional[str] = Field(
        None, description="Nom du journal / confΓ©rence"
    )
    volume: Optional[str] = Field(None, description="Volume")
    issue: Optional[str] = Field(None, description="NumΓ©ro")
    pages: Optional[str] = Field(None, description="Pages, ex: '123-135'")

    figures: Optional[List[Figure]] = Field(
        default_factory=list,
        description="Liste de toutes les figures, diagrammes et images trouvΓ©s dans le document.",
    )

kipavy avatar Nov 04 '25 08:11 kipavy

I think the problem was an AttributeError that occurred inside the save_output function, but only when it was trying to save the results from an ExtractionConverter. The fix, which has already been applied to the marker/output.py file, was to make the metadata-saving step conditional.

with open(...) as f: f.write(json.dumps(rendered.metadata, indent=2))

fix code: # FIX: Check if the 'metadata' attribute exists before trying to access it. # ExtractionOutput objects do not have this attribute, causing the bug. if hasattr(rendered, "metadata"): with open( os.path.join(output_dir, f"{fname_base}_meta.json"), "w+", encoding=settings.OUTPUT_ENCODING, ) as f: f.write(json.dumps(rendered.metadata, indent=2))

gyugut avatar Nov 07 '25 05:11 gyugut

I think the problem was an AttributeError that occurred inside the save_output function, but only when it was trying to save the results from an ExtractionConverter. The fix, which has already been applied to the marker/output.py file, was to make the metadata-saving step conditional.

with open(...) as f: f.write(json.dumps(rendered.metadata, indent=2))

fix code: # FIX: Check if the 'metadata' attribute exists before trying to access it. # ExtractionOutput objects do not have this attribute, causing the bug. if hasattr(rendered, "metadata"): with open( os.path.join(output_dir, f"{fname_base}_meta.json"), "w+", encoding=settings.OUTPUT_ENCODING, ) as f: f.write(json.dumps(rendered.metadata, indent=2))

Hello, Yes that's it but is there a PR for this ? I don't even understand how this hasn't already been fixed. I can do the PR

kipavy avatar Nov 07 '25 07:11 kipavy

I said it wrong. It's not that it's already applied. I fixed it. Good PR

gyugut avatar Nov 07 '25 09:11 gyugut

please can you provide code with llm

ankit8347 avatar Nov 08 '25 12:11 ankit8347