markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Document Intelligence is not working

Open sumitbindra opened this issue 10 months ago • 11 comments

When I use this:

from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)

I get an error saying: No parameter named "docintel_endpoint"

I have version: markitdown==0.0.1a3

sumitbindra avatar Feb 04 '25 23:02 sumitbindra

I got the same error ~

myyang19770915 avatar Feb 06 '25 06:02 myyang19770915

I find out that the markidown version 0.0.1a3 the _markitdown.py file did'nt has the content of docintel_endpoint ... as below:

Image

myyang19770915 avatar Feb 07 '25 01:02 myyang19770915

Yeah maybe the readme is out of sync

sumitbindra avatar Feb 07 '25 02:02 sumitbindra

I am also a bit confused how this can work without an API Key parameter. Or does it require Entra ID?

liamca avatar Feb 08 '25 15:02 liamca

Thanks for the report. Let me investigate. It looks like something didn't quite make it in.

afourney avatar Feb 09 '25 05:02 afourney

In v 0.0.1a4 there's now another issue when running it:

.venv\Lib\site-packages\markitdown\_markitdown.py", line 1727, in _convert
    if res is not None:
       ^^^
UnboundLocalError: cannot access local variable 'res' where it is not associated with a value

The same is valid also building from source 0.0.2a1

EmanueleMeazzo avatar Feb 12 '25 17:02 EmanueleMeazzo

In v 0.0.1a4 there's now another issue when running it:

.venv\Lib\site-packages\markitdown\_markitdown.py", line 1727, in _convert
    if res is not None:
       ^^^
UnboundLocalError: cannot access local variable 'res' where it is not associated with a value

The same is valid also building from source 0.0.2a1

Ok thanks for the report, and sorry for the inconvenience. I'm trying to provision a doc intelligence endpoint to test on, and integrate into the CI, so that we can avoid these breaks in the future. Prior to this, I could only rely on others to test and report findings

afourney avatar Feb 13 '25 04:02 afourney

this doesnt make sense, this endpoint needs to take in an api key of sort or at least tell us if the api key is ingested from .env variables.

Physium avatar Feb 14 '25 09:02 Physium

this doesnt make sense, this endpoint needs to take in an api key of sort or at least tell us if the api key is ingested from .env variables.

Looking at the code, it currently only uses and supports the Azure Identify auth (which could be the cause of some of the above issues if an auth error isn't handled properly) It would be nice and not too complex to add the Key Auth too, if I find the time I can add it via PR

EmanueleMeazzo avatar Feb 14 '25 09:02 EmanueleMeazzo

That be great if that could be sorted out.

to add the Key Auth too

I was thinking either enable support for API key auth as you suggested or simply accept an initialized client directly.

mathieuisabel avatar Feb 24 '25 13:02 mathieuisabel

In v 0.0.1a4 there's now another issue when running it:

.venv\Lib\site-packages\markitdown\_markitdown.py", line 1727, in _convert
    if res is not None:
       ^^^
UnboundLocalError: cannot access local variable 'res' where it is not associated with a value

The same is valid also building from source 0.0.2a1

Ok thanks for the report, and sorry for the inconvenience. I'm trying to provision a doc intelligence endpoint to test on, and integrate into the CI, so that we can avoid these breaks in the future. Prior to this, I could only rely on others to test and report findings

this error is due to the exception handling fg


                try:
                    res = converter.convert(local_path, **_kwargs)
                except Exception:
                    failed_attempts.append(
                        FailedConversionAttempt(
                            converter=converter, exc_info=sys.exc_info()
                        )
                    )

                if res is not None:
                    # Normalize the content
                    res.text_content = "\n".join(
                        [line.rstrip() for line in re.split(r"\r?\n", res.text_content)]
                    )
                    res.text_content = re.sub(r"\n{3,}", "\n\n", res.text_content)

                    # Todo
                    return res

should be a try.. except.. else block or something similar. If you hit an exception then the variable res does not get created and so you get this error. (when the point of the exception handling was to log an error but not throw)

oegedijk avatar Mar 04 '25 12:03 oegedijk