markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

[Feature Request] Magika Dependency Optional

Open RKeelan opened this issue 8 months ago • 10 comments

Would it be possible to make the Magika dependency optional (i.e., pulled in as an extra)? I'm trying to use MarkItDown in the browser via Pyodide and Magika's dependency on ONNX is causing trouble.

My understanding is that Magika is used to determine the stream type, I guess in cases where the application doesn't provide it (or maybe provides incorrect information by accident?). In my case, I'd be happy to trade away that flexibility in exchange for dropping the Magika / ONNX dependency.

I tested with a forked repo where I removed self._magika.identify_stream(file_stream) from _get_stream_info_guesses() and I was able to convert PDFs in the browser.

RKeelan avatar May 04 '25 16:05 RKeelan

👍 We're trying to use this in an AWS lambda and the >100MB runtime dependency makes this more challenging than it needs to be. Being able to specify a file type based on extension, mime type, or something lighter weight like file would be nice.

I forked the repo and removed the magika dependency, and it seems to work okay

RKeelan avatar May 26 '25 00:05 RKeelan

I also think this feature is necessary.

Soulter avatar May 31 '25 04:05 Soulter

It would be nice to have the ability to slim down this package. Im also trying to use this in a lambda and have resorted to just building an image and deploying to lambda which allows me 10Gb of space. The ability to guess types is useful but seems overkill if you are using this to process known filetypes

coneill-relay avatar Jun 23 '25 20:06 coneill-relay

It would help to make the file detection dependency optional in order to bring down the size of the pkg. Also, if every file type was offered as an optional install that would help bring down the size. My use case was just to process HTML, and didn't need the entire pkg to do that.

aswinbharadwaj avatar Jul 23 '25 16:07 aswinbharadwaj

This would be fantastic! We need to support as many hosting environments as possible, and this required dependency makes it difficult.

htxryan avatar Jul 31 '25 12:07 htxryan

Hi, I created a fork of markitdown weeks ago, which has removed magika, guys can use https://github.com/Soulter/markitdown as a temporary solution until authors fix this issue.

pip install markitdown-no-magika
# or select some of the following feature:
# pptx = ["python-pptx"]
# docx = ["mammoth", "lxml"]
# xlsx = ["pandas", "openpyxl"]
# xls = ["pandas", "xlrd"]
# pdf = ["pdfminer.six"]
# outlook = ["olefile"]
# audio-transcription = ["pydub", "SpeechRecognition"]
# youtube-transcription = ["youtube-transcript-api"]
# az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
# 
# example: 
# pip install markitdown-no-magika[pptx, docx, xlsx, pdf]

Soulter avatar Jul 31 '25 12:07 Soulter

Plus one on this, can't use it in an AWS lambda, 140mb for Magika is ridiculous

typefox09 avatar Sep 11 '25 07:09 typefox09

FYI for anyone interested (ie @typefox09 ) the no markitdown-no-magika repo @Soulter made does work on lambda, at least in a POC test I did for a future project. However I suspect a slightly clearner PR would be better to make it work with the main repo.

Small reminder, to import using that package, it would be: from markitdown_no_magika import MarkItDown

ClaraLeigh avatar Sep 12 '25 09:09 ClaraLeigh

User of Digital Ocean Functions here which require a zipped environment to be max 48 MB.

magika comes both with a large (~30 MB if I recall correctly) binary blob for its model and it's also pulling in https://pypi.org/project/onnxruntime/ which itself is another heavy dependency.

To remove these 2 dependencies, I run this post-processing after pyp install markitdown:

pip install markitdown
pip uninstall -y magika onnxruntime

and then replacing magika with a shim:

cp magika.py virtualenv/lib/python3.11/site-packages/

Here's the magika.py shim file:

"""Magika shim which always fails stream identification"""

class Magika:
    def __init__(self, *args, **kwargs):
        pass

    def identify_stream(self, stream):
        return MagikaResult()


class MagikaResult:
    def __init__(self):
        self.status = "failed"


__version__ = "0.6.1"
__all__ = ['Magika']

Now you will always need to explicitly pass either file_extension=".html" (deprecated according to the docs) or stream_info=StreamInfo(mimetype="application/html") to .convert() to prevent the converter detection from failing.

The entire zipped environment is now only ~20 MB and the Digital Ocean Function executes as expected. (I also removed click and a few other dependencies that seem unnecessary for my use case)

I found above approach better than forking the markitdown repo as other users did.

wafriq avatar Oct 28 '25 18:10 wafriq