[Feature Request] Magika Dependency Optional
Would it be possible to make the Magika dependency optional (i.e., pulled in as an extra)? I'm trying to use MarkItDown in the browser via Pyodide and Magika's dependency on ONNX is causing trouble.
My understanding is that Magika is used to determine the stream type, I guess in cases where the application doesn't provide it (or maybe provides incorrect information by accident?). In my case, I'd be happy to trade away that flexibility in exchange for dropping the Magika / ONNX dependency.
I tested with a forked repo where I removed self._magika.identify_stream(file_stream) from _get_stream_info_guesses() and I was able to convert PDFs in the browser.
👍 We're trying to use this in an AWS lambda and the >100MB runtime dependency makes this more challenging than it needs to be. Being able to specify a file type based on extension, mime type, or something lighter weight like file would be nice.
I forked the repo and removed the magika dependency, and it seems to work okay
I also think this feature is necessary.
It would be nice to have the ability to slim down this package. Im also trying to use this in a lambda and have resorted to just building an image and deploying to lambda which allows me 10Gb of space. The ability to guess types is useful but seems overkill if you are using this to process known filetypes
It would help to make the file detection dependency optional in order to bring down the size of the pkg. Also, if every file type was offered as an optional install that would help bring down the size. My use case was just to process HTML, and didn't need the entire pkg to do that.
This would be fantastic! We need to support as many hosting environments as possible, and this required dependency makes it difficult.
Hi, I created a fork of markitdown weeks ago, which has removed magika, guys can use https://github.com/Soulter/markitdown as a temporary solution until authors fix this issue.
pip install markitdown-no-magika
# or select some of the following feature:
# pptx = ["python-pptx"]
# docx = ["mammoth", "lxml"]
# xlsx = ["pandas", "openpyxl"]
# xls = ["pandas", "xlrd"]
# pdf = ["pdfminer.six"]
# outlook = ["olefile"]
# audio-transcription = ["pydub", "SpeechRecognition"]
# youtube-transcription = ["youtube-transcript-api"]
# az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
#
# example:
# pip install markitdown-no-magika[pptx, docx, xlsx, pdf]
Plus one on this, can't use it in an AWS lambda, 140mb for Magika is ridiculous
FYI for anyone interested (ie @typefox09 ) the no markitdown-no-magika repo @Soulter made does work on lambda, at least in a POC test I did for a future project. However I suspect a slightly clearner PR would be better to make it work with the main repo.
Small reminder, to import using that package, it would be: from markitdown_no_magika import MarkItDown
User of Digital Ocean Functions here which require a zipped environment to be max 48 MB.
magika comes both with a large (~30 MB if I recall correctly) binary blob for its model and it's also pulling in https://pypi.org/project/onnxruntime/ which itself is another heavy dependency.
To remove these 2 dependencies, I run this post-processing after pyp install markitdown:
pip install markitdown
pip uninstall -y magika onnxruntime
and then replacing magika with a shim:
cp magika.py virtualenv/lib/python3.11/site-packages/
Here's the magika.py shim file:
"""Magika shim which always fails stream identification"""
class Magika:
def __init__(self, *args, **kwargs):
pass
def identify_stream(self, stream):
return MagikaResult()
class MagikaResult:
def __init__(self):
self.status = "failed"
__version__ = "0.6.1"
__all__ = ['Magika']
Now you will always need to explicitly pass either file_extension=".html" (deprecated according to the docs) or stream_info=StreamInfo(mimetype="application/html") to .convert() to prevent the converter detection from failing.
The entire zipped environment is now only ~20 MB and the Digital Ocean Function executes as expected. (I also removed click and a few other dependencies that seem unnecessary for my use case)
I found above approach better than forking the markitdown repo as other users did.