optional dependencies
Unsure about this - perhaps should be an optional dep?
Originally posted by @casperdcl in https://github.com/microsoft/markitdown/pull/100#discussion_r1889308792
@gagb @casperdcl , Yeah, more generally, a lot of these should be optional dependencies.
Ideally we would have something like:
pip install markitdown[ocr, openai, yt_transcript]
Etc. to optionally include some of the more esoteric or heavy dependencies. We can then just include or exclude the converters accordingly.
What do you think?
I like this but there is so much appeal to the simplicity from just running pip install markitdown
Aliases are quite easy to implement...
pip install markitdown[all] could be made identical to pip install markitdown[ocr,llm,yt].
Pretty common Pythonicity.
btw you should probably rename this issue "optional dependencies" or similar. Also you can use a markdown quote block (>) rather than code block in the description :wink:
i think we can just make all the dependencies optional and make the script install dependencies if needed as the way it is happening in ultralytics there if a package is needed it will be installed on runtime also updates also work on runtime
Whoa at most you could do:
try:
import openai
except ImportError as exc:
raise ImportError("please `pip/conda install openai` or `pip install markitdown[llm]`") from exc
Meanwhile side-effects like this are highly discouraged:
try:
import openai
except ImportError:
os.system(f"{sys.executable} -m pip imstall openai")
import openai
i think something like
import sys
import pip
import pkg_resources
def check_and_install_module(module_name, check_for_updates=False):
"""
Check if a Python module is installed. Optionally check and perform updates.
Args:
module_name (str): Name of the module to check and install
check_for_updates (bool, optional): Whether to check and perform updates. Defaults to False.
Returns:
dict: A dictionary with installation/update status
"""
try:
# Try to import the module
__import__(module_name)
print(f"Module {module_name} is already installed.")
# Check for updates if requested
if check_for_updates:
try:
# Get current installed version
current_version = pkg_resources.get_distribution(module_name).version
# Check for available updates
pip.main(['list', '--outdated'])
# Perform update
print(f"Updating {module_name}...")
update_result = pip.main(['install', '--upgrade', module_name])
if update_result == 0:
# Get new version after update
new_version = pkg_resources.get_distribution(module_name).version
print(f"Updated {module_name} from {current_version} to {new_version}")
return {
'installed': True,
'updated': True,
'old_version': current_version,
'new_version': new_version
}
else:
print(f"Failed to update {module_name}")
return {
'installed': True,
'updated': False
}
except Exception as update_error:
print(f"Error checking/updating {module_name}: {update_error}")
return {
'installed': True,
'updated': False
}
return {
'installed': True,
'updated': False
}
except ImportError:
print(f"Module {module_name} not found. Attempting to install...")
try:
# Use pip to install the module
install_result = pip.main(['install', module_name])
if install_result == 0:
# Verify the module is now importable
__import__(module_name)
print(f"Successfully installed {module_name}")
return {
'installed': True,
'updated': False
}
else:
print(f"Failed to install {module_name}")
return {
'installed': False,
'updated': False
}
except Exception as e:
print(f"Failed to install {module_name}. Error: {e}")
return {
'installed': False,
'updated': False
}
# Example usage
if __name__ == "__main__":
# Check and install pandas
print(check_and_install_module('pandas'))
# Check, install, and update requests
print(check_and_install_module('requests', check_for_updates=True))
👍 to the idea of using optional dependencies. i wanted to try using markitdown as a global install (pip install -u markitdown) in my base python environment but when i did that, i got the whole kitchen sink:
Installing collected packages: pytz, pydub, puremagic, XlsxWriter, tzdata, speechrecognition, soupsieve, sniffio, pydantic-core, Pillow, pathvalidate, numpy, lxml, jiter, h11, et-xmlfile, defusedxml, cobble, annotated-types, youtube-transcript-api, python-pptx, pydantic, pandas, openpyxl, mammoth, httpcore, cryptography, beautifulsoup4, anyio, pdfminer-six, markdownify, httpx, openai, markitdown
there's a whole bunch of users out there that won't have the system privs to bring in this many dependencies (or will stay away because it doesn't make sense). it seems like it should be possible to install the minimum set for minimum stated functionality, "MarkItDown is a utility for converting various files to Markdown."
This also increases exposure to a) supply chain attacks and b) CVEs in the whole repo.
Just today, I added markitdown to my repo running safety checks, and got hit with a CVE:
youtube-transcript-api (==0.6.2) [1 vulnerability found]
-> Vuln ID 74190:
Affected versions of youtube_transcript_api are vulnerable to XML External
Entity (XXE) Injection (CWE-611). T...
Update youtube-transcript-api (==0.6.2) to youtube-transcript-api==0.6.3 to fix
Now here I can potentially just add a constraint on the dependency, but there will not always be "quick fixes", which prevents me from reliably using this library in anything production-grade. Additionally, when working with any kind of docker setup / container registry, every additional dependency and every additional MB translates to potentially a LOT of extra cost. Add to that that maybe sometimes I want to make sure that a video is not accidentally leaked to a 3rd party API when using markitdown?
openai is only used in the tests. Markitdown is already using hatch, so it should create a test env and require openai there: https://hatch.pypa.io/1.12/config/environment/overview/#dependencies
Maybe there are even more dependencies, haven't checked in detail.
Yeah I want to move to optional dependencies asap. Relatedly, the latest version in main (not PyPi) also supports 3rd party extensions, minimizing -- I hope -- the need for the kitchen sink.
So this is a known problem, and one I'm very keen to solve. The current status quo is a consequence of having lifted the code out of another (also experimental) project -- namely Magentic One. It hasn't yet been sufficiently generalized
This also increases exposure to a) supply chain attacks and b) CVEs in the whole repo.
Just today, I added markitdown to my repo running safety checks, and got hit with a CVE:
youtube-transcript-api (==0.6.2) [1 vulnerability found] -> Vuln ID 74190: Affected versions of youtube_transcript_api are vulnerable to XML External Entity (XXE) Injection (CWE-611). T... Update youtube-transcript-api (==0.6.2) to youtube-transcript-api==0.6.3 to fixNow here I can potentially just add a constraint on the dependency, but there will not always be "quick fixes", which prevents me from reliably using this library in anything production-grade. Additionally, when working with any kind of docker setup / container registry, every additional dependency and every additional MB translates to potentially a LOT of extra cost. Add to that that maybe sometimes I want to make sure that a video is not accidentally leaked to a 3rd party API when using markitdown?
Yes, this keeps me up at night. I want to make a series of breaking changes for 0.0.2, and I will include a move to optional dependencies in that.
Folks, I have a potential design here. Let me know what you think -- before I expand it to all the converters we provide (this mechanism is not meant to be used with the plugin extensions.... for that, plugins are responsible for dependencies).
https://github.com/microsoft/markitdown/pull/1079