markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

optional dependencies

Open gagb opened this issue 1 year ago • 12 comments

          Unsure about this - perhaps should be an optional dep?

Originally posted by @casperdcl in https://github.com/microsoft/markitdown/pull/100#discussion_r1889308792

gagb avatar Dec 17 '24 22:12 gagb

@gagb @casperdcl , Yeah, more generally, a lot of these should be optional dependencies.

Ideally we would have something like:

pip install markitdown[ocr, openai, yt_transcript]

Etc. to optionally include some of the more esoteric or heavy dependencies. We can then just include or exclude the converters accordingly.

What do you think?

afourney avatar Dec 17 '24 23:12 afourney

I like this but there is so much appeal to the simplicity from just running pip install markitdown

gagb avatar Dec 18 '24 00:12 gagb

Aliases are quite easy to implement...

pip install markitdown[all] could be made identical to pip install markitdown[ocr,llm,yt].

Pretty common Pythonicity.

btw you should probably rename this issue "optional dependencies" or similar. Also you can use a markdown quote block (>) rather than code block in the description :wink:

casperdcl avatar Dec 18 '24 03:12 casperdcl

i think we can just make all the dependencies optional and make the script install dependencies if needed as the way it is happening in ultralytics there if a package is needed it will be installed on runtime also updates also work on runtime

SigireddyBalasai avatar Dec 18 '24 10:12 SigireddyBalasai

Whoa at most you could do:

try:
    import openai
except ImportError as exc:
    raise ImportError("please `pip/conda install openai` or `pip install markitdown[llm]`") from exc

Meanwhile side-effects like this are highly discouraged:

try:
    import openai
except ImportError:
    os.system(f"{sys.executable} -m pip imstall openai")
    import openai

casperdcl avatar Dec 18 '24 10:12 casperdcl

i think something like

import sys
import pip
import pkg_resources

def check_and_install_module(module_name, check_for_updates=False):
    """
    Check if a Python module is installed. Optionally check and perform updates.
    
    Args:
        module_name (str): Name of the module to check and install
        check_for_updates (bool, optional): Whether to check and perform updates. Defaults to False.
    
    Returns:
        dict: A dictionary with installation/update status
    """
    try:
        # Try to import the module
        __import__(module_name)
        print(f"Module {module_name} is already installed.")
        
        # Check for updates if requested
        if check_for_updates:
            try:
                # Get current installed version
                current_version = pkg_resources.get_distribution(module_name).version
                
                # Check for available updates
                pip.main(['list', '--outdated'])
                
                # Perform update
                print(f"Updating {module_name}...")
                update_result = pip.main(['install', '--upgrade', module_name])
                
                if update_result == 0:
                    # Get new version after update
                    new_version = pkg_resources.get_distribution(module_name).version
                    print(f"Updated {module_name} from {current_version} to {new_version}")
                    return {
                        'installed': True, 
                        'updated': True, 
                        'old_version': current_version, 
                        'new_version': new_version
                    }
                else:
                    print(f"Failed to update {module_name}")
                    return {
                        'installed': True, 
                        'updated': False
                    }
            
            except Exception as update_error:
                print(f"Error checking/updating {module_name}: {update_error}")
                return {
                    'installed': True, 
                    'updated': False
                }
        
        return {
            'installed': True, 
            'updated': False
        }
    
    except ImportError:
        print(f"Module {module_name} not found. Attempting to install...")
        
        try:
            # Use pip to install the module
            install_result = pip.main(['install', module_name])
            
            if install_result == 0:
                # Verify the module is now importable
                __import__(module_name)
                print(f"Successfully installed {module_name}")
                return {
                    'installed': True, 
                    'updated': False
                }
            else:
                print(f"Failed to install {module_name}")
                return {
                    'installed': False, 
                    'updated': False
                }
        
        except Exception as e:
            print(f"Failed to install {module_name}. Error: {e}")
            return {
                'installed': False, 
                'updated': False
            }

# Example usage
if __name__ == "__main__":
    # Check and install pandas
    print(check_and_install_module('pandas'))
    
    # Check, install, and update requests
    print(check_and_install_module('requests', check_for_updates=True))

SigireddyBalasai avatar Dec 18 '24 12:12 SigireddyBalasai

👍 to the idea of using optional dependencies. i wanted to try using markitdown as a global install (pip install -u markitdown) in my base python environment but when i did that, i got the whole kitchen sink:

Installing collected packages: pytz, pydub, puremagic, XlsxWriter, tzdata, speechrecognition, soupsieve, sniffio, pydantic-core, Pillow, pathvalidate, numpy, lxml, jiter, h11, et-xmlfile, defusedxml, cobble, annotated-types, youtube-transcript-api, python-pptx, pydantic, pandas, openpyxl, mammoth, httpcore, cryptography, beautifulsoup4, anyio, pdfminer-six, markdownify, httpx, openai, markitdown

there's a whole bunch of users out there that won't have the system privs to bring in this many dependencies (or will stay away because it doesn't make sense). it seems like it should be possible to install the minimum set for minimum stated functionality, "MarkItDown is a utility for converting various files to Markdown."

robfitzgerald avatar Dec 31 '24 16:12 robfitzgerald

This also increases exposure to a) supply chain attacks and b) CVEs in the whole repo.

Just today, I added markitdown to my repo running safety checks, and got hit with a CVE:

youtube-transcript-api (==0.6.2)  [1 vulnerability found]                      
  -> Vuln ID 74190:                                                             
     Affected versions of youtube_transcript_api are vulnerable to XML External
     Entity (XXE) Injection (CWE-611). T...                                     
Update youtube-transcript-api (==0.6.2) to youtube-transcript-api==0.6.3 to fix

Now here I can potentially just add a constraint on the dependency, but there will not always be "quick fixes", which prevents me from reliably using this library in anything production-grade. Additionally, when working with any kind of docker setup / container registry, every additional dependency and every additional MB translates to potentially a LOT of extra cost. Add to that that maybe sometimes I want to make sure that a video is not accidentally leaked to a 3rd party API when using markitdown?

Zahlii avatar Jan 22 '25 18:01 Zahlii

openai is only used in the tests. Markitdown is already using hatch, so it should create a test env and require openai there: https://hatch.pypa.io/1.12/config/environment/overview/#dependencies

Maybe there are even more dependencies, haven't checked in detail.

AdrianVollmer avatar Feb 27 '25 12:02 AdrianVollmer

Yeah I want to move to optional dependencies asap. Relatedly, the latest version in main (not PyPi) also supports 3rd party extensions, minimizing -- I hope -- the need for the kitchen sink.

So this is a known problem, and one I'm very keen to solve. The current status quo is a consequence of having lifted the code out of another (also experimental) project -- namely Magentic One. It hasn't yet been sufficiently generalized

afourney avatar Feb 28 '25 14:02 afourney

This also increases exposure to a) supply chain attacks and b) CVEs in the whole repo.

Just today, I added markitdown to my repo running safety checks, and got hit with a CVE:

youtube-transcript-api (==0.6.2)  [1 vulnerability found]                      
  -> Vuln ID 74190:                                                             
     Affected versions of youtube_transcript_api are vulnerable to XML External
     Entity (XXE) Injection (CWE-611). T...                                     
Update youtube-transcript-api (==0.6.2) to youtube-transcript-api==0.6.3 to fix

Now here I can potentially just add a constraint on the dependency, but there will not always be "quick fixes", which prevents me from reliably using this library in anything production-grade. Additionally, when working with any kind of docker setup / container registry, every additional dependency and every additional MB translates to potentially a LOT of extra cost. Add to that that maybe sometimes I want to make sure that a video is not accidentally leaked to a 3rd party API when using markitdown?

Yes, this keeps me up at night. I want to make a series of breaking changes for 0.0.2, and I will include a move to optional dependencies in that.

afourney avatar Feb 28 '25 15:02 afourney

Folks, I have a potential design here. Let me know what you think -- before I expand it to all the converters we provide (this mechanism is not meant to be used with the plugin extensions.... for that, plugins are responsible for dependencies).

https://github.com/microsoft/markitdown/pull/1079

afourney avatar Mar 01 '25 01:03 afourney