pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

Minimal Pyjanitor instalation

Open GuiMarthe opened this issue 3 years ago • 7 comments

Hey folks, I recently looked at the package dependency for pyjanitor and it seems too large for a production environment. I saw in the .requirements directory that there are a few sets of dependencies for different use cases, but I don't see anywhere how to actually limit the scope at installation time.

Is the following possible? pip install pyjanitor[base]

Just in case, this is the packages dependency list generated by pipdeptree.

pyjanitor==0.20.10
  black==20.8b1
    appdirs==1.4.4
    click==7.1.2
    mypy-extensions==0.4.3
    pathspec==0.8.1
    regex==2021.4.4
    toml==0.10.2
    typed-ast==1.4.2
    typing-extensions==3.7.4.3
  hypothesis==6.8.9
    attrs==20.3.0
    sortedcontainers==2.3.0
  interrogate==1.3.2
    attrs==20.3.0
    click==7.1.2
    colorama==0.4.4
    py==1.10.0
    tabulate==0.8.9
    toml==0.10.2
  ipykernel==5.5.3
    ipython==7.21.0
      backcall==0.2.0
      decorator==5.0.5
      jedi==0.18.0
        parso==0.8.2
      pexpect==4.8.0
        ptyprocess==0.7.0
      pickleshare==0.7.5
      prompt-toolkit==3.0.18
        wcwidth==0.2.5
      Pygments==2.8.1
      setuptools==54.2.0
      traitlets==5.0.5
        ipython-genutils==0.2.0
    jupyter-client==6.1.13
      jupyter-core==4.7.1
        traitlets==5.0.5
          ipython-genutils==0.2.0
      nest-asyncio==1.5.1
      python-dateutil==2.8.0
        six==1.15.0
      pyzmq==22.0.3
      tornado==6.1
      traitlets==5.0.5
        ipython-genutils==0.2.0
    tornado==6.1
    traitlets==5.0.5
      ipython-genutils==0.2.0
  isort==5.8.0
  jupyter-client==6.1.13
    jupyter-core==4.7.1
      traitlets==5.0.5
        ipython-genutils==0.2.0
    nest-asyncio==1.5.1
    python-dateutil==2.8.0
      six==1.15.0
    pyzmq==22.0.3
    tornado==6.1
    traitlets==5.0.5
      ipython-genutils==0.2.0
  lxml==4.6.3
  natsort==7.1.1
  nbsphinx==0.8.2
    docutils==0.15.2
    Jinja2==2.11.3
      MarkupSafe==1.1.1
    nbconvert==6.0.7
      bleach==3.3.0
        packaging==20.9
          pyparsing==2.4.7
        six==1.15.0
        webencodings==0.5.1
      defusedxml==0.7.1
      entrypoints==0.3
      Jinja2==2.11.3
        MarkupSafe==1.1.1
      jupyter-core==4.7.1
        traitlets==5.0.5
          ipython-genutils==0.2.0
      jupyterlab-pygments==0.1.2
        Pygments==2.8.1
      mistune==0.8.4
      nbclient==0.5.3
        async-generator==1.10
        jupyter-client==6.1.13
          jupyter-core==4.7.1
            traitlets==5.0.5
              ipython-genutils==0.2.0
          nest-asyncio==1.5.1
          python-dateutil==2.8.0
            six==1.15.0
          pyzmq==22.0.3
          tornado==6.1
          traitlets==5.0.5
            ipython-genutils==0.2.0
        nbformat==5.1.3
          ipython-genutils==0.2.0
          jsonschema==3.2.0
            attrs==20.3.0
            importlib-metadata==3.10.0
              typing-extensions==3.7.4.3
              zipp==3.4.1
            pyrsistent==0.17.3
            setuptools==54.2.0
            six==1.15.0
          jupyter-core==4.7.1
            traitlets==5.0.5
              ipython-genutils==0.2.0
          traitlets==5.0.5
            ipython-genutils==0.2.0
        nest-asyncio==1.5.1
        traitlets==5.0.5
          ipython-genutils==0.2.0
      nbformat==5.1.3
        ipython-genutils==0.2.0
        jsonschema==3.2.0
          attrs==20.3.0
          importlib-metadata==3.10.0
            typing-extensions==3.7.4.3
            zipp==3.4.1
          pyrsistent==0.17.3
          setuptools==54.2.0
          six==1.15.0
        jupyter-core==4.7.1
          traitlets==5.0.5
            ipython-genutils==0.2.0
        traitlets==5.0.5
          ipython-genutils==0.2.0
      pandocfilters==1.4.3
      Pygments==2.8.1
      testpath==0.4.4
      traitlets==5.0.5
        ipython-genutils==0.2.0
    nbformat==5.1.3
      ipython-genutils==0.2.0
      jsonschema==3.2.0
        attrs==20.3.0
        importlib-metadata==3.10.0
          typing-extensions==3.7.4.3
          zipp==3.4.1
        pyrsistent==0.17.3
        setuptools==54.2.0
        six==1.15.0
      jupyter-core==4.7.1
        traitlets==5.0.5
          ipython-genutils==0.2.0
      traitlets==5.0.5
        ipython-genutils==0.2.0
    Sphinx==3.5.3
      alabaster==0.7.12
      Babel==2.9.0
        pytz==2020.5
      docutils==0.15.2
      imagesize==1.2.0
      Jinja2==2.11.3
        MarkupSafe==1.1.1
      packaging==20.9
        pyparsing==2.4.7
      Pygments==2.8.1
      requests==2.23.0
        certifi==2020.12.5
        chardet==3.0.4
        idna==2.10
        urllib3==1.25.11
      setuptools==54.2.0
      snowballstemmer==2.1.0
      sphinxcontrib-applehelp==1.0.2
      sphinxcontrib-devhelp==1.0.2
      sphinxcontrib-htmlhelp==1.0.3
      sphinxcontrib-jsmath==1.0.1
      sphinxcontrib-qthelp==1.0.3
      sphinxcontrib-serializinghtml==1.1.4
    traitlets==5.0.5
      ipython-genutils==0.2.0
  pandas-flavor==0.2.0
    pandas==1.1.3
      numpy==1.20.2
      python-dateutil==2.8.0
        six==1.15.0
      pytz==2020.5
    xarray==0.17.0
      numpy==1.20.2
      pandas==1.1.3
        numpy==1.20.2
        python-dateutil==2.8.0
          six==1.15.0
        pytz==2020.5
      setuptools==54.2.0
  pandas-vet==0.2.2
    attrs==20.3.0
    flake8==3.9.0
      importlib-metadata==3.10.0
        typing-extensions==3.7.4.3
        zipp==3.4.1
      mccabe==0.6.1
      pycodestyle==2.7.0
      pyflakes==2.3.1
  pre-commit==2.12.0
    cfgv==3.2.0
    identify==2.2.2
    importlib-metadata==3.10.0
      typing-extensions==3.7.4.3
      zipp==3.4.1
    nodeenv==1.5.0
    PyYAML==5.4.1
    toml==0.10.2
    virtualenv==20.4.3
      appdirs==1.4.4
      distlib==0.3.1
      filelock==3.0.12
      importlib-metadata==3.10.0
        typing-extensions==3.7.4.3
        zipp==3.4.1
      six==1.15.0
  pyspark==3.1.1
    py4j==0.10.9
  pytest==6.2.3
    attrs==20.3.0
    importlib-metadata==3.10.0
      typing-extensions==3.7.4.3
      zipp==3.4.1
    iniconfig==1.1.1
    packaging==20.9
      pyparsing==2.4.7
    pluggy==0.13.1
      importlib-metadata==3.10.0
        typing-extensions==3.7.4.3
        zipp==3.4.1
    py==1.10.0
    toml==0.10.2
  pytest-azurepipelines==0.8.0
    pytest==6.2.3
      attrs==20.3.0
      importlib-metadata==3.10.0
        typing-extensions==3.7.4.3
        zipp==3.4.1
      iniconfig==1.1.1
      packaging==20.9
        pyparsing==2.4.7
      pluggy==0.13.1
        importlib-metadata==3.10.0
          typing-extensions==3.7.4.3
          zipp==3.4.1
      py==1.10.0
      toml==0.10.2
  pytest-cov==2.11.1
    coverage==5.5
    pytest==6.2.3
      attrs==20.3.0
      importlib-metadata==3.10.0
        typing-extensions==3.7.4.3
        zipp==3.4.1
      iniconfig==1.1.1
      packaging==20.9
        pyparsing==2.4.7
      pluggy==0.13.1
        importlib-metadata==3.10.0
          typing-extensions==3.7.4.3
          zipp==3.4.1
      py==1.10.0
      toml==0.10.2
  scikit-learn==0.23.2
    joblib==0.13.2
    numpy==1.20.2
    scipy==1.6.0
      numpy==1.20.2
    threadpoolctl==2.1.0
  seaborn==0.11.1
    matplotlib==3.4.1
      cycler==0.10.0
        six==1.15.0
      kiwisolver==1.3.1
      numpy==1.20.2
      Pillow==8.0.0
      pyparsing==2.4.7
      python-dateutil==2.8.0
        six==1.15.0
    numpy==1.20.2
    pandas==1.1.3
      numpy==1.20.2
      python-dateutil==2.8.0
        six==1.15.0
      pytz==2020.5
    scipy==1.6.0
      numpy==1.20.2
  setuptools==54.2.0
  sphinxcontrib-fulltoc==1.2.0
  unyt==2.8.0
    numpy==1.20.2
    sympy==1.7.1
      mpmath==1.2.1
  xarray==0.17.0
    numpy==1.20.2
    pandas==1.1.3
      numpy==1.20.2
      python-dateutil==2.8.0
        six==1.15.0
      pytz==2020.5
    setuptools==54.2.0

If I'm doing anything wrong, please let me know!

GuiMarthe avatar Apr 07 '21 20:04 GuiMarthe

possible dup of #793

samukweku avatar Apr 07 '21 21:04 samukweku

Hello @GuiMarthe! Thanks for chiming in. Yes, there are a lot of dependencies for pyjanitor. I think the dependency sprawl has been something I haven't managed well in the past, still doing a bit of learning here.

Looks like it might be good for us to split out at least the pip-installable package using the optional dependencies convention that I have just learned from googling on SO. The conda package will have to wait a bit though.

Would you like to help contribute a PR, if you've got the bandwidth? Meanwhile, I'll tag this and the other #793 as being one of the higher priority issues.

ericmjl avatar Apr 08 '21 18:04 ericmjl

hey @ericmjl, thank you for the prompt response! I think I can help you with this issue, even though I'm a beginner, I am aware of how friendly pyjanitor is :smile: .

Now, I'd still need to dig into the code, but from the pipdeptree list I've posted above, I can see the following groups of dependencies:

  • development: pytest, hypothesis, sphinx, and a few others
  • base/minimal: pandas-flavor, numpy, and pandas itself
  • notebook: ipython, jupyter, seaborn (?), etc
  • ML: scikit-learn
  • Spark: pyspark and etc.

Perhaps the ML category could be merged with the base category. Does that make sense?

GuiMarthe avatar Apr 12 '21 20:04 GuiMarthe

Also, should we add messages and/or warnings whenever the user tries to use a method that depends on a non-installed dependency?

GuiMarthe avatar Apr 12 '21 20:04 GuiMarthe

Also, should we add messages and/or warnings whenever the user tries to use a method that depends on a non-installed dependency?

@GuiMarthe yesss! That sounds like a great idea.

Perhaps the ML category could be merged with the base category. Does that make sense?

That makes sense too.

In terms of the different ways we can group things, would you be kind enough to do the following?

  • pyjanitor[all]
  • pyjanitor[base]- includes base dependencies only
  • pyjanitor[bio] - includes base + bio packages
  • pyjanitor[chem] - includes base + chemistry packages
  • pyjanitor[eng] - includes base + engineering packages
  • pyjanitor[spark] - includes base + spark dependencies

Doing so would mirror the structure in the .requirements directory (https://github.com/pyjanitor-devs/pyjanitor/tree/dev/.requirements); the more patterns we have, the easier it is to follow later on while maintaining the project. That said, I think we can omit the test submodule because that's generally for development purposes.

I'm looking forward to reviewing the PR! I will be getting my vaccine shot this week, so I might be KO'd for a day or two (depending on whether my immune system kicks off in a big or small way), but I should be able to come back to it later.

ericmjl avatar Apr 12 '21 23:04 ericmjl

Would pyjanitor[base] be the default and require no actual [base] specification?

hectormz avatar Apr 16 '21 13:04 hectormz

Actually, that sounds like a good idea, @hectormz!

@GuiMarthe, can I check in, do you have bandwidth to handle this one? I ran into a bit of a busy patch myself, and have dropped the ball here, and probably will be like this until the end of the week.

ericmjl avatar May 09 '21 13:05 ericmjl