dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Feature Proposal: reduce dependencies

Open dovahcrow opened this issue 4 years ago • 7 comments

Summary

Reduce unnecessary dependencies for the whole package.

Design-level Explanation Actions

NA

Design-level Explanation

NA

Implementation-level Explanation

  • Removing Pillow Pillow is used only to load the ellipse image for the wordcloud. We can remove this dependency by storing the ellipse as an array and directly read it using NumPy.

  • Removing bottleneck Need to investigate if the bottleneck's ranking method is equivalent to series.rank.

  • Make tqdm optional tqdm is used in the progress bar. We can make this library optional and only display the progress bar if tqdm is installed.

  • Removing tornado I already opened an issue here https://github.com/Kaggle/docker-python/issues/890. Once it is resolved we can remove the tornado version restriction.

  • Removing requests Removing requests requires the connect function to become async. The generator ui might also meet difficulties since it needs to send out requests. Let's first try the following solution to see if http.client satisfies our needs.

import requests
requests.get("www.python.org").json()

to:

import http.client
import json
conn = http.client.HTTPSConnection("www.python.org")
conn.request("GET", "/")
r1 = conn.getresponse()
json.loads(r1.read())
  • Vendor list:
    • jsonpath-ng
    • nltk
    • wordcloud
    • aiohttp

Rational and Alternatives

NA

Prior Art

NA

Future Possibilities

NA

Implementation-level Actions

  • [x] removing Pillow
  • [ ] removebottleneck
  • [ ] make tqdm optional
  • [x] removing tornado
  • [x] remove requests

When vendoring the following packages, make sure the license is copied and followed.

  • [ ] jsonpath-ng
  • [ ] nltk
  • [ ] wordcloud
  • [ ] aiohttp

Additional Tasks

  • [x] This task is put into a correct pipeline (Development Backlog or In Progress).
  • [x] The label of this task is setting correctly.
  • [x] The issue is assigned to the correct person.
  • [x] The issue is linked to related Epic.
  • [x] The documentation is changed accordingly.
  • [x] Tests are added accordingly.

dovahcrow avatar Oct 22 '20 02:10 dovahcrow

@dovahcrow do we remove the requests dependency or not? @pallavibharadwaj is just about to do the implementation though.

peiwangdb avatar Oct 27 '20 15:10 peiwangdb

@dovahcrow do we remove the requests dependency or not? @pallavibharadwaj is just about to do the implementation though.

Let's remove requests and give http.client a try.

dovahcrow avatar Oct 28 '20 21:10 dovahcrow

Removing the tornado dependency if possible would be helpful. Currently dataprep and jupyter notebook don't play well together because dataprep is forcing tornado=5.0.2.

The issue is on jupyter, because they've taken a code dependency on higher tornado versions but didn't update their dependency requirements accordingly. As a result, a conda install today with both jupyter and dataprep causes a jupyter failure with some notebooks when it tries to access a tornado method that doesn't exist.

In the meantime, I saw the dataprep issue with kaggle that caused the pinning of the tornado version. Do you know if dataprep will run ok with the 6.X versions of tornado?

This is the jupyter issue I opened, if you'd like more detail: https://github.com/jupyter/notebook/issues/5920

dhuntley1023 avatar Dec 29 '20 09:12 dhuntley1023

I also see that dataprep is also forcing pandas=1.0, numpy=1.18 and scipy=1.4. Given that these are workhorse modules for the dataprep audience, it would also be valuable to loosen these up to support the latest versions unless there's a strong reason to limit them.

dhuntley1023 avatar Dec 29 '20 09:12 dhuntley1023

@dhuntley1023 thanks for the suggestions! We can definitely loosen pandas numpy and scipy. However, this reason for pinning tornado is because Kaggle notebook pins it. See https://github.com/Kaggle/docker-python/blob/master/Dockerfile. @jnwang @jinglinpeng what do you think of dropping the Kaggle support since it seems like more users are using the newer version of Jupyter nowadays.

dovahcrow avatar Dec 30 '20 01:12 dovahcrow

Is it possible to detect which environment it is? If it is Kaggle, we import tornado. This is similar to how pandas handles the dependency on sqlalchemy for read_sql()

https://github.com/pandas-dev/pandas/blob/v1.2.0/pandas/io/sql.py#L40

jnwang avatar Dec 30 '20 02:12 jnwang

@jnwang Theoretically yes, however that requires us to write our own package loader, installer, and other facilities, i.e. we do not install the package when installing dataprep. At the first run time, we detect the platform through their IP and install the selected version of the packages.

Our case is different than the pandas case where they just decide whether to load the sqlalchemy or not but we need to switch package versions.

dovahcrow avatar Jan 04 '21 18:01 dovahcrow