faculty icon indicating copy to clipboard operation
faculty copied to clipboard

[RFC] Add module to make datasets IO easier with pandas

Open acroz opened this issue 5 years ago • 4 comments

This needs tests added before merging.

While this is still in draft stage, and prior to implementing tests, I'd like to get review on the API proposed by this PR.

Expected usage looks like:

import faculty.datasets.pandas

# Read
df = faculty.datasets.pandas.read_csv("path/to/object.csv")

# Write
faculty.datasets.pandas.to_csv(df, "path/to/object.csv", index=False)

These mirror closely the pandas API (extra args and kwargs are just passed through), except that the to_csv functionality in pandas is a method and not available (AFAICT) as a static function.

Pandas as an optional dependency

faculty does not currently depend on numpy or pandas. It's nice to keep it that way, as the library can be kept lightweight for the majority of applications where the sometimes-expensive installation of numpy is not required. I propose that an optional dependency on pandas be included for this functionality via an extras_require entry in setup.py.

For the main expected use case (inside the platform), pandas is always expected to be available, so users will rarely encounter the case where it's not available. Managing the case where pandas is not installed could be:

  1. (As implemented) Pandas is only imported in this module we make sure this module is not imported by others in the package. In this case, tests would be implemented to check that other functionality works when pandas is not available.
  2. Pandas is imported at function call-time, with a descriptive error message replacing the default ModuleNotFoundError.

I'm interested in input on the above or other options.

 ## Possible aliases

Current recommended style when using faculty.datasets is:

from faculty import datasets
datasets.ls("prefix")
# etc..

People seem to prefer shorter aliases for things (it seems the data science community finds the 5/6 characters of numpy/pandas too lengthy!) so we may want to encourage a particular alias, such as:

import faculty.datasets.pandas as faculty_pandas
import faculty.datasets.pandas as datasets_pandas
import faculty.datasets.pandas as ds_pandas
import faculty.datasets.pandas as fdp

Extra ideas welcome.

Alternatively, if we go with option 2 above (import pandas at function call-time), we could import faculty.datasets.pandas in faculty/datasets/__init__.py, and then the pandas functionality appears as some namespaced components of faculty.datasets, e.g.:

from faculty import datasets
datasets.ls("path/")
df = datasets.pandas.read_csv("path/to/object.csv")

acroz avatar Jan 22 '20 13:01 acroz

This is quite interesting! Some initial thoughts, and coming from a place of ignorance:

  • that namespacing in faculty.datasets.pandas.... seems good, though would you think people would find it confusing that only some functions (and not all pandas) are propagated? I guess not really, but guessing it's good to also document in any case, later on.
  • would we consider adding similar functions to the other read_FORMAT and to_FORMAT as well? Looking at the API reference, I can imagine FORMAT as in excel, json, HDF, parquet, pickle.... I bet these are less frequent, but feel like if include one, should include the rest as well.

imrehg avatar Feb 19 '20 16:02 imrehg

Also just for clarity, with the 2 options above, if the 2nd is used, you mean that the whole of pandas would be available as a namespaced component? Or just these functions?

For the shorter import, I wonder if either

import faculty.datasets.pandas as fdpd
import faculty.datasets.pandas as fpd

would be more natural (so keeping the original pd convention, but adding some way to highlight the "faculty-ness" of things. Just a thought, no strong preference.

imrehg avatar Feb 20 '20 18:02 imrehg

Thanks @acroz , thought about this a bit more and considered all the comments above. How about the following?

from faculty import datasets

url = datasets.presigned_url("/path/to/any/file")

We can then add a section in the docs (or docstrings) illustrating usage with pd.read_*, and possibly readers from other libraries that support url inputs.

As for writing, you can do something like datasets.put_string to take the local data as a string (as returned by pd.DataFrame.to_csv(path_or_buf=None) and also pd.Series.to_csv), and again illustrate usage in the docs. Or perhaps modify datasets.put so that it can take the string as well as the file path as input. Finally, we can also have the inverse of this - datasets.get_string or modify datasets.get.

This satisfies the two requirements:

  1. No dependence on pandas: is general and removes the burden of including pandas as a dependency.
  2. Makes it easier to deal with datasets IO during development. To me, the main burden is dealing with ObjectClient when I want a presigned URL or when I want to upload data that is not sitting on disk.

sbalian avatar May 01 '20 14:05 sbalian

@acroz Also ran a quick test to compare speed for an AWS backend.

For a 139M CSV file,

Method Time in seconds
pandas.read_csv 3.29
faculty.datasets.pandas.read_csv 5.52
pandas.DataFrame.to_csv 10.2
faculty.datasets.pandas.to_csv 18.7

These are very promising because object price/workspace price << 0.5, and here we are seeing that workspace speedup over object is not even 2x (of course I am ignoring other advantages of workspace).

sbalian avatar May 01 '20 15:05 sbalian