python-pachyderm
python-pachyderm copied to clipboard
Walk through the "pachyderm user" onboarding track, and make it work with python-pachyderm
General feedback from users has been that python-pachyderm is hard to use. For example, we've heard "there's no way to parse a Pachyderm file as a CSV or a pandas dataframe", but our getfile library supports the python iterator interface, so it's an open question (at least to me) why this doesn't work? Do we support it incorrectly? Are the functions confusingly-named or hard to find?
Going through the "pachyderm user" onboarding track, which uses pachctl, and making it equally doable with python-pachyderm should be a big step forward in python-pachyderm usability.
Hmm I'm also curious as to why loading into Pandas doesn't seem to work. I uploaded a sample CSV file to a local pachyderm cluster, and was able to get_file() and load it into a pandas dataframe.
Example workflow would be
import python_pachyderm as pachyderm
import pandas as pd
client = pachyderm.client()
pf = client.get_file('REPO', 'COMMIT/BRANCH', 'CSV_FILE')
df = pd.read_csv(pf) # this works!
This works because Pandas can load any file-like obj with a read() method into their read_csv() function, and luckily PFSFile has a read() method.