cudf
cudf copied to clipboard
[FEA] Pass column indices as `index_col` in `read_csv`
If I want to use the index_col
parameter to set certain columns as indices when reading a csv file, I cannot pass a list of column indices (like in pandas). I can pass a list of column labels though:
cudf.read_csv(filepath, index_col=[0])
KeyError: 'None of [0] are in the columns'
cudf.read_csv(filepath, index_col=['family'])
While this is not a huge issue, I imagine the following is a common scenario: You have know that the first 3 columns are index columns, but you don't exactly know how each are spelt ('date'
vs 'Date'
etc.). In this case, if passing a list of column indices were possible, index_col=[0,1,2]
would have worked fine; otherwise, you will have to read the file without specifying index columns and set index later (or require trial and error to guess the column labels).
Is it possible for index_col
to accept list of indices like in pandas?
Thanks - I think this should be somewhat easy to implement. Here's the part of our CSV reader implementation that handles the index_col
argument:
https://github.com/rapidsai/cudf/blob/7d2da0e5bd9bc178ab394506e58207667c59eedb/python/cudf/cudf/_lib/csv.pyx#L458-L470
It looks like we should handle lists-of-int here in addition to just ints.
I can take this. I can't find where to put the test cases though. Where are the read_*
functions tested? I found python/cudf/cudf/tests/input_output/test_csv.py but it's empty at the moment.
Go ahead and add your tests to that file. Over time we want to migrate the CSV tests to that module anyway.
@amanlai, still working on this?
Hi @er-eis @shwina , If no one is following up on this issue, I'd love to give it a try!
i'm working on other things atm
Great! Then I am giving it a try