cudf [FEA] Pass column indices as `index_col` in `read

[FEA] Pass column indices as `index_col` in `read_csv`

Open amanlai opened this issue 1 year ago • 7 comments

If I want to use the index_col parameter to set certain columns as indices when reading a csv file, I cannot pass a list of column indices (like in pandas). I can pass a list of column labels though:

cudf.read_csv(filepath, index_col=[0])
KeyError: 'None of [0] are in the columns'

cudf.read_csv(filepath, index_col=['family'])

While this is not a huge issue, I imagine the following is a common scenario: You have know that the first 3 columns are index columns, but you don't exactly know how each are spelt ('date' vs 'Date' etc.). In this case, if passing a list of column indices were possible, index_col=[0,1,2] would have worked fine; otherwise, you will have to read the file without specifying index columns and set index later (or require trial and error to guess the column labels).

Is it possible for index_col to accept list of indices like in pandas?

Feb 23 '24 08:02 amanlai

Thanks - I think this should be somewhat easy to implement. Here's the part of our CSV reader implementation that handles the index_col argument:

https://github.com/rapidsai/cudf/blob/7d2da0e5bd9bc178ab394506e58207667c59eedb/python/cudf/cudf/_lib/csv.pyx#L458-L470

It looks like we should handle lists-of-int here in addition to just ints.

Feb 26 '24 16:02 shwina

I can take this. I can't find where to put the test cases though. Where are the read_* functions tested? I found python/cudf/cudf/tests/input_output/test_csv.py but it's empty at the moment.

Mar 01 '24 20:03 amanlai

Go ahead and add your tests to that file. Over time we want to migrate the CSV tests to that module anyway.

Mar 04 '24 01:03 shwina

@amanlai, still working on this?

Apr 30 '24 06:04 er-eis

Hi @er-eis @shwina , If no one is following up on this issue, I'd love to give it a try!

May 04 '24 07:05 MananDoshi1301

i'm working on other things atm

May 04 '24 15:05 er-eis

Great! Then I am giving it a try

May 05 '24 20:05 MananDoshi1301

cudf cudf copied to clipboard

[FEA] Pass column indices as `index_col` in `read_csv`

cudf
cudf copied to clipboard