kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

Use same defaults as `pandas.DataFrame.to_csv` in `kedro_datasets.pandas.csv_dataset.CSVDataset`.

Open galenseilisnh opened this issue 6 months ago • 20 comments

Description

I ran a pipeline and saved the results only to find out that the index, which contained required information, was missing!

Context

Users are expecting the pandas.CSVDataset to be a shallow wrapper around pandas.DataFrame.to_csv. It essentially is, but along with that shallow wrapper expectation is the expectation that the defaults will be the same. I found it surprising that the defaults were not preserved. It isn't a huge deal, but it becomes a "gotcha" for anyone onboarding Kedro + pandas CSV.

Possible Alternatives

Naturally, users 'can' use the code in the current state. This isn't a huge problem. Just a footgun for new users.

galenseilisnh avatar May 09 '25 16:05 galenseilisnh

Thanks for flagging this — good point. Ideally, we should match Pandas, but changing this now would be a breaking change. I think the best path is to update the docstring and docs to make this clearer for users.

@merelcht what are your thoughts on this issue?

rashidakanchwala avatar May 23 '25 11:05 rashidakanchwala

I think the best path is to update the docstring and docs to make this clearer for users.

That's a practical solution for the short term. In the long term I'd prefer this change be made in a major release.

Kedro users, including myself, will often accept breaking changes like this if they are communicated in advance and don't require substantial work to fix from the user end of things. The user code that would have to change here could be done for a large number of projects by a script, so from my point of view the changeover costs would be minor even for a large number of projects. That's my opinion as a user of Kedro, and a developer of some internal tools in Python, anyway. Hopefully it is useful to hear that.

galenseilisnh avatar May 23 '25 19:05 galenseilisnh

If the changes only need to happen in kedro-dataset, breaking it is no big deal. The next release is breaking anyway.

merelcht avatar May 26 '25 07:05 merelcht

@galenseilisnh , just to clarify is this specifically for the value "index": False ?

merelcht avatar May 26 '25 09:05 merelcht

@galenseilisnh , just to clarify is this specifically for the value "index": False ?

@merelcht Good question. Yes, that is the parameter and argument implicit in the original post.

galenseilisnh avatar May 26 '25 13:05 galenseilisnh

Those defaults have been like that since "forever" https://github.com/kedro-org/kedro/commit/b7b3301d38fa2f59b04d08dc279617eef42a8142, but the actual rationale is lost. We are okay changing them.

The task is to make this an empty dictionary:

https://github.com/kedro-org/kedro-plugins/blob/3ff2314d7903cf1ee50ce31ad99fcd2e21d9003c/kedro-datasets/kedro_datasets/pandas/csv_dataset.py#L74

@galenseilisnh would you like to send a PR?

astrojuanlu avatar May 26 '25 14:05 astrojuanlu

@astrojuanlu Sure, I will find some time in the next couple of weeks. It looks an unsubstantial implementation effort. I just need to sit down and review the contributing standards and ensure I follow all the required steps.

galenseilisnh avatar May 26 '25 22:05 galenseilisnh

Those defaults have been like that since "forever" kedro-org/kedro@b7b3301, but the actual rationale is lost. We are okay changing them.

Actually rationale isn't lost; the reason is because this allows you to more symmetrically save and load (i.e. the index doesn't end up as another non-index column when you reload).

You can save index and then specify index column to get the behavior you want. For now, I disagree with making this change, as it is quite significant given how long this behavior has been in place--and I think it's very justifiable.

deepyaman avatar May 27 '25 13:05 deepyaman

When I said that it was lost, I mean that the discussion in the original PR https://github.com/McK-Private/private-kedro/pull/355 (sorry, private link) doesn't include any mention of this, and there's no comment in the codebase (which would hopefully help communicate intent).

Thanks for bringing it up regardless @deepyaman . Pasting @galenseilisnh original quote here:

Users are expecting the pandas.CSVDataset to be a shallow wrapper around pandas.DataFrame.to_csv. It essentially is, but along with that shallow wrapper expectation is the expectation that the defaults will be the same. I found it surprising that the defaults were not preserved. It isn't a huge deal, but it becomes a "gotcha" for anyone onboarding Kedro + pandas CSV.

Is there a chance that more users are being tripped up by this?

astrojuanlu avatar May 27 '25 14:05 astrojuanlu

I think the big issue is when you transcode between Spark, Pandas, Polars etc. I like this default behaviour, we can throw a warning if people really feel strongly.

datajoely avatar May 27 '25 15:05 datajoely

Those defaults have been like that since "forever" kedro-org/kedro@b7b3301, but the actual rationale is lost. We are okay changing them.

Actually rationale isn't lost; the reason is because this allows you to more symmetrically save and load (i.e. the index doesn't end up as another non-index column when you reload).

You can save index and then specify index column to get the behavior you want. For now, I disagree with making this change, as it is quite significant given how long this behavior has been in place--and I think it's very justifiable.

That's an interesting point. I am going to think about that some more.

galenseilisnh avatar May 27 '25 16:05 galenseilisnh

Reviewing the public API docs I see that correct and useful info is present:

Image

Image

https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-7.0.0/api/kedro_datasets.pandas.CSVDataset.html

I have a hunch that I would have been more likely to spot this if the class-level docstring explained this deviation from Pandas' defaults. Based on this hunch, I suggest an explanation should go up here:

Image

galenseilisnh avatar May 27 '25 16:05 galenseilisnh

When I said that it was lost, I mean that the discussion in the original PR https://github.com/McK-Private/private-kedro/pull/355 (sorry, private link) doesn't include any mention of this, and there's no comment in the codebase (which would hopefully help communicate intent).

Thanks for bringing it up regardless @deepyaman . Pasting @galenseilisnh original quote here:

Users are expecting the pandas.CSVDataset to be a shallow wrapper around pandas.DataFrame.to_csv. It essentially is, but along with that shallow wrapper expectation is the expectation that the defaults will be the same. I found it surprising that the defaults were not preserved. It isn't a huge deal, but it becomes a "gotcha" for anyone onboarding Kedro + pandas CSV.

Is there a chance that more users are being tripped up by this?

I also think a code comment in kedro_datasets.pandas.CSVDataset would be helpful for some users.

galenseilisnh avatar May 27 '25 16:05 galenseilisnh

After seeing some counterpoints I am convinced that we should keep the current state for the functional code. But I also think this is something that could cause future confusion (it confused me at first, FWIW), so further docs/comments would be my current suggestion.

galenseilisnh avatar May 27 '25 16:05 galenseilisnh

Thanks @galenseilisnh for the extra pointers.

However, now it's me who needs to be convinced that "discard data" is a good default 😄

astrojuanlu avatar May 27 '25 19:05 astrojuanlu

If the intent is to have save + load symmetry, why not leave index=True and make index_col=0 on read_csv?

In [3]: df
Out[3]: 
       value
index       
A         11
B         12
C         13

In [4]: df.to_csv("data.csv")

In [5]: pd.read_csv("data.csv")  # Ugh
Out[5]: 
  index  value
0     A     11
1     B     12
2     C     13

In [6]: pd.read_csv("data.csv", index_col=0)  # Recovers initial result
Out[6]: 
       value
index       
A         11
B         12
C         13

astrojuanlu avatar May 27 '25 19:05 astrojuanlu

If the intent is to have save + load symmetry, why not leave index=True and make index_col=0 on read_csv?

In [3]: df
Out[3]: 
       value
index       
A         11
B         12
C         13

In [4]: df.to_csv("data.csv")

In [5]: pd.read_csv("data.csv")  # Ugh
Out[5]: 
  index  value
0     A     11
1     B     12
2     C     13

In [6]: pd.read_csv("data.csv", index_col=0)  # Recovers initial result
Out[6]: 
       value
index       
A         11
B         12
C         13

Good point. That also seems to obtain the IO symmetry.

galenseilisnh avatar May 27 '25 20:05 galenseilisnh

If the intent is to have save + load symmetry, why not leave index=True and make index_col=0 on read_csv?

In [3]: df
Out[3]: 
       value
index       
A         11
B         12
C         13

In [4]: df.to_csv("data.csv")

In [5]: pd.read_csv("data.csv")  # Ugh
Out[5]: 
  index  value
0     A     11
1     B     12
2     C     13

In [6]: pd.read_csv("data.csv", index_col=0)  # Recovers initial result
Out[6]: 
       value
index       
A         11
B         12
C         13

~I think this would be fine.~

When this dataset was initially developed (and I was primarily a Kedro user), most users didn't use pandas indexes (I remember going through a phase trying to convince people that using indexes was a good thing to do, to no avail :joy:). As a result, it probably seemed most useful to not write a useless column nobody ever used. As a result, I think it's still worth being confident more people are actually using indexes before making this change (if at all).

As I was writing this response, I was like, "What about multi-indexes?" I think it's hard to get right.

deepyaman avatar May 28 '25 13:05 deepyaman

That said, re I/O symmetry, pandas.ParquetDataset writes the index and would load it back as a column...

So the argument is a bit leaky, and there's definitely inconsistencies...

deepyaman avatar May 28 '25 13:05 deepyaman

As a result, it probably seemed most useful to not write a useless column nobody ever used. As a result, I think it's still worth being confident more people are actually using indexes before making this change (if at all).

Good point.

One package I am using which uses pandas' indexing is sktime. I don't have any reliable quantification of use of this package, although it has over 8k stars.

galenseilisnh avatar May 28 '25 17:05 galenseilisnh