kedro-plugins
kedro-plugins copied to clipboard
`mode: "a"` in `pandas.CSVDataset` still overwrites the file
Description
As per title.
Context
Originally reported in https://linen-slack.kedro.org/t/15705930/hi-everyone-i-have-an-easy-question-slightly-smiling-face-wh#b644730d-0683-4426-8e40-4e7ef96d8cc7
Steps to Reproduce
Starting from a pandas-iris
, I tweaked the pipeline like this:
from kedro.pipeline import Pipeline, node, pipeline
def add_data(df):
new_data = df.iloc[len(df) - 1:]
new_data.index = [new_data.index[0] + 1]
return new_data
def create_pipeline(**kwargs):
return pipeline(
[
node(
func=add_data,
inputs=["example_iris_data"],
outputs="new_data",
),
]
)
And the catalog looks like this:
example_iris_data:
type: pandas.CSVDataSet
filepath: data/01_raw/iris.csv
new_data:
type: pandas.CSVDataSet
filepath: data/03_primary/new_data.csv
save_args:
mode: a
header: false
Expected Result
The new rows get appended to the file.
Actual Result
The file is silently overwritten every time.
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used (
pip show kedro
orkedro -V
): 0.18.13 - Kedro plugin and kedro plugin version used (
pip show kedro-airflow
): - Python version used (
python -V
): 3.11 - Operating system and version: macOS Ventura
https://github.com/kedro-org/kedro-plugins/issues/513
Root cause of this is we hardcoded a mode ="wb", this is not consistently so we need to review all the dataset at once.
This is part of the reason why using generator is hard
Thank you astrojuanlu for looking into this!