kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

`mode: "a"` in `pandas.CSVDataset` still overwrites the file

Open astrojuanlu opened this issue 1 year ago • 2 comments

Description

As per title.

Context

Originally reported in https://linen-slack.kedro.org/t/15705930/hi-everyone-i-have-an-easy-question-slightly-smiling-face-wh#b644730d-0683-4426-8e40-4e7ef96d8cc7

Steps to Reproduce

Starting from a pandas-iris, I tweaked the pipeline like this:

from kedro.pipeline import Pipeline, node, pipeline

def add_data(df):
    new_data = df.iloc[len(df) - 1:]
    new_data.index = [new_data.index[0] + 1]
    return new_data


def create_pipeline(**kwargs):
    return pipeline(
        [
            node(
                func=add_data,
                inputs=["example_iris_data"],
                outputs="new_data",
            ),
        ]
    )

And the catalog looks like this:

example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/iris.csv

new_data:
  type: pandas.CSVDataSet
  filepath: data/03_primary/new_data.csv
  save_args:
    mode: a
    header: false

Expected Result

The new rows get appended to the file.

Actual Result

The file is silently overwritten every time.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.18.13
  • Kedro plugin and kedro plugin version used (pip show kedro-airflow):
  • Python version used (python -V): 3.11
  • Operating system and version: macOS Ventura

astrojuanlu avatar Sep 14 '23 08:09 astrojuanlu

https://github.com/kedro-org/kedro-plugins/issues/513

Root cause of this is we hardcoded a mode ="wb", this is not consistently so we need to review all the dataset at once.

This is part of the reason why using generator is hard

noklam avatar Sep 14 '23 09:09 noklam

Thank you astrojuanlu for looking into this!

emilio-gagliardi avatar Sep 14 '23 19:09 emilio-gagliardi