kedro-azureml icon indicating copy to clipboard operation
kedro-azureml copied to clipboard

How to correctly create uri_file data assets ?

Open Gabriel2409 opened this issue 10 months ago • 22 comments

I am trying to save a data asset as a uri_file and the dataset is incorrectly saved as a uri_folder when I launch kedro azureml run

I have the following catalog:

projects_train_raw_local:
  type: pandas.CSVDataSet
  filepath: data/01_raw/dataset.csv

projects_train_raw:
    type: kedro_azureml.datasets.AzureMLAssetDataSet
    azureml_dataset: projects_train_raw
    root_dir: data/00_azurelocals/ 
    versioned: True
    azureml_type: uri_file
    dataset:
        type: pandas.CSVDataSet
        filepath: "dataset.csv"

and the following pipeline which just opens the local file and saves it

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        nodes=[
            node(
                func=lambda x: x,
                inputs="projects_train_raw_local",
                outputs="projects_train_raw",
                name="create_train_dataasset",
            )
        ]
    )

I expected a new data asset to be created on azure as an uri_file. However, i get the following info on azure image

image

It seems my file is not saved correctly, which seems to correspond to this part in cli.py if I am not mistaken

    # 2. Save dummy outputs
    # In distributed computing, it will only happen on nodes with rank 0
    if not pipeline_data_passing and is_distributed_master_node():
        for data_path in azure_outputs.values():
            (Path(data_path) / "output.txt").write_text("#getindata")
    else:
        logger.info("Skipping saving Azure outputs on non-master distributed nodes")

How can I correctly create a uri_file data asset ?

Gabriel2409 avatar Aug 24 '23 15:08 Gabriel2409