MLOpsPython
MLOpsPython copied to clipboard
Support allow_reuse in repo
Currently all of the pipeline steps have allow_reuse=False. As a developer, it would be great to enable reuse of steps so that only my changes run.
The allow_reuse=True is not working in the repo because of 2 reasons:
-
The repo would need to not pass build_id as a parameter to all the steps (or allow user to build and run with a static/fake build id for iterating on code). Updating any parameter value or parameter default means no reuse of steps.
-
All of the pipeline steps also share the same hashed directory, which causes a snapshot rebuild if any of the files change in that directory changes. All the steps in the train pipeline currently all use: source_directory=e.sources_directory_train. In the repo, it seems like train.py is a standalone script. If the repo wanted to optimize more for reuse, it could put scripts into isolated directories for each step or point to the file instead of the directory. As long as the snapshot is not forced to rebuild, then reuse should be able to happen.
Just jotting down ideas:
For 1, we can specify tags to associate with the run & steps as part of the submission process.
PipelineParameters: '"ParameterAssignments": {"model_name": "$(MODEL_NAME)"}, "tags": {"buildid": "$(Build.BuildId)"}, "StepTags": {"buildid": "$(Build.BuildId)"}'
In the register model step we can pull the tags from the parent run (like we do for mse value).
@sudivate The change you made doesn't address @xinyi-joffre's second point. Each step in the pipeline should be able to allow_reuse, which can only be achieved if each step uses a "standalone" script directory.
@jotaylo
Indicates whether the step should reuse previous results when re-run with the same settings. Reuse is enabled by default. If the step contents (scripts/dependencies) as well as inputs and parameters, remain unchanged, the output from the previous run of this step is reused. When reusing the step, instead of submitting the job to compute, the results from the previous run are immediately made available to any subsequent steps. If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed.
How is reuse achieved and why was this issue closed?
- By setting the training step to allow_reuse it will reuse the previous result of the training step if there were no changes made to the training script and thus saving time on re-executing the step.
- I don't want an evaluation and registration steps to follow resue mainly due to what these individual steps are performing in the context of this repo and the script content doesn't change often.
" If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed." Having the allowe_reuse set to true in the training pipeline was problematic for us in the development phase as we were testing the pipelines and scripts. In the train step (where allow reuse is true), the mse is logged to the parent run. In the eval step (where allow reuse if false) we fetch the mse from the parent run to evaluate the model. Since we have the flag set to true on one and false on the other one then the parent run id doesn't match. The train step is just reusing the steps from previous runs and the eval step is unsuccessfully trying to fetch the mse from the current parent run and that fails because we haven't logged the mse in this parent run and this happens even when the data changes.
I think the allow_reuse flag should be set to false to start with on all steps and inform users clearly of its existence and how to use it. Anyone adapting this repo, would be experimenting on the pipelines and trying to run their custom script and this stands in the way of it as it is a very tricky issue to find.