firefox-translations-training
firefox-translations-training copied to clipboard
Task artifact expiration and long-term data storage
We identified a potential issue related to the expiration dates of task artifacts. Currently, all the logs and artifacts of a task are set to be deleted after 1 year. This time frame is adequate for training runs but does not take into account the need for long-term storage of pipeline artifacts or logs.
To prevent data loss, we should publish these artifacts elsewhere. We could add new pipeline steps, which would transfer relevant data to dedicated storage buckets or archive.mozilla.org
While the Mozilla archive it's fully public, we can publish artifacts to a private bucket (if we need to.)
A couple of questions:
- Do we need a dedicated bucket?
- Which tasks are we talking about and what artifacts are we looking to persist?
@eu9ene is this going to be covered by the tracking platform?
is this going to be covered by the tracking platform?
No, we should store the artifacts on GCS regardless of the solution for experiment tracking. In MLFlow case, you basically just track a path to an artifact in a remote storage. We will need to link properly though so that when the temporary TC storage expires, the link still points to the rights data and we can see the artifact in the experiment tracking UI. One solution would be to export key artifacts to long-term storage after the run has succeeded and log those extra artifacts to the corresponding step in experiment tracking (there will be duplicates in this case).
@gabrielBusta
Do we need a dedicated bucket? I would prefer a dedicated bucket for the translation artifacts. Also, I think it's a best practice not to mix things because bucket policies might differ depending on the data. Which tasks are we talking about and what artifacts are we looking to persist? Let's start with:
- Logs for all tasks.
- All evaluation outputs (doesn't matter that we have experiment tracking, let's keep them)
- All outputs of the training steps (including vocab)
- Outputs of the export step
The use cases would be to investigate how things ran historically, inspect the models, and their metrics and maybe fine-tune some. I don't think we need to preserve the datasets that are downloadable or the results of the cleaning that's reproducible having the configs.
Let's sync this work with #312. We'll need to download and publish old artifacts to W&B but what if they become unavailable?
@bhearsum we have a list of old experiments here: https://github.com/mozilla/firefox-translations-training/issues/312#issuecomment-1946802157
I think this issue has outlived its usefulness:
- We already have a bucket for uploading to
- I've done a manual upload of the old experiments already
- #466 is tracking automatic uploads
- I will do more manual uploads in the future in the meantime if needed.
(If there's anything else, let's open a specific issue for such things.)