kedro
kedro copied to clipboard
Improve running kedro as part of an automated workflow (CI/CD)
Description
When any kedro
command is executed for the first time or in a clean environment (which is often the case in CI/CD) the telemetry prompt gets run. The user then has to answer Yes or No to running telemetry. If no user is involved, e.g. in an automated CI/CD workflow a hack needs to be put in place to programatically add a .telemetry
file.
Summary of action (Edited on 2023-07-31)
- [x] kedro-org/kedro-plugins#483
- [ ] Document how a user can automatically give/deny consent to telemetry, by describing how to programatically create a
.telemetry
file (use this issue) - [x] https://github.com/kedro-org/kedro/issues/2867
- Implement logic in the
kedro new
command that detects if it's being run in CI and if so, whether consent has been given and if that's not the case then it will advice the user on how to do this.
- Implement logic in the
Possible Implementation
The above hack works, but isn't well known. An alternative way of solving this issue is by running the command in a form like: yes | kedro new
. We need to document how people can accept/deny the telemetry tracking automatically.
An alternative way of solving this issue is by running the command in a form like:
yes | kedro new
.
There are probably interesting repercussions to recommending a workflow like this. Anybody who followed such a workflow would contribute to the telemetry stats, adding a lot of automated runs (no different to adding a .telemetry
file turning it on programmatically). If you don't want to include stats from CI runs, maybe should prefer yes no | kedro new
. :joy:
I think it would be good to take this opportunity to distinguish CI runs in telemetry. Maybe can read an environment variable; most CI systems (GitHub Actions, CircleCI, Travis CI, etc.) follow the convention of setting CI=true
, so I would recommend detecting presence of the environment variable, solving the telemetry prompt automatically, and recording commands as belonging to a CI run--and ideally not requiring any additional configuration or hacks to avoid this telemetry prompt.
I think the one question this leaves is, what do we do if users want to opt out of telemetry in the CI. Is it OK to have this be documented, but not prompted by default, or does this introduce some risk?
Totally agree with everything @deepyaman said here. If we want to recommend in documentation a command that programmatically disables telemetry in CI then let's go for the clearer echo "consent: false" > .telemetry
rather than yes no
😀
But using the CI
variable is is a much better solution here. I don't think anyone on the team was aware of its existence before (we've just been kind of guessing which telemetry data come from CI jobs) but that is perfect 👍 👍 👍 As Deepyaman says, the only question then is whether CI=true
should disable or enable telemetry by default. In the past I think we've been regarding CI runs as contaminating our telemetry data, so probably it's easiest and safest from the risk perspective to just disable telemetry in this case. But that does mean that in future if we do actually want to look at CI telemetry data we wouldn't have anything available on it... @yetudada?
In the past I think we've been regarding CI runs as contaminating our telemetry data, so probably it's easiest and safest from the risk perspective to just disable telemetry in this case. But that does mean that in future if we do actually want to look at CI telemetry data we wouldn't have anything available on it...
Another thought I had this morning was to enable telemetry by default in CI environments, but to be even more conservative in what's collected by kedro-telemetry
in those cases. Not sure that's necessary, though; I feel like it's quite conservative by default (but I'm also not that familiar with it).
I would advocate as capturing the CI
variable in telemetry regardless of whether the default is to have it on or off.
From a data perspective I would still like to see to understand how many Kedro projects are being run with CI, because it helps give us a sense of projects that have moved into production systems. It would be great to figure out how to identify a "user" that is CI but also still observe the telemetry consent that the user set.
We discussed this issue in a Technical Design session and decided on the following:
- We do want to capture telemetry for projects run on CI. This gives us insights into how many Kedro projects are run as part of an automated CI workflow, which in turn will help us understand how many Kedro projects are part of production systems.
- Use the
CI
environment variable to indicate that data comes from a CI environment - The user still needs to be allowed to decide whether to run telemetry on their CI Kedro projects or not, so we will not implement a system to automatically track the data if it's part of a CI run.
- We need to document how a user can automatically give/deny consent to telemetry, by describing how to programatically create a
.telemetry
file - We want to implement some logic in the
kedro new
command that detects if it's being run in CI and if so, whether consent has been given/not and if not then it will advice the user on how to do this.
To do:
- [x] Update
kedro-telemetry
to use theCI
environment variable to indicate that data comes from a CI environment - [ ] Document how a user can automatically give/deny consent to telemetry, by describing how to programatically create a
.telemetry
file - [ ] Implement logic in the
kedro new
command that detects if it's being run in CI and if so, whether consent has been given and if that's not the case then it will advice the user on how to do this. https://github.com/kedro-org/kedro/issues/2867
Today there was a question about this and I just wanted to confirm that the "Do you opt into usage analytics? [y/N]:` prompt does not block non-interactive terminals. Example:
FROM python:3.10
RUN pip install kedro
WORKDIR /src
RUN echo "project_name: Spaceflights" > kedro.yml
RUN kedro new --config=kedro.yml --starter=spaceflights
WORKDIR /src/spaceflights
RUN pip install -r src/requirements.txt
RUN kedro run
$ docker build . --platform linux/amd64 -t kedro-test:dev
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
Install the buildx component to build images with BuildKit:
https://docs.docker.com/go/buildx/
Sending build context to Docker daemon 796.3MB
Step 1/8 : FROM python:3.10
---> 700e6de1daef
Step 2/8 : RUN pip install kedro
...
Step 8/8 : RUN kedro run
---> Running in e6b62eb0a4cf
As an open-source project, we collect usage analytics.
We cannot see nor store information contained in a Kedro project.
You can find out more by reading our privacy notice:
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice
Do you opt into usage analytics? [y/N]: [06/22/23 07:05:15] WARNING Failed to confirm consent. No data plugin.py:265
was sent to Heap. Exception:
[06/22/23 07:05:15] INFO Kedro project spaceflights session.py:359
As an open-source project, we collect usage analytics.
We cannot see nor store information contained in a Kedro project.
You can find out more by reading our privacy notice:
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice
Do you opt into usage analytics? [y/N]: WARNING Failed to confirm consent. No data plugin.py:265
was sent to Heap. Exception:
INFO Loading data from 'companies' data_catalog.py:345
(CSVDataSet)...
INFO Running node: node.py:331
...
Today there was a question about this and I just wanted to confirm that the "Do you opt into usage analytics? [y/N]:` prompt does not block non-interactive terminals. Example:
FROM python:3.10 RUN pip install kedro WORKDIR /src RUN echo "project_name: Spaceflights" > kedro.yml RUN kedro new --config=kedro.yml --starter=spaceflights WORKDIR /src/spaceflights RUN pip install -r src/requirements.txt RUN kedro run
$ docker build . --platform linux/amd64 -t kedro-test:dev DEPRECATED: The legacy builder is deprecated and will be removed in a future release. Install the buildx component to build images with BuildKit: https://docs.docker.com/go/buildx/ Sending build context to Docker daemon 796.3MB Step 1/8 : FROM python:3.10 ---> 700e6de1daef Step 2/8 : RUN pip install kedro ... Step 8/8 : RUN kedro run ---> Running in e6b62eb0a4cf As an open-source project, we collect usage analytics. We cannot see nor store information contained in a Kedro project. You can find out more by reading our privacy notice: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice Do you opt into usage analytics? [y/N]: [06/22/23 07:05:15] WARNING Failed to confirm consent. No data plugin.py:265 was sent to Heap. Exception: [06/22/23 07:05:15] INFO Kedro project spaceflights session.py:359 As an open-source project, we collect usage analytics. We cannot see nor store information contained in a Kedro project. You can find out more by reading our privacy notice: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice Do you opt into usage analytics? [y/N]: WARNING Failed to confirm consent. No data plugin.py:265 was sent to Heap. Exception: INFO Loading data from 'companies' data_catalog.py:345 (CSVDataSet)... INFO Running node: node.py:331 ...
For me it does keep hanging in a Databricks job, with Kedro==0.18.7 installed.
Thanks @juanchodpg2, good to know. You can create a .telemetry
file with the appropriate content to opt in or out of telemetry and avoid the blocking https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#how-do-i-consent-to-the-use-of-kedro-telemetry
@juanchodpg2 reports that creating a .telemetry
file is finicky and doesn't always work, probably because of having to place it at the right directory (still unclear) https://linen-slack.kedro.org/t/16140465/hi-all-i-am-having-issues-trying-to-run-a-databricks-job-usi#71319d4a-493b-46fc-9f02-6fdf0cad3898
We might need to bump the priority of this. Getting a solid consent process is a blocker for https://github.com/kedro-org/kedro-plugins/issues/375
Let's continue the conversation about kedro-telemetry being blocked on Databricks and other places in https://github.com/kedro-org/kedro-plugins/issues/484
Also most CI platforms include the CI=True
env var - we should detect that
My original example on https://github.com/kedro-org/kedro/issues/1640#issuecomment-1602119633 had a mistake, it didn't install kedro-telemetry
. Updated version:
FROM python:3.10
RUN pip install kedro kedro-telemetry
WORKDIR /src
RUN kedro new --name=spaceflights --starter=spaceflights-pandas
WORKDIR /src/spaceflights
RUN pip install -r requirements.txt
RUN kedro run
and I confirm the docker build doesn't block:
$ docker build . --platform linux/amd64 -t kedro-test:dev
...
---> [Warning] The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
---> Running in ffb042e40704
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
As an open-source project, we collect usage analytics.
We cannot see nor store information contained in a Kedro project.
You can find out more by reading our privacy notice:
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice
Do you opt into usage analytics? [y/N]: [02/19/24 13:33:36] WARNING Failed to confirm consent. No data plugin.py:276
was sent to Heap. Exception:
INFO Kedro project spaceflights session.py:321
As an open-source project, we collect usage analytics.
We cannot see nor store information contained in a Kedro project.
You can find out more by reading our privacy notice:
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice
Do you opt into usage analytics? [y/N]: [02/19/24 13:33:37] WARNING Failed to confirm consent. No data plugin.py:276
was sent to Heap. Exception:
[02/19/24 13:33:39] INFO Using synchronous mode for sequential_runner.py:64
...
Each platform needs to be understood separately.
Any update on this one ? :)
Hi @m-gris! To clarify, are you running Kedro on an unattended setting and being blocked by this? How does your setup look like?
The best workaround for now is to remove kedro-telemetry
from the requirements.
Hi @astrojuanlu
It's just that I had the first node/task in an airflow pipeline that was stuck... 😅
I simply did a manual kedro run
from within the container... (Could have created .telemetry
manually as well)
But your suggestion definitely makes for a better workaround.
Thanks a lot,
M.
To set expectations clear, #2867 is fine for a workaround but the default kedro run
experience should be seamless. We are still working on making that a reality.
Getting back to this after discussing it a bit more. We decided to not go with #2867.
In the future we want that telemetry prompt to disappear. This means that, whenever the user runs a Kedro command, we will assume a default answer, and we will also add more mechanisms for users to override that answer on top of the current .telemetry
file (for example https://consoledonottrack.com/).
However, we are still in the process of minimising the amount of data we collect to make this more palatable to users, so we are not quite there yet.
We will give more details in due time.