kedro Improve running kedro as part of an automated workflow (CI/CD)

Description

When any kedro command is executed for the first time or in a clean environment (which is often the case in CI/CD) the telemetry prompt gets run. The user then has to answer Yes or No to running telemetry. If no user is involved, e.g. in an automated CI/CD workflow a hack needs to be put in place to programatically add a .telemetry file.

Summary of action (Edited on 2023-07-31)

[x] kedro-org/kedro-plugins#483
[ ] Document how a user can automatically give/deny consent to telemetry, by describing how to programatically create a .telemetry file (use this issue)
[x] https://github.com/kedro-org/kedro/issues/2867
- Implement logic in the kedro new command that detects if it's being run in CI and if so, whether consent has been given and if that's not the case then it will advice the user on how to do this.

Possible Implementation

The above hack works, but isn't well known. An alternative way of solving this issue is by running the command in a form like: yes | kedro new. We need to document how people can accept/deny the telemetry tracking automatically.

Jun 21 '22 11:06 merelcht

An alternative way of solving this issue is by running the command in a form like: yes | kedro new.

There are probably interesting repercussions to recommending a workflow like this. Anybody who followed such a workflow would contribute to the telemetry stats, adding a lot of automated runs (no different to adding a .telemetry file turning it on programmatically). If you don't want to include stats from CI runs, maybe should prefer yes no | kedro new. :joy:

I think it would be good to take this opportunity to distinguish CI runs in telemetry. Maybe can read an environment variable; most CI systems (GitHub Actions, CircleCI, Travis CI, etc.) follow the convention of setting CI=true, so I would recommend detecting presence of the environment variable, solving the telemetry prompt automatically, and recording commands as belonging to a CI run--and ideally not requiring any additional configuration or hacks to avoid this telemetry prompt.

I think the one question this leaves is, what do we do if users want to opt out of telemetry in the CI. Is it OK to have this be documented, but not prompted by default, or does this introduce some risk?

Jun 21 '22 16:06 deepyaman

Totally agree with everything @deepyaman said here. If we want to recommend in documentation a command that programmatically disables telemetry in CI then let's go for the clearer echo "consent: false" > .telemetry rather than yes no 😀

But using the CI variable is is a much better solution here. I don't think anyone on the team was aware of its existence before (we've just been kind of guessing which telemetry data come from CI jobs) but that is perfect 👍 👍 👍 As Deepyaman says, the only question then is whether CI=true should disable or enable telemetry by default. In the past I think we've been regarding CI runs as contaminating our telemetry data, so probably it's easiest and safest from the risk perspective to just disable telemetry in this case. But that does mean that in future if we do actually want to look at CI telemetry data we wouldn't have anything available on it... @yetudada?

Jun 22 '22 09:06 antonymilne

In the past I think we've been regarding CI runs as contaminating our telemetry data, so probably it's easiest and safest from the risk perspective to just disable telemetry in this case. But that does mean that in future if we do actually want to look at CI telemetry data we wouldn't have anything available on it...

Another thought I had this morning was to enable telemetry by default in CI environments, but to be even more conservative in what's collected by kedro-telemetry in those cases. Not sure that's necessary, though; I feel like it's quite conservative by default (but I'm also not that familiar with it).

I would advocate as capturing the CI variable in telemetry regardless of whether the default is to have it on or off.

Jun 22 '22 12:06 deepyaman

From a data perspective I would still like to see to understand how many Kedro projects are being run with CI, because it helps give us a sense of projects that have moved into production systems. It would be great to figure out how to identify a "user" that is CI but also still observe the telemetry consent that the user set.

Jun 22 '22 15:06 yetudada

We discussed this issue in a Technical Design session and decided on the following:

We do want to capture telemetry for projects run on CI. This gives us insights into how many Kedro projects are run as part of an automated CI workflow, which in turn will help us understand how many Kedro projects are part of production systems.
Use the CI environment variable to indicate that data comes from a CI environment
The user still needs to be allowed to decide whether to run telemetry on their CI Kedro projects or not, so we will not implement a system to automatically track the data if it's part of a CI run.
We need to document how a user can automatically give/deny consent to telemetry, by describing how to programatically create a .telemetry file
We want to implement some logic in the kedro new command that detects if it's being run in CI and if so, whether consent has been given/not and if not then it will advice the user on how to do this.

To do:

[x] Update kedro-telemetry to use the CI environment variable to indicate that data comes from a CI environment
[ ] Document how a user can automatically give/deny consent to telemetry, by describing how to programatically create a .telemetry file
[ ] Implement logic in the kedro new command that detects if it's being run in CI and if so, whether consent has been given and if that's not the case then it will advice the user on how to do this. https://github.com/kedro-org/kedro/issues/2867

Aug 17 '22 14:08 merelcht

Today there was a question about this and I just wanted to confirm that the "Do you opt into usage analytics? [y/N]:` prompt does not block non-interactive terminals. Example:

FROM python:3.10

RUN pip install kedro

WORKDIR /src

RUN echo "project_name: Spaceflights" > kedro.yml
RUN kedro new --config=kedro.yml --starter=spaceflights

WORKDIR /src/spaceflights

RUN pip install -r src/requirements.txt

RUN kedro run

$ docker build . --platform linux/amd64 -t kedro-test:dev
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  796.3MB
Step 1/8 : FROM python:3.10
 ---> 700e6de1daef
Step 2/8 : RUN pip install kedro
...
Step 8/8 : RUN kedro run
 ---> Running in e6b62eb0a4cf
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: [06/22/23 07:05:15] WARNING  Failed to confirm consent. No data    plugin.py:265
                             was sent to Heap. Exception:                       
[06/22/23 07:05:15] INFO     Kedro project spaceflights           session.py:359
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]:                     WARNING  Failed to confirm consent. No data    plugin.py:265
                             was sent to Heap. Exception:                       
                    INFO     Loading data from 'companies'   data_catalog.py:345
                             (CSVDataSet)...                                    
                    INFO     Running node:                           node.py:331
...

Jun 22 '23 07:06 astrojuanlu

Today there was a question about this and I just wanted to confirm that the "Do you opt into usage analytics? [y/N]:` prompt does not block non-interactive terminals. Example:

FROM python:3.10

RUN pip install kedro

WORKDIR /src

RUN echo "project_name: Spaceflights" > kedro.yml
RUN kedro new --config=kedro.yml --starter=spaceflights

WORKDIR /src/spaceflights

RUN pip install -r src/requirements.txt

RUN kedro run

$ docker build . --platform linux/amd64 -t kedro-test:dev
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  796.3MB
Step 1/8 : FROM python:3.10
 ---> 700e6de1daef
Step 2/8 : RUN pip install kedro
...
Step 8/8 : RUN kedro run
 ---> Running in e6b62eb0a4cf
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: [06/22/23 07:05:15] WARNING  Failed to confirm consent. No data    plugin.py:265
                             was sent to Heap. Exception:                       
[06/22/23 07:05:15] INFO     Kedro project spaceflights           session.py:359
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]:                     WARNING  Failed to confirm consent. No data    plugin.py:265
                             was sent to Heap. Exception:                       
                    INFO     Loading data from 'companies'   data_catalog.py:345
                             (CSVDataSet)...                                    
                    INFO     Running node:                           node.py:331
...

For me it does keep hanging in a Databricks job, with Kedro==0.18.7 installed.

Dec 11 '23 16:12 juanchodpg2

Thanks @juanchodpg2, good to know. You can create a .telemetry file with the appropriate content to opt in or out of telemetry and avoid the blocking https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#how-do-i-consent-to-the-use-of-kedro-telemetry

Dec 11 '23 17:12 astrojuanlu

@juanchodpg2 reports that creating a .telemetry file is finicky and doesn't always work, probably because of having to place it at the right directory (still unclear) https://linen-slack.kedro.org/t/16140465/hi-all-i-am-having-issues-trying-to-run-a-databricks-job-usi#71319d4a-493b-46fc-9f02-6fdf0cad3898

We might need to bump the priority of this. Getting a solid consent process is a blocker for https://github.com/kedro-org/kedro-plugins/issues/375

Dec 12 '23 18:12 astrojuanlu

Let's continue the conversation about kedro-telemetry being blocked on Databricks and other places in https://github.com/kedro-org/kedro-plugins/issues/484

Dec 18 '23 15:12 astrojuanlu

Also most CI platforms include the CI=True env var - we should detect that

Jan 31 '24 14:01 datajoely

My original example on https://github.com/kedro-org/kedro/issues/1640#issuecomment-1602119633 had a mistake, it didn't install kedro-telemetry. Updated version:

FROM python:3.10

RUN pip install kedro kedro-telemetry

WORKDIR /src

RUN kedro new --name=spaceflights --starter=spaceflights-pandas

WORKDIR /src/spaceflights

RUN pip install -r requirements.txt

RUN kedro run

and I confirm the docker build doesn't block:

$ docker build . --platform linux/amd64 -t kedro-test:dev
...
---> [Warning] The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
 ---> Running in ffb042e40704
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: [02/19/24 13:33:36] WARNING  Failed to confirm consent. No data    plugin.py:276
                             was sent to Heap. Exception:                       
                    INFO     Kedro project spaceflights           session.py:321
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: [02/19/24 13:33:37] WARNING  Failed to confirm consent. No data    plugin.py:276
                             was sent to Heap. Exception:                       
[02/19/24 13:33:39] INFO     Using synchronous mode for  sequential_runner.py:64
...

Each platform needs to be understood separately.

Feb 19 '24 14:02 astrojuanlu

Any update on this one ? :)

May 08 '24 02:05 m-gris

Hi @m-gris! To clarify, are you running Kedro on an unattended setting and being blocked by this? How does your setup look like?

The best workaround for now is to remove kedro-telemetry from the requirements.

May 08 '24 04:05 astrojuanlu

Hi @astrojuanlu It's just that I had the first node/task in an airflow pipeline that was stuck... 😅 I simply did a manual kedro run from within the container... (Could have created .telemetry manually as well) But your suggestion definitely makes for a better workaround. Thanks a lot, M.

May 09 '24 05:05 m-gris

To set expectations clear, #2867 is fine for a workaround but the default kedro run experience should be seamless. We are still working on making that a reality.

May 09 '24 05:05 astrojuanlu

Getting back to this after discussing it a bit more. We decided to not go with #2867.

In the future we want that telemetry prompt to disappear. This means that, whenever the user runs a Kedro command, we will assume a default answer, and we will also add more mechanisms for users to override that answer on top of the current .telemetry file (for example https://consoledonottrack.com/).

However, we are still in the process of minimising the amount of data we collect to make this more palatable to users, so we are not quite there yet.

We will give more details in due time.

May 24 '24 10:05 astrojuanlu

kedro kedro copied to clipboard

Improve running kedro as part of an automated workflow (CI/CD)

Description

Possible Implementation

kedro
kedro copied to clipboard