clusterfuzz Can't reliably deploy in 2024

Hey folks,

I can't deploy this project using tag v2.6.0 because it has a python37 constraint, appengine barks at that. Deploying from master using python311 goes much further but has completely undocumented requirements like the terraform/k8s code that doesn't work out of the box. Do you guys have some more recent documentation to add to the project that we non-googlers could deploy to GCP?

Oct 22 '24 21:10 dantecl

@dantecl can you drill down on what exactly does not work out of the box, as far as K8S and Terraform go?

Oct 24 '24 13:10 vitorguidi

When I generate the config, none of the terraform code gets copied over. If I copy it manually, I have to fill out the variables file, and then the deploy process does not actually create any of the resources. I've resorted to deploying it with --target appengine to bypass the terraform stage, and most of the cronjobs in appengine fail with 404 to the cron-service service.

Oct 24 '24 15:10 dantecl

Re crons, they are supposed to run on Kubernetes, so it is not expected that appengine would succeed.

As far as the overall deployment goes, we have a pending project that will replace butler.py for terraform while bootstrapping the infrastructure, so I suggest you follow #3788. Since the deployment strategy will change, docs will be out once this lands.

Oct 30 '24 15:10 vitorguidi

Also, on the terraform deployment, with -target=module.clusterfuzz it does not create anything, and if I remove it, it does "Plan: 7 to add, 0 to change, 0 to destroy" with it trying to clobber my existing VPC, subnet and NAT gateway. I'll follow #3788 for updates, do you have any idea on timeframes?

Oct 30 '24 16:10 dantecl

This issue has not had any activity for 60 days and will be automatically closed in two weeks

Jan 16 '25 21:01 github-actions[bot]

I second @dantecl request. I've moved further in deployment by crafting my own config based on infra/k8s and infra/terraform but there're more issues coming:

some indexes are not added to src/appengine/index.yaml (e.g. WindowRateLimitTask)
some services are not activated at the time of create_config (e.g. secrets manager)
secrets (gcs-signer-key) are not added at create_config stage

Got stuck on the last problem as I'm not sure what the key should be.

@vitorguidi @jonathanmetzman are there any timelines on switching to new deployment scripts? Anything we could help with to let the project become deployable again for new setups?

Mar 13 '25 00:03 varseand

Hey there.

I am currently working in bootstrapping a development environment for our own use. This will probably help out with these deployment pains.

As far as helping us out, please document all the problems you are facing in this issue. If you end up solving stuff on your own, please let us know how you did it.

Re timelines, it is hard to estimate a completion date because this is a section of the system I am unfamiliar with, but there is active effort in this problem right now.

Mar 13 '25 17:03 vitorguidi

Work on create config will be here > https://github.com/google/clusterfuzz/pull/4724

@varseand

Mar 14 '25 14:03 vitorguidi

I second @dantecl request. I've moved further in deployment by crafting my own config based on infra/k8s and infra/terraform but there're more issues coming:

some indexes are not added to src/appengine/index.yaml (e.g. WindowRateLimitTask)

some services are not activated at the time of create_config (e.g. secrets manager)

secrets (gcs-signer-key) are not added at create_config stage

Got stuck on the last problem as I'm not sure what the key should be.

@vitorguidi @jonathanmetzman are there any timelines on switching to new deployment scripts? Anything we could help with to let the project become deployable again for new setups?

Re the secret, it is supposed to contain the json service account token from a service account with GCS permissions, so that the presign can happen during preprocess stage of tasks.

Re the index, the only one that I saw fail so far is the window rate limit task you mentioned. I added it in the PR I mentioned and the trouble went away.

This feedback on what goes wrong when you try to deploy is important to us, please send more our way as you go =)

Mar 20 '25 16:03 vitorguidi

This issue has not had any activity for 60 days and will be automatically closed in two weeks

May 19 '25 17:05 github-actions[bot]

I managed to bootstrap ClusterFuzz with the changes in this PR > https://github.com/google/clusterfuzz/pull/4793

The desirable state is still to have a Terraform module, but this will only be done later this year. Meanwhile, this will achieve the same goal. @varseand all the concerns you raised in that previous comment are addressed

May 21 '25 13:05 vitorguidi

@vitorguidi Thanks for getting this in. Does it make sense to tag a new release of Clusterfuzz now? https://github.com/google/clusterfuzz/issues/4709

May 21 '25 16:05 bkosciarz

@vitorguidi Thanks for getting this in. Does it make sense to tag a new release of Clusterfuzz now? https://github.com/google/clusterfuzz/issues/4709

We haven't done releases in this way for a while, so the policy ATM is to deploy the state of master. @jonathanmetzman has more context on this, and if we will do releases again the near future

May 21 '25 16:05 vitorguidi

@vitorguidi I've tried to deploy Clusterfuzz using your latests changes but I've encountered several issues:

The create_config command expects the config_dir to be a Git repository. Running it without initializing the directory with git init and committing at least once causes it to fail. This can be easily resolved manually.
During the Terraform apply step, the process fails because the GCS bucket for storing the Terraform state isn't created. This bucket must be manually created before applying the configuration, otherwise Terraform cannot proceed.
The _get_redis_ip function doesn't work correctly because the region returned by App Engine (us-central) does not match the region of the Redis instance (us-central1). As a result, the REDIS_HOST env var is not set correctly, and the App Engine cron service cannot connect to Redis.
The Kubernetes cronjobs for running bots and other background tasks are not automatically deployed.

I'm currently stuck on the last point. I attempted to deploy the cronjobs manually using kubectl apply, but haven’t been successful so far.

Let me know if I can help provide logs or further details.

Jun 05 '25 09:06 eduarddfinity

What was the problem when attempting the k8s deploy? @eduarddfinity

Jun 05 '25 21:06 vitorguidi

@vitorguidi I successfully deployed all the cronjobs. However, they do not produce the expected behavior when executed. A clear example of this issue is that the bots are not being spawned or connected to ClusterFuzz.

I’ve attempted to debug the problem and noticed several error messages in the GCP logs. For example:

{"message": "Retrying on clusterfuzz._internal.cron.helpers.bot_manager.Resource.execute failed with <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/[REMOVED]/zones/us-central1-f/instanceGroupManagers?alt=json returned \"Invalid value for field 'resource.autoHealingPolicies[0].healthCheck': 'https://www.googleapis.com/compute/v1/projects/[REMOVED]/global/healthChecks/test-check'. https://www.googleapis.com/compute/v1/projects/[REMOVED]/global/healthChecks/test-check does not exist.\". Details: \"[{'message': \"Invalid value for field 'resource.autoHealingPolicies[0].healthCheck': 'https://www.googleapis.com/compute/v1/projects/[REMOVED]/global/healthChecks/test-check'. https://www.googleapis.com/compute/v1/projects/[REMOVED]/global/healthChecks/test-check does not exist.\", 'domain': 'global', 'reason': 'invalid'}]\">. Retrying again.","severity": "INFO", "logging.googleapis.com/labels": {"python_logger": "root"}, "logging.googleapis.com/trace": "", "logging.googleapis.com/spanId": "", "logging.googleapis.com/trace_sampled": false, "logging.googleapis.com/sourceLocation": {"line": 649, "file": "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/logs.py", "function": "emit"}, "httpRequest": {} }

Jun 10 '25 13:06 eduarddfinity

@eduarddfinity my bad, I forgot to create the health check in butler create config. As a work around, you can manually create one called running-check:

Jun 10 '25 18:06 vitorguidi

This issue has not had any activity for 60 days and will be automatically closed in two weeks

Aug 09 '25 19:08 github-actions[bot]

Another issue I encountered after adding the test-check health check is that the bot fails with this error:

Aug 16 00:12:33 clusterfuzz-linux-pre-smt0 bash[12608]: mount: /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk: mount point does not exist.

The latest clusterfuzz code does not contain a dir called /bot/inputs/fuzzer-testcases-disk. This appears to be a symptom of the default config using a docker image from 2022.

Aug 16 '25 02:08 jasocrow

The default deploy also does not set up firewall rules to allow health checks to pass.

Aug 16 '25 02:08 jasocrow

There's no Kubernetes job for processing code coverage info.

Aug 29 '25 16:08 jasocrow

No Kubernetes job for retry stuck tasks.

Sep 03 '25 20:09 jasocrow

This issue has not had any activity for 60 days and will be automatically closed in two weeks

Nov 02 '25 21:11 github-actions[bot]

Automatically closing stale issue

Nov 16 '25 21:11 github-actions[bot]