Can't reliably deploy in 2024
Hey folks,
I can't deploy this project using tag v2.6.0 because it has a python37 constraint, appengine barks at that. Deploying from master using python311 goes much further but has completely undocumented requirements like the terraform/k8s code that doesn't work out of the box. Do you guys have some more recent documentation to add to the project that we non-googlers could deploy to GCP?
@dantecl can you drill down on what exactly does not work out of the box, as far as K8S and Terraform go?
When I generate the config, none of the terraform code gets copied over. If I copy it manually, I have to fill out the variables file, and then the deploy process does not actually create any of the resources. I've resorted to deploying it with --target appengine to bypass the terraform stage, and most of the cronjobs in appengine fail with 404 to the cron-service service.
Re crons, they are supposed to run on Kubernetes, so it is not expected that appengine would succeed.
As far as the overall deployment goes, we have a pending project that will replace butler.py for terraform while bootstrapping the infrastructure, so I suggest you follow #3788. Since the deployment strategy will change, docs will be out once this lands.
Also, on the terraform deployment, with -target=module.clusterfuzz it does not create anything, and if I remove it, it does "Plan: 7 to add, 0 to change, 0 to destroy" with it trying to clobber my existing VPC, subnet and NAT gateway. I'll follow #3788 for updates, do you have any idea on timeframes?
This issue has not had any activity for 60 days and will be automatically closed in two weeks
I second @dantecl request. I've moved further in deployment by crafting my own config based on infra/k8s and infra/terraform but there're more issues coming:
- some indexes are not added to src/appengine/index.yaml (e.g. WindowRateLimitTask)
- some services are not activated at the time of create_config (e.g. secrets manager)
- secrets (gcs-signer-key) are not added at create_config stage
Got stuck on the last problem as I'm not sure what the key should be.
@vitorguidi @jonathanmetzman are there any timelines on switching to new deployment scripts? Anything we could help with to let the project become deployable again for new setups?
Hey there.
I am currently working in bootstrapping a development environment for our own use. This will probably help out with these deployment pains.
As far as helping us out, please document all the problems you are facing in this issue. If you end up solving stuff on your own, please let us know how you did it.
Re timelines, it is hard to estimate a completion date because this is a section of the system I am unfamiliar with, but there is active effort in this problem right now.
Work on create config will be here > https://github.com/google/clusterfuzz/pull/4724
@varseand
I second @dantecl request. I've moved further in deployment by crafting my own config based on infra/k8s and infra/terraform but there're more issues coming:
- some indexes are not added to src/appengine/index.yaml (e.g. WindowRateLimitTask)
- some services are not activated at the time of create_config (e.g. secrets manager)
- secrets (gcs-signer-key) are not added at create_config stage
Got stuck on the last problem as I'm not sure what the key should be.
@vitorguidi @jonathanmetzman are there any timelines on switching to new deployment scripts? Anything we could help with to let the project become deployable again for new setups?
Re the secret, it is supposed to contain the json service account token from a service account with GCS permissions, so that the presign can happen during preprocess stage of tasks.
Re the index, the only one that I saw fail so far is the window rate limit task you mentioned. I added it in the PR I mentioned and the trouble went away.
This feedback on what goes wrong when you try to deploy is important to us, please send more our way as you go =)
This issue has not had any activity for 60 days and will be automatically closed in two weeks
I managed to bootstrap ClusterFuzz with the changes in this PR > https://github.com/google/clusterfuzz/pull/4793
The desirable state is still to have a Terraform module, but this will only be done later this year. Meanwhile, this will achieve the same goal. @varseand all the concerns you raised in that previous comment are addressed
@vitorguidi Thanks for getting this in. Does it make sense to tag a new release of Clusterfuzz now? https://github.com/google/clusterfuzz/issues/4709
@vitorguidi Thanks for getting this in. Does it make sense to tag a new release of Clusterfuzz now? https://github.com/google/clusterfuzz/issues/4709
We haven't done releases in this way for a while, so the policy ATM is to deploy the state of master. @jonathanmetzman has more context on this, and if we will do releases again the near future
@vitorguidi I've tried to deploy Clusterfuzz using your latests changes but I've encountered several issues:
- The
create_configcommand expects theconfig_dirto be a Git repository. Running it without initializing the directory with git init and committing at least once causes it to fail. This can be easily resolved manually. - During the Terraform apply step, the process fails because the GCS bucket for storing the Terraform state isn't created. This bucket must be manually created before applying the configuration, otherwise Terraform cannot proceed.
- The _get_redis_ip function doesn't work correctly because the region returned by App Engine (us-central) does not match the region of the Redis instance (us-central1). As a result, the
REDIS_HOSTenv var is not set correctly, and the App Engine cron service cannot connect to Redis. - The Kubernetes cronjobs for running bots and other background tasks are not automatically deployed.
I'm currently stuck on the last point. I attempted to deploy the cronjobs manually using kubectl apply, but haven’t been successful so far.
Let me know if I can help provide logs or further details.
What was the problem when attempting the k8s deploy? @eduarddfinity
@vitorguidi I successfully deployed all the cronjobs. However, they do not produce the expected behavior when executed. A clear example of this issue is that the bots are not being spawned or connected to ClusterFuzz.
I’ve attempted to debug the problem and noticed several error messages in the GCP logs. For example:
{"message": "Retrying on clusterfuzz._internal.cron.helpers.bot_manager.Resource.execute failed with <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/[REMOVED]/zones/us-central1-f/instanceGroupManagers?alt=json returned \"Invalid value for field 'resource.autoHealingPolicies[0].healthCheck': 'https://www.googleapis.com/compute/v1/projects/[REMOVED]/global/healthChecks/test-check'. https://www.googleapis.com/compute/v1/projects/[REMOVED]/global/healthChecks/test-check does not exist.\". Details: \"[{'message': \"Invalid value for field 'resource.autoHealingPolicies[0].healthCheck': 'https://www.googleapis.com/compute/v1/projects/[REMOVED]/global/healthChecks/test-check'. https://www.googleapis.com/compute/v1/projects/[REMOVED]/global/healthChecks/test-check does not exist.\", 'domain': 'global', 'reason': 'invalid'}]\">. Retrying again.","severity": "INFO", "logging.googleapis.com/labels": {"python_logger": "root"}, "logging.googleapis.com/trace": "", "logging.googleapis.com/spanId": "", "logging.googleapis.com/trace_sampled": false, "logging.googleapis.com/sourceLocation": {"line": 649, "file": "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/logs.py", "function": "emit"}, "httpRequest": {} }
@eduarddfinity my bad, I forgot to create the health check in butler create config. As a work around, you can manually create one called running-check:
This issue has not had any activity for 60 days and will be automatically closed in two weeks
Another issue I encountered after adding the test-check health check is that the bot fails with this error:
Aug 16 00:12:33 clusterfuzz-linux-pre-smt0 bash[12608]: mount: /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk: mount point does not exist.
The latest clusterfuzz code does not contain a dir called /bot/inputs/fuzzer-testcases-disk. This appears to be a symptom of the default config using a docker image from 2022.
The default deploy also does not set up firewall rules to allow health checks to pass.
There's no Kubernetes job for processing code coverage info.
No Kubernetes job for retry stuck tasks.
This issue has not had any activity for 60 days and will be automatically closed in two weeks
Automatically closing stale issue