initialization-actions icon indicating copy to clipboard operation
initialization-actions copied to clipboard

initialization actions which use apt-get update fail due to purged oldoldstable backports repository

Open olbapjose opened this issue 1 year ago • 10 comments

Very recently, Dataproc clusters started to fail at creation, due to an error in the Kafka initialization script, caused by a Debian repository no longer available:

https://deb.debian.org/debian buster-backports Release

The error says:

  • Initialization action failed. Failed action 'gs://goog-dataproc-initialization-actions-europe-west1/kafka/kafka.sh', see output in: gs://XXXXX/google-cloud-dataproc-metainfo/a697c722-2bd7-440b-b4da-9494892703ac/XXXXXX-m/dataproc-initialization-script-0_output"

The contents of that file is the following. Any advice or workaround is more than welcome.

image

olbapjose avatar Apr 14 '24 21:04 olbapjose

I am in the same situation with 1.5-debian10.

kishida-yuki avatar Apr 15 '24 10:04 kishida-yuki

Thank you for the report. We are addressing this issue with the highest priority.

cjac avatar Apr 15 '24 18:04 cjac

The fix https://github.com/GoogleCloudDataproc/initialization-actions/pull/1161 for gpu init actions has been verified. We are already working on the same fix patch for other init actions which are failing with the same error.

For urgent fix, customers/developers can clone the init action and add the same lines of code as in the fix in their copy, and use it for cluster creation. Please note that we do not encourage our customers to use cloned init script as they will not have updated init actions, and they will have to clone it every time there is a change in the init actions repository. So unless urgent, please wait for the other fixes to go in :)

akhanna213 avatar Apr 16 '24 05:04 akhanna213

@akhanna213 I just tried using the latest version of the install_gpu_driver.sh and just went through the process of create a dataproc cluster through the UI and setting that latest version of the driver and I am still running into initialization issues

ahmedetefy avatar May 14 '24 17:05 ahmedetefy

@akhanna213 @cjac I have run the command and it is still failing. Could you please provide an update? It is very important for us to have this up and running. I am using --image-version 2.0-debian10 which I know is a bit old but I don't think it is related to the issue, correct?

Thanks

olbapjose avatar May 20 '24 16:05 olbapjose

Hi @ahmedetefy @olbapjose could you confirm if the error message is still the same. We have already rolled out the fix a while back.

akhanna213 avatar May 21 '24 09:05 akhanna213

@akhanna213 Please see the image below and the attachment, which is the output file mentioned in the error.

image

google-cloud-dataproc-metainfo_initialization-script-0_output.txt

Long story short, the error says 'Unable to update packages lists.'

olbapjose avatar May 21 '24 09:05 olbapjose

@akhanna213 Yes I can confirm the error is still there

To reproduce the error is quite straightforward

gcloud dataproc clusters create cluster-e485 --enable-component-gateway --bucket <bucket_name> --region <your-region> --single-node --master-machine-type n1-standard-8 --master-boot-disk-type pd-balanced --master-boot-disk-size 500 --master-accelerator type=nvidia-tesla-t4 --image-version <any 2.1 or above image version> --optional-components JUPYTER --initialization-actions '< gcs_path to latest install GPU driver script >' --project <project_name>

I have also had issues with 2.0-ubuntu18 (even though it succeeds in installing the GPU drivers sometimes)

And the following are the error logs if it helps

E: Repository 'https://packages.cloud.google.com/apt google-cloud-logging-bionic-all InRelease' changed its 'Codename' value from 'google-cloud-logging-stretch-all' to 'google-cloud-logging-bionic-all'

ahmedetefy avatar May 21 '24 15:05 ahmedetefy

Hi @ahmedetefy @olbapjose , this looks like a different issue than what the users were facing earlier. Let me check with the team to understand what is causing this breakage. Appreciate your patience on this, let me get back to you as soon as possible.

akhanna213 avatar May 21 '24 15:05 akhanna213

Hi @akhanna213 do you have updates on this? Initially I was able to do a workaround by adding --allow-releaseinfo-change:

function update_apt_get() {
  retry_apt_command "apt-get update --allow-releaseinfo-change"
}

and it worked, but today it is failing again with a different message:

The following NEW packages will be installed: gnupg2 0 upgraded, 1 newly installed, 0 to remove and 3 not upgraded. Need to get 393 kB of archives. After this operation, 411 kB of additional disk space will be used. Err:1 http://deb.debian.org/debian buster/main amd64 gnupg2 all 2.2.12-1+deb10u1 404 Not Found [IP: 151.101.22.132 80] E: Failed to fetch http://deb.debian.org/debian/pool/main/g/gnupg2/gnupg2_2.2.12-1+deb10u1_all.deb 404 Not Found [IP: 151.101.22.132 80] E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

I will try again with fix-missing but looks like the script is not robust as it is exposed to different possible points of failure.

olbapjose avatar May 27 '24 16:05 olbapjose