terraform-provider-ibm
terraform-provider-ibm copied to clipboard
DNS Resource record creation is too slow to build large clusters
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Terraform CLI and Terraform IBM Provider Version
Terraform version: 1.0.11 terraform provider: 1.41
Affected Resource(s)
Terraform Configuration Files
/*
Creates specified number of IBM Cloud Virtual Server Instance(s).
*/
terraform {
required_providers {
ibm = {
source = "IBM-Cloud/ibm"
}
}
}
variable "total_vsis" {}
variable "vsi_name_prefix" {}
variable "vpc_id" {}
variable "zones" {}
variable "dns_service_id" {}
variable "dns_zone_id" {}
variable "dns_domain" {}
variable "vsi_subnet_id" {}
variable "vsi_security_group" {}
variable "vsi_profile" {}
variable "vsi_image_id" {}
variable "vsi_user_public_key" {}
variable "vsi_meta_private_key" {}
variable "vsi_meta_public_key" {}
variable "resource_group_id" {}
data "template_file" "metadata_startup_script" {
template = <<EOF
#!/usr/bin/env bash
if grep -q "Red Hat" /etc/os-release
then
USER=vpcuser
yum install -y python3 kernel-devel-$(uname -r) kernel-headers-$(uname -r)
elif grep -q "Ubuntu" /etc/os-release
then
USER=ubuntu
fi
sed -i -e "s/^/no-port-forwarding,no-agent-forwarding,no-X11-forwarding,command=\"echo \'Please login as the user \\\\\"$USER\\\\\" rather than the user \\\\\"root\\\\\".\';echo;sleep 10; exit 142\" /" ~/.ssh/authorized_keys
echo "${var.vsi_meta_private_key}" > ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
echo "${var.vsi_meta_public_key}" >> ~/.ssh/authorized_keys
echo "StrictHostKeyChecking no" >> ~/.ssh/config
echo "DOMAIN=\"${var.dns_domain}\"" >> "/etc/sysconfig/network-scripts/ifcfg-eth0"
systemctl restart NetworkManager
systemctl stop firewalld
firewall-offline-cmd --zone=public --add-port=1191/tcp
firewall-offline-cmd --zone=public --add-port=60000-61000/tcp
firewall-offline-cmd --zone=public --add-port=47080/tcp
firewall-offline-cmd --zone=public --add-port=47080/udp
firewall-offline-cmd --zone=public --add-port=47443/tcp
firewall-offline-cmd --zone=public --add-port=47443/udp
firewall-offline-cmd --zone=public --add-port=4444/tcp
firewall-offline-cmd --zone=public --add-port=4444/udp
firewall-offline-cmd --zone=public --add-port=4739/udp
firewall-offline-cmd --zone=public --add-port=4739/tcp
firewall-offline-cmd --zone=public --add-port=9084/tcp
firewall-offline-cmd --zone=public --add-port=9085/tcp
firewall-offline-cmd --zone=public --add-service=http
firewall-offline-cmd --zone=public --add-service=https
systemctl start firewalld
EOF
}
resource "ibm_is_instance" "itself" {
for_each = {
# This assigns a subnet-id to each of the instance
# iteration.
for idx, count_number in range(1, var.total_vsis + 1) : idx => {
sequence_string = tostring(count_number)
subnet_id = element(var.vsi_subnet_id, idx)
zone = element(var.zones, idx)
}
}
name = format("%s-%s", var.vsi_name_prefix, each.value.sequence_string)
image = var.vsi_image_id
profile = var.vsi_profile
primary_network_interface {
subnet = each.value.subnet_id
security_groups = var.vsi_security_group
}
vpc = var.vpc_id
zone = each.value.zone
resource_group = var.resource_group_id
keys = var.vsi_user_public_key
user_data = data.template_file.metadata_startup_script.rendered
boot_volume {
name = format("%s-boot-%s", var.vsi_name_prefix, each.value.sequence_string)
}
}
resource "ibm_dns_resource_record" "a_itself" {
for_each = {
for idx, count_number in range(1, var.total_vsis + 1) : idx => {
name = element(tolist([for name_details in ibm_is_instance.itself : name_details.name]), idx)
network_ip = element(tolist([for ip_details in ibm_is_instance.itself : ip_details.primary_network_interface[0]["primary_ipv4_address"]]), idx)
}
}
instance_id = var.dns_service_id
zone_id = var.dns_zone_id
type = "A"
name = each.value.name
rdata = each.value.network_ip
ttl = 300
}
resource "ibm_dns_resource_record" "ptr_itself" {
for_each = {
for idx, count_number in range(1, var.total_vsis + 1) : idx => {
name = element(tolist([for name_details in ibm_is_instance.itself : name_details.name]), idx)
network_ip = element(tolist([for ip_details in ibm_is_instance.itself : ip_details.primary_network_interface[0]["primary_ipv4_address"]]), idx)
}
}
instance_id = var.dns_service_id
zone_id = var.dns_zone_id
type = "PTR"
name = each.value.network_ip
rdata = format("%s.%s", each.value.name, var.dns_domain)
ttl = 300
depends_on = [ibm_dns_resource_record.a_itself]
}
output "instance_ids" {
value = try(toset([for instance_details in ibm_is_instance.itself : instance_details.id]), [])
depends_on = [ibm_dns_resource_record.a_itself, ibm_dns_resource_record.ptr_itself]
}
output "instance_private_ips" {
value = try(toset([for instance_details in ibm_is_instance.itself : instance_details.primary_network_interface[0]["primary_ipv4_address"]]), [])
depends_on = [ibm_dns_resource_record.a_itself, ibm_dns_resource_record.ptr_itself]
}
Debug Output
https://gist.github.com/gmewhinney/6071c5f490f9e31d02c1a385c4b2c87a#file-terraform-log-txt
Expected Behavior
outside of Terraform, 100 DNS records can be created in < 1 second. We need to get closer to that within Terraform
Actual Behavior
creating a single DNS 'A' record takes over 1 second, so for the 64 compute nodes on this small cluster it took 74 seconds All the records are started within 1 second of each other at which time all of the records are in progress, but the completions trickle out at the rate of 1 every second or so. So from start to finish the first record finishes after one second but the last one takes 74 seconds. PTR records are a little slower. It takes 95 seconds to create all 64 PTR records.
This is borderline for a small cluster, but would take over 30 minutes for a 1000 node cluster.
Steps to Reproduce
The code will be moving to a public repository soon. Right now it resides on an internal repository at: https://github.ibm.com/IBMSpectrumScale/ibm-spectrum-scale-ibm-cloud-schematics
To recreate, you would build a Scale cluster starting with schematics specifying the above repo
Important Factoids
The cluster creation is a hybrid between schematics which creates part of the cluster and a service machine, schematics transfers control to the service machine where the Terraform that creates the DNS records is executed. The log linked above is from the service machine.
References
This issue is a result of the need to serialize record creation as detailed in: https://github.com/IBM-Cloud/terraform-provider-ibm/issues/1430 I have discussed this issue with @MalarvizhiK who worked on the above issue. I think her and Vasu from the cloud DNS team have some ideas for improving this.
- #0000
PR: https://github.com/IBM-Cloud/terraform-provider-ibm/pull/3886/files
the 1.43.0 prototype much improved DNS record creation, but there is still a 40+ second delay in creating some of the ptr records. This is a log from the run. a good example of this is the record for instance 21. it starts at: 2022-07-08T19:30:46.744Z and ends at:2022-07-08T19:30:46.744Z this record took over 40 seconds, while most took 3-4 seconds
https://gist.github.com/gmewhinney/8c2e26c051ebc206cbb3a30ae9ce2114