terraform-aws-couchbase icon indicating copy to clipboard operation
terraform-aws-couchbase copied to clipboard

EBS volumes fail to format on nitro instances

Open tinomen opened this issue 4 years ago • 7 comments

While trying to setup a production cluster using the data node instance type m5a.4xlarge I'm getting the following error from the couchbase-commons/mount-volume.sh script.

cat /opt/couchbase/var/lib/couchbase/logs/mock-user-data.log

Mounting EBS Volume for the data directory
2020-12-28 17:56:44 [INFO] [part-001] Creating ext4 file system on /dev/xvdh...
mke2fs 1.44.1 (24-Mar-2018)
The file /dev/xvdh does not exist and no size was specified.

There is no file in /dev for xvdh but it does indeed show that device name in the aws console as attached. When I run lsblk on one of the instances I only see the following:

NAME        TYPE  SIZE FSTYPE   MOUNTPOINT                  LABEL
nvme1n1     disk  200G
nvme0n1     disk   50G
└─nvme0n1p1 part   50G ext4     /                           cloudimg-rootfs

I'm able to manually format the nvme device using the mount_volume function but the ASG fails to create instances when I change the data_volume_device_name to /dev/nvme1n1

Launching a new EC2 instance. Status Reason: Invalid device name /dev/nvme1n1. Launching EC2 instance failed.

[update] It appears that aws nitro based instances and EBS device naming don't symlink the xvdX names any longer and the script doesn't account for this. 🤨

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#identify-nvme-ebs-device

tinomen avatar Dec 28 '20 21:12 tinomen

Indeed. Using nvme is quite a bit more complicated:

  1. You have to detect if the system is using nvme using lsblk.
  2. You then run nvme list to get the list of nvme volumes (note: this requires the nvme utility to be installed).
  3. You then have to find the device you want using nvme id-ctrl.
  4. And then when mounting the volume, you have to mount it using its UUID, as that's the only thing consistent across reboots.

I don't think we're going to be able to get a fix in soon. Does anyone have some cycles to submit a PR for this in the meantime?

brikis98 avatar Jan 06 '21 15:01 brikis98

Keep in mind that your step 3 assumes you can reuse the terraform device name. This however throws an error when ASG tries to create the ebs volume. So your solution would have to create a second volume name and pass it forward thru the functions.

tinomen avatar Jan 19 '21 06:01 tinomen

Keep in mind that your step 3 assumes you can reuse the terraform device name. This however throws an error when ASG tries to create the ebs volume.

What error?

brikis98 avatar Jan 19 '21 09:01 brikis98

The asg is unable to start the instances because the EBS volume fails creation. I don’t recall the words. Give it a try.

tinomen avatar Jan 19 '21 15:01 tinomen

We use the approach described in https://github.com/gruntwork-io/terraform-aws-couchbase/issues/73#issuecomment-755360785 (which we have in a private script) with a number of ASGs, and it works OK, so I'm not sure how to repro...

brikis98 avatar Jan 21 '21 09:01 brikis98

What device name are you using? I tired nvme1 and nvme1n1 and the instances failed to start due to a failed ebs volume.

Well, hopefully the issue and this closed pr will help someone overcome the hurdles in making these scripts production ready.

tinomen avatar Jan 21 '21 16:01 tinomen

As I wrote above, we use the UUID, which we look up using blkid.

brikis98 avatar Jan 25 '21 18:01 brikis98

This repo is being archived, feel free to use a fork if necessary.

ellisonc avatar Mar 29 '23 18:03 ellisonc