neveragain.de teletype

AWS: The Volume that Wasn't (EBS Create silent failure)

2020-07-30

The Teaser

Imagine this: You create a new EBS Volume – and then it’s gone.

The API did report success. It did return an ID for your new Volume.

And immediately afterwards, that API happily tells you that no such Volume exists. No error message. No trace of it to be found. Just gone.

That’s not supposed to happen.

And it’s not a freak accident. It’s reproducible.

Even worse: In a different AWS Account, the exact same thing works just fine.

What the fuck?

If you want the short answer, jump to tl;dr.

The Story

We’ve been working through the AWS EKS Workshop. It’s really nice, by the way.

For one of the labs, on StatefulSets, we taught Kubernetes to create its own EBS Volumes on demand. Afterwards it would attach them to the EKS Worker Nodes as required, so Pods can use them to persist data.

This worked just fine for me. For a colleague, the Pods wouldn’t start. Looking at the Storage Controller log files quickly revealed that something’s fucky with the EBS Volumes:

Could not create volume "pvc-d57184c0-9a8a-4124-81ad-821b6409cb67":
failed to get an available volume in EC2:
InvalidVolume.NotFound: The volume 'vol-092ec6ea1e1eb4d5a' does not exist.

Checking the EC2 Console confirmed that indeed no Volume with that ID exists. But if it has an ID, it must have been created. Wrong region? Failed CreateVolume request? Something unexpected in the API response? Rogue DeleteVolume just after the creation? CloudTrail to the rescue!

But… Nothing. A seemingly successful and correct CreateVolume request and response, matching the Volume ID. Nothing else.

At this point, asking the Big G oracle seemed appropriate: After all, when a successful CreateVolume does not actually create a Volume, that is so weird that many people must have had this issue by now. Or we’re just having a really bad day.

Et voilà: aws-cli issue #2592 was a valuable hint – it loosely matched our problem and suggested that permissions for kms are missing. Could it really be kms permissions? The request was indeed for an encrypted volume, so missing permissions for the Key Management Service seemed plausible. On the other hand, it did not explain why two AWS Accounts behaved differently.

But we were getting somewhere.

Then I remembered a discussion with my colleague, about some settings to make EBS encrypted by default: The ability to Opt-in to Default Encryption for New EBS Volumes is available since 2019.

Checking the EBS settings revealed that yes, my colleague had configured a Customer Managed Key as default key for EBS encryption. We found this by pure luck.

That seemed like a good theory: The AWS-managed default EBS key does not need any explicit kms permissions (it has a blanket allowance in its Key Policy). That’s why it worked in the first AWS Account. A custom key does not have these permissions by default, so additional kms permissions need to be granted. That’s why it failed in the second AWS Account.

So it seems like CreateVolume fails silently when it cannot use the encryption key.

Adding the kms grants, as suggested by the Github issue, immediately solved the issue.

The Proof

To test my theory, I’ve built a small lab environment. If you want to try this at home, there’s Terraform code available.

It consists of an EC2 instance with an Instance Profile that allows it to do what the EKS Storage role would have allowed – basically a bunch of ec2 permissions. I’ve added a second policy to allow switching the EBS default encryption key.

First, let’s try the happy path: No custom key is configured for EBS, it uses the AWS-managed EBS key:

# verify that the EBS default key is the AWS-managed key:
# 1) get default key id
[root@ip-172-31-39-182 ~]# aws ec2 get-ebs-default-kms-key-id --output text
arn:aws:kms:eu-central-1:329261680777:key/2c93167a-ab7a-47d3-8f3e-7d02b317b449

# 2) check that this id is the same as the alias/aws/ebs key
[root@ip-172-31-39-182 ~]# aws kms describe-key --key-id alias/aws/ebs --query KeyMetadata.KeyId
"2c93167a-ab7a-47d3-8f3e-7d02b317b449"

Create an encrypted volume:

[root@ip-172-31-39-182 ~]# aws ec2 create-volume --availability-zone eu-central-1b \
--size 10 --encrypted
[...]
    "Encrypted": true, 
    "VolumeId": "vol-03dc8bf00a729d652", 
    "State": "creating", 
    "KmsKeyId": "arn:aws:kms:eu-central-1:329261680777:key/2c93167a-ab7a-47d3-8f3e-7d02b317b449", 
[...]

# verify that it REALLY worked, i.e. State is available
[root@ip-172-31-39-182 ~]# aws ec2 describe-volumes --volume-ids vol-03dc8bf00a729d652 \
--query Volumes[].State --output text
available

Now let’s change the EBS default key to our custom key (created by Terraform before):

[root@ip-172-31-39-182 ~]# aws ec2 modify-ebs-default-kms-key-id \
--kms-key-id 2f6794ca-ecdb-4f28-a793-04340f989444
{
    "KmsKeyId": "arn:aws:kms:eu-central-1:329261680777:key/2f6794ca-ecdb-4f28-a793-04340f989444"
}

And now let’s try this again:

[root@ip-172-31-39-182 ~]# aws ec2 create-volume --availability-zone eu-central-1b \
--size 10 --encrypted
[...]
    "Encrypted": true, 
    "VolumeId": "vol-09871188b915f6b2f", 
    "State": "creating", 
    "KmsKeyId": "arn:aws:kms:eu-central-1:329261680777:key/2f6794ca-ecdb-4f28-a793-04340f989444", 
[...]

Notice the different KmsKeyId.

As suspected, the volume magically disappears:

[root@ip-172-31-39-182 ~]# aws ec2 describe-volumes --volume-ids vol-09871188b915f6b2f \
--query Volumes[].State --output text
An error occurred (InvalidVolume.NotFound) when calling the DescribeVolumes operation:
The volume 'vol-09871188b915f6b2f' does not exist.

The Analysis (and tl;dr)

So the CreateVolume call doesn’t check encryption key permissions before returning successfully.

This is what happens:

Proper fixes would be to adjust the Custom Key’s Policy so that kms permissions are granted as required – or to grant those kms permissions to the Role that issues the CreateVolume.

Without any hints whatsoever, neither in CloudTrail nor from the API itself, it is next to impossible to debug this (without pure luck and/or help from Big G).

This is something I’d like to see improved, preferrably by checking Key permissions beforehand.

The References

I’ve said it before, I’ll say it again: AWS Documentation is really good.

So yeah, all the above is described in the docs. It’s just that you only look there for confirmation – after you’ve figured it out. And even then it’s not easy to find:

The last one also reminds you that you can set up CloudWatch Events EventBridge Rules to get notified when this happens. Yeah, thanks.


Discuss on Twitter