5 minutes
Fixing a stuck TTL on HashiCorp’s Vault PKI
My shop has been using consul-template
to rotate the vault certificates for us each month, but unfortunately this turned out to not be very reliable. Since my current shop is actually replacing vault for AWS SSM, it hasn’t gotten much TLC lately and there’s really no reason to pour work into this setup to make it more resiliant; plus it’s mostly used by our legacy staging environments, which naturally don’t get much TLC in any environment. My plan is to remove consul-template
and replace it with 1 year certificates which gives us time to not worry about certificates expiring while we dismantle things for the next few months. These certs I speak of are just the certs that are used for vault servers to communicate with each other.
So here I am on a Sunday evening (so I don’t interrupt our development workflows during the week), attempting to rip out Consul template and rotate the certs manually to ones with a 1 year TTL. If you haven’t done this before, it’s actually quite simple.
Obviously it would be awkward to rotate the certs you’re currently using, so on all Vault servers in the cluster, in /etc/vault.d/vault.hcl
you disable TLS and restart everything so they are communicating now on HTTP.
listener "tcp" {
address = "0.0.0.0:8200"
- tls_cert_file = "/etc/ssl/vault/vault.crt"
- tls_key_file = "/etc/ssl/vault/vault.key"
+ tls_disable = 1
}
Once you restarted the daemons, you then set: export VAULT_ADDR='http://127.0.0.1:8200'
. Okay simple enough. Now you go through the whims of unsealing, echo $UNSEAL_KEY1 | vault unseal
as many times as you need, finish with a good ole’ vault status
– now we’re in business to rotate some certs. Using my domain name & random hostnames in the example so I don’t reveal things. 🙈
$ vault write pki/issue/sudoaccess-dot-com common_name=vault-1.sudoaccess.com alt_names="localhost,*.sudoaccess.com,vault.consul.local" ip_sans="127.0.0.1,192.168.1.200" > vault-1.txt
Cool, let’s look at the file:
Key Value
--- -----
lease_id pki/issue/sudoaccess-dot-com/1234567890token1234
lease_duration 767h59m59s
lease_renewable false
ca_chain [-----BEGIN CERTIFICATE-----
MIIEpQIBAAKCAQEA0onHvatXo8X7Sr5ANkTEnn7ipjpL6z0pSc1uV6F1aLX1I94f
Wait… lease_duration of 767 hours. I don’t want 1 month. Let’s be explicit in our TTL.
$ vault write pki/issue/sudoaccess-dot-com common_name=vault-1.sudoaccess.com alt_names="localhost,*.sudoaccess.com,vault.consul.local" ip_sans="127.0.0.1,192.168.1.200" ttl="8760h" > vault-1.txt
Key Value
--- -----
lease_id pki/issue/sudoaccess-dot-com/1234567890token1234
lease_duration 767h59m59s
lease_renewable false
ca_chain [-----BEGIN CERTIFICATE-----
MIIEpQIBAAKCAQEA0onHvatXo8X7Sr5ANkTEnn7ipjpL6z0pSc1uV6F1aLX1I94f
🤔 well that doesn’t seem right.
And on I go with various ways of writing a 1 year TTL, ttl=8760h
, or ttl=31536000
, etc. Still nothing.
Alright fine, I’m going to force the max_ttl
by doing:
$ export VAULT_TOKEN=<root_vault_token>
$ vault secrets tune -max-lease-ttl=8760h pki
Success! Tuned the secrets engine at: pki/
$ vault secrets tune -max-lease-ttl=8760h pki/issue/sudoaccess-dot-com
Success! Tuned the secrets engine at: pki/issue/sudoaccess-dot-com/
If you don’t have a root vault token, follow these directions to set one up.
Alright this should work. Well guess what? It didn’t. Same result. At this point I’m saying WTF out loud.
After digging around in the CLI, trying to override this in various ways. It became obvious that the sudoaccess-dot-com
role is restricting it to 1 month. After reading through this and seeing Note that individual roles can restrict this value to be shorter on a per-certificate basis.
I ended up finding this gem of documentation. It’s the PKI API guide.
Let’s check out this role by running:
$ curl -s \
--header "X-Vault-Token:1234567890token1234" \
http://127.0.0.1:8200/v1/pki/roles/sudoaccess-dot-com | jq .
{
"request_id": "bf389343-73d2-414a-96ff-df37ea15ec5d",
"lease_id": "",
"renewable": false,
"lease_duration": 0,
"data": {
.... truncated ....
"key_bits": 2048,
"key_type": "rsa",
"key_usage": [
"DigitalSignature",
"KeyAgreement",
"KeyEncipherment"
],
"locality": null,
"max_ttl": 2764800,
"no_store": false,
"not_before_duration": 0,
"organization": null,
"ou": null,
"policy_identifiers": null,
"postal_code": null,
"province": null,
"require_cn": false,
"server_flag": true,
"street_address": null,
"ttl": 2764800,
"use_csr_common_name": true,
"use_csr_sans": false
},
"wrap_info": null,
"warnings": null,
"auth": null
}
There we go: "max_ttl": 2764800,
… turns out that’s 768 hours. Let’s save this entire output to a file named payload.json
and change it to 31536000
. Once that’s done, let’s post it by running:
$ curl -s \
--header "X-Vault-Token:1234567890token1234" \
--request POST \
--data @payload.json \
http://127.0.0.1:8200/v1/pki/roles/sudoaccess-dot-com
Let’s give the vault write
command a shot again and look at the cert output:
Key Value
--- -----
lease_id pki/issue/sudoaccess-dot-com/1234567890token1234
lease_duration 8759h59m59s
lease_renewable false
ca_chain [-----BEGIN CERTIFICATE-----
MIIEpQIBAAKCAQEA0onHvatXo8X7Sr5ANkTEnn7ipjpL6z0pSc1uV6F1aLX1I94f
8759h59m59s
!
Woohoo, let’s chop this up into a vault.crt
and a vault.key
and re-enable TLS in /etc/vault.d/vault.hcl
.
listener "tcp" {
address = "0.0.0.0:8200"
+ tls_cert_file = "/etc/ssl/vault/vault.crt"
+ tls_key_file = "/etc/ssl/vault/vault.key"
- tls_disable = 1
}
After daemons are restarted, export VAULT_ADDR='https://127.0.0.1:8200'
, and go through your vault unseal
shenanigans.
$ vault status
Key Value
--- -----
Seal Type shamir
Initialized true
Sealed false
Total Shares 5
Threshold 3
Version 0.11.6
Cluster Name vault-cluster-c917641f
Cluster ID e9361a1d-6e41-4168-9fea-03600feaa035
HA Enabled true
HA Cluster https://192.168.1.115:8201
HA Mode active
👏
So why couldn’t I override it with the CLI? Why did I have to hack away at the API? Upon more reading at least in version 0.11.6, I found that the role will use the TTL no matter what based on the first certificate that is issued by it. Our consul-template
service so happened to be the first to do it with this specific role, and since it was using 1-month TTL’s, that’s what the role was stuck with enforcing.
Full disclaimer, this was my second time ever having to troubleshoot vault. But this still seems to be a pretty unknown (and annoying) gotcha, especially amongst my more knowledgable cohorts on this system. As of writing this wasn’t clearly documented anywhere and I had to be a little creative, and I couldn’t find anyone talking about this scenario with all my google-fu. Hopefully this helps someone. Thanks for reading.