Tailscale is one of those tools that feels like magic the first time you use it. Install it on two machines, and they can talk to each other — no port forwarding, no firewall rules, no VPN concentrators. It is built on WireGuard, it is encrypted end-to-end, and the setup takes about 90 seconds.

But then you try to use it in production on AWS. You set up subnet routing so your team can reach private RDS databases. You enable MagicDNS so people can ssh ec2-user@my-server instead of memorizing IP addresses. You write ACL rules so the research team can only access research servers.

And then someone messages you: "SSH says no such host is known."

This article is everything we learned deploying Tailscale across multiple AWS accounts, dozens of EC2 instances, and a team that ranges from senior engineers to people who have never opened a terminal. It is the guide we wish we had.

What Tailscale actually does (30-second version)

Tailscale creates an encrypted mesh network between your devices. Every device gets a stable IP address in the 100.x.x.x range. Devices find each other through a coordination server, but traffic flows directly between them — Tailscale never sees your data.

Think of it as a VPN that does not suck. No split tunneling configs, no certificate management, no "the VPN dropped again" Slack messages.

Key concepts you will need:

If any of these terms are unfamiliar, that is fine. You will understand each one by the end of this article because we are going to walk through every single one of them with real commands and real mistakes.

Part 1: Installing Tailscale on EC2 instances

The basic setup

On Amazon Linux 2023 or any Linux EC2 instance:

bash curl -fsSL https://tailscale.com/install.sh | sh

Then bring it up:

bash sudo tailscale up --ssh --auth-key=tskey-auth-xxxxx --advertise-tags=tag:dev --hostname=my-app

That is it. Your EC2 instance is now on your tailnet, accessible via Tailscale SSH, tagged for ACL purposes, and named my-app in MagicDNS.

Let us break down what each flag does because you will want to know this when things stop working:

Gotcha: Enable the Tailscale service

After installing, make sure tailscaled starts on boot. The install script usually handles this, but verify:

bash sudo systemctl enable tailscaled sudo systemctl start tailscaled

If you skip this and the instance reboots, Tailscale is not running and the device goes offline in your tailnet. You will get "no such host" errors and spend 20 minutes thinking it is a DNS problem before you realize the daemon just is not running.

Automating it in CloudFormation UserData

You do not want to SSH into every new instance to set up Tailscale. Put it in your CloudFormation UserData:

yaml UserData: Fn::Base64: !Sub | #!/bin/bash dnf clean packages dnf install -y postgresql17 # Enable IP forwarding (required for subnet routing) echo 'net.ipv4.ip_forward = 1' | tee -a /etc/sysctl.conf echo 'net.ipv6.conf.all.forwarding = 1' | tee -a /etc/sysctl.conf sysctl -p /etc/sysctl.conf # Install Tailscale curl -fsSL https://tailscale.com/install.sh | sh # Get auth key from SSM Parameter Store TAILSCALE_AUTH_KEY=$(aws ssm get-parameter \ --name "/${EnvironmentName}/${ApplicationName}/tailscale-auth-key" \ --with-decryption --query 'Parameter.Value' --output text \ --region ${AWS::Region}) # Join the tailnet tailscale up --ssh \ --auth-key=$TAILSCALE_AUTH_KEY \ --advertise-tags=tag:${EnvironmentName} \ --hostname=${EnvironmentName}-${ApplicationName} \ --accept-routes

A few things worth noting here:

Where to store the auth key

Never hardcode auth keys in your templates. Store them in AWS SSM Parameter Store as a SecureString:

bash aws ssm put-parameter \ --name "/production/my-app/tailscale-auth-key" \ --value "tskey-auth-xxxxx" \ --type SecureString \ --region us-east-2

Your EC2 IAM role needs permission to read it:

yaml - Effect: Allow Action: - ssm:GetParameter - ssm:GetParameters Resource: - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/${EnvironmentName}/${ApplicationName}/tailscale-auth-key'

Gotcha: Auth keys expire

Tailscale auth keys have an expiration. The default is 90 days. If you create a key, put it in SSM, and forget about it, your next EC2 instance launch three months later will silently fail to join the tailnet. The instance starts fine, the app runs fine, but the device never shows up in your Tailscale admin console.

Options:

Part 2: Subnet routing — reaching private AWS resources

This is where Tailscale goes from "neat toy" to "essential infrastructure." A subnet router lets every device on your tailnet access resources inside your VPC — RDS databases, ElastiCache clusters, internal APIs — without those resources needing Tailscale installed.

Setting up a subnet router

Designate a small EC2 instance (a t4g.nano is enough — about $3/month) as your subnet router:

bash # Enable IP forwarding (required) echo 'net.ipv4.ip_forward = 1' | sudo tee -a /etc/sysctl.conf echo 'net.ipv6.conf.all.forwarding = 1' | sudo tee -a /etc/sysctl.conf sudo sysctl -p /etc/sysctl.conf # Start Tailscale with subnet routes sudo tailscale up \ --advertise-routes=10.0.0.0/16 \ --hostname=aws-subnet-router \ --ssh \ --auth-key=tskey-auth-xxxxx

The --advertise-routes=10.0.0.0/16 tells Tailscale this device can route traffic to the 10.0.0.0/16 CIDR range. Adjust this to match your VPC.

Gotcha: Routes must be approved

This is the number one thing that catches people. You advertise routes on the EC2 instance and assume it works. It does not. Not yet.

After the device advertises routes, you must approve those routes in the Tailscale admin console. Go to the Machines page, find your subnet router, click the three-dot menu, and approve the routes.

Until you do this, traffic to 10.0.x.x addresses just disappears. No error message. No timeout. Packets leave your laptop, hit the subnet router, and go nowhere. You will spend 30 minutes checking security groups before you realize the routes were never approved.

You can skip this approval step by adding auto-approvers in your ACL policy:

json { "autoApprovers": { "routes": { "10.0.0.0/16": ["tag:infra"] } } }

This automatically approves any device tagged tag:infra to advertise routes for your VPC CIDR. Set it once and your CloudFormation-deployed subnet routers work immediately.

Gotcha: Security groups still apply

Tailscale does not bypass AWS security groups. The subnet router is an EC2 instance with an ENI and a security group. If your RDS instance's security group only allows traffic from sg-app-server, the subnet router cannot reach it either.

You need to add the subnet router's security group to the inbound rules of every resource you want to reach through Tailscale. This is easy to forget because you think "Tailscale is handling the networking" — it is, but only up to the VPC boundary. Inside the VPC, AWS security groups are still the gatekeeper.

Gotcha: Other devices need --accept-routes

The subnet router advertises routes. Other devices on your tailnet need to opt in to using those routes. On your laptop:

bash tailscale up --accept-routes

On macOS and Windows, the Tailscale GUI has a checkbox: "Use Tailscale subnets" or "Accept routes." On Linux, you need the flag. On EC2 instances launched with our CloudFormation UserData above, it is already included.

If someone on your team reports they cannot reach an RDS database from their laptop but another person can, the first question is: "Did you accept routes?"

Part 3: MagicDNS — the part that makes everything nice (until it doesn't)

MagicDNS is the feature that turns Tailscale from "useful" to "delightful." Instead of remembering that your staging database is at 100.89.12.47, you just use staging-db.tailnet-name.ts.net.

Enable it in the admin console under DNS settings. Once it is on, every device on your tailnet gets a hostname based on the --hostname you set (or the machine hostname if you did not set one).

Gotcha: "No such host is known"

This is the error that prompted this article. Someone on your team types:

bash ssh ec2-user@staging-api

And gets: ssh: Could not resolve hostname staging-api: No such host is known.

Nine times out of ten, this is one of these problems:

  1. Tailscale is not running. Check tailscale status. If it says "Tailscale is stopped," run tailscale up. On Windows, check the system tray icon. On Mac, check the menu bar.
  2. The device fell off the tailnet. Auth keys and device keys can expire. Check the Tailscale admin console — if the device shows as "Expired," you need to re-authenticate it or adjust your key expiry settings.
  3. Short hostname resolution is not enabled. By default, MagicDNS only resolves the full hostname: staging-api.tailnet-name.ts.net. To use just staging-api, you need MagicDNS enabled in the admin console and the device's DNS settings need to include the Tailscale search domain.
  4. DNS caching. Your OS cached a previous DNS failure. On Windows, run ipconfig /flushdns. On macOS, sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder. On Linux, sudo systemd-resolve --flush-caches or restart systemd-resolved.
  5. Corporate DNS intercepting. If you are on a corporate network with DNS filtering, their DNS server might be answering before Tailscale's resolver gets a chance. Check tailscale status and try the full .ts.net hostname first to narrow it down.

The debugging checklist

When MagicDNS is not working, run through this in order:

bash # 1. Is Tailscale running? tailscale status # 2. Can you see the device in your tailnet? tailscale status | grep staging-api # 3. Can you ping by Tailscale IP? ping 100.x.x.x # 4. Can you resolve the full hostname? nslookup staging-api.tailnet-name.ts.net # 5. Can you resolve the short hostname? nslookup staging-api

If step 3 works but step 4 does not, it is a DNS problem. If step 4 works but step 5 does not, you need to set the Tailscale search domain. If step 3 does not work, the device is either offline or your ACLs are blocking the connection.

Pro tip: Consistent hostname naming

Agree on a naming convention early. We use {environment}-{service}: production-api, staging-worker, dev-bastion. Whatever you pick, put it in the CloudFormation template so it is consistent. Ad-hoc hostnames like jeff-test-2 and new-server-final make your tailnet unnavigable within weeks.

Part 4: ACLs — where the real complexity lives

Tailscale's ACL system is powerful and, if you are coming from security groups or simple firewall rules, initially confusing. Here is how to think about it.

The default is deny-all

When you first set up ACLs, the default policy allows all traffic between all devices. The moment you add your first ACL rule, Tailscale switches to deny-by-default. This means only traffic explicitly allowed by your ACL rules will flow. Everything else is silently dropped.

This catches every team the first time. Someone adds a rule to give the dev team access to dev servers, and suddenly the production team cannot reach production servers anymore. Not because the new rule broke anything — but because the old "allow everything" default was replaced by "deny everything except what you specified."

Start with a simple policy

Here is a minimal ACL policy that covers most small teams:

json { "tagOwners": { "tag:dev": ["autogroup:admin"], "tag:staging": ["autogroup:admin"], "tag:production": ["autogroup:admin"], "tag:infra": ["autogroup:admin"] }, "acls": [ // Admins can access everything {"action": "accept", "src": ["autogroup:admin"], "dst": ["*:*"]}, // Dev team can access dev and staging servers {"action": "accept", "src": ["group:dev"], "dst": ["tag:dev:*", "tag:staging:*"]}, // All employees can access production via SSH (port 22) {"action": "accept", "src": ["autogroup:member"], "dst": ["tag:production:22"]}, // Subnet router traffic {"action": "accept", "src": ["autogroup:member"], "dst": ["tag:infra:*"]} ], "autoApprovers": { "routes": { "10.0.0.0/16": ["tag:infra"] } }, "ssh": [ {"action": "accept", "src": ["autogroup:member"], "dst": ["tag:dev", "tag:staging"], "users": ["ec2-user", "ubuntu"]}, {"action": "accept", "src": ["autogroup:admin"], "dst": ["tag:production"], "users": ["ec2-user", "ubuntu"]} ] }

Key things to notice:

Gotcha: ACLs use HuJSON, not regular JSON

Tailscale ACL files support HuJSON — JSON with comments and trailing commas. This is great for readability but terrible if your tooling strips comments or rejects trailing commas. If you edit ACLs in the admin console, save them, then pull them via the API, the comments survive. If you pipe them through jq or a standard JSON parser, the comments are gone forever.

Keep your ACL source of truth in a file in your Git repo. Edit there, deploy from there, never hand-edit in the admin console.

Gotcha: Testing ACL changes

Before you push an ACL change that locks half your team out, use the Tailscale ACL preview feature. In the admin console, you can test what a specific user or device would be able to access under the new policy before you save it.

You can also validate ACL files against the API before deploying them:

bash curl -s -X POST "https://api.tailscale.com/api/v2/tailnet/-/acl/validate" \ -u "tskey-api-xxxxx:" \ -H "Content-Type: application/json" \ -d @acl.json

If you manage your tailnet programmatically and want ACL validation as part of a CI/CD pipeline, the open-source tailscale-mcp server from Yaw Labs includes a deploy-acl command that handles ETag fetching, validation, and deployment in one step — and preserves your HuJSON comments through the round-trip.

Part 5: Tailscale SSH — never manage SSH keys again

Traditional SSH key management is a nightmare at scale. Someone leaves the team and you have to remove their public key from every server. Someone's key gets compromised and you are scrambling to figure out which servers it grants access to. Someone rotates their key and forgets to update half their authorized_keys files.

Tailscale SSH replaces all of this. Instead of key-based authentication, SSH access is controlled by your identity provider (Google, Okta, Azure AD, GitHub) through Tailscale. If someone is authenticated in your tailnet and your ACLs say they can SSH to a device, they can. If they leave and you remove them from your IdP, they lose access everywhere immediately.

Enabling Tailscale SSH

On the server side, you already added --ssh when you ran tailscale up. On the client side, it just works:

bash ssh ec2-user@production-api

No key files. No -i ~/.ssh/my-key.pem. No passphrase prompts. Tailscale intercepts the SSH connection, authenticates you against your identity, checks your ACL permissions, and connects you.

Gotcha: The SSH ACL is separate from network ACLs

Having network access to a device (in the acls block) does not automatically grant SSH access. You need rules in the ssh block too. This is a common "it worked yesterday" scenario — someone adds a network ACL rule for a new server group but forgets the SSH rule. Network pings work, but SSH says permission denied.

Gotcha: The users field matters

In your SSH ACL rules, the users field specifies which Linux users the person can log in as. On Amazon Linux it is ec2-user. On Ubuntu it is ubuntu. If someone tries ssh root@my-server and your SSH rule only allows ec2-user, they will be denied.

You can use ["autogroup:nonroot"] as a users value to allow any non-root user, which is a reasonable default.

Part 6: DynamoDB, ElastiCache, and services without IPs

Subnet routing works beautifully for RDS and other VPC-native services that have private IP addresses. But some AWS services need extra care.

DynamoDB

DynamoDB is not in your VPC by default. It is a public AWS service with public endpoints. Your subnet router routes VPC traffic, not public AWS traffic. To access DynamoDB through your subnet router, you need a VPC gateway endpoint:

yaml DynamoDBEndpoint: Type: AWS::EC2::VPCEndpoint Properties: VpcId: !Ref VPC ServiceName: !Sub 'com.amazonaws.${AWS::Region}.dynamodb' RouteTableIds: - !Ref PrivateRouteTable VpcEndpointType: Gateway

Gateway endpoints are free. Once created, DynamoDB traffic routes through the VPC internally, and your Tailscale subnet router can handle it.

ElastiCache

ElastiCache clusters live inside your VPC and have private IPs. Subnet routing handles them out of the box, as long as the security group allows traffic from the subnet router.

One common issue: ElastiCache cluster endpoints resolve to multiple IPs, and those IPs can change during failovers. This is fine for subnet routing (you route the whole CIDR), but if you have been copying specific IPs into config files, they will break after a failover. Always use the cluster endpoint hostname.

Part 7: Multiple AWS accounts and VPCs

As your AWS footprint grows, you will have multiple accounts and VPCs. Tailscale handles this cleanly: put a subnet router in each VPC.

plaintext Account: Production VPC: 10.0.0.0/16 Subnet router: prod-router Account: Staging VPC: 10.1.0.0/16 Subnet router: staging-router Account: Development VPC: 10.2.0.0/16 Subnet router: dev-router

Each subnet router advertises its VPC CIDR. As long as the CIDRs do not overlap (they should not), devices on your tailnet can reach any resource in any account. One laptop, one Tailscale login, access to every environment.

Gotcha: Overlapping CIDRs

If two VPCs use the same CIDR range (like 10.0.0.0/16), Tailscale cannot route to both. Traffic will go to one or the other, depending on which route was advertised first. This is an AWS VPC planning problem, not a Tailscale problem, but it will surface the moment you try to set up subnet routing.

The fix is to plan non-overlapping CIDRs when you create your VPCs. If you inherited overlapping CIDRs, you will need to either re-IP one of the VPCs (painful) or use separate tailnets (also painful).

Part 8: Monitoring and troubleshooting

Essential commands

bash # See all devices and their status tailscale status # Check if your device is connected tailscale status --self # See which routes are available tailscale status --peers # Debug connectivity to a specific peer tailscale ping production-api # Check Tailscale's DNS resolver tailscale dns status # See detailed network path info tailscale netcheck

tailscale ping is your best friend for debugging. It tells you whether the connection is direct (peer-to-peer) or relayed through Tailscale's DERP servers. Direct connections are faster. If you are always relayed, there might be a firewall between the two devices blocking UDP. Tailscale still works over relay, but latency is higher.

When devices go offline

Common reasons a device disappears from your tailnet:

Cleaning up stale devices

After a few months, your tailnet accumulates dead devices — terminated EC2 instances, old laptops, test servers that no longer exist. You can remove them one-by-one in the admin console, or use ephemeral auth keys for short-lived devices. Ephemeral devices are automatically removed from the tailnet when they go offline.

For auth keys used in CloudFormation UserData, making them ephemeral is the right default. The device registers when the instance starts and deregisters when it stops. No ghost devices.

Part 9: Security best practices

Tailscale significantly improves your security posture, but there are things you should still do:

Use tags for everything

Never write ACL rules that reference specific users or device names. Use tags. Tags are durable — they survive device replacement, name changes, and team re-orgs. A rule like "tag:production can access tag:database on port 5432" still works when you replace every EC2 instance and every team member.

Separate admin and member access

Admins who can modify ACLs, approve devices, and create auth keys should be a small group. Use your identity provider's group memberships to map to Tailscale roles. Not everyone who needs SSH access to a dev server should be able to modify the ACL policy.

Enable device approval for new devices

By default, anyone with access to your identity provider can add a device to your tailnet. Enable device approval so new devices need an admin to approve them before they can communicate. This prevents someone who compromises a single user account from silently adding a device to your network.

Use MFA on your identity provider

Tailscale's security is as strong as your identity provider's security. If your Google Workspace or Okta does not require MFA, someone who phishes a password has access to your entire tailnet. This sounds obvious, but we have seen it: teams deploy Tailscale for "better security" while their IdP allows password-only login.

Audit what is connected

Periodically review the devices on your tailnet. Look for personal devices that should not be there, expired devices that should be removed, and devices that have more access (tags) than they need. If your tailnet grows to dozens of devices and you manage it through the admin console, this becomes tedious. You can automate it with the Tailscale API or a tool built on it.

Part 10: Making it real — the complete setup

Here is a start-to-finish checklist for getting Tailscale running with a team on AWS. This is the order we recommend, and the order that produces the fewest "why is it not working" messages:

  1. Create a Tailscale account and connect it to your identity provider (Google, Okta, Azure AD).
  2. Enable MagicDNS in the admin console under DNS settings.
  3. Install Tailscale on your own laptop first. Get familiar with tailscale status, tailscale ping, and the admin console before you add anyone else.
  4. Launch a subnet router in your primary VPC. A t4g.nano with the CloudFormation template above. Approve the routes.
  5. Test from your laptop. Can you psql to your RDS? Can you redis-cli to your ElastiCache? Fix security groups as needed.
  6. Write your initial ACL policy. Start simple. Admins get everything. Everyone else gets dev and staging. Ship it.
  7. Enable Tailscale SSH on your EC2 instances. Add SSH rules to your ACL.
  8. Invite the team. They install Tailscale, sign in with the identity provider, accept routes, and they have access.
  9. Iterate on ACLs as you learn what access patterns your team actually needs.

What comes next

Once you have the basics running, there are features worth exploring:

But do not try to set all of that up on day one. Get the basics working, get your team comfortable, and add features as the need arises. The best thing about Tailscale is that it works incrementally. Every piece you add makes the whole thing more useful, and nothing requires you to rip out what you already have.

The 90-second magic is real. The production deployment takes longer. But once it is set up, the "it just works" feeling scales to the whole team.

Published by Yaw Labs.

Try yaw — a modern terminal with SSH, databases, and AI built in.

winget install YawLabs.yaw

Related Articles

Interested in AI tools and developer workflows? Token Limit News is our weekly newsletter.