Tailscale with Avery Pennarun & Brad Fitzpatrick

“Can I Tailscale my Chromecast?”

You love Tailscale, I love Tailscale, we loved talking to Avery Pennarun and Brad Fitzpatrick from Tailscale about, I dunno, Go generics. Oh, and TAILSCALE! And DNS. And WASM.

If we get to a point where people are debating Go versus Rust, we’ve already won, because nobody’s talking about C, and nobody’s talking about slow-ass scripting languages. If we’re not running Ruby and C, great.

This rough transcript has not been edited and may have errors.

Deirdre: Hello, welcome to Security, Cryptography Whatever. I am Deirdre. Here with us today is David. How are you doing David?

David: I’m doing great.

Deirdre: Cool. Thank you. We also have Thomas. How are you Thomas?

Thomas: I’ve had better days.

Deirdre: Oh, I’m sorry. And we also, for our special guest today, we have Avery Pennarun. Is That the way it best way to say it.

Avery: That is correct.

Deirdre: Awesome. And we have Brad Fitzpatrick from Tailscale. Hi,

Brad: hello?

Deirdre: we are big fans of Tailscale, but other

Thomas: Some of us.

Deirdre: to this, Hey, Hey, who’s okay. We’ll get into it.

Avery: Love, hate relationship.

Deirdre: for those that don’t know what Tailscale is, who wants to tell us what Tailscale is?

Brad: oh, Avery you get to do it you’re the CEO.

Avery: All right. Well, okay. the super short version is a, it’s a VPN, but, it’s a mesh VPN. So you don’t have a center point. You just have computers wherever you want them to be. And you installed Tailscale on each of those computers and those computers can talk to each other.

Deirdre: Ooh. And I hear that it involves WireGuard. And we did a whole episode with Jason Donenfeld who created WireGuard. How does WireGuard fit into Tailscale and where does WireGuard end and Tailscale begin?

Avery: a good question. WireGuard is what I call the data plane. So it actually moves the packets from point to point. It does the encryption. It does the the high bandwidth, high throughput stuff. And what Tailscale does on top of that, is it figures. Where the nodes are in the world and it does the key distribution automatically for you.

So it sets up WireGuard and then it makes sure that WireGuard can negotiate connections through firewalls, which normally wouldn’t be able to do.

Deirdre: Right.

this is instead of me adding a little config on my, on which end of WireGuard and dropping a pub key and an IP address and stuff like that. That is all handled magically

Brad: you don’t have to forward ports. You don’t have to like, you know, set up some dynDNS service. Like we just deal with all this crap for you.

Avery: Yeah. There’s no config file. There’s no keys. There’s no IP addresses unless you want to deal with that stuff.

Thomas: it’s an immensely frustrating product. I have written, I have real problems with it as somebody who has spent so much of my career learning and memorizing the intricacies of IP networking and VPN configuration. My big problem with Tailscale is the installation experience. The setup experience, like you gear yourself up for a little adventure of okay, this is going to be fun. I’m going to like twiddle a bunch of knobs and eventually get Tailscale working. But really you just kind of clicked the installer and then everything works and there’s nothing to mess with. And

Deirdre: That sounds very appealing,

Thomas: I, I find it very jarring. so I think we’d be the, the millionth people to say that you guys kind of utterly nailed the onboarding experience for it. which is, you know, I mean, it seems like it’s a, it’s kind of a banal thing to say about any random startup, but for VPNs in particular, like having the experience of, you know, I spent like four years working with security teams at startups trying to get VPNs and stuff configured, and it’s a nightmare.

Right. And it just kind of works with Tailscale. think we could go into more about like the trust boundary stuff, like the implications of using Tailscale, what you’re kind of giving up, for, you know, all of the simplicity that you’re getting here. But I also think that like, you know, there’s something to the idea of giving the simple pitch about what Tailscale is.

It’s a VPN and it just kind of works with no setup.

Avery: Yeah. So just on the onboarding experience, I, I always find it entertaining because whenever anybody says on the internet, how easy the onboarding experience for Tailscale is we have a designer at work who just says like, oh God, what are these people thinking? This is the most horrible onboarding experience we’ve ever done.

There’s so many things we can improve still. Uh, and I guess that is more or less how our attitude works internally. There’s always more things we can improve, but compared to networking products, it’s not bad.

Deirdre: to, both Thomas and Avery’s point that you just made: how much of this is just being traumatized by networking software for decades, versus Tailscale is actually very good?

Avery: Just remove the networking

Brad: Yeah, and that’s What I was going to say too.

Avery: We have all been traumatized really badly by software.

Brad: I used to have so much patience for computers. It just like, as, as I’ve wasted more and more of my life with computers, my patience has just gone to zero.

David: I, I didn’t get Tailscale at first. And then I saw Thomas talking about how great it was. And then one day I was like, I have an Android phone that has a dev app on it, but I want it to talk to this thing I’ve deployed to on my computer. And I don’t want to use ADB. And then I’m going to read what I messaged to Tom: "fucking Tailscale is so goddamn useful, it pisses me off". To which Tom says in all caps," I FUCKING KNOW IT’S MADDENING." And I feel like that’s the experience that, a lot of people have had with it, or at least me and Thomas have had a similar experience where we’re like, I don’t know, be a little hard to explain what to do, but once you do it, you’re like, wow.

Avery: Yeah, we see a lot of people who post, like I’ve been procrastinating on this for six months. I can’t believe I procrastinated so long. It only took me five minutes. There’s someone who like I set aside my Saturday morning to install Tailscale. And I’m kind of disappointed because I’m 10 minutes in and I’m finished and I don’t have anything to give you a, my Saturday morning.

Deirdre: I Tailscale my Chromecast?

Avery: That is something you can’t directly do right now. You’d have to put Tailscale on your raspberry PI next to your Chromecast.

Brad: But you can put, the, the Amazon equivalent of a Chromecast is there like Firestick that runs Android. Yeah. But it runs tells scale. So if you’re like bringing it to a hotel or whatever you could, you can like get to your Plex server at home from your Amazon Firestick Chromecast, the HTMI thing.

Deirdre: There is a Google TV, which is literally Android underneath and they, they smashed a Chromecast layer on top of it. And it’s shitty, but I might be motivated to go back to it because I can run Tailscale on it.

Brad: no, we’ve had several people tweeting at us that they have like have Tailscale running on their, uh, Android TVs.

Avery: I got Tailscale to run in my Oculus quest. I don’t know what to do with it except like browse our monitoring dashboard. but it’s pretty cool because I

Brad: In 3d though, you can

Avery: exactly. From anywhere from a hotel

room.

David: Well, no, it just sounds like Tailscale is the first company to have meetings in their VPN, in the metaverse. That’s what that is.

Can I get everybody an Oculus quest? Put them all on the same Tailscale. There you go.

Avery: Yeah, we’re actually, we’re actually thinking of changing our tagline. It’s going to be like Tailscale, the network platform that powers the web three metaverse. I’m just testing it out.

Thomas: What’s the current Tailscale tagline?

Avery: Oh, do you really want to know? Uh, it is, it’s basically "the new internet".

Thomas: That’s okay. Anyways, I have a, I have a sort of like, I have an impression of like what the big win is for Tailscale. And you guys can tell me if I’m wrong. Right. But I, you know, I feel like if I talked to just, normal people about VPNs, a lot of what they’re thinking about is like, okay, this is how I get access to Netflix and other countries, or, you know, how I anonymized my a coffee shop browsing, but that’s not really, it doesn’t seem like that’s really the core focus of what you’re doing.

It seems like the big problem that you guys are addressing as access VPNs. It’s like, it’s the thing a startup would stand up to give their developers access to prod and dev and staging environments. the things that OpenVPN gets used for right now, it seems like mostly what you guys are heading at.

Am I crazy to think that?.

Brad: well, no, I mean, the real problem is that VPN means like, at least two, but often more things to different people and it’s it’s so it’s kind of a terrible name because people either think VPN suck because they use their corporate VPN that was like a really annoying client that nagged them all the time and was slow to connect and like made things, not work, or they have an impression that VPNs are these like shady VPN reseller companies.

You know, say they don’t log, but probably like actually actively report you to the authorities and, you know, while you’re trying to like pirate movies or whatever. So there’s like two crappy VPNs at different sides of the spectrum. And people kind of like hate and are afraid of them both. And then we come in and we’re saying, "we’re a VPN company!"

And everyone’s like, oh,

Deirdre: I hate those things.

Brad: know you have to say like, no, no, but we’re like a cool VPN. And we’re like a different type of

Avery: It’s kind of funny actually, because we initially wanted to avoid the term VPN in our marketing because we know exactly all of the feelings that are attached to the word VPN is like just despair, I guess, is the, is the best way to describe it. But, uh, there’s this technique used in product management, which is you ask your customers how they would describe your product.

And universally, everybody would be like, look, I know you don’t want to be a VPN, but let’s be honest. You’re a VPN. It’s just the best VPN you ever used. It’s the VPN that makes you want to use VPNs is the VPN that when you install it, it totally changes your outlook on life. And I’m like, we should just take that and put it at the top of our website.

And so we toned it down a little, but that’s the kind of stuff that people say, but you have to say it’s a VPN because it is.

And then get all the confusion about what kind.

Thomas: I, I, everyone kind of hits on, it’s really easy to install, but it does like a couple of really big things that even if you could install open VPN as simply as Tailscale, and you’d end up with the shitty open VPN vertical and not WireGuard, but whatever, even, even all that aside, right. two huge things you get kind of out of the box with Tailscale with no effort are your access to the VPN is controlled through single sign up. so you have a single source of truth about identity and login. in particular, you don’t have to have a separate multi-factor setup for Tailscale. whatever you’ve got configured in Google, like whatever your MFA requirements are, there are the requirements for Tailscale.

Cause it’s just drafting off of, I guess it’s OIDC from Google. there’s that, and then there’s. There’s network access controls inside the VPN, which is a thing I never see done correctly with people that try to do their own access VPNs. What I mean there is like you’ve got a product staging in the test environment and you want to give some developers access to testing, but not proud or whatever.

That’s kind of a salt problem with Tailscale. You have kind of like a, a really, really simple ACL system that lets people, you know, decide what I’m, a little bit gushing about this because we rolled this out. We rolled out ACLs for, for serious, at fly a couple of months ago. And it was, it was frustratingly painless.

it it’s another one of these projects where I set aside a day for it and it took us, you know, a half an hour. and then we’re just like kind of scratching our heads with what to do with the rest of the, you know, that can’t be all there is to it. So like, those are two big things. I don’t see people talking a lot about Tailscale, but For company access VPNs, for giving people access to internal networks, was about to use the word game-changer and you know, it’s

Avery: you

Thomas: you

Avery: I try not to use that word, but like,

Thomas: "changes to the game such that such as the game can never be played the same way again." Yeah.

Deirdre: Paradigm shift.

Avery: Yeah. So the way I described that is, is there’s sort of two kinds of networking products. There’s conductivity products and security products. And the job of a conductivity product is to connect you to things that you might otherwise have not been connected to the job of a security product is to stop people who shouldn’t be connecting from connecting to those things.

and VPNs, even though they’ve got, you know, you might imagine since they’ve got encryption and all of these access controls and logins and keys and things that their security products, but they’re not really, they’re actually connectivity products. The job of a VPN is to connect people into your network who otherwise would not be connected into your network.

Avery: And so most VPNs don’t have an access control layer because that’s the opposite of the job. Right. Access control is what firewalls do. And so you get firewall companies and you get VPN companies and they don’t talk to each other. And their jobs are basically like opposed to each other. Right. You have the VPN team, which is probably your it team trying to connect people to stuff.

And then you have the firewall team, which is probably a security team, trying to stop them from doing that. And then the two teams are fighting with each other and you’ve got politics and tickets getting filed back and forth. If you can combine the two into one product and just think like, look, my job is not really one thing.

Or the other thing I need to connect the right people to the right things. And then Tailscale adds in the like, "and I don’t care where those things are." Right. Have you sort of thinking of it holistically as like, there are people, there are things. How do I connect the people to the things?

Thomas: when I’m connecting into a Tailscale network to get access to my test environment, right. I’m not connecting to my Tailscale, you know, gateway setup that I configured on some crappy box in my internal network. Right? Like I’m connecting to your kind of cloud instance of Tailscale.

So kind of walk us through what’s happening when I do that, like where are my packets going?

And what are the decisions being made along the way?

Brad: So when you first start up Tailscale, the first thing it does is, talk to the control plane and says like, you know, "Hey, you know, this, I’m going to prove that I own this, uh, this WireGuard key." the clients always hold on to their private keys, but you know, you prove that you own the private key.

And then the server tells you what you’re allowed to know, who who’s in your network and whatever. And so, you know, all the peers that you’re allowed to see their, their names, their DNS names, at this point, the client configures your operating systems, routing tables, its DNS configuration. and if you ever want to send a packet to somebody, we make a decision at that time about like how to get that packet to that other WireGuard peer.

We don’t like actively bring up tunnels to like everybody in your company. If you have like 10,000 people in your company, like you don’t need a tunnel to every employee’s iPhone, you know, like sending them tons of traffic. So we do that very lazily. And the very first thing we do on like the very first packet is,we go over DERP, which is our kind of like our TURN-like relay network.

Every client also is connected to where the client is connected to the control plane to figure out like, what is your map over the world? What we call your net map. And it’s also connected to a DERP relay in your region. And we run like a dozen of them and we’re adding a new one every so often. and so like I’m connected to the Seattle one and I’m connected to the control plane.

So if somebody wants to send me a packet, they know that I’m in Seattle and they know what DERP node I’m connected to. So they connect to that and they send me a packet it sends me the WireGuard packet, but it also starts over that same DERP connection. It sends me a packet, what we call a disco packet, which is like a discovery level thing that’s like unrelated to WireGuard. And that’s all about NAT traversal. And so that’s, doing, you know, sending out the UDP packets to our STUN servers and to each other, and to figure out, like to try to trick our firewalls into like opening up things and then we send to each other. so it’s negotiating that in parallel with copying these packets through DERP.

So ideally we only use DERP for the first few packets and then we find a direct path and then, then they heartbeat to each other and keep the firewalls Nat mappings open. Cause you only have about for UDP. You only have maybe 30 seconds, sometimes 15 seconds before your, um, your firewall shuts down.

Deirdre: That’s usually plenty, right?

Brad: Yeah. I mean it depends how hard it was to set up the mapping about how aggressively we should like keep it alive. If it was like a very easy now on both sides or one of the sides supported, one of the three port mapping protocols, there’s like, UPNP NATPNP and PCP. If your firewall supported, one of these, your port was already open and we mapped a port for you.

So if either side is like easier, both sides are easy. Nat. Then we just like, it doesn’t really matter. We can just, we can spin it up whenever there’s packets to go back and forth. But so that’s the happy path. the unhappy path is that like, you know, you’re in a hotel that blocks all UDP or like both sides are behind free BSD routers that don’t, they don’t do end point,independent NAT.

And so they, they have a different source, UDP port number for every different destination and those ones like we just can’t get through. So then those end up getting relayed through our DERP servers in different regions

Deirdre: and how many regions

are you at? Are you deployed in? I’m

Brad: where we’re like, we’re in like 12 geo regions and we run multiple servers per, per region,

Thomas: so DERP is like, it’s, it’s a derivative of one of like the TURN protocols, like the web RTC, like STUN, TURN, whatever things? Like

TCP relay for UDP packets basically?

Brad: Yeah. So basically it’s like a IP packet relay, but instead of using IP addresses as the source and destination, it’s WireGuard private keys as the unit of addressing. And so you send it to a public key and when we just routed, so we, we don’t know what the packets are there that are just encrypted blobs, but we, when you connect to a DERP server, you, uh, similarly to control you prove that you own a private key, and then we’re saying, okay, sure.

Like you prove you have that public key. So we will route all packets to that public key to that connection. And that that’s just, um, that’s just TCP.

Deirdre: do you just like make a signature over some challenge and if it verifies with the public key that’s—

Brad: Yeah,

Avery: Yeah, it’s an, it’s a really neat, simple server. Because two people connect you with, they you’d say I own this public key. They prove they own the public key. And then anybody else who’s connected to, it can say, I want to send a message to this public key and it just relays it through the server. So it it’s neat in that there’s a lot less possibility of abusing the server because the only way to get a packet from the server is to have connected to the server in the first place.

Deirdre: Yeah.

Brad: that server is all open source, so we let people run their own if they want to. So a lot of people, when they’re setting up Tailscale, they find that there wasn’t like enough to do so then we’re like, well, I gotta keep configuring it. I guess I gotta run my own DERP server now.

Or like, just stop you don’t, you don’t need to, it’s like extra. And then they’re going to write in support tickets being like, I can’t get this to work. I’m like, yeah, you have to read the docs and open up all the ports and yeah, it’s annoying running your own infrastructure, isn’t it?

Deirdre: I think that’s what we’re doing it for you. Or is that on your GitHub org or is it somewhere else?

Brad: Yeah. It’s in the same repos, everything else since.

Deirdre: Just the scale, Tailscale repo.

Brad: Yep.

Avery: like a DERP subdirectory.

Brad: Okay. Uh, come to slash Dorper.

The binary is called DERPer.

Thomas: from the fact that you can run your own DERP server, I’m kind of assuming there’s like, there’s no configuration from Tailscale central injected into those things. It’s just anybody with any public key that proves like conform it public key. So like you could— other people.

You know, elicitly build things onto your DERP infrastructure. And right now you don’t have controls about that.

Avery: I had to convince Brad that that was a good idea. I don’t know if he’s fully convinced or not.

Brad: I know nobody has, nobody has abused it yet, cause we’ve never advertised it until five seconds ago. So we’ll, we’ll see.

Avery: so here’s the thing I want to redefine, you know, I want to say not abuse. It’s like taking advantage of a free service that’s out there. Right? So the way I think of it is like anybody who might need to do this kind of like what’s the word?

Um, rendezvous, uh, right. So you’ve got two servers somewhere.

They need to be able to talk to each other. they need to figure out how to do Nat traversal. There is a DERP network out there that can help them do this connection. So what we do is we limit the number of packets we’re willing to forward through the system so that you can’t abuse us by using up all of our bandwidth and spending all of our money.

And we also do some level of fairness between clients. So if somebody blasting all the packets, it doesn’t like ruin the ruin the day for anybody else.

Right. And because of that, it doesn’t really matter to us whether you’re using Tailscale, because most of the Tailscale users using the free plan anyway, or using somebody else’s product that wants to run through our DERP network, right.

We’re sort of providing this helpful internet wide service because most of the time you only need DERP for a few packets.

Brad: we ended up adding a flag to, to say that like, only people in your network could use your drip server because when we first launched this, people were running them. It’s like, okay, how do I lock this down? And I’m like, I don’t know, you can’t.

Avery: Oh yeah. When you run your own DERP server, people are a little more concerned about it. So there is a flag for this, but our public network doesn’t do that. And that allows us also to create like a much more privacy. Like there’s even if somebody broke into one of our DERP servers it wouldn’t really matter because they never see any decrypted traffic and they never don’t anything about anybody’s tailnets.

It’s

Thomas: This is highly relevant to my interests. I’ve been nipping at this relate the past five minutes. You guys talk. Right? So like we’re an application hosting company. I won’t go into the whole thing here, but like think, Heroku, and that’s fly. Right.

And our primary interface to services that people boot up on us is WireGuard. And we’re kind of regularly running into connectivity problems with WireGuard right now in that, like, we’ve got random people on random operating systems doing direct UDP WireGuard to us from behind random firewalls and things like that.

And w we don’t have a really good sense of where that breaks down. and what blows up there. We have things like there’s a, there’s a web sockets WireGuard relay that we have on all of our gateways that we could potentially downgrade to. But I’m just wondering if I can solve all my problems here with DERP.

Avery: you probably can. It thinks that you have to be a little bit careful about it. Like for example, uh, it’s our control server that decides which DERP server people on your network should be using for different things. So the simplest thing to do is pick one of our DERP regions and just have everybody rondezvous, through that. If you are using multiple regions, then you need a way to decide which region you’re going to rendezvous in, which is a little bit out of band.

Thomas: So both sides of in web RTC, both sides are connecting into the relay server to make that work for first packet for like the first time you talked to a tail Tailscale peer, how does the other side get of like, what’s the notification of the other side to bring up

Brad: Yeah. So every note is responsible for picking its own home. We don’t use anycast or, uh, any fun tricks like that. We, we literally the client, measures the latency to a whole bunch of regions. It kind of figures out where, where it’s closest to. And then after that, it tells the control plane that like, Hey, I’m using Seattle, that’s my home DERP region.

And so the control plane distributes out the home DERP region to all the peers in the network. So you know that like, you know, that service is in Frankfurt and that one’s in Singapore. And so it’s up to you to do the whole setup to that deep connection. If you want to send a packet to that person, or that node.

So, you know, That’s TCP and this TLS currently, but you know, several round trips to like, get that connection up. So sometimes you’ll see like the first packet, this, if you’re going from like Bangalore to, Seattle or something like that, you’ll see it miss its one second deadline, but then it’s fast after the second packet.

Thomas: okay. So like, I understand this right then like then as long as my agent is online on my client or my note or whatever, it’s got a connection to some DERP server somewhere

Brad: yeah, yeah. Two TCP connections are opened.

David: now I’m imagining someone publishing say some sort of distributed hash table or DHT of files. But then instead of communicating over UDP, they’re just doing all of the routing over DERP servers, just for fun and other, because it’s an easier way to route packets, to people that are behind NATs than trying to write your own

Avery: so they, yeah, they could do that. But the way I think of DERP conceptually is it’s, like some random router on the internet, right? If you are distributing illegal content over the internet, there’s the person who is sending out the packets and then the person who’s receiving the packets. And it goes through a whole bunch of routers along the way, right?

DERP is just a weird kind of router to send things through. We don’t know anything about the content it’s fully encrypted. We never store anything. And we can’t be used to create a denial of service attack because the only way you can get packets from DERP is if you first connect into it yourself, right?

As far as I know, we shouldn’t have any sort of like, you know, if people are using it for this, it’s going to be a very slow way to distribute files because DERP limits your throughput. Right. But in theory, you could share stuff that way.

Brad: well, technically our DERP servers are also the stun servers and th that’s UDP. So you could do a reflection DoS through it, kind of

if you fake

Avery: shouldn’t be able to do. Yeah. It shouldn’t do any amplification. So generally that’s okay.

Thomas: but what I could do is I could have all of my WireGuard gateway, I’ve got like seven of them in different cities. Right. I would just have each of them connect to its regional closest DERP server. And then in our database, we would just tell people if you, if you want to connect to the gateway in Frankfurt, here’s the Frankfurt server.

And that would work.

Avery: Yup. Yup. That would work. And the amount of bandwidth each person gets is certainly usable for things like SSH.

Deirdre: I’m sort of wondering if we can, if someone can leverage this, including the WireGuard, part of like ephemeral, WireGuard devices on using a DERP, network for, you know, basically using your Tailscale network, spinning things up, ephemerally on the fly for secure messaging for secure calling for

Brad: Yeah. And in fact, I a ported Tailscale to WebAssembly. And as part of that to make, to make it all work. I, uh, I had to add web socket support to DERP. So the DERP servers all also speak, uh, web sockets and our client does that. So you can bring up Tailscale in a browser and like get an IP address and run an HTTP server in your browser and like ping the browser, and I made the browser flash whenever you got a ping. So you could like ping flood it, and then it’s like flash flash flash flash flash.

Deirdre: You’re nerd sniping me into projects, someone. Okay. Free project

idea,

Thomas: on. Hold on.

Deirdre: for me, please.

Thomas: wait. wait, At this hadn’t occurred to me. So I did, I did web socket WireGuard because I figured the thing least likely to be filtered by the worst firewalls that we would encounter would be actual, real HTP connections doing web sockets.

Like if I, if I want to run WireGuard over something, the least offensive thing, I could run it over. It’d be web sockets. It hadn’t occurred to me when I did that, that I could also run WireGuard from my browser through that web sockets interface, just by giving up the right configuration information.

Brad: great!

Thomas: I can do all the fly contol in a browser with web sockets?!

Brad: Yeah.

Deirdre: How fast?

Avery: It’s

really fast.

Brad: yeah,

Avery: mean,

Brad: no,

Avery: don’t don’t try to do gigabits per second, but it’s actually, it’s quite fine. WASM is way better than I thought.

Thomas: Well, I wouldn’t do WASM. I would like to a JavaScript implementation

of WireGuard.

David: Is there a use case for like WireGuard in the browser that isn’t just like you had something in the terminal, like fly cuddle or fly control and you want to make a web interface.

Brad: here’s the use case I tell people: let’s say you’re like, you went on vacation and you didn’t bring a laptop because you’re on vacation. but then you, like, you get paged that, you know, there’s some production outage or whatever, And the only computer you can use is that a sketchy hotel business center computer I’m like, are you really going to type your credentials into that thing?

Like, no, but what you could do is go to like tailscale.com/, i dunno, dunno connect. And it shows you a QR code and you scan the QR code with your phone, which authenticating the Tailscale, running in your browser on the sketchy business center computer, makes it an ephemeral, uh, ephemeral node that’s running in your browser only, where the WireGuard key is just living in that tab.

You connect with RDP client or SSH client in the browser. Do your work, fix production. You close the tab, and the WireGuard key is done. So you never typed a password into the sketchy computer.

Deirdre: You need to build this. This is why you

need to build this so that I can use it.

No, no, no, no, no, no, no. I mean the Tailscale people have done all the

Brad: So,

David: this podcast is not condone running code written by Thomas.

Deirdre: like, it’s atailscale.com/connect, like, please, please manifest this so that I can use it because I want to poke at it at— this is great.

Brad: Well, just stay tuned to let’s say.

Avery: We had to like very carefully constrain, which order we do things in because it just, you know, there, there’s a reason that, you know, you asked me earlier, what, what was our motto? Right. It’s like the new internet, when you actually start realizing what’s possible, you start realizing that this is the internet that you wish somebody else had built.

Right? All the stuff you thought you could do on the internet. And then when you learned about computers, the more you learn, the more you realize all that stuff was impossible. It’s not supposed to be impossible. Right? It’s just a series of accidents in the way the internet evolved that made all these things not work.

Like why is it so hard for me to SSH from a web browser that should not be so difficult. Right away. Like it’s hard because of all the stuff that went wrong.

Deirdre: and then

because of the it’s fine for us, decisions that were made like 30, 40 years ago, just sort of like this doesn’t need to be encrypted or authenticated or have integrity. It’s fine. We all trust each other, right?

Avery: Exactly. And then we didn’t and then we added firewalls and then we ran out of IP addresses. So we added NATS and now we just can’t connect to anything except like AWS and Google

David: the other thing that I think was really important is what Jason had said, oh, well, whenever we recorded that episode, which was that like the key distribution and the transport are like completely.

Deirdre: Yeah.

David: And that enables a lot of stuff. Whether it’s like the Tailscale product with a guard, or even what Brad was just talking about with the QR code.

Again, we’re doing like the key distribution in the authentication via a separate channel from the thing that we’re actually trying to use. And that’s, what’s enabling a bunch of this stuff.

Avery: Yeah, that’s really valuable because one of the, one of the beautiful things about WireGuard is they ripped out all the parts that they couldn’t figure out how to do in a provably secure way. And they just punted that to somebody else. And then they did only the part that they could actually say, this is absolutely positively the best, most secure thing I can do.

And so that’s, that’s basically what Jason works on. And so WireGuard is super clean and then Tailscale takes on all of this hairy, disgusting stuff that you can’t prove is correct, because it involves like, OAuth, and authentication and that will never be provably correct. But it’s, it’s something that we can put all of our energy into because Jason solved what I call the data plane.

Right. Which normally when you’re building a VPN product is like 90% of the work you do.

David: Plenty of gnarly stuff that he’s doing too, but different shape for sure.

Thomas: Brad, you’re, you’re allowed to not answer this question for reasons of competitive intelligence, but am I right? That what you basically did was just build the go-wireguard with web sockets and load it to a browser?

Brad: No, like I started by saying, uh, you know, I said go S J S go, Archie equals plasm environment variables. So I tried to like, you know, like build things and I would see what didn’t compile. And I adjusted, built tags added more and more stuff that, I mean, the hardest thing was adding the web socket wrapper around DERP protocol, but that was like that wasn’t hard at all.

it went incredibly quickly, like the first day went so far that I was like, well, crap. Now I have to finish this. It turned out to be easy, you know? And then it was like, it took a week after that. But

Deirdre: Like setting up on Tailscale!

David: I’ve never like done a cross compile to WASM like, what gets output? Like do you just get a JavaScript bundle?

what actually gets written out by the

compiler at the end?

Brad: you get a .wasm file, which is like kind of a bytecode thingy. So I dunno, it was like, it’s like 15 MB uncompressed blob and it compresses down to like two MB or something like that. And then you, um, you have this JavaScript HTML, you know, HTML wrapper with some JavaScript in it that like has like three lines of JavaScript that says like, uh, you know, promise, instantiate load stream, blah, blah, blah.

And the name of the file. And then, um, it passes control or, you know, you call a function in that WASM module once it loads. And then we can take over and do whatever, you know, like you have access to the Dom. And so you can like create elements and all that stuff. So, I mean, I embedded, the xterm.js project. I made a fake console, and I made it so you could run the whole Tailscale CLI in a fake console in the page.

So you can run Tailscale status, and tailscale ping and all of

Deirdre: yes. Oh my God. I think xterm.js is also, and it might be a different one that is also used by like, you can SSH into Google cloud instances in the browser

and they just have like a link in there. it’s very useful little project.

Brad: yeah. There’s a whole bunch of these projects, but they require you to like, you know, run a reflector server somewhere to do this stuff for you. But

Deirdre: yeah.

Brad: yeah. Then the code is up on a branch. It it’s all opensource. The Tailscale WASM stuff is in like WASM test branch or something like that.

Deirdre: I’m, I’m going to go find it.

Brad: I have some tweets with videos.

David: back on the control plane? Is there, um, anything interesting to know about like how the identity and, uh, OAuth and zero trust for lack of a better word features work, or is it just like, yeah, we do OIDC and SAML and then you get a key?

Avery: it is kind of, kind of that boring. The, the hard part about trust, generally, I don’t really like the term zero trust. I realized this is another one of those things where you ask people, you got the right security people and they’re like, oh, this is a zero trust product. And I’m like, okay, it’s a zero trust product.

We will put that on our website, but it’s kind of dumb because. There’s no such thing as not trusting in cryptography. Right? You have to decide who you’re going to trust. That’s the hard part. Uh, and so what you really want is trust. How do you get that trust? is this complicated path? And so I think it was Thomas who said something about drafting behind OIDC.

And I think that that’s really important because the initial trust establishment is the most impossible part of any crypto system, right? And you already trust your login provider, whether it’s Google or Microsoft or GitHub or Okta or whatever, if you already trust somebody, then you can trust that somebody to introduce you to somebody else.

And so the control server, outsources that trust to Google or whoever. And we just basically say, okay, well, this, I now know that this public key does belong to this person who successfully logged into Google. Now I can tell other people that. So you trust the control server, the control server trust, Google.

And it all works. This is also why we don’t do like username and password authentication because that involves a whole new enrollment step, which would have doubled the complexity of our control server, right. With all the account recovery and everything else.

Deirdre: And like, if you’re, if you’re using Google or whatever for your identity management, they have just like this entire global infrastructure. Spamming and like, figuring out if you logged in from an IP address in some far off country that you’ve never logged in from, and then they like, won’t let you log in because it’s B it’s being weird.

They have like all this stuff around identity and protection, and then you can log in with your YubiKey and advanced protection turned on for your account. All that stuff is just like hooked right into Tailscale. just, it’s very nice. Very, very nice.

Avery: And we didn’t have to implement any of

it, which is my favorite part.

David: I feel like zero trust. A lot of the times, at least like when I would hear it, I would hear like, ah, that means not a V like, Oauth, instead of a VPN. then when you build a zero trust VPN, you’re like, oh shit. Like, what am I doing

Avery: Yeah. So what zero trust actually means? I think the term, like so many things, it was a good idea. And then got it got stolen and lost all its meaning. Right. But the original meaning is that you don’t trust the physical network. Right? Once upon a time, people would like go into the office, plugged into the ethernet port and have access to everything.

Because we assumed that if we can get in the front door, then they must be secure. Right. And then wifi came along and they invented this— my favorite engineering, I don’t know, engineering team joke ever is they called it WEP, encryption, wired equivalent privacy. It was like, wait a minute. Wired doesn’t have any privacy?

But you made the equivalent and sure enough, somebody broke it like six months later. And it’s like, this is terrible. And nobody said, because there was no marketing, some legal or marketing key must have stopped them. But nobody said. Exactly what we promised. Right. then WPA two is better, but it’s still like you get into the network and now you have access to everything.

People will have like a dashboard with no login process or logging on it, or like, you know, a database with default credentials and all this stuff like trusting the network is kind of dangerous. Um, well the physical network and use anyway. So what we add on top, what zero trust really does. It says don’t trust the physical network.

That’s a disaster. You need a way to establish trust before you let people do anything. Once that trust has been established, maybe it’s not so bad to have a database with default credentials because you can control exactly who can access that database with the default credentials.

Thomas: So the net experience of using this is like, a new developer joins the company. Right. And instead of having to have them go through a VPN enrollment step there’s just by dent of being in the Google organization and signing on through Google with your MFA requirements and all that stuff.

Right. All that three was installed tailscale. And that just works. And then they’re kind of on our network, but like, By default being on like fly.io’s Tailscale network, doesn’t give you anything, right. Like I still have to add you to like a group to give you access to anything meaningful on that network.

not cause they’re totally at T teleport is a great example of this, right? Like we do some teleport stuff for SSH. Now that replaces a really janky bastion system that we had before that. Right. I wouldn’t run teleport on a, you know, a star can access star, you know, actual system just because I don’t trust how teleport enough to, but I have like a whole extra level of, you had to go to this RDC provider to prove your identity and you have to go through WireGuard and you have to go through Ackles ticket to, and you have to be in the right group to access this age.

And that I find really reassuring. Like I really,

Avery: I mean,

Thomas: would have a hard time underselling the apple, the apple component of this whole system

Avery: yeah. So I mean, I’m, I’m very proud of it. Don’t get me wrong. I have to go back to your original question about like, what can I do other than adding people into groups. We do make this distinction between people and services, right? So we have this, this thing called Akhil tags, which I imagine you’re using.

that’s

because you

configure ACL’s that way.

Thomas: is there a better way to configure ACL’s?

Avery: Well, I mean, the default ACL’s in tailscale just allow anybody who’s allowed to connect and connect to everything. So that’s sort of a, it’s an, it’s an interesting trust choice, but

the, in a small organization that’s actually relatively safe. Right? You’ve got 10 employees, probably none of the 10 employees are like NSA spies. Right? So not doing ACL’s just means that all of your stuff is only available to those 10 people, as opposed to, you know, if I’m running a public server and not doing ACL’s means my stuff is exposed to like 3 billion people or.

Thomas: but you’re not really, using Tailscale. the, I’m trying to think of the right way to articulate it, but it’s like, it’s like, you’re not really all in on this. If you’re using like default, everyone gets access to everything ACLs, right. It’s not just that ACLs, you know, kind of, restrict down the amount of access that people get to your internal network or whatever. It’s also that it changes the way you design other security controls. Um, so there are things that there are things that we expose now in fly to, you know, team members that I would not expose if we didn’t have ACLs, I just wouldn’t deploy them at all.

Avery: You can, when you bring up a service, you can tag it as say, you know, uh, developers, dev dev server, or prod server. And then you can make an ACL that says anybody in the end group can access dev server. But any person can access a prod server, but a dev server and prod servers are not allowed to talk to each other.

Right? So you get this interesting world where you’ve tagged something as a prod server, and now it’s sort of, it accepts only incoming connections and can’t connect outward to anything else in your network and dev servers, you could set them up say maybe be able to connect to other dev servers, but not connect outside of dev servers, but anybody in your company won’t be allowed to connect to your dev server.

Any engineer might be allowed to connect your dev server. So you don’t necessarily have to name every single person in order to have security. It’s easier if you tag the servers first and then figure out if you really need to lock down

by individual person.

Thomas: yeah, that’s how we do staging environments for like full stack developers, people that spend most of their time doing kind of front end code, have different levels of access than people that are working on prod.

I think like,

I think I,

David: how full-stack

means front end developer. Now

Thomas: you know,

it does

let’s move on, you know, Lubian who’s awesome and a full stack developer on our team would probably smack me for saying that. it’s, certainly not the case, but whatever. A thing I kind of want to hit on here is the other side of this coin about how like, okay. I feel comfortable with deploying teleport in our environment because we have this set of Tailscale controls for it.

The flip side of that is teleport is on prem. Like we run it ourselves. It’s open-source code, you know, it runs on a server that we control. It goes through, GitHub SSO. Um, you can’t do anything with it unless you have a GitHub credential or whatever. also, there’s not that much in our environment. would have a hard time thinking of something in our environment that if Tailscale got owned up, we would, transitively be directly owned up with.

But, but it’s, it’s, it’s an interesting thing. Like when we were first talking about doing Tailscale at fly, it was like, well, we’re like morally the same thing as AWS. We’re not practically by any means the same thing as AWS, but morally we’re the same thing as AWS. And it’s hard to imagine AWS deploying Tailscale in that they would have to factor all of the trust they have for every organization that they run into Tailscale.

Which doesn’t seem like a reasonable thing to do. How do you guys think about that? Right? Because essentially for these network controls, you guys are the, the source of truth and the authority on who’s allowed to talk to what who’s going to see, which nodes and all that stuff. Like, do you kind of, how do you wrap your head around of control that you guys have of where people’s networks.

Avery: right. Well, let’s go back to what I was talking about with trust, right? If you can bootstrap your trust from somebody else that you trust, then everything is easy. if you truly have zero trust, then everything is hard, right? So tail’s skill fundamentally is about like, look, you don’t want to trust everybody, but maybe you can trust Tailscale.

If you trust him. Then your life is going to be easy. Now what does, what does trusting tale scale really mean? Right? Well, first of all, you have to trust that the software we’re sending you is actually not filled with, uh, Trojans and stuff. Right. Which is actually something you trust about every software provider that you download software from.

and that’s actually scary. We can talk about software supply chain stuff. I’m sure you’ve, you’ve thought about software supply chain a lot,

Brad: by being open source mitigates a lot of that.

Avery: It does. But I mean, if you’d download Tailscale from the app store, who knows what’s in the Tailscale from the app store, right. It’s not, there’s no way for you to prove that that came from our open source repository.

Right. You kind of just have to trust that we did that correctly.

Right.

Deirdre: the moment we

Avery: there’s there’s

Deirdre: But yeah. we, the ecosystem that distributes the thing, what transparency, logs, and

Avery: Yep, exactly. And we were working on all that stuff because the bigger the customer, the more they think about this kind of thing, and we have to get to that point, but I think we are, we’re always going to be. Like our philosophy is going to be, we have to be the trustworthy person in the room so that you don’t have to deal with all of that stuff because that’s, that’s the job that you’re sort of hiring Tailscale to take care of for you.

If you don’t trust us to exchange your keys for you, then you should exchange your keys yourself. And then we don’t really have much to add.

Thomas: yeah, I don’t worry. I don’t worry at all about the client. Right. And the supply chain stuff. Cause you’re right. It’s a risk. I accept for everything already. But like, you know, don’t believe that you guys have server-side vulnerabilities that let people bypass, you know, your authorization controls, but like, if you were any other provider, I would assume that you had them, like, I would assume that they’d be there and very bad.

And in this case, if you’re, if you’re running a pretty typical Tailscale configuration, then those vulnerabilities give anybody with an account on Tailscale access to your internal network, which is it’s game. It’s why, it’s why server-side request. Forgery is a game over vulnerability. Now it’s the exact same thing, right?

Is if you’ve got access to some of these internal network, that’s the ballgame.

Avery: Right. I mean, if, if we had a major security hole that, for example, like disabled ACL’s right. That would, of course disabled allow people to disable ACL’s on your network and connect from any machine to any machine inside.

Thomas: Yeah, so,

Avery: But that’s true of any key distribution system that you might use for say WireGuard. And most likely people building their own key distribution system because they won’t have the time and attention to spend on it that, you know, a standalone company doing this for many, many customers has. A hand rolled one will probably have more security holes than the one we’re building at least

Thomas: Yeah. A hundred percent. Yeah. And just to be clear, like, I, I trust you guys pretty much completely and we, we, you use guys, um, my only problem with you is that, uh, you said earlier that like you were the internet that we wish we had when we were first coming into the field and me and David both feel like you’re the internet that we wish we were building right now.

And we’re kind of upset that you built it for us,

Avery: Well, you’re, you’re building a different part of it, right? You’re building this sort of, ah, one of my friends told me a long time ago that that’s only three parts of computers. There’s, there’s the processing the storage and the connectivity. Right. And we are doing the connectivity and you guys are kind of doing the processing.

Uh, of course the processing has to be connected to something, but it’s a little bit different than what that, than the kind of connectivity that Tailscale is doing. Right.

And I think at this point, nobody has really solved the storage problem and it drives me crazy.

Thomas: Totally, random question. Are you guys named for the paper?

Avery: Ah, well, glad you asked. So yes, in fact, the original joke was because, uh, the paper by Google from a few years ago called, “The Tail at Scale” it’s a really interesting, exciting paper, well worth reading, about when you have millions of computers and petabytes of data, like terabytes of data transfer every little tiny, super, incredibly infrequent problem that you can imagine will happen probably multiple times per day.

and so here’s all of the really neat computer science you can do to prevent those things from, from hurting your product, which, you know, when you’re Google and you’re running things like Gmail and Google maps and Google search is you absolutely have to do that stuff. And it’s very exciting. And Google people do talks about all the really neat computer science-y things they do to solve these times.

These problems, Tailscale is a little bit I, I very much appreciate the paper, so it’s, you know, um,

we named it, you know, out of appreciation, but it’s actually, I’m flipping, I’m flipping it around. So what we’re saying is that the long tail of products are never going to be that big. Almost everybody building almost everything. Doesn’t have any of those problems because they need like one or two servers and they have these long tail of problems simply won’t happen. So we’re actually talking about the scale at tail, as opposed to the tail at scale. Um, and, and that is basically we named the company before we ever built a product or even decided what we’re going to build.

Uh, the idea was like, I’m so frustrated by everybody. Overdesigning everything, because Google does such great presentations about computer science stuff. Right. It’s

Deirdre: Yeah.

Avery: that everybody wants to do like raft consensus algorithms. It’s like, well, you probably don’t need a raft consensus algorithm. You probably need a, my SQL server.

Right.

David: I remember when, uh, the MapReduce paper came out. It had like that one paragraph and out that was like, sometimes some computers are slow, so stuff gets stuck, we just kill it and rerun those, um, to make the job finish. And then, academia for years was like, let’s optimize that piece of the process.

And it’s just like, literally no one needs to do that. Like the one person that this was a problem for it was Google and they already told you how they fixed it. And it’s fine.

Deirdre: It works at Google scale. It’s fine for you. The, the whole, these things, the law, the long tail actually matters when you are a planet sized computer or approximately planet sized compute reminds me of AWS running into like, when you are, HMACing things, billions of times a day for your S3 authentication tokens, um, you run into like the very, very, very, very small numbers of like collisions or reuse or nonces or whatever it is that like, you know, people doing, uh, symmetric cryptography are usually like, this will only happen, like a very, very tiny fraction of time.

So it’s probably fine. And it’s like, AWS is like, Yeah, we do that every day. It’s not fine for us. And we had to tweak it and it’s like, that’s really cool. But you are not AWS, so you are fine and you don’t need to worry about it.

Avery: Yeah. So I think, I think one of the reasons Thomas finds tales skills, so infuriating is. Is that we, we started from this one magical assumption that we are not going to solve the problems for the, the Googles and the AWS of the world. We’re going to solve the problem for the, basically everybody else who doesn’t have any of those problems.

And when we do that, we can focus on the problems you actually have and make them go away. If we’d spent all of our time on consensus algorithms and just giant distributed networks, so we could have a billion nodes, blah, blah, blah, then we still wouldn’t have launched. Right. And when we did, it would have like a 10,000 line configuration file of where to run.

All of your replica is in map reduces, right? And we wouldn’t have solved any of these problems that we’re having. We’re solving these problems because we started with a focus on like, look, I’m a tiny development team building some product, or like a dashboard I want to run internally. How can I

make that dashboard super easy to deploy?

Thomas: We’re having a very, like, we’re having a lot of Raft heartburn right now. Like raft is never the answer. It’s like a, it’s part of a moment that we’re having here. Do you guys have any distributed consensus internally at all? Are you just one SQL like that?

Avery: We use slack. Um, we, we, argue on slack a lot and I guess that counts distributed

David: weren’t you guys a single JSON file for a while. Not even

Brad: yeah, we, we have a Jason file on disc and on every mutation, we’d grab a BW, text and memory rewrite the file. And that lasted for a while until it didn’t work at all.

Avery: how did we,

not, how did we not call it to make the title of that blog post? The scale at tail

Brad: yeah,

Avery: that we’re never going to get that one into an academic paper. Here’s how we launched our company, uh, with a JSON file as the entire production database.

Brad: yeah.

David: I mean, for years when we were running census as a research project at Michigan, it was primarily running inside of a screen session. It might use your name, that we then transferred to the company when we started the company. And then I’m told that that some of that stuff was still running in a screen session under my username, even though my account couldn’t log into anymore.

that just got killed when they migrated data centers. So there you

go.

Deirdre: I kind of hate it. That that gives me the willies. Like what if I don’t know where that was running in the Michigan infrastructure, but I can just imagine like normal service, um, maintenance window:

kicking!

David: No. So here’s what the actual risk, if you’re running something in like a good data center and like at colleges, the answer is that another grad student is going to accidentally come into your rack and unplug your stuff. Cause they think it’s theirs. That’s why we would lock our rack. It was specifically to keep the other students from coming in and unplugging our stuff.

They still found ways to do it. but that was most of our downtime when we were in academic

Deirdre: or someone else logs into the very big beefy machine to run their experiment. And they’re like, why isn’t it running? What else is running? Killing, you know, kill dash

whatever.

David: well, the benefit of that was we were the admins of our own networks. We kicked all

Deirdre: Nice. There you go. Okay. That gives me a lot more confidence that it was safe to just leave forever. Wow. Okay.

Thomas: Is it, is it a single big SQLite database right now?

Brad: yeah, we were on etcd for awhile and we ran three etcd nodes and that got annoying and it was weird for people and we just migrated off that. And so now we’re just on

Avery: with LightStream

Brad: Yeah. so

yeah, it streams the write ahead, log into S3 every like second or something.

Thomas: Yeah, we we’ve talked about it a bunch. How’s that working out for you guys?

Brad: It seems to work. We have tests that restores work, but we’ve never had to do a, an emergency restore yet, but you know, the unit tests say it works.

Avery: Yeah, our architecture architecturally, the reason we can get away with building it that way. Cause we switched to et cetera, because we thought we needed to have at least one part of the system that was really redundant and reliable. And we eventually realized like we had architected the entire system so that if the control plane goes away, even for hours at a time people’s networks are not going to be affected because the DERP network is completely distributed.

It doesn’t depend on the control server, all of your nodes and the data plane don’t depend on the control server. It’s just inserting and deleting keys. That depends on the control server. So a couple of minutes of downtime is actually pretty much completely harmless ] in the control server. And if you make that, oh no, the DNS— ha ha no. MagicDNS runs entirely inside your Tailscale node and has no dependency on our Tailscale infrastructure whatsoever. That’s a whole different thing we can talk about if you like, but the, yeah, if the control server is down for a couple of minutes, it’s not that bad. So having a single source of truth that we then replicate up using LightStream and then can recover really fast if it goes down.

It’s actually good enough for an entire world to run Tailscale, if that were necessary.

Deirdre: Wait. So my Tailscale instance on each device is hosting its own DNS.

Avery: That

Brad: Yeah. Yeah. You’re. If you have Tesco on your iPhone, your iPhone is running a DNS server just for.

Avery: And this is actually, this is so exciting to me. I just want to point it out. The reason we called it magic DNS, despite everyone at the company, trying to convince me not to call it magic as is that ,it’s actually solves two extremely major problems with DNS. One of them is cache invalidation, right? Normally DNS has a cache delay time. Um, and if you don’t want to overload your infrastructure, you need to set that delay time higher. But if you actually want it to be able to update things, you need to set that delay time lower, and there’s no way to win, right?

So the way magic DNS works is when one of the names is supposed to change, we sent push notifications from the control server. Let’s say the same way we send push notifications for everything else. So you get instant update to your local copy of DNS, and you’re not going to overload any infrastructure because there’s no DNS packets going out over the network. You need to get.

The DNS has resolved locally. And also because we’re not ever sending the DNS packets out over the internet, or even over the private network, we don’t have to worry about DNS sec or any of that cryptography stuff because the packets are all internally. So there is no encryption to deal with. The push notifications are sent out over control, which is over TLS, but you don’t have the delay because they’re, they were

sent in advance,

David: so the local DNS server as a TTL of like one or

something

incredibly.

Avery: low.

Deirdre: Oh my God.

Thomas: yeah.

Avery: and, as, as with everything DNS, there’s always a catch, which is like integrating this with your operating systems. Local DNS turns out to be like filled with just, just updating, resolve.conf is where all the problems are. It’s ridiculous

Brad: But it’s cool because it lets us do things like, upgrade if your operating system doesn’t support, like, DNS over HTTP or DNS over TLS, we can upgrade it to that. So it thinks it’s speaking like plain ol’ UDP port 53 to us, which is on localhost effectively. And then we do DNS over HTTPS to like Google or CloudFlare or whatever you want.

So we do that like transparently, if you’ve told us that, like you wanted to use a to date or one-on-one we’re like, oh yeah, yeah. We know that one, that one supports something better than udp’s

Deirdre: This is fabulous.

Thomas: A frequent topic of conversation of ours is would we rather be fly or would we rather be Tailscale in terms of the problems that were like your, your breakdown of like the, we would probably just see there’s two problems in computing, right?

There’s the Tailscale problem and the fly problem. So, but w we have a really similar DNS setup to what you’re describing. you know, w we don’t have to worry about DNS encryption because we’re the source of truth for all that information. And it’s all going over WireGuard and all that But ours has to be consistent on hundreds of different machines around the world. And it can’t go down. And when it, when it’s inconsistent, anywhere users immediately noticed, cause their databases stopped working and stuff. Um, and it sickens me because if I was working on your problem, I w I wouldn’t have that problem.

Avery: I mean, Tailscale’s magicdNS is consistent, right? We, we, as soon as any information changes, we distribute that change to everybody through the push notifications. And then if the control server temporarily goes down, it doesn’t matter because you still have your magic DNS sync locally,

Thomas: Yeah, it might just be the case that your control server can go down. And what that really impacts is you can’t enroll things and that doesn’t happen very often. And for us enrollments happen, you know, many times a minute. So if it

goes down, like you can’t bring up a new instance of your application and then you’re kind of, you’re

kind of boned

Avery: fair.

Thomas: or else we would be using.

I think LightStream, I mean, I’m, I’m very concerned that Kurt is going to hear this and then me to replace all the stuff that we’re doing that with livestream. Cause he really

Avery: Yeah. It’s LightStream is, is, is really neat because it doesn’t get into the the high speed, but high-performance path of S F Q light. Right? You do all the stuff you normally do with escalate. Exactly. Like you’d normally do it in, LightStream just picks stuff up and streams it as fast as it can. And you can have a hot spare sitting there synchronized via light stream in real time.

And so if the main one goes down, you can switch over to.

Thomas: Yeah.

Avery: It’s pretty neat, but like, you know, it’s well worth thinking about, and this is very much a like the, the scale at tail way of thinking about things, right? It’s, it’s well worth thinking about whether two minutes of downtime, extremely rarely is actually going to be less downtime than you get from a stupid raft thing. Right? Cause it’s so complicated that these things go down because the con the complexity just caused it all to crash. Right? When you don’t have the complexity, it’s still going to go down because you have this guaranteed possibility of downtime. Right. But you can make it go down for a very short time and have a very recoverable system.

David: so you just have one instance of the service that talks to SQL Lite, basically because it’s more or less impossible to have two connections to the same SQLite database

Brad: Yep. We can, We can, have as many as we want on the same machine, on the same file system. So we can have like, you know, we could do like blue-green ones and stuff like that,

Avery: and we can do old fashioned sharding between tailnets.

Thomas: it it’s a little frustrating for us because we don’t control our users network environment. So like, we rely on a lot of low-level network services that we run on our platform that are exposed through our command line tool.

But the only way you use fly is with fly control and. So fly control is talking to a different DNS server. It’s making its own WireGuard connections and all that, but that’s all contained within the flight control process here. OSTP doesn’t know about it. So we rely on being able to override, for instance, what DNS server go is talking to like to have a resolver that points to a different DNS server, which is really clunky to do in go DNS.

I’m actually more curious, just in general, how you guys are feeling about go. I feel go versus rust is one of the big questions, the right word where we are independent. We have, uh, we have, we have some really solid, we have some really committed go people and some very committed Rust people on our team.

And we do, we, we work in both things and don’t know. I know it’s come up on your guys. I’ve read

presentations from team members of yours that mentioned rust.

Brad: I gave a talk like. Six or seven years ago where I said, if we get to a point where people are debating, like go versus rust, we’ve already won because nobody’s talking about C and nobody’s talking about like, you know, slow ass scripting languages. So like, yeah. Great. If we’re not running like Ruby and C great.

I don’t care. Like I think rust and go are perfectly acceptable alternatives for different types of problems. So.

Deirdre: yeah. And one thing that go already has in its favor is it has a strong and somewhat opinionated standard library where they are like putting a bunch of stuff in there and you can just pull from it and you can be like, this is the go, you know, version of crypto primitive or whatever. Like there is a crypto library in the go standard library and you can just reach for it and trust it.

And I know the people that maintain that

Brad: oh yeah, it’s not, it’s not just standard library opinion, nativeness itself. Also like, you know, the runtime, like there is exactly one scheduler. You don’t get like it to pick tokio versus 20 other options. Like, no, like we’re telling you the threading model and it’s safe to block because blocking will never actually stall anything.

Whereas like, you know, you don’t have to be careful with your schedule or anything. Likewise, you know, the garbage collection, like you can try to not avoid making garbage if you want, but like it’s there.

Avery: We have done a bunch of work to specifically avoid garbage collection in our fastpath. Right. And it actually works. You can do this in go, it’s something that most garbage collection languages, you just wouldn’t be able to avoid producing garbage and go as designed well enough that if you design your, write your code very carefully, you can make the fast path, not produce a bunch of garbage and therefore.

Thomas: I mean, I feel like the, the garbage collection stuff is maybe a little bit overblown. Um, when people compare, go and rest, they’re pretty, I mean, the actual ergonomics of both. You know, in rustier, you’re doing a little bit more typing because you’re doing like, you know, Arc mutexes instead of something else, you know, instead of it’s just a simple variable, but you have roughly the same experience, right?

Like it’s the pervasive generics that like makes rust a very different kind of programming experience. Like everything in rust is modeled as a generic of some sort is parameterized and, you know, three or four different ways and nothing in, in, in go is right. Are you guys doing any one 18 stuff yet?

Yeah. So like w w we, we talked a lot about this because for us, we could be using one 18 right now and using generics if we wanted to, right. Like really open source in the sense that like, we have to work with people using earlier versions of go and stuff. So we could just do it if we want it to. guys started that yet?

Brad: We’re not using it, but we will. We, we switched to the, uh, the latest version of go as our minimum dependency. Like the day it comes out. So we will depend on one 18, you know, next month or whenever

Thomas: do you have thoughts about what that’s going to, how that’s going to change your code?

Brad: we’ll use it in a few places here and there, but I don’t think that, go generics will make the call sites… uglier. you’ll see ugly code in the data structures and algorithms that use generics because they have to define, you know, the constraints and stuff. But call site code should look better if anything, I’m hoping

Thomas: I was wondering you think that just because you’re going to lose a bunch of weird credit interfaces that are defined just to pass through like sort of thing.

Brad: Yeah, exactly. I mean like the, the type inference that generics is doing to instantiate things, is pretty good. So yeah, I think there’ll be there. Won’t be so much crap on the page. Like as, as a regular user of go, it won’t like get in your face very much. Like you don’t really have to understand how to write, to go generic code to use it.

You just write, you write the obvious code. If, you know, if the. if the provider of the library did a good job defining things, you could just write the code straight and they’ll just work like, you know, min function or something like that. Right? Like people always expect things like that to work. now if you’ve read some code, it’ll say like men have three come before or whatever, and it’ll do what you think it does.

And there’s, there’s no extra syntax in there.

Thomas: except for the first year where everyone using generics is going to do results.

Brad: The people abuse everything for awhile. Yeah. I mean, you can always like read some go code and say like, oh, do you have this as a Python person that just as new to go, or this is a Java person that’s new to go there. There’s always people that their previous exposure shows through, but the same will happen.

I’m

sure

David: I feel kind of lucky. Cause you know, I think over the last 10, 13 years, whatever it goes, been out like complaining about lack of generics has been like a key part of the experience of doing go development. Like at some point you complain to your coworkers you’ve complained on Twitter or you make a snarky remark and then someone took treats you and on a go develop a response.

And you’re like, I wasn’t actually trying to be mean to you. Um, I was just complaining. and uh, but anyway, you know, so like you said, generics are going to come out. People are going to overuse them, but conveniently I’m quote unquote pivoting from engineering to product management next month. So I have to deal with none of the blow back of that.

I’ve gotten all of the complaining for 13 years, the not going to have to deal with the generics once they actually land.

Thomas: Which speaking of that, by the way, it kind of blew my mind. I think I only figured this out like a couple of months ago that Avery was the CEO of Tailscale. I figured you might work for Brad for a little while. I’m like,

Avery: I kind of do.

Thomas: is this the first? Is this the first time you, you you’ve run a company?

Avery: Uh, no, actually I started to start up when I was in university with my roommate from university, Dave Coons back in the late 1990s. And, we made Linux-based network appliances, which interestingly people also described as magical. Uh, it was back in the day before there were containers and cloud and stuff like that.

So, uh, we actually, we built systems that you would buy our appliance and you could, there was a CD rom drive, cause that’s what people use at the time. And you could put in a CD and the CD would basically install an equivalent of a container. So you could have applications that were distributed that way.

And we actually could run multiple, multiple of them side-by-side and so on. And so it was, it was, there was a lot of similarity. What, what happened is I think strategically, I didn’t know. as much as I do now about how not to crash your startup into the ground. Um, and so we had some, some really, really, really beautiful stuff that really actually worked that people would rave about, well, we never got past the early adopter stage because we didn’t understand what it took to commercialize a really good product.

And so eventually we kind of ran into money. We got acquired by IBM in 2008. Uh, and it was okay. It was like a reasonable exit. It looked okay. But I’m very sad that, of course, after you get acquired, all the beautiful stuff you built sort of gets ripped to shreds by the acquirer. Not really intentionally just by the accident of being acquired by a big company with different culture.

Uh, and so all of this, like we had a DNS server that was really good that was, you know, reminiscent of magic DNS. Right. And we had this container like thing and we had. We had a, what do they call it? Read only boot system. That’s a lot like CoreOS that could run containers in it that were the mutable part.

And we had backups that would distribute, you know, you could automatically back up your hard drive in real time and stream it out to the cloud. What word of a cloud was at the time? And then, you know, bring it back. If you need to restore it, all of this stuff was there and it’s all Gone now. Right? So like,

Deirdre: at the time.

Avery: yeah, well we opened source some parts, but it wasn’t, you know, the problem is we get up, I could have a whole separate rant about this.

Right. But open source is great for giving you tools. It’s not great for giving you highly integrated environments that make all your problems go away. Right. Because anybody, like the point of open source has to be able to fiddle with it. Right. And, and so it was wonderful for things like WireGuard. You have this beautiful component that then you can plug into something else. And even Tailscale is in the open source. Parts of Tailscale are in my opinion, a beautiful component that you can plug into something else and give that something else conductivity. Right. But making it super, super seamless, it’s not something that open source is great at.

Right. And I’m really sad that we crashed that startup into the ground. And basically one of the, one of my absolute, uh, probably my number one priority at Tailscale is to not screw up at least in the same way over again.

Brad: There’s so many ways to screw up though.

Avery: Well, yeah, there’s so many ways to screw up, but like, look, this is. Somebody has to fix this nonsense, right? This is, this is me getting emotional, but the a lot of bad stuff has happened in the computer industry, in the intervening decades. And a lot of the directions that we’ve been calling in are just dumb.

Um, and we need to reverse some of this really just dumb stuff and start doing things the easy way again. Right. And, and nobody wants to do things the easy way. We kind of have to show them that like, look, the easiest way, the easy way is so much easier that you should just do that except the two minutes of downtime.

Because your users won’t care, right? Put less than 10 layers in your stack. You don’t need Kubernetes because you only have one database server and the database server is not going to lose its data and you can recover it fast.

Deirdre: yo, let me Tell you how hard it is to just deploy one container that talks directly to the internet. That is not easy to do on cloud platform of your choice. And literally the answer, the answer is a virtual machine, one container, that’s it? Because all the other options are like, you know, Kubernetes cluster and like a

Thomas: me

Deirdre: and engine

and yeah, like

I’m

literally typing fly control over.

David: call had a hosting platform that could solve

Deirdre: Excuse me, Thomas. I just ran into the, like the super speed run of fly.io. And it’s asking me for a credit card. So I am running into friction with your onboarding process. So I

have a bone to pick with

Avery: reported that to Thomas. When I signed up for fly.io

Thomas: Yes. I

David: w

Thomas: heard about it in

real time from from Kurt

it’s, uh, w

we’re we’re.

David: fly.io the credit card form rejected my card and gave me a 500 and I ended up DMing Thomas about it.

Thomas: Yeah.

W we, we,

we would, uh, you guys, you guys all know what happens. If you have just arbitrary VMs, you can sign up for, without a credit card.

Avery: blockchain mining.

Thomas: Yeah. A hundred percent of it. Yeah. It’s all just blockchain. It’s immediately just miners everywhere. we’re getting rid of the cards for signups for some things, or it’s like, the white whale of, of product development here is the, is

Deirdre: I see.

Thomas: towards mostly getting rid of the card for people.

But right now that card is all, that’s the thin plastic line between us and complete chaos.

Avery: Yeah, that is the nice, the nice part of Tailscale is as far as we know, there’s nothing good. you can do to like, well, it cost us a lot of money just by using the free plan.

Deirdre: Hmm.

Thomas: you messaged me the other day about like the go DNS stuff you’re working on. And like, I have, like, have my go DNS complaint and then you have a bunch of other things you’re working on and go DNS. And I feel like I totally incomplete thoughts about the Go DNS situation, but now I feel like I should have them and that should have had them ready for this conversation, but I didn’t and I feel bad about it.

Brad: yeah, I mean, I don’t know what, like what set of complaints you have, because I’ve heard so many over the years that, uh, I could predict like one of the five or 10 that you would say, but

David: I can give one, which is that, if you’re using the HTTP library and you’re following redirects and you want to do something, like figure out if you’re getting redirected to local host or to a specific IP that you don’t want to connect to, it is very difficult to set that up.

Brad: I wrote a Perl module for this like 15, 20 years ago. Exactly. For this problem for livejournal. yeah. It’s you ha you have to specify a transport with a, uh, with a dialer and check the result in your dialer and filter it there.

David: yes, I believe that is what we did. I don’t remember. Our problem was our like crawlers at census would occasionally hit redirects back to our own infrastructure on HTTP. And then you would end up with sites in our little crawler that had our own, like this machine belongs to the university of Michigan or the census.

And they’re just doing all of this research and here’s the big link list of papers that we’ve published. showing up as the content of random pages on the internet, because they were redirecting to local hosts.

Brad: I mean, the other thing that you can do is a. I don’t trust your machine to have internet access to begin with and make it, you know, go through a proxy somewhere else. And then you just set your dialer’s proxy to something else. or your HTTP transports.

Thomas: apropos nothing. Did you guys ever end up doing something with the netstack stuff? The user Modi’s next stuff?

Avery: Did we

Brad: Yeah. Yeah. I mean, that’s how the whole WASM stuff works there. Ain’t no TCP stack and

in the

browser.

Thomas: This is all. This is all back to the fact that I didn’t realize that I could do wasm with go-wireguard

Brad: and actually we now use it even on, um, Android and stuff. So like, you can run an exit node on the Android now. So if you have like an old phone sitting around that doesn’t even have a SIM card, you put it on wifi, plugging into power, and then you will like use that. And we use the net stack TCP stack to, um, to be your exit node.

David: which which TCP stack is this? Is this the

Brad: This is the one from G visor. It was originally a different unit it’s used in a fuchsia or

so.

Thomas: netstack is fuchsia is TCP IP stack?

Brad: Yeah, for now, I think there was some talk about maybe replacing it at some point, but then, um, so it was a standalone project and then it was in fuchsia and then it was like in visor and then I pull it out of G visor and made it standalone again. Cause it was just in a, it was in a massive repo that I wanted to cut

down.

Thomas: where you involved in that code?

Brad: No crush I’ll use to be

David: it doesn’t G visor have some like intermediate build pass or something that does some code gen for it. I have a vague recollection of reading the G visor TCP stack at some point to get inspiration for protocol I was working on and I

needed to

Brad: Not much, did they have some magic annotations for static analysis and comments that just check various and variants, but like they don’t do any real code generation that I know of.

Deirdre: Hmm.

David: wrong.

Avery: You can do some really neat stuff with the net stack bits. One of my favorite things is if you’re a, well, if you, as we are concerned that you have really, really high sensitivity production servers, where if Tailscale were to do something weird, like start adding routes to your kernel, it might break conductivity.

So, for example, like on our control server, if you want to use Tailscale to get into our control server, you have this bit of a chicken and egg problem. If something is broken, it’s probably going to really break, you can use the net stack version and that runs in user space and you can stick it in a cgroup or something like that and prevent it from touching any of your Colonel routes.

And it just can proxy your incoming sessions purely in user space. It also lets you use like you can do sub-net routing. And like Brad said, exit nodes in like a macro S app store container which you normally wouldn’t be able to do quite so much flexibility in the network.

Brad: and it lets us do fun things like, we’re running as a regular user. So we can’t do, um, like ICNP or like, you know, RA sockets. So instead when we see like an ICNP echo request go out, or like, I know a shell out to the pink command, we run the 10 dash N one and we’re like, I don’t know, X is zero. We make an ICP response.

So like the person behind this testing, their internet, that their exit knows working with ping edited date or something, they’re like, okay. Yeah, ping worked.

Avery: but it’s really neat. The direction we’re trying to get to there is, where basically any of your Tailscale nodes will someday be able to act as a relay onto your local sub-net in some sort of safe way so that if somebody has like a Mac sitting at home, they can transparently use that Mac to access all the other devices like printers and stuff on their network without ever having to set anything out.

It’s going to be a few steps from here to there.

Deirdre: nice. I don’t like that. I can talk to my printer at all. I just want it Like I’m looking at it from over here and I’m just

Brad: So the other thing we were playing with was, we did a layer, two Tailscale implementation where we use tap instead of tun. And we implemented enough of AARP and DHCP that then you can boot up like a Q umu Linux VM on a bridge and it comes up and it gets a Tailscale. IP addresses as DHCP server and all the traffic is only over Tailscale.

That’s like the only tail the VM only sees the Tailscale world. So then I plugged my HP printer into that bridge and turned on my printer. My printer came up with a Tailscale IP address and co-workers printed to it. I’m so reTailscale. So there’s fun stuff. Really. We want to do a better implementation that uses like, you know, MDNS are all about, you know, service discovery stuff, but, uh,

Deirdre: beautiful. I extremely desire tailscale.com/connect to be available for me to use soon ever.

Brad: well, now, now I have to make it not a 404 and make it say something like.

David: well, the good news is this’ll come out anywhere between tomorrow and like four months from now, depending on who does the editing and how it’s all going.

Deirdre: Yeah,

David: you’ve probably got time.

Deirdre: yeah, you gotta, you gotta rebase that a branch it’s about 200 commits behind. and then, uh, Bob’s your uncle. You’re ready to go.

Thomas: I’m lost in thought about DERP and about. Web sockets from my browser, but those both have taken over my brain.

Deirdre: yep.

Brad: cool.

you.

Tailscale with Avery Pennarun & Brad Fitzpatrick

Latest Posts

Trump’s Golden Post-Quantum EO(s)

Facing the Vulnpocalypse With lcamtuf