Using Paris Traceroute for Modern Load-Balanced Networks

00:00

0.5
1
1.25
1.5
1.75
2

This is a podcast episode titled, Using Paris Traceroute for Modern Load-Balanced Networks. The summary for this episode is: <p>Traditional traceroute is a widely used and very useful tool, but it struggles to accurately trace load-balanced networks. In this episode, Dr. Justin Rohrer joins us to talk about Paris traceroute, an extension of traditional traceroute, and how it's used to trace paths on a network that uses flow-based load balancing, or in other words, most of the internet. Learn more about Paris traceroute in Phil's blog post, <a href="https://www.kentik.com/blog/the-power-of-paris-traceroute-for-modern-load-balanced-networks/" rel="noopener noreferrer" target="_blank">The Power of Paris Traceroute for Modern Load-Balanced Networks</a>.</p>

Transcript

Philip Gervasi: If you're a network engineer or just work in tech in pretty much any capacity, you're probably already familiar with traceroute, a mechanism that we use to trace the path between a source and destination over a network. And that sounds straightforward enough, but the way that we do networking today, especially when you take the public internet into account, poses some challenges to traceroute or what I'm going to call in this episode legacy or traditional traceroute. With me today is Dr. Justin Rohrer, a subject matter expert on Paris Traceroute, a newer version of traceroute that you might not be familiar with. So today's episode is going to be a little in the weeds, which is just the way I like it, and I hope that you do too. My name is Philip Gervasi, and this is Telemetry Now. Justin, thank you so much for joining me today. I know that traceroute is something that's very familiar to most network engineers and really to a lot of people working in tech in general, help desks, admins, all that, for years and years and years. So I really am interested in talking about this newer version of traceroute, Paris Traceroute, what problems it solves, why it's necessary today. I do appreciate you bringing your subject matter expertise to the show today. So thanks very much.

Dr. Justin Rohrer: Yeah, it's no problem at all. I'm happy to be here.

Philip Gervasi: Justin, before we get started, can you give us a little bit about your background, maybe even your formal education and what you specialized in?

Dr. Justin Rohrer: Right. So my PhD is actually in electrical engineering. That's what I did as an undergrad and worked my way up the stack over time. I was always really interested in networking, but when I was starting to look at graduate school, the hot thing was optical networking, so it was a very physical sciences oriented approach. And then as time went on, I became more and more interested in the science of network resilience and survivability. And a lot of that comes into play with software stacks and the interaction between the layering of the protocols and things like this. So by the time I finished my PhD, I ended up working as a professor in computer science for over a decade and teaching networking from the software side of things rather than from the hardware side of things.

Philip Gervasi: Oh, interesting. I didn't know that you were a professor. I mean, I knew you were in academia for a long time. Anyone who's getting a PhD, especially in the sciences, is going to be in academia for a long time, but I didn't know that you were a teacher, as a professor, so that's pretty interesting. Also interesting something that you mentioned is that you were approaching networking from the software side of things. And without giving away your age, that happened years ago, right? And so it's just so interesting to me that we're talking about approaching networking in the context of software, infrastructure is code, and all these things today like it's some new thing. But years and decades ago VLANs were invented, which is a construct written in software for networking. So I don't know, it's just interesting to me that we've really kind of been talking about this stuff for years and even decades anyway. So anyway, let's focus on today's episode, a discussion of traceroute and specifically Paris Traceroute. Now, I'm not going to give you a definition of traceroute per se because I think most of our audience is familiar with tools like ping and traceroute, again, that exist on most systems out there from networking devices to most operating systems as well. But let's start with a quick recap of what traceroute is and I guess I'll use the term traditional traceroute, legacy traceroute. What do you think, Justin?

Dr. Justin Rohrer: Yeah, I think traditional traceroute is good. Sometimes we would call it legacy traceroute, but it's not like it's deprecated in any way. It's still the tool that's on all these systems.

Philip Gervasi: Right. So legacy in the sense that there are new versions, but not legacy in the sense that it's completely deprecated and is already replaced on all systems everywhere with some other more modern maybe better solution. Now, we are going to talk about a more modern solutio, Paris Traceroute, but it's important to understand that that is not installed everywhere. So folks are still relying very much on traditional or legacy traceroute in their day- to- day network operations. So in that light, Justin, can you give us a little bit of a background, a technical explanation of how traceroute works and what we use it for?

Dr. Justin Rohrer: Right. It's filling in a gap in IP where we don't have any instrumentation about the path that traffic flows over from end- to- end just natively in the IP protocol. And so it's a little bit of a hack that uses the IPTTL field in the header to try and interrogate the different hops on the path. And so it starts with a low value of that TTL so that it will expire at the first hop and it'll get back an error message essentially from that router, which is the ICMP time exceeded message. And then it will send a packet that expires at the second hop and it'll get that same message back hopefully, and then construct the path in terms of the IP addresses that send the traceroute application these ICMP messages.

Philip Gervasi: Right, so a hack not in the sense that we're just hobbling together random technologies haphazardly, but we're using technology and mechanisms and information that wasn't necessarily intended to do what we're doing in this case with traceroute to solve that particular problem. But as I guess engineers do, you use what you have in order to solve the problem at hand. So in this case, we're using things like ICMP and TTL messages and all that information that we have available to us to then trace a path, which is something that's lacking inherently in IP, right?

Dr. Justin Rohrer: Exactly, yeah. It's a hack in the traditional terms of using things expediently for nefarious purposes. That's not what the ICMP time exceeded messages were intended for. They're intended to expire packets that might've gotten stuck in a routing loop or something like that. I guess the other thing to mention here is that traceroute sends UDP packets by default. At least on Linux and BSD systems, it's sending out UDP packets. But it can be done with TCP SYN messages or ICMP ping packets. Windows uses the ICMP ping packets by default. So any of these can be used because they're just expiring based on this field in the IP header.

Philip Gervasi: Okay, so we have traceroute that has not been hobbled together, but it is kind of a hack, again, using the technology that we have at hand to solve a problem. So with the advent of newer forms of traceroute, like Paris Traceroute, again the topic of today's episode, does that mean that traceroute is just completely dysfunctional, broken and we needed a solution to fix it or a solution rather to replace it entirely? I mean, it is everywhere on pretty much every operating system, every network device, and I have used it extensively over the course of many years of my career, I would say decades now.

Dr. Justin Rohrer: So I guess I want to mention a couple of things that it does well first, because it's not like you have to completely replace traditional traceroute with Paris Traceroute. It can answer questions for you like, just how far away is this destination? Am I going to my ISP to reach something or one ASOA, or am I going across the country, the internet just to get a notion of how many hops there are? It works just fine for that. If you don't care very specifically about which hops are in the path and just want to get this general notion of distance, it does fine for that. Also, probably the most common thing I use it for is I can't reach something and I want to know where the path is broken. Is the traffic getting out of my house? Is it getting out of my data center? Is it getting out of my ISP? That notion of just in broad terms, where's the path broken?

Philip Gervasi: Okay, so traditional traceroute, I'm not going to use the term legacy because as you said, we're still using it. It's still very useful. So maybe not legacy, but I think we can say traditional traceroute as compared to newer forms is still very, very ubiquitous. It's very commonly used out there and very useful to us. Now, I know that there are ways to trace path in layer two, but here we're talking primarily about layer three. Hop by hop, where are my packets going, assuming that those hops, especially if they're out there on the public internet like ISPs, are returning that information to us. Traditional traceroute still very useful. We get that. Where does it fall short such that we need something new like Paris Traceroute?

Dr. Justin Rohrer: Yeah, so then where you run into issues is when you want to know specifically what path am I taking and characteristics of that path and it involves load balancing.

Philip Gervasi: Well, not to cut you off, but it sounds like load balancing is the primary driver here. Load balancing is where traditional traceroute is lacking, a general inability to be able to trace paths when there is layer three, I assume, load balancing occurring. But what are the technical reasons for that? Why can't traditional traceroute handle that?

Dr. Justin Rohrer: The technical reasons that this becomes an issue for traditional traceroute is that it has to encode a sequence number in each packet it sends. So that when it gets the ICMP responses back, they have a snippet of that original packet including the sequence number, and it can match up the responses with the packet that was sent. In the ICMP header, there's a field for the sequence number. And in the UDP header, there's no sequence number, and so it instead encodes the sequence number in the destination port. And then in the TCP header, it can use the sequence number. So that's not an issue for TCP, but for UDP and ICMP, using those fields causes routers to think these packets are part of separate flows. And so when the router goes to do flow- based load balancing, it doesn't keep the packets together on the same path. So one packet may go down one load balance path, the next packet goes down a different one. And when traceroute puts these responses back together, it says, " Oh, here is the path," and it says there's links effectively between these interfaces that may not exist. And that's assuming both paths are the same length. You may just get a random assortment of hops that are actually in different paths and don't actually have links to each other. In the cases of unbalanced load balancing, which is an oxymoron, but where the load balance paths are different lengths, it can actually look like you've got a loop in the path. You might see the same hop appear multiple times. And when you're trying to diagnose something, this looks like, oh, I've got a routing problem here, when actually it's just load balancing, it's all working correctly, but traditional traceroute made the wrong inferences for you.

Philip Gervasi: So you're getting either completely imaginary links. You're getting maybe indication of a routing loop that doesn't exist. We want to know if there are routing loops, of course, and traceroute is one of the first things that I'll do to look at that and see if the IP addresses basically flip back and forth, back and forth. But ultimately, there might be something in the results that indicates that and it doesn't exist at all. So traceroute, when we're talking about the context of load balancing, specifically flow- based load balancing in particular, is potentially inaccurate. So how do we solve for this? We have Paris Traceroute. And when I say Paris Traceroute, I mean Paris as in the city, P- A- R- I- S. First introduced around I think 2006 is when the first came out and named after the city of Paris because the authors were working in and through several universities in the city of Paris and then in the suburbs just south of Paris. And then some subsequent talks were given in Munich and other cities in Europe to socialize the technology. But ultimately, that's what we're talking about as far as the origin and that time period, so around 2006-2007. Now, how does Paris Traceroute differ from traceroute? How does it solve the problem for us?

Dr. Justin Rohrer: Right. So in the cases where the paths are load balanced... And by the way, we see load balancing in about 65% of all paths on the internet. So it's very prevalent. And of those, about 98% are flow- based. So this is very applicable to the path you actually see in the internet. So basically the goal of Paris Traceroute is to keep the headers consistent across all the probe packets such that they're all treated as part of the same flow. So they got creative in where to put the sequence numbers so that it doesn't change, for example, in UDP it doesn't change the destination port number. It's encoded somewhere else in the header. I'm forgetting off the top of my head. But the key is that it maintains the specific and header fields that are used to define a flow by routers doing flow- based load balancing. Those fields, by the way, are the source and destination IP address, the protocol, and the source and destination port number, and typically it's just using byte offsets and looking at a hash of that. And so certain of the ICMP headers will also get caught up in those byte offsets, and that includes the ICMP header checksum. So you can't change anything in the ICMP header that would change the checksum, otherwise it'll get treated as different flows. So they do some magic to still be able to encode the sequence number, but add other data in the header so that the checksum always comes out the same.

Philip Gervasi: Okay, I understand. So the real issue here is that traceroute, or at least the packets that traceroute uses and its probes, operate differently on the network than regular application traffic does. So application traffic is going to be load balanced at least by flow by pinning that flow using sequence numbers and checksums and whatever else to particular links and particular paths, whereas traceroute is just sending packet by packet for their probes, and so it's not being pinned to anything. There's no concept of sending everything down the same path for the multiple probes that traceroute is going to use, therein giving us the problem of imaginary links, false links, routing loops that might not even be there, and ultimately the inaccuracy that we get in flow- based load balancing or rather flow- based load balanced networks and trying to use traceroute on them.

Dr. Justin Rohrer: Right. So by keeping all the paths within a given traceroute, or sorry, all the packets within a given traceroute on the same path, then if you see two interfaces to each other, you can reasonably infer that there really is a link there, that they didn't go down different paths along the way. It also solves that problem of having the perceived routing loops. Because if your packets took the longer path, they all took the longer path, and so your responses all come back in a reasonable sequence.

Philip Gervasi: So what we're doing here is we're keeping our sequence numbers among all of our path probes used by traceroute consistent, so that way that forwarding decisions are made such that all the probes go down the same path, whether that's using sequence numbers, checksum or whatever other mechanisms. But really the underlying technology doesn't really differ that much. I mean, we're still sending out probes to get back time exceeded messages, and really it operates very similar, if not the same as traditional traceroute does. So Paris Traceroute does not really differ that much from traditional traceroute in that sense, does it?

Dr. Justin Rohrer: It does not, and it can still use UDP, TCP, or ICMP. All the tools that I'm aware of support all three of those packet types.

Philip Gervasi: So then if Paris Traceroute gives us all the same information that we get in traditional traceroute, except it solves those inaccuracies that we have on flow- based load- balanced networks, which as you said before is everything on the internet, wouldn't it make sense to just replace traceroute everywhere with this new modern Paris Traceroute and just use that moving forward everywhere that we can?

Dr. Justin Rohrer: You could if you wanted to go to the trouble of installing it everywhere that you want to use it, but I don't think it's necessary. It's more being aware of I'm trying to use this for use case that I might get inaccurate information from traditional traceroute. So really I should go install one of these tools that runs the Paris version and use it in those cases.

Philip Gervasi: Now, when I actually run it and experiment with it, I'm getting multiple responses back. I'm not just getting one particular response per probe or per run. I'm getting multiple results in the output and I'm not exactly sure why. Can you explain what that means?

Dr. Justin Rohrer: So I think you're referring to getting the three different timings per hop. The idea there is that it's just so common for either one of these packets to get lost, especially because ICMP responses have to be handled by the router's CPU typically. They're not a fast path response. And so if that CPU is busy, it may just not respond to a particular packet, or it could just be general congestion or whatever, losing packets on the internet. There's no reliability here for retransmissions. So part of it's for that. It's also the case that when you're looking at latencies and trying to correlate latency to distance, you want the lowest value. You can never beat the speed of light, but your latencies can be highly affected by congestion and buffering on the path. And so if you get three samples back, you can just say, " Well, the lowest one is the closest correlation to distance," and that might be the one that you want.

Philip Gervasi: All right, so what we're talking about right now is testing or monitoring, I'm not sure which term to use, a production network, but we're not using production traffic. It's not like we're looking at an end user's application traffic and then determining what the latency is or what the path is that that application traffic is taking. These are test probes, whether it's traceroute or Paris Traceroute. So this is artificial traffic. I mean, it's real traffic, it's on a production network, but it's not an end user's traffic. So it's passive in that sense. And I guess this is the difference between active and passive monitoring, active and passive testing. And I noticed in our show notes and in some of the literature from you actually, you use the term active and passive measurement. You use the term measurement specifically. I do want to understand why. But can you explain a little bit about the difference between active and passive monitoring and then, of course, measurement?

Dr. Justin Rohrer: Right. Yeah, I use the term measurement because I do think of Paris as a measurement tool in terms of measuring topology, some flavor of topology. And in passive measurements, we're just collecting data, whether it's PCAP where you're collecting raw traffic or you're collecting your net flow sampled traffic, or even polling SNMP to get counters that are passively collected as traffic flows through an interface. And in the control plane, we have BGP looking glasses where we collect routing information from the internet. So those are all the things where we're not injecting traffic into the network as part of the measurements. And then traceroute, of course, falls into the active category where we're injecting that traffic. When you run one traceroute, you generate something like maybe 50 packets. It's a very small number of packets that you hope is not affecting anything in your network. If your network changes its behavior based on 50 packets, you're really in trouble. But in aggregate, if you are trying to do really large scale measurements, you want to see how the whole internet gets to your AS, for example. You might need to run 100 million traceroutes. And at that point, you can really start affecting things. An example of that comes into play with something called ICMP rate- limiting. So routers by default will have a rate limit of how many ICMP responses they can generate. And if you're doing high volume traceroute, it's pretty easy to induce the routers to hit those limits and stop responding to you. And so then you traceroutes where you're getting a bunch of stars back, and so now you can no longer infer that part of the path. It may also be viewed poorly by service providers that you're causing this load on their CPUs and causing those routers to stop sending the ICMP messages that they should be. So that's something to be really aware of when scaling up active measurements.

Philip Gervasi: Yeah, that makes sense. I mean, really it's just basic math. When you have a very small sample size, a small data set, there's a potential for inaccurate results. So you use a larger data set, a larger sample size. In this case with traceroute, Paris Traceroute, you send out more probes and you get hopefully more accurate results. But when you inundate the network with a huge number of probes, there's a potential of some of those devices in the path, presumably routers, dropping your traffic or just skewing the results themselves. You're adversely affecting latency because you're inundating a production network with more packets with traffic. So let me ask you, Justin, what are the limitations of Paris Traceroute? I mean, what I just explained was more of an operational thing, not necessarily a deficiency in the mechanism itself. But you tell me, what are the limitations of Paris Traceroute?

Dr. Justin Rohrer: So the main limitation that comes out of just running a single Paris Traceroute versus a single traditional traceroute is that the Paris Traceroute, by keeping all the packets on the same path through load balancing, actually hides that load balancing from you. So you can't tell where load balancing might be happening. Where traditional traceroute with those three responses, oftentimes you'll notice that at a given hop, you got different answers because those packets are different paths. And so it's in one hand telling you something incorrect about what's linked to what, but it gave you that hint that yeah, there's load balancing around these hops and maybe you care about that. For Paris Traceroute to do something similar, you need to run it a bunch of times basically with different flow identifiers. So there's an add- on on top of Paris Traceroute called MDA Traceroute, which can do this, but it becomes far more expensive in terms of the number of packets you're sending to interrogate a specific path.

Philip Gervasi: So is then that's just the way to solve for that limitation, just send out more packets, more probes? And in this case, you mentioned MDA, Paris Traceroute.

Dr. Justin Rohrer: Right. I mean, if you just run Paris with its default settings, it's not going to do any of that for you. And so it'll just look like there's a single path to anywhere.

Philip Gervasi: All right, so this variation of Paris Traceroute called MDA Traceroute, you mentioned, the multipath detection algorithm, that sends a lot more probes to factor in for possible load balancing. I get it. So by sending more probes, you can vary the flow identifiers, I think you mentioned, and you can trace those multiple paths more accurately. So it sounds like the underlying mechanism with even the MDA version or variation of Paris Traceroute is also the same where it's using some of the same underlying or the same underlying technology. Is it any different? That's my first question. And then my follow- up question to that is, should we therefore be seeking to install Paris Trace route and its variations everywhere and using that as our primary method, especially as network engineers and network operators?

Dr. Justin Rohrer: Oh, no. No. The hop by hop mechanisms are using ICMP that's already part of the IP stack everywhere. What I meant by that is more if you're tracerouting from different servers in your infrastructure, you would need to install it in those places as opposed to traditional traceroute that's probably already there in whatever distribution you're using. And then if you are tracerouting from routers, you may not even have the option to install Paris Traceroute.

Philip Gervasi: Yeah, so that doesn't sound like a limitation of Paris Traceroute to me because it's more of an adoption problem. I mean, there's nothing technically deficient there with traceroute in that sense at least. It's really a matter of operating systems out there, both in compute and network devices not supporting it, or you have to go and manually install it, which is an operational problem. And so we use traditional traceroute everywhere. It's everywhere. It's commonly installed. And so we rely on that at first. And then when we need something like Paris Traceroute, we install it where we can, and where we install it is on the sending device. It's on the device that's generating that Paris Traceroute traffic, those probes, and so whether that's on some compute node, your operating system in front of you, or network devices themselves. So Justin, what would be your advice, your recommendation to those that are managing networks right now and want to improve their strategy of how they can trace application traffic over the network, especially over the public internet? So for those engineers like network engineers, network security, maybe cloud engineers as well, those working in operations, what would your advice be to them?

Dr. Justin Rohrer: Certainly if traceroute is part of your regular operations and troubleshooting and everything, it's worth having that tool in the toolkit. As you were talking about experimenting with it, it's worth mentioning that there's at least three different software packages that implement Paris Trace route. At Kentik, we use the Scamper package, and that's out there readily available. I put the link in the document to that one. I think that one's the most widely used by folks doing the large scale measurement projects. And then there's also the YARP implementation, which is for doing very large scale, if you have 1, 000 or 100, 000 destinations that you want to traceroute to all at once. Kind of like the ZMap of traceroute. That's the one that I've used the most in the past. But these tools have their strengths and weaknesses and are worth playing with.

Philip Gervasi: Yep, that makes sense. So ultimately, the encouragement is to install it where you can, when you can, perhaps starting on the computer sitting in front of you right now and just get familiar with it and experiment and see the kind of results that you can get back and how Paris Traceroute operates and maybe eventually add it to your overall network operations strategy. So Justin, this has been a very interesting episode. I love getting into the weeds on this kind of stuff, so I appreciate you joining me today and for bringing us your subject matter expertise. And I will make sure to link in the show notes on the website, the various resources that we alluded to today. So if folks want to reach out to you with a question or a comment of some sort, how can they find you online?

Dr. Justin Rohrer: You can Google my name. Some version of a site or social media will come up. My Kentik email is jrohrer @ kentik. com, so you can reach me that way.

Philip Gervasi: Great, and that's Rohrer spelled R- O- H- R- E- R, so jrohrer @ kentik. com. And you can still find me online on Twitter @ network_phil, my blog networkphil. com, and you can search my name in LinkedIn and some other various social media as well. LinkedIn being the primary these days. Now, if you would like to be a guest on Telemetry Now, or if you have an idea for an episode, I'd love to hear from you. You can reach out to us at telemetrynow @ kentik. com and we'll start a conversation there. So for now, thanks very much for listening. See you soon.

DESCRIPTION

Traditional traceroute is a widely used and very useful tool, but it struggles to accurately trace load-balanced networks. In this episode, Dr. Justin Rohrer joins us to talk about Paris traceroute, an extension of traditional traceroute, and how it's used to trace paths on a network that uses flow-based load balancing, or in other words, most of the internet. Learn more about Paris traceroute in Phil's blog post, The Power of Paris Traceroute for Modern Load-Balanced Networks.

Today's Host

Phil Gervasi

|Head of Technical Evangelism at Kentik

Today's Guests

Justin Rohrer

|Senior Software Engineer

Justin P. Rohrer, PhD., currently develops Internet-scale network measurement, monitoring, and analysis systems at Kentik. His career as a network engineer began in the late 90s, which was followed by an academic role as a computer science professor and a decade leading a research lab focused on improving network resilience and survivability, during which he published over 50 scientific papers and articles in the field. As part of his study of Internet resilience, Dr. Rohrer has run more than 100 Billion paris-style traceroutes.

Justin's LinkedIn