CIO Speaks - Episode 6: A Conversation with Bill Norton of DrPeering Part II

:: ::

Host Steve Ginsberg speaks with Bill Norton of Equinix about network peering and network strategy from a CIO perspective.

This is the second in a two-part episode. Find the first part here.

Guest

Most know Bill Norton as Co-Founder and former Chief Technical Liaison Equinix, or as DrPeering - author of "The Internet Peering Playbook: Connecting to the Core of the Internet." Some may remember him as the first chairman of NANOG (1995-1998). He is a thought leader, a passionate Internet engineer with deep interest and expertise in the Internet interconnection and colocation sector.

Today, he helps executive teams strategically leverage SDN, BGP peering and cloud interconnection to their businesses, and assist with market research, sales, lead generation and software / DevOps / network engineering. Recent engagements have him building Internet peering ecosystems in the Middle East and consulting on peering and transit for large-scale global cloud gaming.

Transcript

Steve Ginsberg: I think a lot of enterprises have moved to the cloud as part of a corporate initiative that had a very fast timeline in some cases. When I talk to peers, sometimes it's driven by the board as much as by the organization itself.

Whoever instantiates it, I think a lot of IT organizations are catching up with the reality, waking up to it, and as we just discussed, with more than one cloud, that kind of traffic that you just mentioned—that’s easy to generate the second you start moving...Part of the reason people move to different clouds is differing capabilities, so you might have developers who want to be in Google Cloud or they might want to be in Azure and their main organization is in Amazon, or it could be the other way exactly flipped. Once that starts happening, it's easy to start having teams move a lot of resource assets between the different locations.

Bill Norton: Yeah, the other thing I've seen is you might find as you look around that there are some people who were just brought up using the AWS portal and they live on that portal. They know how to spin up easy instances like that and move them around and to turn them off. There's even some logic that people have put into their scripts to do tests of the virtual machine that they just spun up because all virtual machines are not created equal.

And sometimes you'll find a virtual machine that does not have the IO that you need or does not have the speed for whatever reason that you need and they'll turn it down and hope to go back to the pool and get a better virtual machine. It's kind of interesting. Google and Microsoft are pushing heavily to make it easy to do a ‘lift and shift.’ They can be less expensive in some cases; in some cases the network connectivity might be more expensive in another location.

But what you find is interesting Steve, is that you find different personalities leaning towards these different platforms. For example corporate America might say AWS is perfect—it's what everyone goes for [so] that's where we're going to put our information. Other folks would say “You know we're a Microsoft shop; everything we do day in and day out is Microsoft; we're dot net and all the tools we use…” I worked with a client in cloud gaming that was entirely focused on Microsoft stuff. Those folks would probably be more comfortable and more familiar with the Azure portal interface. I found the Azure interface to be a bit more complicated and different. See I came up in the AWS side, so Azure looks foreign to me, and that's why I think maybe people will be comfortable with that which they have started to use because they're using that particular...

It's interesting, so I would normally hear that part of the discussion being based on the people throwing up BMs essentially like you know the developers and IT people supporting developers, those teams, but are you saying also that the portals make a difference too for the networking teams? Or are you talking more as a developer working in the cloud in that case?

I'm thinking more about the user and what their preference is. For example I'm spinning up a new version of my website now and I have a choice: Do I want put up on a free AWS account or free Azure account or a free Google Cloud Platform account? I would probably go where I'm most comfortable, where I'm familiar and I can do things fairly quickly.

Sure. So you're talking really as the corporate end user or developer, folks using cloud service not necessarily setting up the infrastructure underneath?

If you talk to the research folks in an enterprise though, you might find them more comfortable with the Google platform. Yeah because maybe that's where they did their masters work when they're off at school. For whatever reason, it does seem to be some kind of categorizations of appeal that I've seen in the field today. Maybe it's going to change, but as it stands now it seems to me that AWS is really strong in the corporate market. Microsoft's really strong in the dot net space, and if you're a Microsoft shop that's probably where you’re gonna be going. And Google is kind of the researcher, ‘the nerd’ for lack of a better word, who loves to play with the toys. Google's got some great toys.

Yeah I think that's fair and I think the folks who run each one of those cloud services would want to say “Oh no, we absolutely appeal to these other audiences.” And I think in fairness there probably is some cross-pollination. But I agree that certainly they're all communities ultimately, and the communities have brought in some types of members first, and then I think as these services are going through their rapid development, some of this changes over time and some may grow stronger.

This is one of the really important points that your audience should know about, and that is that each of these cloud connection techniques Google Cloud interconnect, Express Route, AWS Direct Connect, each of them does essentially the same kind of thing. They allow you to get access to your resources in their particular clouds. The difference though is they use different terms—and that can be really confusing and tends to make people want to stay in their own cloud of comfort. The terminology is one of the forms of lock-in that you have. Whatever terminology, whatever environment she is comfortable with—that's where you tend to stay.

Yeah I did some research on Kubernetes and containers in the past year and when you start to look at that, it takes a while to kind of wrap your head around ‘At what level is the computing actually happening here?’ And to your point: What services are actually the same as something that's in a computing cloud and what are actually somewhat distinguished in that way?

And the other decision you have to make is: do you want to download and install open source software yourself or do you want to pay Amazon, who's already doing that for a lot of other people and pay them a fee? It depends on whether you have the in-house expertise and if it's strategic for you to control your network and control your software stack like that.

You've traveled all over the world to talk to people about peering. Do you have a favorite story from doing that?

Well I'd say my favorite story is probably the 111 8th Street story. There was a power outage in New York. I'm not sure it might have been 2006 or 2007 something like that. The power went out for the entire New York City. Now in downtown New York City there's a place called 111 8th Street. That's a major carrier hotel. This is where all the fibers from Europe are coming in and terminating and it's a major carrier hotel is the point.

Now the power went out for the entire city of New York. And what happens in those cases is, there is a thing called ‘automatic transfer switch’ that's down in the basement of 111 8th Street that kicks over when the power from the street is turned off; and it kicks over and all of the power for the building is from these UPS uninterruptible power supplies. So for a short period of time the UPS, these batteries are powering the entire building until the generators start up, up on the roof of the building. Those generators—once they kick in, they can power the entire building for days. As long as they have fuel for the generators, they can provide enough power.

Well it seemed like the right things were happening. The power went off, the automatic transfer switch kicked over, UPS took over the building load, generators started up on the roof of the building. The other thing you need to know is up on the roof of the building for those generators were 500 gallon tanks. That was to provide fuel for all of the generators. Now down in the basement after 9/11, they had to have all of the major, the large amounts of fuel underground—not on top of a building for security reasons. So the big 50,000 gallon tank was down underneath. And the idea was that when the power went out, the fuel pump down in the basement would pump the fuel from the 50,000 gallon [tank] in the basement all the way up to the rooftop and continually supply the 500 gallon tanks that supplied the diesel generators. That's a long setup, are you with me though?

I think so.

OK so what happened was the building owner sent a note out to everyone saying “We've lost power, there's a citywide power outage, [with] no estimated time of repair. But everything is fine, the generators are cranking away. Your equipment is fine at 111 Eighth Street.” Then another note came out. The generator kicked off one of the generators on the roof, (there's four of them I think, maybe six), one of them kicked off and the building owner said “What's going on there?” They sent the guy off to the roof to figure out what's going on and in the time it took him to get to the roof, second generator kicked off and it was off.

Then they sent a note to every tenant saying, “Please turn off unnecessary equipment, we're having a mechanical problem with our generators” and then the third one kicked off, and then a fourth one kicked off and then they sent another note saying: you have to turn off all nonessential equipment we're down to two generators for the entire building, that kind of thing. What's going on here? What is causing these generators to kick out? Well the guy up on the roof is trying to figure out that that very thing. And he goes down to the basement and he sees that the fuel pump is cranking away, so that's the right sound a fuel pump should be [making] refilling the tanks. He goes up to the roof [and] all the tanks are empty. Last generator kicked off, now they have no power. Why?

Well it turned out that in a previous month or two, they had replaced the fuel pump and they had tested the fuel pump. But the test involved just seeing if the fuel pump made sounds. Turned out they had the polarity wrong on the pump. Instead of pumping fuel up to the rooftop, it was pumping the 500 gallon tank fuel down into the 50,000 gallon it just drained, so it drained the tanks and the guy's scratching his head—what's going on here? He goes back down and he hears the fuel pump still cranking away and he tries flipping the polarity and then all of a sudden he hears the fuel going up through the pipes, and they say “Oh my God, they tested it but they didn't test to make sure that the the polarity was proper and that this was fully working.”

So then he goes up to the roof expecting the generators [are] going to be running, [but] they're not. They're actually silent and there's plenty of fuel, the fuel tanks are all capped off. What happened? Well it turned out during all this going up and down the stairs because remember there's no power, no elevators, so he had to go up and down the stairs. All this time that was taken going up down the stairs, the generators were constantly trying to turn over and they burned out the starter motors. So they had to go across the partying streets of New York City, everyone was having a big old fun time down there and get to find replacement starter motors for their generators, six of them, and they finally find them, they go back (this is hours and hours and hours go by by the way), and they replace the starter motors and sure enough, the starting motors are cranking over but they can't seem to get the generators to work. Why not? Well it turns out that when the fuel tanks got drained down to the bottom and all that sludge on the bottom of the diesel tanks clogged the fuel filters on the generators. So now the guy had to go back across the city of partying, celebrating folks to find replacement fuel filters to put back into the generators to get them back and working.

So to me this is an interesting story, not to pooh pooh the systems that they had in place. They had all the same credentials that any CIO would see on a data center: two and plus one redundancy. We have fuel trucks that are contracted take two different routes to deliver fuel. All of this sort of stuff they have, but look what happened here, we have a cascading set of failures one after another, the first one, the polarity caused the fuel filters to get clogged and the starter motors to burn out and these sort of things happen in real life.

Now the Internet service providers and the carrier community’s reaction to this 24 hour+ outage was fascinating to me because they weren't really annoyed that they had to go without days of power. They were annoyed because they didn't get scheduled updates from the building owner. And I thought it was interesting—not that there was a problem—I guess everyone understands that there will be things that will break, bad things will happen, but the real question is: ‘How does your vendor treat you during the time when they're having a really tough time? Are you a partner or are you more like someone that they want to hide stuff from?’ That was really the kind of feeling that I got from that story.

Thanks Bill, yeah that's a great cautionary tale in testing and in thinking through a failure plan—what goes beyond the first few levels from there. And a good reminder too on peering, the subject of peering that one of the reasons to connect at various peering exchanges is exactly this. So I think what you are saying in part is a lot of the peering community, they weren't too upset there because their peering traffic would have just failed over to another location and been been kind of picked up somewhere else.

That's right. That's another interesting tradeoff that you'll find in the Internet peering ecosystem. I asked peering coordinators at one point, ”How many exchange points would you like to see in a region for offloading your traffic in either peering or transit relationships?” And I thought the answer would be: “Well, one. We want to build into one location only so we can get our traffic offloaded in that one place, and it turns out that half of the people said “yes, we want exactly one location.” I said, “Well how about redundancy?” They said exactly what you said, they said “Bill for redundancy, we don't want to have extra cost in our peering infrastructure that we have to pay for. We handle redundancy by routing traffic to another place where we interconnect with the same people, they would say we interconnect with those people in multiple locations. And even if we didn't, having you know just a short period of time where we don't have the optimal peering path to that destination, that's something that we can deal with.”

That was half of the audience saying exactly “one.” The other half the audience said, “We want exactly two exchange points per region,” and I said, “that means double the cost.” They would say “Fine, I don't care about double the cost, I want the redundancy, I want there to be different exchange points operated by different people, different networks, different companies that operate it, I want them to use different switches, different vendors' switches and we want them to use different security guards.” They want to have the redundancy be as far into their infrastructure as that peering infrastructure having different everything. So it's a it's a way of removing systematic risk.

So it's great to hear how companies are approaching their peering strategy in that way—that they might have some difference from there. How would you characterize peering exchanges changing over time in the current day? Are things different than they were five years ago? Transit costs have come down, so maybe there's the idea that peering is a little bit different than it was certainly 10 years ago?

Oh absolutely. You know when I first started working with Equinix, I traveled about 90% of the time and that's how I developed all these relationships and how I collected all these stories that I put in the book. [I had] a lot of conversations with folks in the field and when I first started I would ask people, “What's a rough number of the cost to buy internet transit these days?” And back in 1998 when Equinix started, the price was $1200 per megabit per second. Four years later the price was $120 per megabit per second. Four years later the price was $12 a megabit per second. And sure enough we're now four years after that, and it's about a $1.20 per meg, and many times, less than that.

So we're heading towards 12 cents per meg as the next stopping point. But this has been going on since the beginning of the commercial Internet. Every year the price drops, every year the ISPs say “no one's making any money at these prices.” And every year the prices go down yet again. There are efficiencies in optical equipment that allow you to get things like you know 40 gigs or 100 gigs, 400 gigs is coming pretty soon. So these help make it possible to deliver large amounts of bits for lower and lower prices. So that changes the landscape, but we've always had this tension between the ever dropping price of transit and what is the actual cost of peering, because with peering you do need to buy, you need to have a router you need to have transport into a place where you can exchange traffic with somebody else. Maybe you need co-location space for a router. All these different expenses for implementing peering have to go into the mix to figure out: does peering make sense financially? It is a difficult case to make these days that you're going to save money but it still can be done.

I know some companies that are doing some pretty clever things to minimize their transit costs. I'll share one of the tactics from my book. What you do is you take a look at the way that you're doing global traffic distribution, and if you are buying in transit in different regions from different ISPs, you have a bit of a challenge because the 95th percentile—which is how things are are priced in Internet space, the 95th percentile in Europe will be at a different time than the 95th percentile in the United States, at a different time from the 95th percentile in Asia. So you can be paying...

So you're saying if your peak traffic kind of goes up at the same time it's because it's distributed globally and also because audiences will naturally... if a service is commonly used at 9:00 at night or at 1:00pm—that time follows the globe and therefore your peak will move with it?

That's right. So the clever trick that one enterprise did was they used a single vendor and they told that vendor: “globally I want the same time scale to be used for all of my traffic around the world.” And that way the peaks of the 95th percentile in Europe are offset by the valleys in North America and in Asia and likewise, the peaks in North America are offset by the valleys in the other parts of the world. And by doing so, you end up having a much flatter demand curve that allows you to get better usage at a better price.

So for global enterprises, that's a pretty powerful piece to get into your contract negotiation early and make sure it's in the contract?

And you have to ask for it. You have to negotiate for it. There are a whole bunch of really interesting techniques that folks use to try to maximize the price/performance.

Well thanks, Bill. I'm really enjoying our conversation. I wonder if maybe you could kind of bring it all back together. Why should enterprises care about peering? What should they be most hopeful about gaining from having an active peering program, as opposed to say, just letting their transit provider deal with it all?

Yeah it's an interesting question. As I said before, security is a number one reason that people go down the path of pursuing peering. The second reason is reliability. When people want to have a connection into say Salesforce.com, the direct path is one that has the fewest moving parts in between you and Salesforce.com: fewer routers, fewer links to go down, fewer networks that could be involved in the transaction. So number two is reliability, number one security, the third reason is performance.

I did some consulting work for a cloud gaming company recently and these guys have incredibly tight network requirements, extremely low latency, extremely low jitter, large amount of traffic and almost no tolerance for packet loss. Packets that are dropped will be re-transmitted before TCP could even have a chance to take a look at whether it needs to be retransmitted. So these are the types of enterprises that have specific network requirements that require the kind of reliability that being closer to the eyeballs can give you.

So that's an interesting point. I'm probably like a lot of folks lulled sometimes into just thinking the Internet is all TCP IP traffic, but increasingly more things are being accomplished over UDP.

Particularly the real time stuff, absolutely. The fourth reason that folks go down the peering path is for a better visibility. When you send your traffic over the wall to your transit provider, you really don't have any visibility as to how that traffic is being handled by the second ISP or the third ISP that they hand the traffic off to. Contrast that with where you are directly peered with that fourth ISP in the chain, and the traffic goes directly from you to them on to the final destination. So you have the visibility to see how much traffic you're sending to that fourth ISP in the list and they have visibility into the traffic that's coming back to you. So when you're trying to debug a problem, you're dealing with the principals that are involved in that transaction.

And then finally is cost benefits. This used to be a primary driver for peering, you know everyone wants to be able to offload the traffic for free. But as I said, the price of transit keeps on dropping every single year and the price of peering drops, but maybe not quite as fast. So it becomes a tenuous case in some cases to make the peering a cost justification type of argument. But those are the five reasons that companies generally will go down the peering path.

Well thanks. Appreciate that. And I think I would, having been through it myself in an organization and talked to peers would encourage enterprises to have their network teams look into the detail here and see where the cost benefits really are, because there really are some. One more on a specific point is remote peering. So this is a concept that I know you've been involved in, in companies that work with remote peering. This is kind of a special form. Why would enterprises look at remote peering?

This is one of my favorite plays in the Internet Peering Playbook. With remote peering what essentially you can do as an enterprise is contract with a transport provider to get you connected to the most popular peering points that you want to peer your traffic away from. Now you can do this and the cost to you will be the cost of the port and the cost of the transport to get to the exchange point, but you don't have to pay for a co- location space. You don’t have to pay for a router because the remote peering provider is delivering you directly into that switch.

Now watch what happens here. If you're smart what you can do is build in and remotely peer at a whole bunch of peering points all over the country or all over the world. And then when you see how much traffic you're offloading for free at those various different exchanges, you can then make your decision whether you want to build in permanently and establish a co-located presence and buy a router and participate more fully with that exchange point. And those where you're not delivering a lot of traffic you can disconnect, and it's just a matter of turning off the transport circuit and the port on the peering fabric that you leased.

One thing that was interesting when we were setting up our peering program related to all that was that it's sometimes a little difficult to know exactly who is on a net peering exchange. Data centers typically don't want to give out that information because they want to respect the privacy of their customers. And yet in order to know it's a good place to peer, you have to have at least some assumption that you have traffic flowing there that you can exchange. Otherwise you'll show up at a party that there's no one to talk to essentially.

I know there's peeringDB as one way to get visibility into that. Are there other ways that peering teams can find out, to develop their peering strategy—where they should locate?

Yeah what I advise my clients to do is to start attending some of these Internet operations conferences in person. And the reason you do that is—a couple reasons actually. The first is you want to be able to have an informal conversation about peering many times with these companies to find out what are their peering requirements? Do they peer openly, would they be receptive to peering with you given the type of traffic that you're going to be exchanging? You can find out information from other people like maybe you don't want to talk to a particular cable company because you're afraid they might say “No” or you want to do that one very carefully, but you can find out from people in the field what it's like to try and negotiate peering with that particular company.

There's all kinds of great market intelligence that you mentioned that it might be difficult to find out if people are at a particular data center. It's a bit of a challenge too, to know how receptive they will be to peering requests. Very often peering requests just don't go answered. Sometimes you get a no answer, no you don't meet a peering prerequisites, but face to face you can find out: ‘well you know sometimes they make exceptions and these are the things you can do maybe you're not in three locations where they want to meet you but you're in two locations and you plan on going to the third location next quarter.’ Those are the types of things that there's some wiggle room or not. And that's the type of ground intelligence you can get at these Internet operations conferences.

Great discussion. Thank you, Bill. That's it for our episode today. I'm your host Steve Ginsberg and I want to thank our guest Bill Norton. You can find more information from Bill on DrPeering.net, and of course we're available at GigaOm.com. Thanks for listening.

Interested in sponsoring one of our podcasts? Have a suggestion for a great guest? Please contact us and let us know.

This is the second in a two-part episode. Find the first part here.

Guest

Transcript

Related Research