DæmonNews: News and views for the BSD community

Daemon News Ezine BSD News BSD Mall BSD Support Forum BSD Advocacy BSD Updates

OpenBGPd in OpenBSD

By Henning Brauer

OpenBGPd Logo

BGP: Why another implementation?

Page 1

I started OpenBGP two years ago, after getting completely fed up with Zebra, which we were running before. There were lots of bugs, bad configuration language, performance problems, and since I don't speak Japanese - I had problems understanding the documentation. Zebra makes heavy use of cooperative threads, which leads to its main problem: Combined with the central event queue, Zebra can lose sessions while busy. This is because the keepalive events can be way down in the queue, so if something else simultaneously consumes all the CPU power - Zebra just doesn't process the keepalives until the peer resets the session. Zebra's successor, Quagga, caught up and apparently fixed many of the bugs. However, they still used Zebra's design, which I think was wrong. So, the issues are kind of unfixable.

Designing our BGP Daemon:

Page 1

Turning a generic unix machine into a BGP router requires way more than just adding a userland BGP speaking process. The three main components are a Session Engine (SE) that just manages BGP sessions, a Route Decision Engine (RDE) that holds the BGP tables and makes routing decisions for best path selection, and a parent process that enters routes into the kernel and forks the SE and RDE.




The example shows the master talking to the kernel. The red lines indicate the functions that require root privileges.


Page 1

Session Engine

Page 1

The BGPd session engine maintains tcp sessions to BGP neighbors, and control sessions to the bgpctl utility. Once a session is established, the Session Engine is responsible for sending the keepalives out and processing the keepalives from neighbors, but it does not deal with routes at all. Update messages are passed on to the RDE. It's very lightweight - typically under 1MB of RAM on i386. Sometimes you'll see it getting bigger, but it only does that when it has to buffer a lot of events for a slow neighbors. It runs as an unprivileged user (_bgpd) and chroots to /var/empty, which is, of course, empty except for logging sockets.

Route Decision Engine

Page 1

The Route Decision Engine maintains the Routing Information Base (RIB), which means the prefix table and the AS path table. The BGP filters run here. The RDE calculates the best possible path per prefix, and generates the UPDATE messages as needed.
The RIB layout is split into many tables that are heavily crosslinked. The goal was to avoid table walks. In fact, we almost never need table walks. We need table walks of course when a new peer comes up (where we need to send them the entire table), and when the (userland) BGP control utility requests display of the entire table. Otherwise, there are no table walks.



Page 1 It's very memory efficient - these numbers are from before we had soft reconfig enabled, but it didn't get that much worse, and you can still turn soft reconfig off. One full view needs around 20MB of ram on i386, and two full views just need 25MB of RAM. This is not even half of what you need with other implementations. It's very fast - it takes about 10 seconds to load a full view on a 1Ghz P3, and less than 5 seconds to dump a full view to your peer. Just like the session engine, it runs as an unprivileged user, and chroots to /var/empty as well.



The BGPd Decision Process

Page 1

1) Check if the prefix is reachable at all.
2) Check the local preference, localpref. (bigger is better).
3) Check the AS path length (shorter is better).
4) Check origin (lower is better).
5) Multi Exit Discriminator (MED). This is only comparable between neighboring AS.

6) EBGP is cooler than IBGP, and is one of our extensions.
7) Weight, which is used to force traffic to your preferred uplink (a cheaper or faster one; whatever).
8) Route age - this is disabled by default. This means older routes get preference.
9) Lowest BGP ID (only used to ensure a winner).
10) Lowest peer address (only used to ensure a winner).

Page 1

Weight can help a lot. More and more often, you'll see equally long AS paths from your uplinks, because they're at the same exchange points. For traffic engineering, we want the possibilty to express a preference, and that isn't going to happen in localpref - because localpref comes before the AS path length check. So, we added weight. We know it somewhat clashes with weight implimentations by others, but unfortunately we asked others to help us come up with a better keyword and nobody came up with anything - so that's what it is now.




The BGPd Parent Process

Page 1

The parent process is responsible for getting the routes into the kernel. It does nexthop validation, and maintains its own copy of the kernel routing table. For that, it has to fetch the kernel routing table and the interface list at startup. It listens to the routing socket, where all changes to the kernel routing table show up as messages. It also keeps the internal view in sync.

The way we coded it, OpenBGPd notices if you manually fill the routing table, and we cope with it instead of overwriting your manual changes. We have an internal list of interfaces and their status, and that is kept in sync with the kernel as well. We do know about the interface link status - it's often said to be almost impossible in Unix to get a link state - actually, it's not that hard. We use that for next hop verification; an interface that doesn't have a cable plugged in probably doesn't lead to a useable next hop. Yes, that means we notice when you quickly pull the cable, and we invalidate the next hop.

We do not need periodic next-hop table walks at all, like "a big vendor" and Zebra does. This means we react much faster to interface state changes - there's an up to 30 second delay in Cisco routers and Zebra installations. The internal view of the kernel routing table can be coupled and decoupled from the kernel. This originally was a debugging feature, because we had a problem with one of my test machines that didn't have enough memory, but it turned out to be very useful. It's really fast - with a full table, it takes less than 3 seconds on a P3 750Mhz. The parent process needs about 6 to 7MB total, with full view configurations.
Page 1 Page 1

Features

TCP MD5 Signatures

Page 1

TCP sessions are typically unauthenticated as we know, so we implemented TCP MD5 signatures as a security association in the IPSEC framework. They are really just a special form of an IPSEC authentication header. This means I had to code a pfkey interface in BGPd, which was not really fun, to interact with the IPSEC framework. TCP MD5 signatures are not a new attack vector. There are people spreading that, but it's pretty much FUD. Of course, by the time you hit the TCP MD5 code, you already have to (correctly) hit the sequence number, the port number, the right addresses - the chance to hit all that is pretty low, and even then, MD5 is really cheap. The conclusion from that is, it's kind of weak - but it's extremely easy to configure, and it works with almost everything out there, so why not go ahead and use it.

IPSEC Integration

Page 1

Since we have the pfkey interface already, it was not too hard to do real IPSEC. BGPd loads the security associations (keys) into the kernel, and sets up the flows (routes) for IPSEC. Juniper can do static keyed IPSEC as well, and we're compatible with that. As far as I know, Cisco cannot - there might be some expert feature set that you can pay extra for, but I don't know.

Instead of doing static keying, we can also use isakmpd to do the keying for us - which also means the keys are changed on a regular basis. BGPd asks the kernel for an unused pair of SPI's (identifiers), and uses them. BGPd loads the flows into the kernel. It's usually done by isakmpd, but in this case, it's done by BGPd, because BGPd already knows the endpoints. That means that isakmpd only needs to handle the keying - isakmpd needs very little configuration. Everyone who's ever had to write an isakmpd configuration file will value that. Here's a complete howto:


Page 1

1) Copy the keyfiles, which are generated during the first boot of OpenBSD, over to your peer.
2) Run isakmpd -ka
3) Done.






pf Integration

Page 1

The BGP protocol is an efficient way to distribute lists of network prefixes - it doesn't necessarily need to be routes. BGPd can add prefixes learned from it's neighbors into a pf table. The prefixes to add to the table are selected using the filter language.

The tables in pf use a radix tree - which is the same code used by the kernel routing tables; it's very fast, even with a lot of entries. In turn, pf tables can be used for pretty much everything. You can do packet filtering based on that - you can redirect packets, for example, to a userland spam daemon. This in turn means you're using BGP distributed spam blacklists instead of using the stupid DNS-based approach. Or, you can do QoS processing.

Route Labels

Page 1

BGPd can attach labels to routes. Labels are basically 32 bytes of freetext information that can be attached to the route and stored with the route in the kernel routing table. Well, they're not stored directly, but who cares about the implementation details. PF can then filter based on those labels, and then write rules to classify traffic for QoS.

For example, you can pick all routes labeled as "MCI" and apply QoS. You can tell your customers that MCI is always very slow, and forget to mention that you play a part in that.



Page 1 Combining BGP information with PF capabilities is really very powerful. You can limit states per source address, depending on the source AS. Let's say you have your broadband ISP. You know where the hackers are, and can limit DDoS effects by limiting connections per IP address to 10, or something similar. You can also use the maximum source connection rate features in PF to fight off DDoS, and filter based on origin AS numbers.





Integration with CARP

Page 1 CARP is the Common Address Redundancy Protocol. This allows you to share an IP address in a master/backup scenario. CARP's much like VRRP, but unencumbered by patents. It's actually better, because it's properly authenticated and faster.

A typical [usage] case is exchange points. If you only get one IP address in the exchange point network, why not use two boxes there and share the IP address using CARP? This works without special support from BGPd, but we can do even better: If we make BGPd aware of the CARP master/backup state, we can force sessions that depend on the CARP interface. We can force them into state idle so they don't even try to connect as long, as we are not master. The moment we become master, all sessions depending on the CARP interface immediately try to connect to the neighbor, which in turn leads to way faster failover.

Page 1

IPv6 Support
Page 1

IPv6 support has been implemented since What The Hack 2005. Almost everything "just works" like IPv4. Lots of testing is needed, but so far it doesn't look bad.









BGPd Configuration

Page 1

The config files are split into 5 sections:

(1) macro definitions - just like in pf.

(2) global settings.

(3) neighbors to announce.

(4) neighbor definitions.

(5) filters.



Macro Definitions
Page 1
This is an example where we see some macros defined. We then have the global configuration - the only mandatory global configuration is your own AS number. router-id and "listen on" can auto-configure; holdtime has sensible defaults. "fib-update yes/no" determines whether or not it initially uses a copy of the kernel routing table. The final part, "neighbor" definition, where you place the networks you want to announce.



Neighbor Definitions
Page 1
Here we have the neighbor definition, starting with the IP address, and the remote AS description, etc. inside the braces. You can see TCP MD5 configured here; its really very easy. The last setting is the announce keyword, which can save a lot of work. As for options, they are "none" - announce nothing, "self" - announce only your own networks, "all" - announce everything, or "default-route" - announce the default route and nothing else. Many other implementations need filters to do this, but with OpenBGPd, it is just one configuration setting.

Neighbors can be grouped - members inherit settings from the group. When a connection comes in from any host within the specified network, the AS is ineherited from the OPEN message, and we clone the neighbor definition. When the session drops, we keep it cloned for a while and remove it later to aid with flap suppression. This is nice for redistributing BOGONs or blacklists, or something similar; you can define your whole router network, instead of configuring all your routers individually.

Page 1 Page 1

Here's an example of IPSEC with static keying. The long keys are a little bit unwieldy of course, which leads us directly into the next example: IKE using dynamic keys. It shows how simple IPSEC configuration really is with OpenBGPd - just put "ipsec esp ike" or "ipsec ah ike" (authenticated header), and you're done.


Page 1 Page 1


Filter language

This is currently where the most development is being done. Its modeled after pf, and should be pretty easy to follow. Shown here are the filters we have in our default OpenBGPd configuration. The filter language is "last match" - the last rule matches. The rule has three parts:

(1) action: allow, deny, or match. Match does not change the allow or deny state.

(2) match: where you can match based on prefix, prefixlen, or parts of / the entire AS path

(3) set: add prepends, modify localpref, metric, assign a pf label, etc.


Page 1 Page 1

Userland Tools

bgpctl

The bgpdctrl program connects to bgpd using a Unix domain socket. You can query runtime information, reload the configuration, couple or decouple the kernel routing table, and bring specific sessions up or down.

Page 1

Shown here is an example from one of my production routers showing an overview. It should look pretty familiar. In the next slide we see a kernel overview showing connected static routes, and in the next slide we see a view to a specific neighbor. Next, we take a look at the timers as well.

Page 1 Page 1 Page 1 Page 1 Page 1

Here's a view of the nexthops. We actually display the nexthop depending on link state and speed.

Page 1

Forcing a configuration reload or coupling/decoupling the fib is easy, as shown here. You can bring up neighbors based on their name or IP address. The last example is special; the feedback of "no such neighbor" by name is a pretty good feature that used to not have any feedback.

Looking Glass
Page 1

BGPd has a second, restricted, control socket now; I coded that two weeks ago. It only allows certain messages - namely those behind the BGPd "show" operations. While running httpd in a chroot environment, which is default on OpenBSD, a cgi can call the bgpctl binary placed inside the chroot, passing the path to this restricted socket. Then, you just need the cgi to call that, and the looking glass is done.

The cgi... yeah, someone needs to sit down and hack that, but it should be easy.




The Status Quo:
Page 1

BGPd is very stable - it has to be, otherwise my company would be offline. It's in use at quite a few sites already, including sites with many, many peers. I don't have hard facts about the number of sites using OpenBGPd - after all, this is free software, so nobody has to tell me that they're using it. On the other hand, quite a few people have mailed me, and they usually express that they're happy with the quality and ease of use. I like getting those emails, so if you're using OpenBGPd, please mail me.

BGPd is around 20,000 lines of code. bgpctl is ~ 2,000, and the man pages are 2,340 - and yeah, they're in english. :)

OpenBGPd has been part of the OpenBSD distribution since 3.5, which was released May 1st, 2004. There's a new release every 6 months, with pretty thorough testing.

 

Thanks

I want to thank Claudio Jeker, who's writing OpenBGPd with me (and others); Chaos Computer Club, for sponsoring the route-server project; Theo, who kicked my ass until I finally worked on OpenBGPd; Andre Opperman, who designed the RDE with us and funds much of Claudio's work on BGPd; Wim Vandeputte, for his continued support (and beer supply), and DE-CIX for the great cooperation, tech meetings at the castle of Kransberg, and sponsoring my trip to NANOG 36. This is all free software, but of course funding is needed. So if you can, please donate or at least buy CDs.


Q&A Session:

Audience: Regarding the route server - you said you allow label tagging to allow people to handle redundancy, etc. That has an indication that your labels are transmitted through BGPd. I'm a little confused.
Henning: No; in BGPd, you can label routes. The labels start with the route in the kernel routing table. That is not redistributed.

Audience: Then, if I talk to your route server, and routes are labeled, and you say I can make decisions based on them, how do I get them?
Henning: No.. it's just a community thing. For example, we label a community if it's from switch #1, and another community if it's from switch #2, etc. Thats it. But labels are not communities. That's different.

Audience: Do you know if CARP can be ported to other distributions?
Henning: There is a userland implementation that works on many Unix systems, but of course userland is the wrong place to solve the issue. I'm not certain of the state of it on FreeBSD; same with NetBSD - there definitely is code, but I'm not sure whether they've finally imported that. It does not run on Linux.

Audience: Correct me if I'm wrong, but it initially started with OpenBSD, correct?
Henning: Yes.

Audience: I was curious about your descripton of using an IPSEC SA to do the TCP MD5 stuff. You mentioned that this wasn't a [viable] attack because the TCP processing and sequence number stuff occurs before you do the checking... but I'm wondering how you prevent the IPSEC kernel stuff from occuring before the kernel gets ahold of the packet, if you're using an IPSEC SA.
Henning: If you're using encryption, IPSEC ESP for example, of course you cannot avoid this. If you're using TCP MD5, the packets themselves are not encrypted; there's just an authentication checksum. There's the MD5 signature. But, you still have all the TCP information in cleartext already, so it's easy to check that before you bother to check the signature.

Audience: Ok, so if you're using an IPSEC AH... is that what you mean?
Henning: Well ... this is for TCP MD5.

Audience: OK, but you said you were using an IPSEC SA for the TCP MD5...
Henning: Well, we just implemented in the IPSEC framework, the same interfaces in userland. If you think about it, it's just a special case of IPSEC AH.

Audience: You said something about picking two SPIs. IPSEC SPIs are usually picked by the destination; do you mean that one side of the peering session picks the SPI for both sides of the conversation?
Henning: They don't need to match. You just need two; you need one per direction, but they don't need to match on both hosts.

Audience: Your discussion about cloning sessions, and accepting connections from anything on the network - how does that work with TCP MD5? Do they have to have the same shared secret on all those hosts?
Henning: Yes. If you want to use TCP MD5 and different passwords, you'll need to configure that for each peer. There's no way around that.

Audience: Do you have any mechanism that would prevent people from allowing that cloning stuff without doing TCP MD5?
Henning: There's really no reason to prevent that, depending on what you use it for. If you set a server that redistributes a BOGON list, there's nothing wrong with everyone connecting to it. I don't see the point in preventing anything like that.

Audience: ... preventing BGP connections from anyone in the world?
Henning: Why not? If you're not accepting anything, and just redistributing a BOGON list, why not? It's just an option. You have to know what you're doing - it's not enabled by default.

Google
Web daemonnews.org

More Articles
  • Interview with Jan Schaumann
  • Interview with Theo de Raadt
  • Book Review: Virtualization with VMware ESX Server
  • Editorial: Not Quite Dead Yet
  • The Design of OpenBGPd
  • Interview with der Mouse
  • Letter to Steve Jobs
  • Interview with Manuel Bouyer on Xen
  • Apple and Open Source
  • BSDCan 2006
  • BSD Certification Survey Results
  • Lab in a Box
  • Ike Notes on BSDCan 2005
  • BSDCan 2005 Photos
  • FreeBSD Developer Summit Pictures

  • Advertisements




    Author maintains all copyrights on this article.
    Images and layout Copyright © 1998-2006 Dæmon News. All Rights Reserved.