🐱‍👤Askbow

2024-03-10T10:01:59+00:00

Policy-based routing allows network administrator to stir traffic in directions different from the one chosen by destination-based routing and its routing protocols. This can be useful in several scenarios, namely in dual-homing to different ISPs, as well as other special cases.

Using policy-based routing for dual-homing

General notes on dual-homing

The term Dual-homing in its most general meaning is used to refer to a situation, when connection to the same resource is built via two (or more) independent paths. Here, dual-homing is about having connection to another network (the Internet) via several Internet Service Providers.

There are usually several goals to reach with this type of connection to the Internet:

Internet connection resiliency - with businesses world-wide relying more on Internet (even more so with rise of cloud services usage) to operate, Internet connection might be as important, as air; thus, many would prefer to install a secondary link in case of primary link failure;
More bandwidth - when one connection isn't enough
Cost optimization - some ISPs might charge you for traffic usage (others just provide you with limited bandwidth, but with no limit on traffic besides the maximum possible timebandwidth*), and some of those will charge differently for different traffic;

for example, I remember days when local (in-country) megabyte of traffic was way cheaper then foreign. At the same time, due to poor interconnectivity at local IX, traffic to a nearby city was routed trough an IX in Germany, thus treated as foreign.

Probably the best solution (design-wise) in this case is to get a provider-independent network and an autonomous system number, then use these to peer with both ISPs using BGP. This solution is usually flexible, scalable and it is relatively easy to support. On the other hand, it may cost more - especially with IPv4 address depletion at hand.

Policy-based routing to the rescue

In cases when there are some reasons not to buy address space, and we want to use all of the links to the Internet simultaneously, policy-based routing (PBR) will help.

With PBR, network administrator is able:

(in some topologies) ensure that traffic coming via one ISP will return via the same ISP
route HTTP and FTP (or any other port) traffic to a certain ISP, while routing SMTP and DNS via another
ensure that traffic from some users will be forwarded to a certain ISP, or even balanced shared L4 port wise

Have a web server that you need to always be visible trough only one ISP? PBR can do that.

The case for PBR-based dual-homing

Dual-homing topology

So, here is my example topology:

I'm trying my best not to use completely textbook examples here, so I use a fairly simplified topology from my work experience

(not shown: some less important redundant connections) Here, Routers 1&2 are the border routers, and Router 3 performs NAT and firewalling between internet and LAN/DMZ. In that capacity, Router3 terminates all spare IPs in the two /27 networks which are provided by ISPs. Also, Router3 cannot perform policy-based routing itself for some reason.

Router 3 has two default routes, pointing at Routers 1&2 (only two, because I'm keen on VRRP) in each of the /27 networks. ISP1 doesn't know (and thus doesn't route) ISP2's network, and vice versa.

ISP's gateways MUST have a route in their tables pointing traffic destined to respectful /27 networks towards our Routers 1&2.

We need to ensure, that traffic originated by Router3 from its IPs in ISP1's /27 network will always go towards ISP1 router which serves as a default gateway, and the same goes for ISP2's networks: always forward to ISP2 gateway.

How policy-based routing is configured in this case

There are two elements that need to be configured for this to work:

Two access lists, to match traffic origins
Two route-maps, to do the PBR itself

An access list will look like this (Cisco IOS 15.x):

ip access-list standard ISP1-NET 
 permit 1.1.1.16 0.0.0.31

If there are more than one network owned by the same ISP, the respective ACL will contain either more lines (suggested for ease of management) or a summary network (may or may not increase speed, depending on platform).

A route map then will look like that:

route-map ISP-FORWARD permit 10
 match ip address ISP1-NET
 set ip next-hop 1.1.1.1

Here, we match addresses listed in the access-list previously created, and then for any matching routes we set the next-hop to 1.1.1.1.

Another option would be to set ip next-hop recursive. The recursive keyword helps in case the next-hop is not adjacent (that is, not reachable directly on one of the Connected networks). Not our case, but still - nice to know.

Obviously, the same configuration is made for ISP2, using relevant addresses for networks and hosts/gateways.

After that, the route-map (it is possible to create multiple, but I found it more manageable to use just one) is applied to each of the inside-looking interfaces:

interface gi0/1.42
 ip policy route-map ISP-FORWARD

Moreover, it might be necessary to make traffic originated by the router itself to behave the same way:

#ip local policy route-map ISP-FORWARD

How policy-based routing works in this case

As illustrated by colorful arrows on the diagram above, PBR makes any traffic originating from orange subnets to be forwarded to the first ISP, and any traffic originated in the purple network to be forwarded to the second ISP. I.e. it follows this simple procedure:

A packet comes into ingress interface; N.B.: route-maps, in PBR cases, are allways applied on ingress interfaces
The route-map, applied at the interface has a PERMIT statement with a match, referencing an access list
IF the packet matches the any of the access-list statements, the SET directive will be applied, else - next route-map statement is evaluated
The SET directive commands our router to forward the packet to the next-hop stated.
The next-hop is evaluated and the egress interface is determined
The packet is forwarded to the next-hop out of the egress interface

The return traffic is forwarder normally, using normal destination-based routing process.

How to live with that

As ISPs and inside networks are added and removed (that happens too), it is relatively easy to support this config:

If an ISP gives you another network, it is easy to add to that ISP's respective ACL
If, for some reason, another interface must be added, the policy-based routing-wise configuration might be copied from existing interfaces as-is;
better still, make the Router3 terminate them in its logic and only add a new [static] route to Router 1&2 configurations pointing to Router3, and a new line to the ACL

There are some problems though. For example, Router3 must be able to decide from which IP to originate the traffic, how to maintain state for the flows traversing it and how to learn about failures in the upstream.

The scheme I've drawn here protects from Router 1 or 2 failure (or failure of one of their interfaces). There are some parts missing (like, VRRP and IP SLA tracking) which are not relevant to the PBR case itself but are, in reality, present in the configuration.

The problem is that this technique doesn\t help in case of ISP failure, especially in case of a partial failure (for example, an ISP looses connectivity to a half of the Internet) - we can't detect of work around such a failure (BGP peering with receiving full-views from both ISPs would help).

Thus the PBR option is far from perfect and should be used for dual-homing with caution and careful planning.

How to find initial function authors in large Git repo

2023-12-17T09:30:00+00:00

Sometimes, when onboarding into a new codebase we need to explore a little. Some older documentation references might include only partial commit hashes. Other methods were touched by several commits that may include additional context. Finally, we may just want to find the initial author of a given method. How to go about those?

Here and further, I’ll be looking at the azure-quickstart-templates repo in the state as of this writing in Dec 2023.

How to find full commit hash from a partial?

Let’s find the full hash of a commit starting 82a5218 that we found in some doc from a few years back:

$ git show "82a5218" --no-patch --pretty="%H %s"
82a5218d94226a85083c9cf748e8549500cdf405 End-to-end Azure ML set up reference implementation (#12006)

%H for the full hash message, %s for the single-line commit message

Notice that this repository is large enough for git to start using slight longer hashes in default output compared to what we found in the docs:

$ git show "82a5218" --no-patch
82a5218d9  End-to-end Azure ML set up reference implementation (#12006)

git automatically shortens and extends the displayed hash so that it remains powerful enough for uniqueness, while being as human-readable as possible

How about the parents of that commit? We can use git rev-parse with some modification to the hash reference for that:

$ git rev-parse 82a5218d9^@
f88c1f77c8340cb914d999c7d005b512ad4ab9c6

the ^@ literally means “all parents of the specified commit”, or more specifically “anything that is reachable from the commit’s parents” excluding the commit itself.

How to find all commits that happened between two other events?

By events I mean also commits, but in a more general form of “pointers” or refs. These include branch (and HEAD) pointers, tags, etc.

For a more concise output, let’s count all commits since 82a5218:

$ git log 82a5218..HEAD --oneline | wc -l
3249

the A..B notation means “from A to B inclusive”

How to find the very first commit for a specific method?

Let’s say we are looking at the origins of the KeyVault usage.

$ git log -G'Microsoft.KeyVault/vaults\@.+' --oneline | tail -1
e33bf9904 New from bicep example: 101/function-http-trigger (#11759)

e33bf9904

Notably, the built-in regex of git log allows us to search through commit contents. In this case, we are only interested in the very first instance, which is the last in the log output, hence tail -1.

Another form of git built-in regex allows us to find for example the function definitions across the code:

$ git grep 'resource keyVault' | tail
quickstarts/microsoft.network/azurefirewall-premium/main.bicep:resource keyVault 'Microsoft.KeyVault/vaults@2019-09-01' = {
quickstarts/microsoft.network/azurefirewall-premium/main.bicep:resource keyVaultName_keyVaultCASecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
quickstarts/microsoft.storage/storage-blob-encryption-with-cmk/main.bicep:resource keyVault 'Microsoft.KeyVault/vaults@2021-10-01' = {
quickstarts/microsoft.web/function-http-trigger/main.bicep:resource keyVault 'Microsoft.KeyVault/vaults@2019-09-01' = {
quickstarts/microsoft.web/function-http-trigger/main.bicep:resource keyVaultSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep:resource keyVaultPrivateEndpoint 'Microsoft.Network/privateEndpoints@2021-02-01' = {
quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep:  resource keyVaultPrivateDnsZoneGroup 'privateDnsZoneGroups' = {
quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep:resource keyVaultPrivateDnsZone 'Microsoft.Network/privateDnsZones@2020-06-01' = {
quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep:  resource keyVaultPrivateDnsZoneLink 'virtualNetworkLinks' = {
quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep:resource keyVault 'Microsoft.KeyVault/vaults@2021-04-01-preview' = {

This one shows us the locations in the repository – that is the file names. How do we find the history of modifications to it? Turns out, git log can search by line reference:

git log -L:"resource keyVault:quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep" --no-patch --oneline
ef64cb155 New quickstart showing Application Gateway with internal API Management and Web App (#11939)

here, I am taking the last line from the previos command output

What if along with the commit message, we wanted to see the author? For such cases git log supports formatting:

$ git log -L:"resource keyVault:quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep" --no-patch --oneline --pretty="%h %s || %an %ae" | tail -1
ef64cb155 New quickstart showing Application Gateway with internal API Management and Web App (#11939) || Michael S. Collier mcollier@microsoft.com

Moving site to Jekyll, trying Mermaid diagrams, Mathjax LaTex

2023-12-17T08:30:00+00:00

🐱‍👤🐱‍👤🐱‍👤

MD?

I hope to later find a good MD editor for Jekyl. Something sensible.

Note: I collapsed the older MD post into this one; no need for more than one sandbox

🧜‍♀️ Mermaid?

Mermaid is a javascript library to draw diagrams directly in Markdown

https://mermaid.live/

A test diagram

graph TD;
    A-->B;
    A-->C;
    B-->D;
    C-->D;

Mathjax? LaTeX?

Mathjax is a framework that implements LaTeX rendering for the web. I use it to display nice formulas directly from Markdown.

https://jbergknoff.github.io/mathjax-sandbox/

A simple formula

A formula can be rendered inline: $y = a\times x^2 + b\times x + c$

Or as a block:

\[L = \frac{\pi^2\times R}{2}\]

How to safely transform a routing domain

2018-09-11T12:36:00+00:00

As part of my job as a Senior Network Engineer, I develop procedures for undertakings of varying complexity. In this post I'm describing a technique that greatly simplifies any project where a routing domain is expected to churn (i.e. neighborships going up and down, routes flapping), when such event is undesirable.

Motivation

I developed this technique for a client running a critical network operating 24x7 with data flowing across ten timezones. The prime objective was minimization of packet loss during the procedure, so that the real-time application can continue.

Personally, I consider this application way too brittle and in a huge need of redesign. But this is beyond the point of this blog.

The original project I developed this procedure for was segmentation of a flat OSPF (i.e. single-area) network into multiple areas of different types. However it is general enough so we can easily adapt it to other similar projects.

Design options for transforming a routing domain

We start with a routing domain in state A and want transform it into state B, without loosing connectivity in the process.

Why do that? To isolate less-stable WAN churn from more-stable LAN, for one. Also remember that in OSPF you can effectively enforce policy only at ABRs/ASBRs, so segmentation may make sense for you.

There are a few general ways we could go about that:

Schedule a maintenance window, do the job as quickly as possible Good: just do the core job Bad: will drop packets in the process
spin up a parallel routing domain temporarily over the same network Good: low command count to apply on device, automatic workings of a routing protocol taking care of connectivity all the way Bad: possible routing loops, need to account for existing routing policy (redistirbution, filtering, costs, etc)
convert existing routing tables into static routes and use them Good: routing tables already assumed loop-free and based on policy Bad: there are hundreds, thousands of routes, overwhelming volume

Luckily, the overwhelming volume part is easily solved (or so I thought, see below) with automation!

Hence, I decided that we must convert the existing routing tables into long lists of static routes, which we add to configurations of every device in the network we're working on.

But there are thousands of them routes!

Python to the rescue!

The script I wrote to wrangle this task is on GitHub: https://github.com/askbow/networking-tools/blob/master/routep.py

Note: this is an old post; I wrote this script before TextFSM came to my attention; the script essentially implements a single-purpose finite-state-machine to parse text input (“screen-scraping”). Nowadays, just use TextFSM.

The basic idea of the script is this:

load show ip route output from file
parse it line-by-line into a dictionary, where keys are prefixes and values are lists of nexthops and interfaces
optionally optimize the routes where safely possible
go through the dictionary and print static route commands to default output

The result is a neat list of all the routes in the routing domain (of which this device is aware of) with a high administrative distance.

The most gruesome challenge for me when writing this script was the sheer inconsistency of Cisco IOS and ASA products of different versions in terms of show ip route output structure.

If the network was build of just one type of device running one version of software, the whole script would've been thee times shorter and basically consist of a single RegEx match to extract the information I need. My script is ugly because it must parse the ugly.

Known issues

This simple method was generally successful, simplifying procedures for more than a dozen projects. There were however some operational challenges I must make you aware of.

First, the high metric I've chosen as default for the script is not optimal in some topologies. Such topologies tend to be complex and the problem lies on intersection of several routing domains. For example iBGP takes precedence with a lowed AD, setting routes across a path. Adjust the script accordingly.

Second, the route optimization the script employs is very straightforward: it aggregates adjacent prefixes where possible and drops any equal-cost routes from the lists. This usually reduces the length of the resulting command list several times over. Yet, like with any aggregation, you loose detailed routing information and that may introduce some additional risks.

Both of these in my practice had only played out in more-complicated-then-usual topologies. Your mileage may vary, be prepared and double check.

The procedure to safely transform a routing domain

With all that being said, here's an outline of a procedure which is based on the method described here.

collect fresh show ip route outputs from all devices in the immediate routing domain (i.e. there might be no point in scraping those behind an aggregation/summarization wall)
parse them through the script to get static route commands
apply static route commands to all devices
do the main job (i.e. change the routing protocol, change OSPF areas)
check everything that your routing domain is back up as expected (make a checklist ahead of time!)
remove the static routes

Looks simple to me and It does the job.

Paper: Scanning the Internet for Liveness

2018-06-14T08:20:00+00:00

An interesting paper where the authors are building a better way to scan the Internet.

https://sheharbano.com/assets/publications/ccr18-scan-liveness.pdf

Shehar Bano et al. Scanning the Internet for Liveness // ACM SIGCOMM Computer Communication Review, Volume 48 Issue 2, April 2018

Liveness—whether or not a target IP address responds to a probe packet—is a nuanced concept without a simple yes/no answer. Responsiveness directly depends on the probe type, the configuration of the targeted host, as well as on firewalling and filtering behaviors at the edge or within networks.

Key findings include:

TCP and UDP probes increase the population responsive over ICMP by 18%,

comprehensively capturing reply traffic (i.e., taking into account negative reply packets) increases the responsive population by more than 13%,

TCP stacks do not consistently respond with a TCP Rst for non-available services—in our measurements only 24% of hosts with an active TCP stack respond to all the probes,

our concurrent scans allow us to identify nearly 2M tarpits that would bias measurements that do not take them into account, and

we report on the correlation of responsiveness across protocols uncovering potential filtering practices.

Other things I found interesting:

probe redundancy [sending deferred repeated probes] increases the population of active IP addresses by 2.2%

scans recorded 487M network alive IPs (IPall) out of 3.6B probed.

they see that ICMP Echo probes are most effective in discovering network active IPs, revealing 79% of IPall, followed by TCP probes.

they found that 16% of IPall can only exclusively be discovered via TCP, and a small but significant ≈2% can only be discovered via UDP probes.

How many spares do you need?

2018-05-11T15:08:00+00:00

In designing a network, there is a question that is often missing an answer or at best, answered using some rule-of-thumb. How many spare units you should include in your BOM? Actually, do you need them at all?

Disclaimer: I won't be covering any of the really complex models. People who need them probably know about spare part forecasting and procurement more than I do. But some simple models are useful in general network design work, so here's my take on it.

TL;DR:

It depends. The lower mean time to recovery (MTTR) you want, the more chance there is that you need to have on-site spares. The lower the MTTR, the higher the availability you get.

Why discuss spares?

Let's go with a top-down approach here.

There can be several business drivers for really high network availability. A few examples:

network downtime cost is very high - think of a broker connecting to an exchange, or medical equipment during remote procedures (these will become more and more common over the years)
regulatory / compliance - rules imposed by regulatory body (industry association, state department) upon your information system in general and by extension on the network
tight SLAs with customers (who then have cost / compliance or other stuff for their reasons)

To see why we may consider spare parts as part of high availability equation, let's go a little deeper.

What is availability

Availability in its general mathematical form depends on two factors:

MTBF - mean time between failures; many people confuse MTBF with how long a given specimen will work for. A better, more practical understanding of it goes like this: if a vendor has sold 1 000 000 units (power supplies for example) with MTBF 1 000 000 hours, then on average they will be sending one replacement unit every hour.
MTTR - mean time to recover [from failure] - how long it takes to fix a problem

The availability is usually taken as $A = \frac{MTBF}{MTBF + MTTR}$ and the result might look something like 0.99818231.

There's a comprehensive article on that topic over at Packet Pushers: Reliability Basics- Part1 by Diptanshu Singh. There's no point in repeating all of that math background here.

What it means in practical terms is, you can compute the expected (notice expected - it's all a matter of statistics) downtime by taking $T_d = (1 - A) \times T$, where $T$ is your time budget (most people use a Gregorian year here, as expressed in minutes or seconds).

How do you increase your network availability?

There are several ways to push availability up:

get more reliable equipment (i.e. increase MTBF)
add redundancy (think failover/cluster/VSS/vPC/stack/VRRP etc) - also sometimes called structural reliability
decrease MTTR

Now, there is some technological level after which it is prohibitively expensive to increase MTBF, plus there's a natural trade-off between features (=complexity) and reliability.

It is also hard to do redundancy: it increases complexity even further and introduces a separate set of distributed systems problems (for example, firewall cluster and VSS state machines have a lot of moving parts). Although some people may push it higher than that, most settle for something manageable, like running two units in parallel.

Seems like the only thing left to do is to try to decrease MTTR.

Side note: what network equipment MTBF looks like?

Enterprise-class ethernet switches (fixed) seem to have their MTBF converged around 250000 - 400000 hours (at least based on datasheets referencing Telcordia parts-count methods). Individual linecards for modular switches have about the same figures.

Fixed routers are in the same ballpark or higher. Servers are ususally considered to have a lower MTBF, around 75000 hours, while the appliances (like many firewalls, which are basically stripped-down servers) expect to have something around 100000-150000 hours of MTBF.

You should always refer to your vendor/manufacturer if you need exact datapoints for precise calculation.

What is MTTR?

In general, MTTR consists of several components:

Failure detection time
Problem diagnose time
Repair time
Time to test and confirm restoration of service

Many times, Time to actually repair is time to reboot the device (i.e. in the order of 5-10-20-30 minutes) or remove a config line (in the order of 1-5 minutes). On the other end of spectrum is replacing a whole half-rack-high modular switch (0.5-4 hours) . Notice here also that time spread increases with complexity. A corollary to that is for lower MTTR you might want to minimize complexity.

In special circumstances, like remote sites, you might also add to the mix:

Engineering team delivery to the site to perform repairs (for unmanned site)
Time to deliver spares to the site (if sent separate from the repairs team) - which is also the case if you don't have any spares at all

How do you decrease MTTR?

Before we dive deeper into the whole spare part business, let's cover other ways to decrease MTTR first.

If you think about it, redundancy is actually a way to decrease MTTR taken to its absolute: the spare unit takes over automatically with minimum switchover delay feasible.

First things first, depending on your economics and technology, you optimize MTTR down by decreasing detection time. You do it with all sorts of monitoring/telemetry, regular health check-ups and planned maintenance procedures. Same approach works for decreasing diagnostics time. You prepare and use checklists, configuration management procedures [i.e. you always know if any change was made prior to failure], automation. Last but not least - you invest in people by training them. You can also make your critical sites manned 24x7x365, i.e. hire more people.

Similarly, time to actually repair something depends again on procedures and people, but there are limits to that.

Other replacement options

At some point, you will need to replace failed equipment. You don't necessarily need spares for that:

Warranty - many honest manufacturers will cover (although without any real SLA) their products for some reasonable time (or for the product's lifetime, i.e. until they declare its End of Live)
Service contracts - these include not only replacements, but also have some SLA attached to them - for example shipping the replacement part on the next business day (mean time delivery will be at least 32 hours), the next day (24 hours), in 4 hours (5 hours)

Time estimates here are rough and include some reasonable same-postcode-expedite-delivery. No vendor has a warehouse in every area, so add some time allowance for UPS / DHL / FedEx to reach you.

As far as I know, 4 hour shipping SLA is the top speed available from most vendors. Sometimes, if you have enough leverage, you can squeeze a little more (down to 1 hour maybe) from your local vendor's VAR.

Here we arrive to the final point: if you need to go further down the timescale, you have to have on-site spares.

The economic effect of having spares

First of all, spares cost money directly, that is - you need to buy them (and possibly cover them with appropriate contracts as well).

Then you need to store them, and spend some time regularly testing them (a once-a-year [or more often for more critical systems] smoke test). Moreover, from financial point of view, spares are stale capital [not exact term, sorry], that is by buying something (to sit in your warehouse) you throw out your ability to employ that capital otherwise. And that, in short, can make some of your financial KPIs look not as good.

On the other hand, spares relax service contract requirements. For example, instead of covering your whole park of 1000 access points with 24x7@4hrs contracts, with ten spares stored in a wire closed you would only need to get 8x5@NBD contracts.

All in all, your mileage will vary, and this motivator is worth a due consideration.

Spare part kit

Your spare part inventory consists of one or more spare part kits. Spare part kits are collections of spare parts which serve a particular site or a group of sites. As such, we can distinguish between:

local kits - serve one site, stored on that site (zero delivery time)
group kits - serve a set of sites (usually grouped by geography). Vendor's warehouses that ship you a spare part based on service contract can be considered an example of that
multilevel kits - some combination of the above

Replenishing spare part kit

Another way to classify spare part kits is the way you top them up (i.e. how you drive your spare part procurement process). Basically, you can do it in any of these ways:

never - you load up everything you will ever need and fly to the edge of the Solar system.
regularly - every year (month, quarter, other set interval) you buy a bucket of transceivers.
waterline - as soon as the number of spares goes down to some predefined level (waterline) below the base, you buy more to restore status quo.

There are special cases and combinations of these, but they make sense only for some level of sophistication of supply and support organizations. For example, military organizations probably have very complex schemes with specific goals (given that most of the interesting math for reliability and spare part calculation was initially [and still is] developed for army's needs).

I expect organizations such as Google and AWS, as well as network equipment vendors, who also happen to have a lot of data about IT systems reliability, to have developed their own complex spare kit configurations as well.

Down the line, I'll be covering a generic case of a local kit which we replenish on a waterline basis.

How do you evaluate your spare part kit?

Sorry for another interruption, but answering this question early will make later understanding easier.

There are two important metrics of spare part kits:

spare kit efficiency - basically, how many hot devices you are covering with each spare
spare kit readiness - this is a statistical measure of probability to find necessary spare in the kit at the moment of hot device failure

Efficiency

Efficiency can be calculated as $Q = 1 - S / N$, where $S$ is the number of spares, and N is total number of devices protected with these spares. The higher - the better (lower economic loss).

Readiness

Readiness is a little more complex. For waterline replenishment discipline it goes in two steps:

calculate minimum spares needed as $S_{min} = \frac{N \times T_t }{MTBF}$ - see notes in the next section
insert the result into this slightly bigger formula:

$$R = 1 - \frac{S_{min}^{m + 2}}{(S - m + S_{min})\times((1 + S_{min})^{m + 1})}$

where $S$ - your spares base level, $m$ - your waterline level.

Obviously, you want your Spare Kit readiness as high as possible.

In practice, you would find your own balance between efficiency and readiness by solving some optimization problem specific to your needs.

Calculating minimum spares required

This formula comes as a result of developments in mathematical modeling and queuing theories. By modeling spare parts kit as a queue and failure rate as Poisson independent events, it can be shown that for a general case:

\[S_{min} = \frac{N \times T_t }{MTBF}\]

$T_t$ here is the mean time it takes to replenish the kit, i.e. for the vendor to deliver on the contract (see above).

There's a side result from same sciences. We can say with high confidence that this kit will be optimal for many typical cases as well. That is, it maximizes both efficiency and readiness at the same time. I'm not sure if this holds for all cases.

Calculating the Total number of spares required

How many spares would you need if you expect never to replenish the kit?

\[S_T = T \times N / MTBF\]

, where T is total expected lifetime.

By taking $1/MTBF$ we effectively convert it to failure rate, which we then multiply by total system time budget (expressed in machine-hours).

The result of this calculation is also the maximum number of spares. Between the minimum and this maximum, you must choose a point that makes sense to you. A good way to start is to set some expectations about the kit's readiness and efficiency and crunch the numbers, trying to maximize both.

Conclusion

Maintaining a spares inventory is a good way to lower MTTR in a highly available system. The models I list here might not be perfectly refined, and they certainly don't take into account every situation possible, but they produce a fair estimate.

How many BGP routers can a big AS have?

2017-10-21T12:52:00+00:00

For iBGP number of peers (i.e. the number of BGP routers inside an AS), the only significant limiting factor is that iBGP peers must be fully meshed (N.B.: not directly interconnected! An iBGP peering can span all the hops you can fit into the IP TTL field) - because it is the only way for iBGP to prevent loops.

The impact of each BGP peer is an open TCP connection, some memory, and occasionally some processing to do and then some administrative burden.

How many connections?
\[\frac{n \times (n-1)}{2}\]
- That's quadratic complexity:


flowchart LR

A <-...-> B & C & D & E & F
B <-...-> C & D & E & F
C <-...-> D & E & F
D <-...-> E & F
E <-...-> F

To overcome iBGP scalability problems, two approaches were developed:

Confederations
BGP Route Reflectors

BGP Confederations

Confederations is basically splitting your AS into several sub-ASes. A confederated AS looks like a single entity to its eBGP peers, even though each individual router might belong to a different sub-AS.


flowchart LR


subgraph "AS1"
  C <-...-> D
end

subgraph "AS2"
  A <-...-> B & C
  B <-...-> C 
end

subgraph "AS3"
  D <-...-> E & F
  E <-...-> F
end

Routers prevent loops inside confederation by using a special CONFED version of AS_PATH. Just like AS_PATH, its CONFED counterparts can be of two types: _SET and _SEQ.

Importantly, we must fully mesh BGP routers inside a sub-AS. Basically, a sub-AS is just an AS in its own right.

Downside: loss of detailed routing information when we cross sub-AS boundary.

Where is it logical to put confederations in production? My guess would be, large enterprises. It is kind of normal for one company to own several AS numbers - that usually happens as a result of corporate mergers and acquisitions. At the same time, such company might want to present itself as a single entity to any outside network.

BGP Route reflectors

Route reflectors allow you to build a hierarchy of routers. A route-reflector client router doesn't know that it works with a route reflector - it's a normal iBGP peering for a client. Thus client's algorithm is the same as in fully-meshed iBGP system.


flowchart LR

subgraph "RR-A"
  A <-...-> A_1
  A <-...-> A_2
  A <-...-> A_3
end

subgraph "RR-B"

  B <-...-> B_1
  B <-...-> B_2
  B <-...-> B_3
end

subgraph "RR-C"

  C <-...-> C_1
  C <-...-> C_2
  C <-...-> C_3
end

A(A_RR) <-...-> B
B(B_RR) <-...-> C
A(C_RR) <-...-> C

A route reflector (RR) acts a little differently though. That is because its clients are not fully meshed. So, RR (almost) always sends updates to its clients, even when received from another client.

In order to prevent loops, a CLUSTER_ID attribute is used by route reflectors.

Notice that we must fully mesh Route reflectors between each other. And for redundancy, we must install at least two.

Moreover, we sometimes would place route reflectors outside of traffic paths. That way, we can use a cheaper router (still has to receive all the routers we have). It is possible thanks to BGP's third party next hop feature.

Downside: loss of detailed routing information, because RR will only send the best routes to its clients. Hence, possible suboptimal routing.

Interestingly, BGP RRs are the basic idea behind some SDN implementations. Basically, the RR (SDN control server) is filling client's routing tables via BGP.

Finally,

Both schemes allow ASes to grow to hundreds of routers and more, and the two schemes can be used in parallel if desired. The Route reflectors method is perhaps the most widely deployed. The reason is it is easier to design, setup, and support. Also, it allows to build a multi-tier routing hierarchy (core-aggregation-edge, for example) with minimal effort both initially and during scaling.

Why Hulc LED process consumes so much CPU on 2960 platforms?

2017-10-17T13:51:00+00:00

In this post I'll try to make an educated guess about what happens with Hulc LED process and why it appears to consume 20-30% CPU on Cisco 2960(S/X/XR/RX) switches.

(N.B.: the issue appears to be present on Cisco 3750 / 3560 platforms as well)

Symptoms

If you monitor your switch via SNMP, you may quickly notice constantly elevated CPU at about 20-40 % total. To investigate further, you get relevant command output from the device: #show process cpu sort

And the result looks something like this:

or what we could call Angry Hulc process ;-)

What is Cisco HULC?

According to Cisco document 64641 (it's public somewhere on cisco.com):

The Mirage is based on HULC hardware architecture and the Sasquatch switching ASIC chipset from DSBU. HULC is a hardware architecture that you use to build low cost, stackable, 10/100/1000 Ethernet switches. … Lord Of The Rings (LOTR) - The the first release of the HULC based platforms and software based on the Sasquatch ASIC chip set from DSBU. The Mirage is based on LOTR platforms from DSBU.

N.B.: DSBU - Data Switching Business Unit, a part of Cisco*)
Also, notice the LOTR reference; for contrast, 4500 platform was built by Star Wars geeks

A quick look around shows that a lot of stuff is based on that machinery, starting with 3750 / 3560 and 2960 series, but later included SMB series like Express 500 switches, industrial Ethernet series. I'd also assume that some spoils of that development went into 3850, 4500, 6800IA and later systems.

This gave me pause, because, judging by the docs and CiscoLive sessions, the stacks of 3750 and 2960 series differ quite a lot.

It is reasonable to assume that all the multitude of hulc and /H./ (HLPIP, HRPC, HL3U, etc.*) processes in IOS are talking to hardware in this architecture. For example, from docs:

Hulc Forwarding TCAM Manager (HFTM)

Hulc Quality of Service (QoS)/access control list (ACL) TCAM Manager (HQATM)

We can also infer that commands under show platform stack talk directly to HULC.

Ok, what's Hulc LED process anyway?

According to the description of the two most relevant Cisco bugs (CSCtg86211 and CSCtn42790), Hulc LED process is the thing that monitors port states, including PoE, transceiver (think SFP) status and sets the LED indicators accordingly. It also communicates with the MODE button, and resets the switch to factory default if you press it for too long:

%SYS-7-NV_BLOCK_INIT: Initialized the geometry of nvram
%SYS-5-RELOAD: Reload requested by Hulc LED Process. Reload Reason: Reason unspecified.

(see Cisco FN - 63722 and bug CSCuj69384 for insight into why this is important)

Sadly, the description in these bugs doesn't go beyond this is an expected behavior.

Disclaimer: This isn't publicly documented. What follows are my general thoughts and speculations on what might be happening inside. I do not work for Cisco at the moment of this writing and cannot have access to that kind of detail, so don’t rely on my words too much. I’m probably wrong.

How come it consumes so much CPU?

Here comes my theory. This might be completely bogus and is definitely based on a lot of assumptions, some of which are made up. But I tried to keep it realistic.

Hulc LED process does not consume your CPU. It's an illusion created by processes' waiting state.

Taking into account that much of today's IOS has Linux blood in it's veins, that's not hard to imagine. That way, the command show process cpu and its derivatives don't show us actual CPU usage, but something closer in spirit to Linux load average: a decaying average over three time windows, in case of Cisco IOS - 5 Sec, 1 Min, 5 Min.

If we look closely at how data points for calculating load average are collected,

long calc_load_fold_active(struct rq *this_rq, long adjust)
{
	long nr_active, delta = 0;

	nr_active = this_rq->nr_running - adjust;
	nr_active += (long)this_rq->nr_uninterruptible;

	if (nr_active != this_rq->calc_load_active) {
		delta = nr_active - this_rq->calc_load_active;
		this_rq->calc_load_active = nr_active;
	}

	return delta;
}

we could see that it takes load measurements for processes not only in the running state, but also in the uninterruptible wait state.

I would like to thank Brendan Gregg and his post for giving me this idea. His is a great writeup about the problem of load average in Linux.

Let's consider what the Hulc LED process is doing.

It is communicating with peripheral (from CPU's point of view) devices: the ports are [probably] connected to the CPU complex via some serial bus. Hulc LED process polls every port. It does it by pulling port status register and setting the command register to blink the LED.

Or something similar. It's a logical assumption from the fact that the more ports the switch has, the more load Hulc LED exhibits.

This is your basic IO operation. IO operations are slow and need to be completed atomically (i.e. without interruption) to prevent corruption / inconsistency. Hence, it's reasonable to assume that during this operation the process is put to TASK_UNINTERRUPTABLE state.

Refer to Understanding Linux Process States by Yogesh Babar, RedHat for details on what different states mean.

The LED is not that bright

Despite what I've said so far, Hulc LED process can still potentially consume too much CPU. That might be a symptom of faulty SFP modules, link flapping, or some very specific hardware problems in the switch. It's possible to think about several kinds of problems that will result in longer delay on system buses.

This would result in [apparent] very high (beyond 30% of CPU) consumption by Hulc LED process.

The 30% here is rule-of-thumb-level-arbitrary (also given in table 3 here so my guess is in the right ballpark). Although in other Cisco documentation (see BRKCRS-3141 2011 for example) they state normal levels for some devices, you should always consider your environment and do a baseline analysis.

Moreover, there is also bug CSCvd68472 which can make Hulc LED process consume au pair with Hulc running con up to 100% CPU.

Conclusion

To summarize, most of the time Hulc LED process on Cisco 3750/2960 platforms does not actually consume CPU for 20-30%, but rather is mostly waiting for its IO syscalls to finish for all that time. The system displays it as usage, because of the specifics of the algorithm used to calculate the load.

I can counter my own argument: this might actually be a quirk of run-to-completion FIFO discipline used by Cisco IOS scheduler and have nothing to do with Linux.

Also notice, that this Cisco fixed this behavior in 15-something IOS branch for many hardware platforms, so your mileage may vary.

What happens if you start a Cisco 6500 switch without the fan module?

2017-05-30T11:44:00+00:00

Recently, I tested a Cisco 6500 switch in a fan-less configuration, to see how long it can go.

DISCLAIMER: DO NOT TRY TO DO IT. This is a stupid idea and it will void warranty / would be a perfectly valid reason for Cisco to decline RMA (in my opinion at least). Running a switch without fans will be directly damaging to active components (ASICS, TCAM, CPU etc) and increase wear-out of passive ones (capacitors etc). I did it in the lab so you don't have to.

Initially, I would've guessed that you can run the system w/o the fan module for about five minutes. That should give enough time to replace (swap, or clean) the fan module if needed.

Fanless test results

That's not exactly the case:

right away, as the CPUs heated up, the system became slow to respond to direct console connection
in a minute, I've got these logs:

*May 30 07:24:53.619: %C6KENV-SW2_SP-4-INSUFFCOOL: Module 1 cannot be adequately cooled
*May 30 07:24:53.707: %C6KENV-SW2_SP-4-INSUFFCOOL: Module 3 cannot be adequately cooled
*May 30 07:24:55.539: %C6KENV-SW2_SP-4-FANCOUNTFAILED: Required number of fan trays is not present
*May 30 07:25:20.215: %C6KENV-SW2_SP-4-MINORTEMPALARM: switch 2 RP 5/0 inlet temperature crossed threshold #1(=50C. It has exceeded normal operating temperature range.
*May 30 07:27:07.183: %C6KENV-SW2_SP-4-MINORTEMPALARM: switch 2 module 5 asic-1 temperature crossed threshold #1(=. It has exceeded normal operating temperature range
*May 30 07:27:26.387: %C6KENV-SW2_SP-4-MINORTEMPALARM: switch 2 EARL 5/0 outlet temperature crossed threshold #1(=. It has exceeded normal operating temperature range

In short, it took a minute w/o fans for the test system to start overheating.

The EnvMon (as far as I understand) would shut the system down if it went too far above the temperature range red line. But I didn't go that far, because this wasn't the purpose of my test.

Why we need the fans

A supervisor in a Cisco 6500 switch may consume around 250-500 W from a power supply. Most of that energy is actually burned away, translating into heat (thermal energy).

The heat is dissipated by the chip surface into the environment. The bigger the surface, the more heat a chip can dissipate. The radiators glued on top of the chips increase the dissipating surface.

The format of the chassis cards dictates the size of the radiators installed on the chips: they are necessarily small to fit in the slot height.

Hence, to provide enough cooling (i.e. take enough heat from the chip and its radiator) we need to force air through. To that end, we use fans (I won't go into liquid cooling here, but the basic principle is the same).

By removing the fan, I let the heat stay in the chip. I wasn't able to find a datasheet on the SR71000 processor (owned by Broadcom, who are too shy to publish anything) used as SP and RP in Sup720, but as a reference, Intel CPUs (I'm specifically choosing to look at an older generation Intel CPUs here, as I hope the tech of SR71000 is about the same age) are tested up to about 70 degrees Celsius. Given the warning at 50 degrees I've got in the logs, that seems to be a reasonable estimate.

Test environment:

WS-C6506-E with Sup720
Two Gigabit linecards inserted so to have at least a single free slot between any two cards
linecard slot blanks removed
it was an isolated lab system without any traffic to load it
AC in the room providing a steady 23 degrees Celsius

Cisco 6500 VSS flap during SSO switchover if some features are configured

2017-03-13T19:30:00+00:00

There is an interesting problem with Cisco 6500 VSS clusters: generally, switchover between nodes is fast enough and only a few packets are lost. NSF&SSO algorithms help a lot to achieve that. But if you configure a feature that doesn't support SSO for some reason, the flap becomes more noticeable. In this post I'm trying to make an educated guess of what is happening.

Background information on VSS and SSO

Cisco Virtual Switching System (VSS) for 6500 and 4500 is a clustering technology which is meant to increase network availability.

VSS is fairly well documented by the vendor, and the basic idea is that the supervisors in two chassis form a distributed system with one being master (Active) and the other slave (Hot-standby).

The active supervisor maintains routing and other protocol neighborships and synchronizes its Forwarding Information Base [FIB] (i.e. CEF datastructures) to the standby.

In the event of switchover, for whatever reason (switchover command, active powerdown…), the standby will detect the loss of its neighbor. It will then assume the active role and go through NSF / SSO recovery procedures.

NSF (Non-Stop Forwarding) basically works because the FIB datastructures are synchronized, and the dataplane remains populated, thus able to forward traffic. That gives SSO enough time for soft-recovery of controlplane.

Most of the protocols at the moment can go through SSO without problems.

At the same time, some specific features cannot. A clear example for that is Enhanced Object Tracking (EOT), which does not support SSO. If we use it to directly control other protocols (that do support SSO), SSO becomes broken for them. For example, refer to this Cisco bug (which is not really a bug, more like a reply to an inquiry): CSCui37233 (requires CCO login)

### What are the reasons for that EOT flapping during SSO

There are two main thing that follow from documented SSO algorithms:

First, during switchover event, the new active supervisor starts by initializing the processes (i.e. protocols) that support SSO. This is necessary so that the neighborships are not torn down and to notify the neighbors that we are undergoing a soft-reboot. At the same time, the NSF process keeps FIB (CEF) separated from RIB. After the new active supervisor fully boots, all processes are started normally to perform their duties.

Second, as I said before, a VSS cluster does not keep RIB synchronized between the nodes. Process memory (i.e. BGP or EIGRP datastructures) isn't synchronized either (thus the need for NSF to keep FIB separated from RIB until all processes start normally).

What happens inside the supervisor during that kind of switchover

Disclaimer: as EOT (track commands) do not support SSO, their relationship with SSO/NSF isn't publicly documented. So what follows are my general thoughts and speculations on what might be happening inside. I do not work for Cisco at the moment of this writing and cannot have access to that kind of detail, so don't rely on my words too much. I'm probably wrong.

Initially, before the switchover event, the two supervisors had their FIB (CEF datastructures) synchronized. The routing table (RIB) is not, nor are the datastructures of any routing protocols running.

I'll assume EIGRP further on, but the program flow described should be very similar for other protocols as well. My other assumption is that the EOT is set to track a presence of a route in the routing table. This makes my argument so much easier! But it should also work in a very similar fashion for other track variations.

As the switchover happens, new active supervisor takes control. It already has a working copy of CEF, so the data continues to flow [almost] without interruption. Also it is important here that at that moment, the RIB's state is undetermined (it is empty for all practical purposes). The supervisor continues to boot.

As it finishes booting, it starts the processes and they commence their work. Because IOS process scheduling is a variation on run-to-completion with FIFO discipline, every process will have a chance to do something useful (if it is not interrupted). Then it will either finish or return the CPU to the scheduler and go into waiting state.

In that time, the EIGRP process would say hello to it's neighbors and notify them (with a special flag) that it is undergoing a soft restart and needs data from them. After sending out these messages, the EIGRP goes to waiting state. The reason is that network communication will take time and it can't block the CPU for that long.

Other processes start as well, EOT included. EOT starts and following its configuration, checks for the tracked objects (per my assumption, it is a route in RIB). It will see that the route isn't there (which is normal, as the RIB is empty). EOT sets a down flag and notifies any process that is subscribed to that event via IPC (inter-process communication). It then relinquishes CPU control to the scheduler, which runs other processes in the queue.

At some point, some process (which might in general support SSO very well, like BGP, OSPF, EIGRP, HSRP) receives EOT's notification via IPC and acts accordingly, as configured. It might drop a neighborship, change priority, or filtering. What ever that is, the state changes.

After some time, the EIGRP process receives the updates from its neighbors, computes the routes, and sends that update to the RIB management process. RIB management process fills out the routing table using static routes and data it receives from routing protocols like EIGRP.

Notice that all that time, the system is fully capable of forwarding packets, as FIB was sort of frozen (i.e. the NSF was active). But because of the state change above, the neighboring devices might have ceased to forward traffic through VSS pair or tore down neighborship relationship.

When the RIB is complete and all the SSO-supporting processes report that they are done, the synchronization of RIB->FIB is reestablished, and eventually the FIBs of all CPUs, linecards, PFC/DFC are updated accordingly. The SSO/NSF procedure is complete.

Later on, at some point, the EOT process is scheduled to run again. When running, it sees that the tracked object (the route) is present; it puts it into up state and notifies the subscribed processes.

When the subscribed process runs, it reacts to the track state change by changing its own state back.

And we observe a flap, documented in the Cisco support case referenced above.

Discussion

Notice here that because the order of process execution isn't really deterministic in IOS, we can't say that the order of events will always be the same. But the fact that, unlike EOT, a routing process like EIGRP depends on IO (to send and receive packets) and will have to wait before it can ready an update to RIB, lets us assume that any other process (including EOT) will have a chance to run at least once and do their duties as configured.

Given that a routing database update is rather big, it needs some quality time with IO to be fully loaded (CPU time-wise). I think it is logical to further assume that there are several transitions of the routing process (EIGRP) between running and waiting before it eventually synchronizes. That definitely gives enough time for the scenario I describe.

As a result, we can observe the flapping event, despite SSO/NSF working correctly. The inability of EOT to work with SSO is documented by Cisco (I reference just one document above, but there are others), so flapping is not a bug but really a feature.