<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://askbow.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://askbow.com/" rel="alternate" type="text/html" /><updated>2026-02-18T07:51:41+00:00</updated><id>https://askbow.com/feed.xml</id><title type="html">🐱‍👤Askbow</title><subtitle>Lorem ipsum generator is broken.</subtitle><entry><title type="html"></title><link href="https://askbow.com/2026/02/18/2015-04-20-dual-homing-with-policy-based-routing-a-case-study" rel="alternate" type="text/html" title="" /><published>2026-02-18T07:51:41+00:00</published><updated>2026-02-18T07:51:41+00:00</updated><id>https://askbow.com/2026/02/18/2015-04-20-dual-homing-with-policy-based-routing-a-case-study</id><content type="html" xml:base="https://askbow.com/2026/02/18/2015-04-20-dual-homing-with-policy-based-routing-a-case-study"><![CDATA[<p><a href="https://askbow.com/2015/04/13/policy-based-routing-ip/" title="Policy-based">Policy-based routing</a> allows network administrator to stir traffic in directions different from the one chosen by <a href="https://askbow.com/2015/04/06/general-ip-routing/" title="General">destination-based routing</a> and its <a href="https://askbow.com/tag/routing-protocols/">routing protocols</a>. This can be useful in several scenarios, namely in dual-homing to different ISPs, as well as other special cases.</p>

<h2 id="using-policy-based-routing-for-dual-homing">Using policy-based routing for dual-homing</h2>

<h3 id="general-notes-on-dual-homing">General notes on dual-homing</h3>

<p>The term Dual-homing in its most general meaning is used to refer to a situation, when connection to the same resource is built via two (or more) independent paths. Here, dual-homing is about having connection to another network (the Internet) via several Internet Service Providers.</p>

<p>There are usually several goals to reach with this type of connection to the Internet:</p>

<ul>
  <li>Internet connection resiliency - with businesses world-wide relying more on Internet (even more so with rise of cloud services usage) to operate, Internet connection might be as important, as air; thus, many would prefer to install a secondary link in case of primary link failure;</li>
  <li>More bandwidth - when one connection isn't enough</li>
  <li>Cost optimization - some ISPs might charge you for traffic usage (<em>others just provide you with limited bandwidth, but with no limit on traffic besides the maximum possible time</em>bandwidth*), and some of those will charge differently for different traffic;
    <blockquote>
      <p>for example, I remember days when local (in-country) megabyte of traffic was way cheaper then foreign. At the same time, due to poor interconnectivity at local IX, traffic to a nearby city was routed trough an IX in Germany, thus treated as foreign.</p>
    </blockquote>
  </li>
</ul>

<p>Probably the best solution (design-wise) in this case is to get a provider-independent network and an autonomous system number, then use these to peer with both ISPs using BGP. This solution is usually flexible, scalable and it is relatively easy to support. On the other hand, it may cost more - especially with IPv4 address depletion at hand.</p>

<h3 id="policy-based-routing-to-the-rescue">Policy-based routing to the rescue</h3>

<p>In cases when there are some reasons not to buy address space, and we want to use all of the links to the Internet simultaneously, policy-based routing (PBR) will help.</p>

<p>With PBR, network administrator is able:</p>

<ul>
  <li>(<em>in some topologies</em>) ensure that traffic coming via one ISP will return via the same ISP</li>
  <li>route HTTP and FTP (or any other port) traffic to a certain ISP, while routing SMTP and DNS via another</li>
  <li>ensure that traffic from some users will be forwarded to a certain ISP, or even balanced shared L4 port wise</li>
</ul>

<p>Have a web server that you need to always be visible trough only one ISP? PBR can do that.</p>

<h2 id="the-case-for-pbr-based-dual-homing">The case for PBR-based dual-homing</h2>

<h3 id="dual-homing-topology">Dual-homing topology</h3>

<p>So, here is my example topology:</p>

<blockquote>
  <p><em>I'm trying my best not to use completely textbook examples here, so I use a fairly simplified topology from my work experience</em></p>
</blockquote>

<p><a href="https://askbow.com/wp-content/uploads/2015/04/pbr-example2.png"><img src="https://askbow.com/wp-content/uploads/2015/04/pbr-example2.png" alt="Example" /></a></p>

<p>(<em>not shown: some less important redundant connections</em>) Here, Routers 1&amp;2 are the border routers, and Router 3 performs NAT and firewalling between internet and LAN/DMZ. In that capacity, Router3 terminates all spare IPs in the two /27 networks which are provided by ISPs. Also, Router3 cannot perform policy-based routing itself for some reason.</p>

<p>Router 3 has two default routes, pointing at Routers 1&amp;2 (only two, because I'm keen on VRRP) in each of the /27 networks. ISP1 doesn't know (and thus doesn't route) ISP2's network, and vice versa.</p>

<p>ISP's gateways MUST have a route in their tables pointing traffic destined to respectful /27 networks towards our Routers 1&amp;2.</p>

<p>We need to ensure, that traffic originated by Router3 from its IPs in ISP1's /27 network will always go towards ISP1 router which serves as a default gateway, and the same goes for ISP2's networks: always forward to ISP2 gateway.</p>

<h3 id="how-policy-based-routing-is-configured-in-this-case">How policy-based routing is configured in this case</h3>

<p>There are two elements that need to be configured for this to work:</p>

<ol>
  <li>Two access lists, to match traffic origins</li>
  <li>Two route-maps, to do the PBR itself</li>
</ol>

<p>An access list will look like this (Cisco IOS 15.x):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;pre access="" class="lang:python" decode:true="" list="" match="" source="" the="" title="An" to=""&gt;ip access-list standard ISP1-NET 
 permit 1.1.1.16 0.0.0.31
</code></pre></div></div>

<p>If there are more than one network owned by the same ISP, the respective ACL will contain either more lines (suggested for ease of management) or a summary network (may or may not increase speed, depending on platform).</p>

<p>A route map then will look like that:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;pre all="" class="lang:python" decode:true="" match="" route-map="" them="" title="A" to=""&gt;route-map ISP-FORWARD permit 10
 match ip address ISP1-NET
 set ip next-hop 1.1.1.1
</code></pre></div></div>

<p>Here, we match addresses listed in the access-list previously created, and then for any matching routes we set the next-hop to 1.1.1.1.</p>

<p>Another option would be to <em>set ip next-hop recursive.</em> The <em>recursive</em> keyword helps in case the next-hop is not adjacent (that is, not reachable directly on one of the Connected networks). Not our case, but still - nice to know.</p>

<p>Obviously, the same configuration is made for ISP2, using relevant addresses for networks and hosts/gateways.</p>

<p>After that, the route-map (it is possible to create multiple, but I found it more manageable to use just one) is applied to each of the <em>inside-looking</em> interfaces:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;pre a="" an="" class="lang:python" decode:true="" interface="" route-map="" title="Applying" to=""&gt;interface gi0/1.42
 ip policy route-map ISP-FORWARD
</code></pre></div></div>

<p>Moreover, it might be necessary to make traffic originated by the router itself to behave the same way:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;pre class="lang:python" decode:true="" route-map="" router-originated="" title="Apply" to="" traffic=""&gt;#ip local policy route-map ISP-FORWARD
</code></pre></div></div>

<h3 id="how-policy-based-routing-works-in-this-case">How policy-based routing works in this case</h3>

<p>As illustrated by colorful arrows on the diagram above, PBR makes any traffic originating from orange subnets to be forwarded to the first ISP, and any traffic originated in the purple network to be forwarded to the second ISP. I.e. it follows this simple procedure:</p>

<ol>
  <li>A packet comes into ingress interface; <strong>N.B.:</strong> <em>route-maps, in PBR cases, are allways applied on ingress interfaces</em></li>
  <li>The route-map, applied at the interface has a PERMIT statement with a match, referencing an access list</li>
  <li>IF the packet matches the any of the access-list statements, the SET directive will be applied, else - next route-map statement is evaluated</li>
  <li>The SET directive commands our router to forward the packet to the next-hop stated.</li>
  <li>The next-hop is evaluated and the egress interface is determined</li>
  <li>The packet is forwarded to the next-hop out of the egress interface</li>
</ol>

<p>The return traffic is forwarder normally, using normal destination-based routing process.</p>

<h3 id="how-to-live-with-that">How to live with that</h3>

<p>As ISPs and inside networks are added and removed (that happens too), it is relatively easy to support this config:</p>

<ul>
  <li>If an ISP gives you another network, it is easy to add to that ISP's respective ACL</li>
  <li>If, for some reason, another interface must be added, the policy-based routing-wise configuration might be copied from existing interfaces as-is;</li>
  <li>better still, make the Router3 terminate them in its logic and only add a new [static] route to Router 1&amp;2 configurations pointing to Router3, and a new line to the ACL</li>
</ul>

<p>There are some problems though. For example, Router3 must be able to decide from which IP to originate the traffic, how to maintain state for the flows traversing it and how to learn about failures in the upstream.</p>

<p>The scheme I've drawn here protects from Router 1 or 2 failure (or failure of one of their interfaces). There are some parts missing (like, VRRP and IP SLA tracking) which are not relevant to the PBR case itself but are, in reality, present in the configuration.</p>

<p>The problem is that this technique doesn\t help in case of ISP failure, especially in case of a partial failure (for example, an ISP looses connectivity to a half of the Internet) - we can't detect of work around such a failure (BGP peering with receiving full-views from both ISPs would help).</p>

<p>Thus the PBR option is far from perfect and should be used for dual-homing with caution and careful planning.</p>]]></content><author><name></name></author></entry><entry><title type="html">Explain Like I’m 5: ChatGPT</title><link href="https://askbow.com/2025/05/02/eli5-chatgpt" rel="alternate" type="text/html" title="Explain Like I’m 5: ChatGPT" /><published>2025-05-02T08:30:00+00:00</published><updated>2025-05-02T08:30:00+00:00</updated><id>https://askbow.com/2025/05/02/eli5-chatgpt</id><content type="html" xml:base="https://askbow.com/2025/05/02/eli5-chatgpt"><![CDATA[<h1 id="explain-like-im-5-chatgpt">Explain Like I’m 5: ChatGPT</h1>

<p>Scrolling morning Reddit, I stumbled upon a great question in the ELI5 sub.
It went something like this:</p>
<blockquote>
  <p>Why doesn’t ChatGPT admit it doesn’t know stuff?</p>
</blockquote>

<p>Most top answers there were correct, but alas more like high-school or higher level.</p>

<p>It got me thinking. Can I explain it to a literal five year old?
Now, five-year-olds in the year 2025 are probably way smarter than I was.
Here’s an explanation that would have worked for my own self, as far as I was aware of things at that age.</p>

<blockquote>
  <p>What follows is a really basic analogy. As any model, an analogy has its limits.</p>
</blockquote>

<h1 id="eli5-how-do-they-train-a-model">ELI5: How do they “train” a “model”?</h1>
<p>Imagine, a rabbit hopping on a field 🐇 ⛳⛳⛳
The field is uneven: little hills🏞, little valleys🌄⛺</p>

<p>We plant on the field various grasses 🌾🌺🌻🌼🌷 the rabbit might like to chew or hide in.
We also place little rocks to create paths, place food 🥕🌽 and water 🌊 so that the rabbit could go to these things.
All-in-all, we made a really huge garden for our rabbit to roam.</p>

<p>We first guide the rabbit through the garden by some path. We show it some food and some water. Then, we let it roam free.</p>

<p>It starts hopping around. Depending on how playful it is, it might even jump over our little stone fences to other areas! 
It goes on some journey around our garden. We record its path on our map🗺.</p>

<p>Maybe the path we have drawn on the map is not what we’d like to see. We want the rabbit’s journey to make a pretty picture on our map.</p>

<p>We carefully move the things around in the garden. We hope to guide the rabbit’s journey more to our liking.</p>

<p>We repeat the experiment many times, a million millions times. Until the rabbit’s journey on the map looks pretty.</p>

<p>The rabbit itself doesn’t see our map. It just choses where to hop next. And it can only hop from the point where it is to some other nearby point that it sees and want to get to.</p>

<h2 id="how-does-this-translate">How does this translate?</h2>

<ul>
  <li>The rabbit hopping around carelessly is the LLM’s algorithm.
We can have slightly different algorithms by picking other animals!</li>
  <li>The placement of the things in the garden are called the “model weights”. The weights that made a rabbit’s path pretty may not work for a horse🐎! A horse and a rabbit share some fundamentals. They are both mammals, have four legs, eat grass, etc. But they differ in development and movement ability. The LLM algorithms share the common fundamentals of the neural nets, generators, and the attention feedback loop.</li>
  <li>The initial path by which we guide the rabbit before setting it free is called the “prompt”.</li>
  <li>The training makes the path look “pretty”, just like the model training made the produced text look like it was written by a person.</li>
</ul>

<h1 id="eli5-how-do-people-interact-with-a-model">ELI5: How do people interact with a “model”?</h1>

<p>We make the whole thing into a farm!</p>

<p>We let other people visit our farm and play with the animals on their respective fields.</p>

<p>For example, our visitors can take turns introducing a rabbit to the field, to guide it in, and then let it roam free. Then it hops around a bit.
Then they guide it some more, until the rabbit’s journey drawn on a map looks pretty to them.</p>

<p>But it turns out, the initial guiding path is important. So we take over guiding the rabbit the first few steps into the garden, before letting our visitors take it further.</p>

<h2 id="how-does-this-translate-1">How does this translate?</h2>

<ul>
  <li>people can choose different underlying models, like they can choose which animal to play with</li>
  <li>people write their question to the model, the same way we let them guide the rabbit into the field</li>
  <li>after letting the rabbit roam free for some time, people can take guide it a little more, adding to their interaction – but the whole rabbit’s journey is recorded; the same way you respond to ChatGPT’s initial reply if it wasn’t exactly what you wanted</li>
  <li>when we take over the initial few steps, we create what is called the “system prompt”</li>
</ul>

<h1 id="eli5-why-doesnt-it-stop-and-say-it-doesnt-know">ELI5: Why doesn’t it stop and say it doesn’t “know”?</h1>

<p>The rabbit is careless and doesn’t know anything. The beauty is in the eye of the beholder. The figures we draw on our map as we trace the rabbit’s path look nice to us, but the rabbit doesn’t really see the map. It just chooses where to hop next.
And it always does hop somewhere in the garden, attracted by the things it sees that we placed there.</p>

<h2 id="how-does-this-translate-2">How does this translate?</h2>

<ul>
  <li>an LLM operates on symbols connected by a graph, and all it does is finds some pseudo-random path on that graph.</li>
  <li>I like an explanation I’ve read from Kent Back recently, for coding LLMs. There is a set of all possible programs. The LLM doesn’t know which ones are actually correct. In a more general way, an LLM finds a nice-looking path in a field of all possible vectors of tokens. It always goes along <strong>some</strong> path.</li>
</ul>

<h1 id="eli5-what-do-the-llm-researchers-and-developers-do">ELI5: What do the LLM researchers and developers do?</h1>

<p>It’s a lot of work.</p>

<p>Some people are herding and nurturing the rabbits, the horses, and the cats. Somebody has to breed those special fluffy chickens too! Some people try to make it wore with little mice and hampsters.</p>

<p>Other people are doing some landscape design so that the animals roam the gardens in specific ways. There are other people who make the shovels and excavator machines for the gardeners.</p>

<p>Yet other people, create the initial paths that are later useful for the guests.
Then some other people build the farm and post the ads.</p>

<h2 id="how-does-this-translate-3">How does this translate?</h2>

<ul>
  <li>some people design and optimize the algorithms to a variety of performance requirements</li>
  <li>some people run the training to produce the model weights for various needs</li>
  <li>some people make the hardware that runs the computation</li>
  <li>some people develop the prompts, set up the infrastructure (backend, frontent), etc</li>
</ul>]]></content><author><name></name></author><category term="musings" /><summary type="html"><![CDATA[Explain Like I’m 5: ChatGPT]]></summary></entry><entry><title type="html">How to find initial function authors in large Git repo</title><link href="https://askbow.com/2023/12/17/how-to-find-information-in-git" rel="alternate" type="text/html" title="How to find initial function authors in large Git repo" /><published>2023-12-17T09:30:00+00:00</published><updated>2023-12-17T09:30:00+00:00</updated><id>https://askbow.com/2023/12/17/how-to-find-information-in-git</id><content type="html" xml:base="https://askbow.com/2023/12/17/how-to-find-information-in-git"><![CDATA[<p>Sometimes, when onboarding into a new codebase we need to explore a little. Some older documentation references might include only partial commit hashes. Other methods were touched by several commits that may include additional context. Finally, we may just want to find the initial author of a given method. How to go about those?</p>

<blockquote>
  <p>Here and further, I’ll be looking at the <a href="https://github.com/Azure/azure-quickstart-templates">azure-quickstart-templates</a> repo in the state as of this writing in Dec 2023.</p>
</blockquote>

<h2 id="how-to-find-full-commit-hash-from-a-partial">How to find full commit hash from a partial?</h2>

<p>Let’s find the full hash of a commit starting <code class="language-plaintext highlighter-rouge">82a5218</code> that we found in some doc from a few years back:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git show <span class="s2">"82a5218"</span> <span class="nt">--no-patch</span> <span class="nt">--pretty</span><span class="o">=</span><span class="s2">"%H %s"</span>
82a5218d94226a85083c9cf748e8549500cdf405 End-to-end Azure ML <span class="nb">set </span>up reference implementation <span class="o">(</span><span class="c">#12006)</span>
</code></pre></div></div>
<ul>
  <li><code class="language-plaintext highlighter-rouge">%H</code> for the full hash message, <code class="language-plaintext highlighter-rouge">%s</code> for the single-line commit message</li>
</ul>

<p>Notice that this repository is large enough for git to start using slight longer hashes in default output compared to what we found in the docs:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git show <span class="s2">"82a5218"</span> <span class="nt">--no-patch</span>
82a5218d9  End-to-end Azure ML <span class="nb">set </span>up reference implementation <span class="o">(</span><span class="c">#12006)</span>
</code></pre></div></div>
<ul>
  <li>git automatically shortens and extends the displayed hash so that it remains powerful enough for uniqueness, while being as human-readable as possible</li>
</ul>

<p>How about the parents of that commit? We can use <code class="language-plaintext highlighter-rouge">git rev-parse</code> with some modification to the hash reference for that:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git rev-parse 82a5218d9^@
f88c1f77c8340cb914d999c7d005b512ad4ab9c6
</code></pre></div></div>
<ul>
  <li>the <code class="language-plaintext highlighter-rouge">^@</code> literally means “all parents of the specified commit”, or more specifically “anything that is reachable from the commit’s parents” excluding the commit itself.</li>
</ul>

<h2 id="how-to-find-all-commits-that-happened-between-two-other-events">How to find all commits that happened between two other events?</h2>

<p>By events I mean also commits, but in a more general form of “pointers” or refs. These include branch (and <code class="language-plaintext highlighter-rouge">HEAD</code>) pointers, tags, etc.</p>

<p>For a more concise output, let’s count all commits since <code class="language-plaintext highlighter-rouge">82a5218</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git log 82a5218..HEAD <span class="nt">--oneline</span> | <span class="nb">wc</span> <span class="nt">-l</span>
3249
</code></pre></div></div>
<ul>
  <li>the <code class="language-plaintext highlighter-rouge">A..B</code> notation means “from A to B inclusive”</li>
</ul>

<h2 id="how-to-find-the-very-first-commit-for-a-specific-method">How to find the very first commit for a specific method?</h2>

<p>Let’s say we are looking at the origins of the KeyVault usage.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git log <span class="nt">-G</span><span class="s1">'Microsoft.KeyVault/vaults\@.+'</span> <span class="nt">--oneline</span> | <span class="nb">tail</span> <span class="nt">-1</span>
e33bf9904 New from bicep example: 101/function-http-trigger <span class="o">(</span><span class="c">#11759)</span>
</code></pre></div></div>
<ul>
  <li><a href="https://github.com/Azure/azure-quickstart-templates/commit/e33bf9904d599f5b7fd401e8171328d460af2fbc#diff-f8d1aa6307090ce36078ceb0660eef646ab3c4a81380dadb3b93f88a0c2cd1edR180">e33bf9904</a></li>
</ul>

<p>Notably, the built-in regex of <code class="language-plaintext highlighter-rouge">git log</code> allows us to search through commit contents. In this case, we are only interested in the very first instance, which is the last in the log output, hence <code class="language-plaintext highlighter-rouge">tail -1</code>.</p>

<p>Another form of git built-in regex allows us to find for example the function definitions across the code:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git <span class="nb">grep</span> <span class="s1">'resource keyVault'</span> | <span class="nb">tail
</span>quickstarts/microsoft.network/azurefirewall-premium/main.bicep:resource keyVault <span class="s1">'Microsoft.KeyVault/vaults@2019-09-01'</span> <span class="o">=</span> <span class="o">{</span>
quickstarts/microsoft.network/azurefirewall-premium/main.bicep:resource keyVaultName_keyVaultCASecret <span class="s1">'Microsoft.KeyVault/vaults/secrets@2019-09-01'</span> <span class="o">=</span> <span class="o">{</span>
quickstarts/microsoft.storage/storage-blob-encryption-with-cmk/main.bicep:resource keyVault <span class="s1">'Microsoft.KeyVault/vaults@2021-10-01'</span> <span class="o">=</span> <span class="o">{</span>
quickstarts/microsoft.web/function-http-trigger/main.bicep:resource keyVault <span class="s1">'Microsoft.KeyVault/vaults@2019-09-01'</span> <span class="o">=</span> <span class="o">{</span>
quickstarts/microsoft.web/function-http-trigger/main.bicep:resource keyVaultSecret <span class="s1">'Microsoft.KeyVault/vaults/secrets@2019-09-01'</span> <span class="o">=</span> <span class="o">{</span>
quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep:resource keyVaultPrivateEndpoint <span class="s1">'Microsoft.Network/privateEndpoints@2021-02-01'</span> <span class="o">=</span> <span class="o">{</span>
quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep:  resource keyVaultPrivateDnsZoneGroup <span class="s1">'privateDnsZoneGroups'</span> <span class="o">=</span> <span class="o">{</span>
quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep:resource keyVaultPrivateDnsZone <span class="s1">'Microsoft.Network/privateDnsZones@2020-06-01'</span> <span class="o">=</span> <span class="o">{</span>
quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep:  resource keyVaultPrivateDnsZoneLink <span class="s1">'virtualNetworkLinks'</span> <span class="o">=</span> <span class="o">{</span>
quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep:resource keyVault <span class="s1">'Microsoft.KeyVault/vaults@2021-04-01-preview'</span> <span class="o">=</span> <span class="o">{</span>
</code></pre></div></div>

<p>This one shows us the locations in the repository – that is the file names. How do we find the history of modifications to it? Turns out, <code class="language-plaintext highlighter-rouge">git log</code> can search by line reference:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git log <span class="nt">-L</span>:<span class="s2">"resource keyVault:quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep"</span> <span class="nt">--no-patch</span> <span class="nt">--oneline</span>
ef64cb155 New quickstart showing Application Gateway with internal API Management and Web App <span class="o">(</span><span class="c">#11939)</span>
</code></pre></div></div>
<ul>
  <li>here, I am taking the last line from the previos command output</li>
</ul>

<p>What if along with the commit message, we wanted to see the author? For such cases <code class="language-plaintext highlighter-rouge">git log</code> supports formatting:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git log <span class="nt">-L</span>:<span class="s2">"resource keyVault:quickstarts/microsoft.web/private-webapp-with-app-gateway-and-apim/main.bicep"</span> <span class="nt">--no-patch</span> <span class="nt">--oneline</span> <span class="nt">--pretty</span><span class="o">=</span><span class="s2">"%h %s || %an %ae"</span> | <span class="nb">tail</span> <span class="nt">-1</span>
ef64cb155 New quickstart showing Application Gateway with internal API Management and Web App <span class="o">(</span><span class="c">#11939) || Michael S. Collier mcollier@microsoft.com</span>
</code></pre></div></div>]]></content><author><name></name></author><category term="git" /><summary type="html"><![CDATA[Sometimes, when onboarding into a new codebase we need to explore a little. Some older documentation references might include only partial commit hashes. Other methods were touched by several commits that may include additional context. Finally, we may just want to find the initial author of a given method. How to go about those?]]></summary></entry><entry><title type="html">Moving site to Jekyll, trying Mermaid diagrams, Mathjax LaTex</title><link href="https://askbow.com/2023/12/17/try-mermaid-diagrams" rel="alternate" type="text/html" title="Moving site to Jekyll, trying Mermaid diagrams, Mathjax LaTex" /><published>2023-12-17T08:30:00+00:00</published><updated>2023-12-17T08:30:00+00:00</updated><id>https://askbow.com/2023/12/17/try-mermaid-diagrams</id><content type="html" xml:base="https://askbow.com/2023/12/17/try-mermaid-diagrams"><![CDATA[<h1>🐱‍👤🐱‍👤🐱‍👤</h1>

<h2 id="md">MD?</h2>

<p>I hope to later find a good MD editor for Jekyl. Something sensible.</p>

<blockquote>
  <p>Note: I collapsed the older MD post into this one; no need for more than one sandbox</p>
</blockquote>

<h2 id="️-mermaid">🧜‍♀️ Mermaid?</h2>

<p>Mermaid is a javascript library to draw diagrams directly in Markdown</p>

<p>https://mermaid.live/</p>

<h3 id="a-test-diagram">A test diagram</h3>

<pre><code class="language-mermaid">graph TD;
    A--&gt;B;
    A--&gt;C;
    B--&gt;D;
    C--&gt;D;
</code></pre>

<h2 id="mathjax-latex">Mathjax? LaTeX?</h2>

<p>Mathjax is a framework that implements LaTeX rendering for the web. I use it to display nice formulas directly from Markdown.</p>

<p>https://jbergknoff.github.io/mathjax-sandbox/</p>

<h3 id="a-simple-formula">A simple formula</h3>

<p>A formula can be rendered inline: $y = a\times x^2 + b\times x + c$</p>

<p>Or as a block:</p>

\[L = \frac{\pi^2\times R}{2}\]]]></content><author><name></name></author><category term="sandbox" /><summary type="html"><![CDATA[🐱‍👤🐱‍👤🐱‍👤]]></summary></entry><entry><title type="html">How to safely transform a routing domain</title><link href="https://askbow.com/2018/09/11/how-to-safely-transform-a-routing-domain" rel="alternate" type="text/html" title="How to safely transform a routing domain" /><published>2018-09-11T12:36:00+00:00</published><updated>2018-09-11T12:36:00+00:00</updated><id>https://askbow.com/2018/09/11/how-to-safely-transform-a-routing-domain</id><content type="html" xml:base="https://askbow.com/2018/09/11/how-to-safely-transform-a-routing-domain"><![CDATA[<p>As part of my job as a Senior Network Engineer, I develop procedures for undertakings of varying complexity. In this post I'm describing a technique that greatly simplifies any project where a routing domain is expected to churn (<em>i.e. neighborships going up and down, routes flapping</em>), when such event is undesirable.</p>

<h3 id="motivation">Motivation</h3>

<p>I developed this technique for a client running a critical network operating 24x7 with data flowing across ten timezones. The prime objective was minimization of packet loss during the procedure, so that the real-time application can continue.</p>

<blockquote>
  <p><em>Personally, I consider this application way too brittle and in a huge need of redesign. But this is beyond the point of this blog.</em></p>
</blockquote>

<p>The original project I developed this procedure for was segmentation of a flat OSPF (<em>i.e. single-area</em>) network into multiple areas of different types. However it is general enough so we can easily adapt it to other similar projects.</p>

<h3 id="design-options-for-transforming-a-routing-domain">Design options for transforming a routing domain</h3>

<p>We start with a routing domain in state A and want transform it into state B, without loosing connectivity in the process.</p>

<p><a href="https://askbow.com/wp-content/uploads/2018/08/routing-domain-A-B.png"><img src="https://askbow.com/wp-content/uploads/2018/08/routing-domain-A-B-300x163.png" alt="" /></a></p>

<blockquote>
  <p><em>Why do that? To isolate less-stable WAN churn from more-stable LAN, for one. Also remember that in OSPF you can effectively enforce policy only at ABRs/ASBRs, so segmentation may make sense for you.</em></p>
</blockquote>

<p>There are a few general ways we could go about that:</p>

<ol>
  <li>Schedule a maintenance window, do the job as quickly as possible
  Good: just do the core job
  Bad: will drop packets in the process</li>
  <li>spin up a parallel routing domain temporarily over the same network
  Good: low command count to apply on device, automatic workings of a routing protocol taking care of connectivity all the way
  Bad: possible routing loops, need to account for existing routing policy (<em>redistirbution, filtering, costs, etc</em>)</li>
  <li>convert existing routing tables into static routes and use them
  Good: routing tables already assumed loop-free and based on policy
  Bad: there are hundreds, thousands of routes, overwhelming volume</li>
</ol>

<p>Luckily, the overwhelming volume part is easily solved (<em>or so I thought, see below</em>) with automation!</p>

<p>Hence, I decided that we must convert the existing routing tables into long lists of static routes, which we add to configurations of every device in the network we're working on.</p>

<h3 id="but-there-are-thousands-of-them-routes">But there are thousands of them routes!</h3>

<p>Python to the rescue!</p>

<p>The script I wrote to wrangle this task is on GitHub: <a href="https://github.com/askbow/networking-tools/blob/master/routep.py">https://github.com/askbow/networking-tools/blob/master/routep.py</a></p>

<blockquote>
  <p>Note: this is an old post; I wrote this script before TextFSM came to my attention; the script essentially implements a single-purpose finite-state-machine to parse text input (“screen-scraping”).
Nowadays, just use TextFSM.</p>
</blockquote>

<p>The basic idea of the script is this:</p>

<ol>
  <li>load show ip route output from file</li>
  <li>parse it line-by-line into a dictionary, where keys are prefixes and values are lists of nexthops and interfaces</li>
  <li>optionally optimize the routes where safely possible</li>
  <li>go through the dictionary and print static route commands to default output</li>
</ol>

<p>The result is a neat list of all the routes in the routing domain (<em>of which this device is aware of</em>) with a high administrative distance.</p>

<p>The most gruesome challenge for me when writing this script was the sheer inconsistency of Cisco IOS and ASA products of different versions in terms of show ip route output structure.</p>

<blockquote>
  <p><em>If the network was build of just one type of device running one version of software, the whole script would've been thee times shorter and basically consist of a single RegEx match to extract the information I need. My script is ugly because it must parse the ugly.</em></p>
</blockquote>

<h3 id="known-issues">Known issues</h3>

<p>This simple method was generally successful, simplifying procedures for more than a dozen projects. There were however some operational challenges I must make you aware of.</p>

<p>First, the high metric I've chosen as default for the script is not optimal in some topologies. Such topologies tend to be complex and the problem lies on intersection of several routing domains. For example iBGP takes precedence with a lowed AD, setting routes across a path. Adjust the script accordingly.</p>

<p>Second, the route optimization the script employs is very straightforward: it aggregates adjacent prefixes where possible and drops any equal-cost routes from the lists. This usually reduces the length of the resulting command list several times over. Yet, like with any aggregation, you loose detailed routing information and that may introduce some additional risks.</p>

<p>Both of these in my practice had only played out in <em>more-complicated-then-usual</em> topologies. Your mileage may vary, be prepared and double check.</p>

<h3 id="the-procedure-to-safely-transform-a-routing-domain">The procedure to safely transform a routing domain</h3>

<p>With all that being said, here's an outline of a procedure which is based on the method described here.</p>

<ol>
  <li>collect fresh show ip route outputs from all devices in the immediate routing domain (<em>i.e. there might be no point in scraping those behind an aggregation/summarization wall</em>)</li>
  <li>parse them through the script to get static route commands</li>
  <li>apply static route commands to all devices</li>
  <li>do the main job (<em>i.e. change the routing protocol, change OSPF areas</em>)</li>
  <li>check everything that your routing domain is back up as expected (<em>make a checklist ahead of time!</em>)</li>
  <li>remove the static routes</li>
</ol>

<p>Looks simple to me and <em>It does the job</em>.</p>]]></content><author><name></name></author><category term="automation," /><category term="design," /><category term="ospf," /><category term="python," /><category term="routing" /><summary type="html"><![CDATA[As part of my job as a Senior Network Engineer, I develop procedures for undertakings of varying complexity. In this post I'm describing a technique that greatly simplifies any project where a routing domain is expected to churn (i.e. neighborships going up and down, routes flapping), when such event is undesirable.]]></summary></entry><entry><title type="html">Paper: Scanning the Internet for Liveness</title><link href="https://askbow.com/2018/06/14/paper-scanning-the-internet-for-liveness" rel="alternate" type="text/html" title="Paper: Scanning the Internet for Liveness" /><published>2018-06-14T08:20:00+00:00</published><updated>2018-06-14T08:20:00+00:00</updated><id>https://askbow.com/2018/06/14/paper-scanning-the-internet-for-liveness</id><content type="html" xml:base="https://askbow.com/2018/06/14/paper-scanning-the-internet-for-liveness"><![CDATA[<p>An interesting paper where the authors are building a better way to scan the Internet.</p>

<p><a href="https://sheharbano.com/assets/publications/ccr18-scan-liveness.pdf">https://sheharbano.com/assets/publications/ccr18-scan-liveness.pdf</a></p>

<p><em>Shehar Bano et al. Scanning the Internet for Liveness // ACM SIGCOMM Computer Communication Review, Volume 48 Issue 2, April 2018</em></p>

<blockquote>
  <p>Liveness—whether or not a target IP address responds to a probe packet—is a nuanced concept without a simple yes/no answer. Responsiveness directly depends on the probe type, the configuration of the targeted host, as well as on firewalling and filtering behaviors at the edge or within networks.</p>
</blockquote>

<p>Key findings include:</p>

<blockquote>
  <ul>
    <li>TCP and UDP probes increase the population responsive over ICMP by 18%,</li>
    <li>comprehensively capturing reply traffic (i.e., taking into account negative reply packets) increases the responsive population by more than 13%,</li>
    <li>TCP stacks do not consistently respond with a TCP Rst for non-available services—in our measurements only 24% of hosts with an active TCP stack respond to all the probes,</li>
    <li>our concurrent scans allow us to identify nearly 2M tarpits that would bias measurements that do not take them into account, and</li>
    <li>we report on the correlation of responsiveness across protocols uncovering potential filtering practices.</li>
  </ul>
</blockquote>

<p>Other things I found interesting:</p>

<blockquote>
  <ul>
    <li>probe redundancy [sending deferred repeated probes] increases the population of active IP addresses by 2.2%</li>
    <li>scans recorded 487M network alive IPs (IPall) out of 3.6B probed.</li>
    <li>they see that ICMP Echo probes are most effective in discovering network active IPs, revealing 79% of IPall, followed by TCP probes.</li>
    <li>they found that 16% of IPall can only exclusively be discovered via TCP, and a small but significant ≈2% can only be discovered via UDP probes.</li>
  </ul>
</blockquote>]]></content><author><name></name></author><category term="worth_reading," /><category term="networking," /><category term="research" /><summary type="html"><![CDATA[An interesting paper where the authors are building a better way to scan the Internet.]]></summary></entry><entry><title type="html">How many spares do you need?</title><link href="https://askbow.com/2018/05/11/how-many-spares-do-you-need" rel="alternate" type="text/html" title="How many spares do you need?" /><published>2018-05-11T15:08:00+00:00</published><updated>2018-05-11T15:08:00+00:00</updated><id>https://askbow.com/2018/05/11/how-many-spares-do-you-need</id><content type="html" xml:base="https://askbow.com/2018/05/11/how-many-spares-do-you-need"><![CDATA[<p>In designing a network, there is a question that is often missing an answer or at best, answered using some rule-of-thumb. How many spare units you should include in your BOM? Actually, do you need them at all?</p>

<blockquote>
  <p><strong>Disclaimer</strong>: <em>I won't be covering any of the really complex models. People who need them probably know about spare part forecasting and procurement more than I do. But some simple models are useful in general network design work, so here's my take on it.</em></p>
</blockquote>

<h3 id="tldr">TL;DR:</h3>

<p>It depends. The lower mean time to recovery (MTTR) you want, the more chance there is that you need to have on-site spares. The lower the MTTR, the higher the availability you get.</p>

<h2 id="why-discuss-spares">Why discuss spares?</h2>

<p>Let's go with a top-down approach here.</p>

<p>There can be several business drivers for really high network availability. A few examples:</p>

<ul>
  <li>network downtime cost is <strong>very</strong> high - think of a broker connecting to an exchange, or medical equipment during remote procedures (<em>these will become more and more common over the years</em>)</li>
  <li>regulatory / compliance - rules imposed by regulatory body (<em>industry association, state department</em>) upon your information system in general and by extension on the network</li>
  <li>tight SLAs with customers (<em>who then have cost / compliance or other stuff for their reasons</em>)</li>
</ul>

<p>To see why we may consider spare parts as part of high availability equation, let's go a little deeper.</p>

<h3 id="what-is-availability">What is availability</h3>

<p>Availability in its general mathematical form depends on two factors:</p>

<ul>
  <li>MTBF - mean time between failures; many people confuse MTBF with how long a given specimen will work for. A better, more practical understanding of it goes like this: if a vendor has sold 1 000 000 units (<em>power supplies for example</em>) with MTBF 1 000 000 hours, then on average they will be sending one replacement unit every hour.</li>
  <li>MTTR - mean time to recover [from failure] - how long it takes to fix a problem</li>
</ul>

<p>The availability is usually taken as $A = \frac{MTBF}{MTBF + MTTR}$ and the result might look something like 0.99818231.</p>

<blockquote>
  <p>There's a comprehensive article on that topic over at Packet Pushers: <a href="http://packetpushers.net/reliability-basics-part1/">Reliability Basics- Part1 by Diptanshu Singh</a>. There's no point in repeating all of that math background here.</p>
</blockquote>

<p>What it means in practical terms is, you can compute the expected (<em>notice expected - it's all a matter of statistics</em>) downtime by taking $T_d = (1 - A) \times T$, where $T$ is your time budget (<em>most people use a Gregorian year here, as expressed in minutes or seconds</em>).</p>

<h3 id="how-do-you-increase-your-network-availability">How do you increase your network availability?</h3>

<p>There are several ways to push availability up:</p>

<ul>
  <li>get more reliable equipment (<em>i.e. increase MTBF</em>)</li>
  <li>add redundancy (<em>think failover/cluster/VSS/vPC/stack/VRRP etc</em>) - also sometimes called structural reliability</li>
  <li>decrease MTTR</li>
</ul>

<p>Now, there is some technological level after which it is prohibitively expensive to increase MTBF, plus there's a natural trade-off between features (=complexity) and reliability.</p>

<p>It is also hard to do redundancy: it increases complexity even further and introduces a separate set of distributed systems problems (<em>for example, firewall cluster and VSS state machines have a lot of moving parts</em>). Although some people may push it higher than that, most settle for something manageable, like running two units in parallel.</p>

<p>Seems like the only thing left to do is to try to decrease MTTR.</p>

<h3 id="side-note-what-network-equipment-mtbf-looks-like">Side note: what network equipment MTBF looks like?</h3>

<p>Enterprise-class ethernet switches (fixed) seem to have their MTBF converged around 250000 - 400000 hours (<em>at least based on datasheets referencing Telcordia parts-count methods</em>). Individual linecards for modular switches have about the same figures.</p>

<p>Fixed routers are in the same ballpark or higher. Servers are ususally considered to have a lower MTBF, around 75000 hours, while the appliances (like many firewalls, which are basically stripped-down servers) expect to have something around 100000-150000 hours of MTBF.</p>

<p>You should always refer to your vendor/manufacturer if you need exact datapoints for precise calculation.</p>

<h2 id="what-is-mttr">What is MTTR?</h2>

<p>In general, MTTR consists of several components:</p>

<ul>
  <li>Failure detection time</li>
  <li>Problem diagnose time</li>
  <li>Repair time</li>
  <li>Time to test and confirm restoration of service</li>
</ul>

<blockquote>
  <p>Many times, Time to actually repair is time to reboot the device (i.e. in the order of 5-10-20-30 minutes) or remove a config line (in the order of 1-5 minutes). On the other end of spectrum is replacing a whole half-rack-high modular switch (0.5-4 hours) . Notice here also that time spread increases with complexity. A corollary to that is for lower MTTR you might want to minimize complexity.</p>
</blockquote>

<p>In special circumstances, like remote sites, you might also add to the mix:</p>

<ul>
  <li>Engineering team delivery to the site to perform repairs (<em>for unmanned site</em>)</li>
  <li>Time to deliver spares to the site (<em>if sent separate from the repairs team</em>) - which is also the case if you don't have any spares at all</li>
</ul>

<h3 id="how-do-you-decrease-mttr">How do you decrease MTTR?</h3>

<p>Before we dive deeper into the whole spare part business, let's cover other ways to decrease MTTR first.</p>

<blockquote>
  <p>If you think about it, <em>redundancy</em> is actually a way to decrease MTTR taken to its absolute: the spare unit takes over automatically with minimum switchover delay feasible.</p>
</blockquote>

<p>First things first, depending on your economics and technology, you optimize MTTR down by decreasing detection time. You do it with all sorts of monitoring/telemetry, regular health check-ups and planned maintenance procedures. Same approach works for decreasing diagnostics time. You prepare and use checklists, configuration management procedures [i.e. you always know if any change was made prior to failure], automation. Last but not least - you invest in people by training them. You can also make your critical sites manned 24x7x365, i.e. hire more people.</p>

<p>Similarly, time to actually repair something depends again on procedures and people, but there are limits to that.</p>

<h4 id="other-replacement-options">Other replacement options</h4>

<p>At some point, you will need to replace failed equipment. You don't necessarily need spares for that:</p>

<ul>
  <li><em>Warranty</em> - many honest manufacturers will cover (although without any real SLA) their products for some reasonable time (or for the product's lifetime, i.e. until they declare its End of Live)</li>
  <li><em>Service contracts</em> - these include not only replacements, but also have some SLA attached to them - for example shipping the replacement part on the next business day (mean time delivery will be at least 32 hours), the next day (24 hours), in 4 hours (5 hours)</li>
</ul>

<blockquote>
  <p>Time estimates here are rough and include some reasonable same-postcode-expedite-delivery. No vendor has a warehouse in every area, so add some time allowance for UPS / DHL / FedEx to reach you.</p>
</blockquote>

<p>As far as I know, 4 hour shipping SLA is the top speed available from most vendors. Sometimes, if you have enough leverage, you can squeeze a little more (<em>down to 1 hour maybe</em>) from your local vendor's VAR.</p>

<p>Here we arrive to the final point: if you need to go further down the timescale, you have to have on-site spares.</p>

<h2 id="the-economic-effect-of-having-spares">The economic effect of having spares</h2>

<p>First of all, spares cost money directly, that is - you need to buy them (<em>and possibly cover them with appropriate contracts as well</em>).</p>

<p>Then you need to store them, and spend some time regularly testing them (<em>a once-a-year [or more often for more critical systems] smoke test</em>). Moreover, from financial point of view, spares are stale capital [not exact term, sorry], that is by buying something (<em>to sit in your warehouse</em>) you throw out your ability to employ that capital otherwise. And that, in short, can make some of your financial KPIs look not as good.</p>

<p>On the other hand, spares relax service contract requirements. For example, instead of covering your whole park of 1000 access points with 24x7@4hrs contracts, with ten spares stored in a wire closed you would only need to get 8x5@NBD contracts.</p>

<p>All in all, your mileage will vary, and this motivator is worth a due consideration.</p>

<h2 id="spare-part-kit">Spare part kit</h2>

<p>Your spare part inventory consists of one or more spare part kits. Spare part kits are collections of spare parts which serve a particular site or a group of sites. As such, we can distinguish between:</p>

<ul>
  <li><em>local kits</em> - serve one site, stored on that site (zero delivery time)</li>
  <li><em>group kits</em> - serve a set of sites (usually grouped by geography). Vendor's warehouses that ship you a spare part based on service contract can be considered an example of that</li>
  <li><em>multilevel kits</em> - some combination of the above</li>
</ul>

<h3 id="replenishing-spare-part-kit">Replenishing spare part kit</h3>

<p>Another way to classify spare part kits is the way you top them up (i.e. how you drive your <em>spare part procurement process</em>). Basically, you can do it in any of these ways:</p>

<ul>
  <li><em>never</em> - you load up everything you will ever need and fly to the edge of the Solar system.</li>
  <li><em>regularly</em> - every year (month, quarter, other set interval) you buy a bucket of transceivers.</li>
  <li><em>waterline</em> - as soon as the number of spares goes down to some predefined level (waterline) below the base, you buy more to restore status quo.</li>
</ul>

<blockquote>
  <p>There are special cases and combinations of these, but they make sense only for some level of sophistication of supply and support organizations. For example, military organizations probably have very complex schemes with specific goals (given that most of the interesting math for reliability and spare part calculation was initially [and still is] developed for army's needs).</p>

  <p>I expect organizations such as Google and AWS, as well as network equipment vendors, who also happen to have a lot of data about IT systems reliability, to have developed their own complex spare kit configurations as well.</p>
</blockquote>

<p>Down the line, I'll be covering a generic case of a local kit which we replenish on a waterline basis.</p>

<h3 id="how-do-you-evaluate-your-spare-part-kit">How do you evaluate your spare part kit?</h3>

<blockquote>
  <p>Sorry for another interruption, but answering this question early will make later understanding easier.</p>
</blockquote>

<p>There are two important metrics of spare part kits:</p>

<ul>
  <li><em>spare kit efficiency</em> - basically, how many hot devices you are covering with each spare</li>
  <li><em>spare kit readiness</em> - this is a statistical measure of probability to find necessary spare in the kit at the moment of hot device failure</li>
</ul>

<h4 id="efficiency">Efficiency</h4>

<p>Efficiency can be calculated as $Q = 1 - S / N$, where $S$ is the number of spares, and N is total number of devices protected with these spares. The higher - the better (<em>lower economic loss</em>).</p>

<h4 id="readiness">Readiness</h4>

<p>Readiness is a little more complex. For waterline replenishment discipline it goes in two steps:</p>

<ol>
  <li>calculate minimum spares needed as $S_{min} = \frac{N \times T_t }{MTBF}$ - <em>see notes in the next section</em></li>
  <li>insert the result into this slightly bigger formula:</li>
</ol>

<p>$$R = 1 - \frac{S_{min}^{m + 2}}{(S - m + S_{min})\times((1 + S_{min})^{m + 1})}$</p>

<p>where $S$ - your spares base level, $m$ - your waterline level.</p>

<p>Obviously, you want your Spare Kit readiness as high as possible.</p>

<p>In practice, you would find your own balance between efficiency and readiness by solving some optimization problem specific to your needs.</p>

<h3 id="calculating-minimum-spares-required">Calculating minimum spares required</h3>

<p>This formula comes as a result of developments in mathematical modeling and queuing theories. By modeling spare parts kit as a queue and failure rate as Poisson independent events, it can be shown that for a general case:</p>

\[S_{min} = \frac{N \times T_t }{MTBF}\]

<p>$T_t$ here is the mean time it takes to replenish the kit, i.e. for the vendor to deliver on the contract (see above).</p>

<blockquote>
  <p>There's a side result from same sciences. We can say with high confidence that this kit will be optimal for many typical cases as well. That is, it maximizes both efficiency and readiness at the same time. I'm not sure if this holds for all cases.</p>
</blockquote>

<h3 id="calculating-the-total-number-of-spares-required">Calculating the Total number of spares required</h3>

<p>How many spares would you need if you expect never to replenish the kit?</p>

\[S_T = T \times N / MTBF\]

<p>, where T is total expected lifetime.</p>

<p>By taking $1/MTBF$ we effectively convert it to failure rate, which we then multiply by total system time budget (expressed in machine-hours).</p>

<p>The result of this calculation is also the maximum number of spares. Between the minimum and this maximum, you must choose a point that makes sense to you. A good way to start is to set some expectations about the kit's readiness and efficiency and crunch the numbers, trying to maximize both.</p>

<h3 id="conclusion">Conclusion</h3>

<p>Maintaining a spares inventory is a good way to lower MTTR in a highly available system. The models I list here might not be perfectly refined, and they certainly don't take into account every situation possible, but they produce a fair estimate.</p>]]></content><author><name></name></author><category term="design," /><category term="high_availability" /><summary type="html"><![CDATA[In designing a network, there is a question that is often missing an answer or at best, answered using some rule-of-thumb. How many spare units you should include in your BOM? Actually, do you need them at all?]]></summary></entry><entry><title type="html">How many BGP routers can a big AS have?</title><link href="https://askbow.com/2017/10/21/many-bgp-routers" rel="alternate" type="text/html" title="How many BGP routers can a big AS have?" /><published>2017-10-21T12:52:00+00:00</published><updated>2017-10-21T12:52:00+00:00</updated><id>https://askbow.com/2017/10/21/many-bgp-routers</id><content type="html" xml:base="https://askbow.com/2017/10/21/many-bgp-routers"><![CDATA[<p>For iBGP number of peers (<em>i.e. the number of BGP routers inside an AS</em>), the only significant limiting factor is that iBGP peers must be fully meshed (N.B.: not directly interconnected! An iBGP peering can span all the hops you can fit into the IP TTL field) - because it is the only way for iBGP to prevent loops.</p>

<p>The impact of each BGP peer is an open TCP connection, some memory, and occasionally some processing to do and then some administrative burden.</p>

<blockquote>
  <p>How many connections?</p>

\[\frac{n \times (n-1)}{2}\]

  <p><em>- That's quadratic complexity:</em></p>
</blockquote>

<pre><code class="language-mermaid">
flowchart LR

A &lt;-...-&gt; B &amp; C &amp; D &amp; E &amp; F
B &lt;-...-&gt; C &amp; D &amp; E &amp; F
C &lt;-...-&gt; D &amp; E &amp; F
D &lt;-...-&gt; E &amp; F
E &lt;-...-&gt; F

</code></pre>

<p>To overcome iBGP scalability problems, two approaches were developed:</p>

<ul>
  <li>Confederations</li>
  <li>BGP Route Reflectors</li>
</ul>

<h3 id="bgp-confederations">BGP Confederations</h3>

<p><em>Confederations</em> is basically splitting your AS into several sub-ASes. A confederated AS looks like a single entity to its eBGP peers, even though each individual router might belong to a different sub-AS.</p>

<pre><code class="language-mermaid">
flowchart LR


subgraph "AS1"
  C &lt;-...-&gt; D
end

subgraph "AS2"
  A &lt;-...-&gt; B &amp; C
  B &lt;-...-&gt; C 
end

subgraph "AS3"
  D &lt;-...-&gt; E &amp; F
  E &lt;-...-&gt; F
end

</code></pre>

<p>Routers prevent loops inside confederation by using a special CONFED version of AS_PATH. Just like AS_PATH, its CONFED counterparts can be of two types: _SET and _SEQ.</p>

<p>Importantly, we must fully mesh BGP routers inside a sub-AS. Basically, a sub-AS is just an AS in its own right.</p>

<p>Downside: loss of detailed routing information when we cross sub-AS boundary.</p>

<p>Where is it logical to put confederations in production? My guess would be, large enterprises. It is kind of normal for one company to own several AS numbers - that usually happens as a result of corporate mergers and acquisitions. At the same time, such company might want to present itself as a single entity to any outside network.</p>

<h3 id="bgp-route-reflectors">BGP Route reflectors</h3>

<p><em>Route reflectors</em> allow you to build a hierarchy of routers. A route-reflector client router doesn't know that it works with a route reflector - it's a normal iBGP peering for a client. Thus client's algorithm is the same as in fully-meshed iBGP system.</p>

<pre><code class="language-mermaid">
flowchart LR

subgraph "RR-A"
  A &lt;-...-&gt; A_1
  A &lt;-...-&gt; A_2
  A &lt;-...-&gt; A_3
end

subgraph "RR-B"

  B &lt;-...-&gt; B_1
  B &lt;-...-&gt; B_2
  B &lt;-...-&gt; B_3
end

subgraph "RR-C"

  C &lt;-...-&gt; C_1
  C &lt;-...-&gt; C_2
  C &lt;-...-&gt; C_3
end

A(A_RR) &lt;-...-&gt; B
B(B_RR) &lt;-...-&gt; C
A(C_RR) &lt;-...-&gt; C

</code></pre>

<p>A route reflector (RR) acts a little differently though. That is because its clients are not fully meshed. So, RR (<em>almost</em>) always sends updates to its clients, even when received from another client.</p>

<p>In order to prevent loops, a CLUSTER_ID attribute is used by route reflectors.</p>

<p>Notice that we must fully mesh Route reflectors between each other. And for redundancy, we must install at least two.</p>

<p>Moreover, we sometimes would place route reflectors outside of traffic paths. That way, we can use a cheaper router (still has to receive all the routers we have). It is possible thanks to BGP's third party next hop feature.</p>

<p>Downside: loss of detailed routing information, because RR will only send the best routes to its clients. Hence, possible suboptimal routing.</p>

<blockquote>
  <p>Interestingly, BGP RRs are the basic idea behind some SDN implementations. Basically, the RR (SDN control server) is filling client's routing tables via BGP.</p>
</blockquote>

<h3 id="finally">Finally,</h3>

<p>Both schemes allow ASes to grow to hundreds of routers and more, and the two schemes can be used in parallel if desired. The Route reflectors method is perhaps the most widely deployed. The reason is it is easier to design, setup, and support. Also, it allows to build a multi-tier routing hierarchy (<em>core-aggregation-edge, for example</em>) with minimal effort both initially and during scaling.</p>]]></content><author><name></name></author><category term="routing," /><category term="basics," /><category term="bgp," /><category term="design" /><summary type="html"><![CDATA[For iBGP number of peers (i.e. the number of BGP routers inside an AS), the only significant limiting factor is that iBGP peers must be fully meshed (N.B.: not directly interconnected! An iBGP peering can span all the hops you can fit into the IP TTL field) - because it is the only way for iBGP to prevent loops.]]></summary></entry><entry><title type="html">Why Hulc LED process consumes so much CPU on 2960 platforms?</title><link href="https://askbow.com/2017/10/17/hulc-led-process-consumes-much-cpu-2960-platforms" rel="alternate" type="text/html" title="Why Hulc LED process consumes so much CPU on 2960 platforms?" /><published>2017-10-17T13:51:00+00:00</published><updated>2017-10-17T13:51:00+00:00</updated><id>https://askbow.com/2017/10/17/hulc-led-process-consumes-much-cpu-2960-platforms</id><content type="html" xml:base="https://askbow.com/2017/10/17/hulc-led-process-consumes-much-cpu-2960-platforms"><![CDATA[<p>In this post I'll try to make an educated guess about what happens with Hulc LED process and why it appears to consume 20-30% CPU on Cisco 2960(S/X/XR/RX) switches.</p>

<p>(<em>N.B.: the issue appears to be present on Cisco 3750 / 3560 platforms as well</em>)</p>

<h1 id="symptoms">Symptoms</h1>

<p>If you monitor your switch via SNMP, you may quickly notice constantly elevated CPU at about 20-40 % total. To investigate further, you get relevant command output from the device: #show process cpu sort</p>

<p>And the result looks something like this:</p>

<p><a href="https://askbow.com/wp-content/uploads/2017/10/sh-proc-cpu-hulk-led.png" title="sh proc cpu"><img src="https://askbow.com/wp-content/uploads/2017/10/sh-proc-cpu-hulk-led-e1508217172316.png" alt="" /></a></p>

<p>or what we could call <strong>Angry Hulc</strong> process ;-)</p>

<h1 id="what-is-cisco-hulc">What is Cisco HULC?</h1>

<p>According to Cisco document 64641 (<em>it's public somewhere on cisco.com</em>):</p>

<blockquote>
  <p>The Mirage is based on HULC hardware architecture and the Sasquatch switching ASIC chipset from DSBU. HULC is a hardware architecture that you use to build low cost, stackable, 10/100/1000 Ethernet switches.
 …
<strong>Lord Of The Rings (LOTR)</strong> - The the first release of the HULC based platforms and software based on the Sasquatch ASIC chip set from DSBU. The Mirage is based on LOTR platforms from DSBU.</p>
</blockquote>

<ul>
  <li>N.B.: DSBU - Data Switching Business Unit, a part of Cisco*)</li>
  <li>Also, notice the LOTR reference; for contrast, 4500 platform was built by Star Wars geeks</li>
</ul>

<p>A quick look around shows that a lot of stuff is based on that machinery, starting with 3750 / 3560 and 2960 series, but later included SMB series like Express 500 switches, industrial Ethernet series. I'd also assume that some spoils of that development went into 3850, 4500, 6800IA and later systems.</p>

<blockquote>
  <p><em>This gave me pause, because, judging by the docs and CiscoLive sessions, the stacks of 3750 and 2960 series differ quite a lot.</em></p>
</blockquote>

<p>It is reasonable to assume that all the multitude of hulc and /H.<em>/ (</em>HLPIP, HRPC, HL3U, etc.*) processes in IOS are talking to hardware in this architecture. For example, from <a href="https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst2960/software/release/12-2_55_se/configuration/guide/scg_2960/swtrbl.html">docs</a>:</p>

<blockquote>
  <ul>
    <li>Hulc Forwarding TCAM Manager (HFTM)</li>
    <li>Hulc Quality of Service (QoS)/access control list (ACL) TCAM Manager (HQATM)</li>
  </ul>
</blockquote>

<p>We can also infer that commands under show platform stack talk directly to HULC.</p>

<h2 id="ok-whats-hulc-led-process-anyway">Ok, what's Hulc LED process anyway?</h2>

<p>According to the description of the two most relevant Cisco bugs (CSCtg86211 and CSCtn42790), Hulc LED process is the thing that monitors port states, including PoE, transceiver (<em>think SFP</em>) status and sets the LED indicators accordingly. It also communicates with the MODE button, and resets the switch to factory default if you press it for too long:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%SYS-7-NV_BLOCK_INIT: Initialized the geometry of nvram
%SYS-5-RELOAD: Reload requested by Hulc LED Process. Reload Reason: Reason unspecified.
</code></pre></div></div>

<p>(<em>see Cisco FN - 63722 and bug CSCuj69384 for insight into why this is important</em>)</p>

<p>Sadly, the description in these bugs doesn't go beyond <em>this is an expected behavior</em>.</p>

<blockquote>
  <p><em><strong>Disclaimer</strong>: This isn't publicly documented. What follows are my general thoughts and speculations on what might be happening inside. I do not work for Cisco at the moment of this writing and cannot have access to that kind of detail, so don’t rely on my words too much. I’m probably wrong.</em></p>
</blockquote>

<h2 id="how-come-it-consumes-so-much-cpu">How come it consumes so much CPU?</h2>

<p>Here comes my theory. This might be completely bogus and is definitely based on a lot of assumptions, some of which are made up. But I tried to keep it realistic.</p>

<p><strong>Hulc LED process does not consume your CPU. It's an illusion created by processes' waiting state.</strong></p>

<p>Taking into account that much of today's IOS has Linux blood in it's veins, that's not hard to imagine. That way, the command <code class="language-plaintext highlighter-rouge">show process cpu</code> and its derivatives don't show us actual CPU usage, but something closer in spirit to Linux load average: a decaying average over three time windows, in case of Cisco IOS - 5 Sec, 1 Min, 5 Min.</p>

<p>If we<a href="https://github.com/torvalds/linux/blob/master/kernel/sched/loadavg.c"> look closely</a> at how data points for calculating load average are collected,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">calc_load_fold_active</span><span class="p">(</span><span class="k">struct</span> <span class="n">rq</span> <span class="o">*</span><span class="n">this_rq</span><span class="p">,</span> <span class="kt">long</span> <span class="n">adjust</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">long</span> <span class="n">nr_active</span><span class="p">,</span> <span class="n">delta</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

	<span class="n">nr_active</span> <span class="o">=</span> <span class="n">this_rq</span><span class="o">-&gt;</span><span class="n">nr_running</span> <span class="o">-</span> <span class="n">adjust</span><span class="p">;</span>
	<span class="n">nr_active</span> <span class="o">+=</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">this_rq</span><span class="o">-&gt;</span><span class="n">nr_uninterruptible</span><span class="p">;</span>

	<span class="k">if</span> <span class="p">(</span><span class="n">nr_active</span> <span class="o">!=</span> <span class="n">this_rq</span><span class="o">-&gt;</span><span class="n">calc_load_active</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">delta</span> <span class="o">=</span> <span class="n">nr_active</span> <span class="o">-</span> <span class="n">this_rq</span><span class="o">-&gt;</span><span class="n">calc_load_active</span><span class="p">;</span>
		<span class="n">this_rq</span><span class="o">-&gt;</span><span class="n">calc_load_active</span> <span class="o">=</span> <span class="n">nr_active</span><span class="p">;</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">delta</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>we could see that it takes load measurements for processes not only in the running state, but also in the <em>uninterruptible</em> wait state.</p>

<blockquote>
  <p><em>I would like to thank <a href="http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html">Brendan Gregg and his post</a> for giving me this idea. His is a great writeup about the problem of load average in Linux.</em></p>
</blockquote>

<p>Let's consider what the Hulc LED process is doing.</p>

<p>It is communicating with peripheral (<em>from CPU's point of view</em>) devices: the ports are [probably] connected to the CPU complex via some serial bus. Hulc LED process polls every port. It does it by pulling port status register and setting the command register to blink the LED.</p>

<blockquote>
  <p>Or something similar. It's a logical assumption from the fact that the more ports the switch has, the more load Hulc LED exhibits.</p>
</blockquote>

<p>This is your basic IO operation. IO operations are slow and need to be completed atomically (<em>i.e. without interruption</em>) to prevent corruption / inconsistency. Hence, it's reasonable to assume that during this operation the process is put to TASK_UNINTERRUPTABLE state.</p>

<blockquote>
  <p>Refer to <a href="https://access.redhat.com/sites/default/files/attachments/processstates_20120831.pdf">Understanding Linux Process States by Yogesh Babar, RedHat</a> for details on what different states mean.</p>
</blockquote>

<h2 id="the-led-is-not-that-bright">The LED is not that bright</h2>

<p>Despite what I've said so far, Hulc LED process can still potentially consume too much CPU. That might be a symptom of faulty SFP modules, link flapping, or some very specific hardware problems in the switch. It's possible to think about several kinds of problems that will result in longer delay on system buses.</p>

<p>This would result in [apparent] very high (<em>beyond 30% of CPU</em>) consumption by Hulc LED process.</p>

<blockquote>
  <p>The 30% here is rule-of-thumb-level-arbitrary (<em>also given in table 3 <a href="https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst3750/software/troubleshooting/cpu_util.html">here</a> so my guess is in the right ballpark</em>). Although in other Cisco documentation (<em>see BRKCRS-3141 2011 for example</em>) they state normal levels for some devices, you should always consider your environment and do a baseline analysis.</p>
</blockquote>

<p>Moreover, there is also bug CSCvd68472 which can make Hulc LED process consume au pair with Hulc running con up to 100% CPU.</p>

<h2 id="conclusion">Conclusion</h2>

<p>To summarize, most of the time Hulc LED process on Cisco 3750/2960 platforms does not actually consume CPU for 20-30%, but rather is mostly waiting for its IO syscalls to finish for all that time. The system displays it as usage, because of the specifics of the algorithm used to calculate the load.</p>

<blockquote>
  <p>I can counter my own argument: this might actually be a quirk of run-to-completion FIFO discipline used by Cisco IOS scheduler and have nothing to do with Linux.</p>
</blockquote>

<p>Also notice, that this Cisco fixed this behavior in 15-something IOS branch for many hardware platforms, so your mileage may vary.</p>]]></content><author><name></name></author><category term="switching" /><summary type="html"><![CDATA[In this post I'll try to make an educated guess about what happens with Hulc LED process and why it appears to consume 20-30% CPU on Cisco 2960(S/X/XR/RX) switches.]]></summary></entry><entry><title type="html">What happens if you start a Cisco 6500 switch without the fan module?</title><link href="https://askbow.com/2017/05/30/happens-start-cisco-6500-switch-without-fan-module" rel="alternate" type="text/html" title="What happens if you start a Cisco 6500 switch without the fan module?" /><published>2017-05-30T11:44:00+00:00</published><updated>2017-05-30T11:44:00+00:00</updated><id>https://askbow.com/2017/05/30/happens-start-cisco-6500-switch-without-fan-module</id><content type="html" xml:base="https://askbow.com/2017/05/30/happens-start-cisco-6500-switch-without-fan-module"><![CDATA[<p>Recently, I tested a Cisco 6500 switch in a fan-less configuration, to see how long it can go.</p>

<blockquote>
  <p><strong>DISCLAIMER</strong>: <strong>DO NOT TRY TO DO IT</strong>. This is a stupid idea and it will void warranty / would be a perfectly valid reason for Cisco to decline RMA (<em>in my opinion at least</em>). Running a switch without fans will be directly damaging to active components (<em>ASICS, TCAM, CPU etc</em>) and increase wear-out of passive ones (<em>capacitors etc</em>). I did it in the lab so you don't have to.</p>
</blockquote>

<p>Initially, I would've guessed that you can run the system w/o the fan module for about five minutes. That should give enough time to <a href="https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/hardware/Chassis_Installation/Cat6500/6500_ins/04remrep.html#49856">replace</a> (swap, or clean) the fan module if needed.</p>

<h3 id="fanless-test-results">Fanless test results</h3>

<p>That's not exactly the case:</p>

<ol>
  <li>right away, as the CPUs heated up, the system became slow to respond to direct console connection</li>
  <li>in a minute, I've got these logs:</li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*May 30 07:24:53.619: %C6KENV-SW2_SP-4-INSUFFCOOL: Module 1 cannot be adequately cooled
*May 30 07:24:53.707: %C6KENV-SW2_SP-4-INSUFFCOOL: Module 3 cannot be adequately cooled
*May 30 07:24:55.539: %C6KENV-SW2_SP-4-FANCOUNTFAILED: Required number of fan trays is not present
*May 30 07:25:20.215: %C6KENV-SW2_SP-4-MINORTEMPALARM: switch 2 RP 5/0 inlet temperature crossed threshold #1(=50C. It has exceeded normal operating temperature range.
*May 30 07:27:07.183: %C6KENV-SW2_SP-4-MINORTEMPALARM: switch 2 module 5 asic-1 temperature crossed threshold #1(=. It has exceeded normal operating temperature range
*May 30 07:27:26.387: %C6KENV-SW2_SP-4-MINORTEMPALARM: switch 2 EARL 5/0 outlet temperature crossed threshold #1(=. It has exceeded normal operating temperature range
</code></pre></div></div>

<p>In short, it took a minute w/o fans for the test system to start overheating.</p>

<p>The EnvMon (<em>as far as I understand</em>) would shut the system down if it went too far above the temperature range red line. But I didn't go that far, because this wasn't the purpose of my test.</p>

<h3 id="why-we-need-the-fans">Why we need the fans</h3>

<p>A supervisor in a Cisco 6500 switch may consume around 250-500 W from a power supply. Most of that energy is actually burned away, translating into heat <em>(thermal energy)</em>.</p>

<p>The heat is dissipated by the chip surface into the environment. The bigger the surface, the more heat a chip can dissipate. The radiators glued on top of the chips increase the dissipating surface.</p>

<p>The format of the chassis cards dictates the size of the radiators installed on the chips: they are necessarily small to fit in the slot height.</p>

<p>Hence, to provide enough cooling (i.e. take enough heat from the chip and its radiator) we need to force air through. To that end, we use fans (<em>I won't go into liquid cooling here, but the basic principle is the same</em>).</p>

<p>By removing the fan, I let the heat stay in the chip. I wasn't able to find a datasheet on the SR71000 processor (<em>owned by Broadcom, who are too shy to publish anything</em>) used as SP and RP in Sup720, but as a reference, <a href="https://www-ssl.intel.com/content/www/us/en/processors/core/2nd-gen-core-lga1155-socket-guide.html">Intel CPUs</a> (<em>I'm specifically choosing to look at an older generation Intel CPUs here, as I hope the tech of SR71000 is about the same age</em>) are tested up to about 70 degrees Celsius. Given the warning at 50 degrees I've got in the logs, that seems to be a reasonable estimate.</p>

<h3 id="test-environment">Test environment:</h3>

<ol>
  <li>WS-C6506-E with Sup720</li>
  <li>Two Gigabit linecards inserted so to have at least a single free slot between any two cards</li>
  <li>linecard slot blanks removed</li>
  <li>it was an isolated lab system without any traffic to load it</li>
  <li>AC in the room providing a steady 23 degrees Celsius</li>
</ol>]]></content><author><name></name></author><category term="switching" /><summary type="html"><![CDATA[Recently, I tested a Cisco 6500 switch in a fan-less configuration, to see how long it can go.]]></summary></entry></feed>