Here comes a really tough design question from one of my co-workers: What is a good estimation of the maximum number of hosts per ethernet domain (i.e. VLAN)? Well, when I was first studying networking in college, we generally accepted that about 250 hosts should be the limit and enough for everybody. It was years ago, so let’s look at how these things are now.
What exactly is the limit and why would we expect one to be there? Right away I can think of two technical factors contributing to this limit:
- Memory limit.
- Bandwidth-hungry broadcasts and unknown unicasts.
Another consideration is that an Ethernet L2 domain is a single failure domain – what happens at one end of the wire is propagated everywhere else. Because that’s how Ethernet works.
Greg Ferro in his post on the subject actually considers these limits and tells that they are not really limiting us today, because we have lot’s of both processing power and memory. His advice is to limit the failure domain and keep it under about a 1000 hosts.
But for now, let’s entertain the other two factors. Because I’m curious=)
OK, so how much memory is it?
Memory limit is, perhaps, the simplest to consider, as each host would eat memory in ARP cache of each other host (or else, create ARP broadcast requests indefinitely, see factor two), and we have MAC address-table size limit in our switches.
A basic ARP table entry stores this information:
- 32-bit IP address
- 48-bit MAC address
- Local interface to spit traffic out of and [a pointer to] its MAC address (the MAC rewrite information in Cisco Routers)
- Record’s age
- Record’s type (Ethernet)
- Record’s protocol (IP)
I don’t really know how much memory is consumed by points 3-6, but my guesstimate would be a total of 192 bits (round in binary), i.e. 24 bytes per record. Not much, considering gigabytes of memory some people have in their
Each record might be stored for hours on a router and for about 15 minutes on a [Windows-based] host.
Considering a Windows host, a very old (year 2002) Microsoft KB99150 states that:
ARP cache size is controlled by the “arptblsize” parameter in the [tcp_xif] section of the PROTOCOL.INI file. The default is (tcpconnections*2)+6. The range is from 6 to 512.
was about 500 hosts tops at the time (for some systems). But today it is 2015 so I need to find some more recent sources.
Configuration guide for IOS15M&T in the chapter about ARP has plenty of information to offer to my little investigation.
By the way, this document explicitly states that a large number of ARPs can be considered an attack vehicle to cast a DoS on a router. Also, “The maximum limit for the number of learned ARP entries is platform dependent.“
The example in this document is 512 kilorecords, so I think it is safe to consider it a good median estimation.
So far, it doesn’t seem to be much of a problem, I mean – how often would you have hundreds of thousands of hosts on a single segment? An L3-switch will have a more constrained resource allocation policy than a typical CPU-based router. Well, Cat6500 Sup2T has an adjacency table size (and this, I suppose, correlates with ARP table size) of 1000 kilorecords (that is, 1 million). Really, it is hard to consider this a limit for any practical purposes.
At the same time, its CAM table of Sup2T is 128 K in size. So, if ARP cache is not a memory-limiting factor, MAC-address table size of the switches we use would be.
Actually, a quick look up in datasheets of several other switches clearly shows that for enterprise networks it is possible to have at least 10 thousands hosts in a segment without burning all of the MAC-address table on a switch. It’s far from a million, alright, but it is far more than 250 hosts as well.
So, to sum it up, memory exhaustion isn’t much of a limit in practical terms. Of course, we can easily exhaust any limit just by spoofing addresses. There are even tools out there to do this in a GUI.
Aside from an attack, MAC flapping would be another way to kill a network. We had a support case recently and part of the problem was that the need to constantly rewrite ARP and MAC (CAM) tables (control plane) exhausted switches’ CPUs in a matter of seconds, rendering management plane unresponsive.
There were two causes for this to happen: the L2 domain is too big and then a loop happened. No, STP doesn’t work if there is something doing a good job at filtering BPDUs in the loop. Back to the host count, the network in that support case has only about 300-400 hosts, all put in one segment for application-level reasons.
This example is actually a bandwidth problem, as we drain CPU resources, not memory volume.
I will try to estimate bandwidth in the next post on the subject.