My previous blog post on this subject concluded that I should spend more time looking at backbone traffic patterns in order to recognise when something is “not quite right”.
I have been so doing, but it didn’t really help me in what happens next.....
I arrived back in my office at 4:15 pm to see a couple of my colleagues hovering around a monitor, saying things like “latency all over the place” “ping spikes” and “dropped packets”
A sense of dread started to threaten my calm.
Sure enough we were seeing the same sort of behaviour that occurred the fortnight before - traffic latency, ping spikes and lost packets. Looking closely at the data, we could see that the patterns were different within each symptom, but the end result was the same - poor network performance, a refusal to save documents on the first try, but subsequent saves OK, mail from the imap server being slow to open, dropped connections to servers etc.
Knowing that the wireless access point that was the cause of the last episode was in several pieces in a disposal bin in stores, we knew we had another problem....
To understand the problem, you will need to know about the network setup in our area:
We have four discontiguous “class c” or /24 networks running over the same wires, spread over a main building and a satellite site a few miles away. IP’s are allocated on a first come, first served basis. These IP’s are routable ip’s, reachable (in theory) from any other Internet connected computer. We now have four IP ranges in the 172.20.xxx.xxx/16 which “shadow” the third octet of our “class c” addresses. (e.g. 1.2.3.4 and 172.20.3.4). These are IANA private network space IPs that will not be routable outside the organisation.
Many groups within the network supply their own infrastructure (switches, cabinets, etc).
Many groups have their own private networks hidden behind NAT’ed gateways.
We have HPC clusters in the building and we have world facing servers supplying standard and non-standard services to other Institutions around the world.
We have a large mobile contingent who need to access local resources, and many collaborations that need controlled access to some services that can’t be world-facing.
We have a wireless network, and a DHCP server for known clients.
As we are part of a larger organisation, we need to allow that organisation to present their services to our users, and our users to present their own services to the larger organisation. We also need to allow the network security team from the centre access to all our networks for security scans etc.
We do not control the border router for these networks.
So, how do you control access to something like this?
We use a firewalling bridge machine. Every packet that comes in or goes out of the network goes through the “firebridge” - but it gets worse - even our local traffic (i.e cross-subnet traffic) must go in and out of the firebridge (because we don’t control the border router).
Our users are not the most security minded of individuals, and any new service that is a collaboration will inevitably result in a request for an access rule for the collaborators in the other institutions. A request for specific IP addresses will always result in “just let them all through - they could be on any computer at the Institution”.
So the “firebridge” is more for making the users feel better as opposed to a real effort at security.
Nevertheless, it is an important machine in the current overall scheme of things for our network.
And it wasn’t working properly.
When you trace a problem like this to a particularly busy machine, and the hardware checks out OK, the problem is usually a resource starvation one. The quick test for this is to restart the service (or reboot the machine). If it comes back in perfect working order it is probably a load related issue, but over time, as the load grows, it will start exhibiting problems again. You can then fix the resource problem before that happens.
If it doesn’t come back in perfect order, it is usually best to assume a replacement machine is required.
Our firebridge came back exhibiting the same problems as before.
No problem, we have a machine prepared for just such an emergency.
We dug it out of storage, checked it, loaded our latest firwall rules, and deployed it.
Within seconds we had the same network problems as before!
OK, regroup, re-think, coffee.
If it is the firebridge that is the culprit, then the only way to prove it was to take it out of the loop.
Scary thought for all those machines on the inside.
(I read somewhere that it takes an average of seven minutes for a new Windows machine to be compromised when exposed to the Internet.)
We took it out of the loop.
All network problems disappeared instantly.
We plugged it back in. The network problems re-appeared almost instantly.
So we hit the books, got some low-level diagnostics on the firebridge (packet level) and watched what was happening in real time.
A rule (added 2 months or so ago) to allow the recently implemented 172.20.xxx.xxx IP range to enter the network and cross the router had been implemented as 172.20.xxx.xxx/16 with a “keep state” argument.
This turned out to be a big mistake - but only when that network range started to be deployed, which has happened only in the last few weeks.
We did some revisions of our firebridge rules, some increasing of the allowed state table and various caches on the firebridge, and rebooted.
Watching closely we saw the entries in the firebridge state table rise to just over the previous minimum (and this is with the revised rules!), and then slowly creep upwards.
The network problems did not reappear. We sent 30,000 packets all around the network, and to the outside world. We lost none.
Twenty-four hours on, and no return of the problems.
While the real test will be on Monday around 10am, when things really start to hum in this network, I anticipate (fingers crossed!) that the problems will not reappear.
So what have we learned, grasshopper?That changes in configuration need to be thoroughly assessed.We were used to putting rules dealing with large blocks of IP addresses through our firebridge because we knew very few of them would be active at any one time. Not so with our own Institutions large block of new IP’s
That network problems with similar symptoms do not necessarily have the same cause.This looked like more of the previous problem, but we knew the offending hardware was “decommisioned”, and besides, the network traffic patterns were different this time.
Hope you enjoyed this. Leave a comment if you have any questions.