Saturday, 20 March 2010

Networks are wonderful, but.... (Take 2)


My previous blog post on this subject concluded that I should spend more time looking at backbone traffic patterns in order to recognise when something is “not quite right”.

I have been so doing, but it didn’t really help me in what happens next.....

I arrived back in my office at 4:15 pm to see a couple of my colleagues hovering around a monitor, saying things like “latency all over the place” “ping spikes” and “dropped packets”

A sense of dread started to threaten my calm.

Sure enough we were seeing the same sort of behaviour that occurred the fortnight before - traffic latency, ping spikes and lost packets. Looking closely at the data, we could see that the patterns were different within each symptom, but the end result was the same - poor network performance, a refusal to save documents on the first try, but subsequent saves OK, mail from the imap server being slow to open, dropped connections to servers etc.
Knowing that the wireless access point that was the cause of the last episode was in several pieces in a disposal bin in stores, we knew we had another problem....

To understand the problem, you will need to know about the network setup in our area:

We have four discontiguous “class c” or /24 networks running over the same wires, spread over a main building and a satellite site a few miles away. IP’s are allocated on a first come, first served basis. These IP’s are routable ip’s, reachable (in theory) from any other Internet connected computer. We now have four IP ranges in the 172.20.xxx.xxx/16 which “shadow” the third octet of our “class c” addresses. (e.g. 1.2.3.4 and 172.20.3.4). These are IANA private network space IPs that will not be routable outside the organisation.

Many groups within the network supply their own infrastructure (switches, cabinets, etc).

Many groups have their own private networks hidden behind NAT’ed gateways.

We have HPC clusters in the building and we have world facing servers supplying standard and non-standard services to other Institutions around the world.

We have a large mobile contingent who need to access local resources, and many collaborations that need controlled access to some services that can’t be world-facing.

We have a wireless network, and a DHCP server for known clients.

As we are part of a larger organisation, we need to allow that organisation to present their services to our users, and our users to present their own services to the larger organisation. We also need to allow the network security team from the centre access to all our networks for security scans etc.

We do not control the border router for these networks.

So, how do you control access to something like this?

We use a firewalling bridge machine. Every packet that comes in or goes out of the network goes through the “firebridge” - but it gets worse - even our local traffic (i.e cross-subnet traffic) must go in and out of the firebridge (because we don’t control the border router).

Our users are not the most security minded of individuals, and any new service that is a collaboration will inevitably result in a request for an access rule for the collaborators in the other institutions. A request for specific IP addresses will always result in “just let them all through - they could be on any computer at the Institution”.

So the “firebridge” is more for making the users feel better as opposed to a real effort at security.

Nevertheless, it is an important machine in the current overall scheme of things for our network.

And it wasn’t working properly.

When you trace a problem like this to a particularly busy machine, and the hardware checks out OK, the problem is usually a resource starvation one. The quick test for this is to restart the service (or reboot the machine). If it comes back in perfect working order it is probably a load related issue, but over time, as the load grows, it will start exhibiting problems again. You can then fix the resource problem before that happens.
If it doesn’t come back in perfect order, it is usually best to assume a replacement machine is required.

Our firebridge came back exhibiting the same problems as before.

No problem, we have a machine prepared for just such an emergency.

We dug it out of storage, checked it, loaded our latest firwall rules, and deployed it.

Within seconds we had the same network problems as before!

OK, regroup, re-think, coffee.

If it is the firebridge that is the culprit, then the only way to prove it was to take it out of the loop.
Scary thought for all those machines on the inside.
(I read somewhere that it takes an average of seven minutes for a new Windows machine to be compromised when exposed to the Internet.)

We took it out of the loop.

All network problems disappeared instantly.

We plugged it back in. The network problems re-appeared almost instantly.

So we hit the books, got some low-level diagnostics on the firebridge (packet level) and watched what was happening in real time.

A rule (added 2 months or so ago) to allow the recently implemented 172.20.xxx.xxx IP range to enter the network and cross the router had been implemented as 172.20.xxx.xxx/16 with a “keep state” argument.
This turned out to be a big mistake - but only when that network range started to be deployed, which has happened only in the last few weeks.

We did some revisions of our firebridge rules, some increasing of the allowed state table and various caches on the firebridge, and rebooted.

Watching closely we saw the entries in the firebridge state table rise to just over the previous minimum (and this is with the revised rules!), and then slowly creep upwards.

The network problems did not reappear. We sent 30,000 packets all around the network, and to the outside world. We lost none.

Twenty-four hours on, and no return of the problems.

While the real test will be on Monday around 10am, when things really start to hum in this network, I anticipate (fingers crossed!) that the problems will not reappear.

So what have we learned, grasshopper?

That changes in configuration need to be thoroughly assessed.
We were used to putting rules dealing with large blocks of IP addresses through our firebridge because we knew very few of them would be active at any one time. Not so with our own Institutions large block of new IP’s

That network problems with similar symptoms do not necessarily have the same cause.
This looked like more of the previous problem, but we knew the offending hardware was “decommisioned”, and besides, the network traffic patterns were different this time.


Hope you enjoyed this. Leave a comment if you have any questions.




Sunday, 7 March 2010

Networks are wonderful, but....


The advent of the LAN (Local Area Network) has been a landmark for computing.
Originally used to share expensive peripherals like disks and printers, it is now also used to control access to data - you get access to only that data that you need to do your job - everything else is locked down by ACL (Access Control Lists), policy driven firewalls or share permissions.

And that's great - when it works.

[ NB In all that follows, I was ably assisted by other members of my team - I am not a one-man-band.]

When it doesn't, this is what happens:

I got a phone call about 2:45 on Monday. It was one of the secretaries on the 5th floor. She was having trouble saving a document the she had been working on for several hours, and needed help now! A quick check over the network showed her machine was up and reachable. A visit to her office and a click on the Save button and all was well. So what was she complaining about? And here is where knowing your users is of vital importance - this secretary was not prone to exaggeration, reported problems were usually real problems and have in the past been early warning signs that something is "not quite right".

I went back down to my office and started my usual "is the network all there" ping scan. (Basically ping 100 packets over the longest paths and measure latency and return count).

The network was there, but the "round trip times" were way too long.

Time for a more focused approach. I took my trusty macbook and went to the wiring cabinet on the 4th floor (serves the front half of the building and all of the 4th, 5th and 6th floors) and plugged it into a switch serving the far corners of the 6th floor. After 30 seconds when I didn't get an IP address from our DHCP server, I started to get a little concerned. I physically checked the dhcp server and it was fine. I got back to my laptop and it had an address. So, from there, I repeated my "is the network all there" scan. And it was. And the results were within specification. Wash, rinse, repeat! Again, perfect. Back down to my office. Wash, rinse repeat!

Perfect.

OK, intermittent glitch. These will strike fear into any network admins heart. You never know when they are going to happen, and they don't last long enough to pinpoint.
And since we had a power cut on Sunday, I started to think of hardware problems in the areas where management would not spring for UPS support…

Next morning, logging in from home before travelling in, I got some random freezes on the SSH session I was using. Cut breakfast short and got in asap. Scan round trip times all over the place, latency up across the network, parts of the network not visible - trust me folks, this is not good!

Starting with the bits of the network that were not visible, I started checking the switches serving that part of the network. One switch had been replaced a few weeks ago and this replacement appeared to be working normally, but was not passing any packets through its uplink port. Replaced it, and it was fine. Meanwhile, the rest of the network was experiencing the intermittent connectivity shown the previous day.
The replaced switch went down. Same as before - wasn't passing packets through its uplink port. WTF??
We have several of these switches in many different parts of the network. I checked them all. Three out of 12 were not passing packets through their uplink. I pressed into service the emergency switches (you know, the ones you have replaced with newer higher spec models, but never got around to throwing out..).
Things started to settle down. By now it is 12 hours after breakfast and I am knackered. I checked a few more times and all was well.

Home.

Next morning, freezes in the SSH session once more. Skipped breakfast, arrived to find the problems manifesting themselves in the 3rd floor comms room (serves the back half of the building and 1st, 2nd, and 3rd floors).
I immediately zeroed in on the switches I had checked yesterday. All bar one were passing packets. Replaced it, things were well again - for a short while.
We checked servers (lots of NFS mounts in our network), we checked switches. Some switches were not passing traffic on uplink ports, although were fine on inter-switch traffic. Lots of head scratching, theorising about power cuts affecting the newer switches only, lots of side paths explored, and frenzied testing as the intermittent trouble flared and then subsided. Everything seemed fine by 7pm. Time for home.

Next morning, freezing SSH sessions again. I had already informed everybody that there was going to be downtime from 7am this morning, so I got into work, ran the basic "is the network there?" test and got loopy results.

Into the 3rd floor comms room, plugged into the backbone switch and started checking packet counts. Actual packet counts were normal, but multicast packets as a proportion of all packets was off on one port. That port led to a switch (of the kind that was playing up the previous days) that aggregated several areas of the network. Checking that switches packet counts showed two ports with larger than normal multicast packets. Shutdown the worst offender and things started to settle down - latency didn't spike, round trip times were closer to normal - all in all, a much healthier network.
Tracing that port to its other end, I found a very old netgear wireless access point plugged into the port. Now, our wireless network is on a different physical network, and there should not be WAPs plugged into the normal wired network. This particular WAP should have been plugged into socket next to it (marked with red tape).

I ran some tests on the WAP over the course of the day, and sure enough, it would go into a packet spewing frenzy for an indeterminate amount of time, then be normal for an indeterminate amount of time. It was dismantled with extreme prejudice and a hammer.

Lessons learned? Well, without knowing what was "normal" for the backbone switches I could have missed the elevated multicast packet count, so I guess more time checking traffic patterns on the backbone wouldn't go amiss.
And start with the backbone switches!
Oh yes, and bolt down the WAP ethernet ports so nobody can unplug them from the ports I put them in….

[postscript] Those switches that weren't passing packets on the uplink ports had "storm control" enabled, whereas others did not. Switch preparation procedures amended to consistently apply a known configuration.