Sunday, 7 March 2010

Networks are wonderful, but....


The advent of the LAN (Local Area Network) has been a landmark for computing.
Originally used to share expensive peripherals like disks and printers, it is now also used to control access to data - you get access to only that data that you need to do your job - everything else is locked down by ACL (Access Control Lists), policy driven firewalls or share permissions.

And that's great - when it works.

[ NB In all that follows, I was ably assisted by other members of my team - I am not a one-man-band.]

When it doesn't, this is what happens:

I got a phone call about 2:45 on Monday. It was one of the secretaries on the 5th floor. She was having trouble saving a document the she had been working on for several hours, and needed help now! A quick check over the network showed her machine was up and reachable. A visit to her office and a click on the Save button and all was well. So what was she complaining about? And here is where knowing your users is of vital importance - this secretary was not prone to exaggeration, reported problems were usually real problems and have in the past been early warning signs that something is "not quite right".

I went back down to my office and started my usual "is the network all there" ping scan. (Basically ping 100 packets over the longest paths and measure latency and return count).

The network was there, but the "round trip times" were way too long.

Time for a more focused approach. I took my trusty macbook and went to the wiring cabinet on the 4th floor (serves the front half of the building and all of the 4th, 5th and 6th floors) and plugged it into a switch serving the far corners of the 6th floor. After 30 seconds when I didn't get an IP address from our DHCP server, I started to get a little concerned. I physically checked the dhcp server and it was fine. I got back to my laptop and it had an address. So, from there, I repeated my "is the network all there" scan. And it was. And the results were within specification. Wash, rinse, repeat! Again, perfect. Back down to my office. Wash, rinse repeat!

Perfect.

OK, intermittent glitch. These will strike fear into any network admins heart. You never know when they are going to happen, and they don't last long enough to pinpoint.
And since we had a power cut on Sunday, I started to think of hardware problems in the areas where management would not spring for UPS support…

Next morning, logging in from home before travelling in, I got some random freezes on the SSH session I was using. Cut breakfast short and got in asap. Scan round trip times all over the place, latency up across the network, parts of the network not visible - trust me folks, this is not good!

Starting with the bits of the network that were not visible, I started checking the switches serving that part of the network. One switch had been replaced a few weeks ago and this replacement appeared to be working normally, but was not passing any packets through its uplink port. Replaced it, and it was fine. Meanwhile, the rest of the network was experiencing the intermittent connectivity shown the previous day.
The replaced switch went down. Same as before - wasn't passing packets through its uplink port. WTF??
We have several of these switches in many different parts of the network. I checked them all. Three out of 12 were not passing packets through their uplink. I pressed into service the emergency switches (you know, the ones you have replaced with newer higher spec models, but never got around to throwing out..).
Things started to settle down. By now it is 12 hours after breakfast and I am knackered. I checked a few more times and all was well.

Home.

Next morning, freezes in the SSH session once more. Skipped breakfast, arrived to find the problems manifesting themselves in the 3rd floor comms room (serves the back half of the building and 1st, 2nd, and 3rd floors).
I immediately zeroed in on the switches I had checked yesterday. All bar one were passing packets. Replaced it, things were well again - for a short while.
We checked servers (lots of NFS mounts in our network), we checked switches. Some switches were not passing traffic on uplink ports, although were fine on inter-switch traffic. Lots of head scratching, theorising about power cuts affecting the newer switches only, lots of side paths explored, and frenzied testing as the intermittent trouble flared and then subsided. Everything seemed fine by 7pm. Time for home.

Next morning, freezing SSH sessions again. I had already informed everybody that there was going to be downtime from 7am this morning, so I got into work, ran the basic "is the network there?" test and got loopy results.

Into the 3rd floor comms room, plugged into the backbone switch and started checking packet counts. Actual packet counts were normal, but multicast packets as a proportion of all packets was off on one port. That port led to a switch (of the kind that was playing up the previous days) that aggregated several areas of the network. Checking that switches packet counts showed two ports with larger than normal multicast packets. Shutdown the worst offender and things started to settle down - latency didn't spike, round trip times were closer to normal - all in all, a much healthier network.
Tracing that port to its other end, I found a very old netgear wireless access point plugged into the port. Now, our wireless network is on a different physical network, and there should not be WAPs plugged into the normal wired network. This particular WAP should have been plugged into socket next to it (marked with red tape).

I ran some tests on the WAP over the course of the day, and sure enough, it would go into a packet spewing frenzy for an indeterminate amount of time, then be normal for an indeterminate amount of time. It was dismantled with extreme prejudice and a hammer.

Lessons learned? Well, without knowing what was "normal" for the backbone switches I could have missed the elevated multicast packet count, so I guess more time checking traffic patterns on the backbone wouldn't go amiss.
And start with the backbone switches!
Oh yes, and bolt down the WAP ethernet ports so nobody can unplug them from the ports I put them in….

[postscript] Those switches that weren't passing packets on the uplink ports had "storm control" enabled, whereas others did not. Switch preparation procedures amended to consistently apply a known configuration.




No comments:

Post a Comment