For IT Systems Administrators there are no "typical" days. There are "typical" tasks. Log reviews, software patches and updates, data retrieval, data backup, storage management, commission new servers, retire old ones, test new software, server performance checks, network performance checks, power regulation checks on the Uninterruptible Power Supplies, Air Conditioning unit checks, handling user queries and complaints, checking that all services are up and running (and running correctly!) and many many more smaller tasks that fall into the "preventative" category - all form the backbone of the Systems Administrator raison d'être.
Then there are the atypical tasks - like dealing with a flooded server room, moving your entire inventory of servers from one room to another, crawling around under the floor to find a possible cable break, running emergency power cables because someone forgot to tell you that there would be a power interruption that day and a hundred and one other things that can and do happen only occasionally, but that need to be dealt with by "someone".
Then we have the administrative tasks of the position - the documentation of systems, servers, networks and services, the preparation of proposals for new/upgraded servers and services, the preparation of specifications for desktops, servers and laptops so that they will do the job at the most reasonable cost - and then there is the form filling for management - but enough said about that particular endeavour.
And after all that, we have the real reason why we are needed. When a service goes down, we need to get it backup as soon as possible - and that may mean 24 hour shifts, cannibalising other, less important servers/services and basically doing what is needed to ensure that that service is back up in the least amount of downtime possible.
There are many ways that could be used to describe the job of Systems Administrator, but my favourite is this one:
"So, Jock, you say you are the "Official Elephant Hunter and Disposal Person" for the City of Glasgow?"
"That's right"
"But Glasgow has never had a problem with elephants!"
"See what a good job I'm doing!"
You don't know you need me, until you need me.
Saturday, 19 June 2010
Tuesday, 15 June 2010
A tale of two ISPs
I switched to Namesco as my ISP nearly 3 years ago, because they offered a fixed 2M package (my line had always done 2M and no more) with 100G download per month. I paid a year in advance.
Mostly, I got what I paid for - 2M download speeds and virtually unlimited data downloads. Occasionally, the Internet went away, but mostly it was fine. At renewal time, the cost of the 2M fixed package had increased, and the data allowance had gone down to 1G per month, purchase extra as required. After a review of my download habits, I decided 10G extra was enough. I renewed for a year, paid in advance to take advantage of a discount, and expected the same level of service as I had the year before. Didn't happen. Mostly, I got Internet at fluctuating speeds.
After another year, I renewed again, but this time with an 8M ADSL+ package that was cheaper than my fixed 2M package (by about £100 per year). That, if anything, was worse. The connection speed was up and down, the Internet took frequent holidays from me, and my ISP always started the diagnostics with "Please reboot your router" and then had me crawling under the desk to put the connection in the master socket (which was where it was from the last time!), then said the line test was fine etc etc.
Eventually I found mention of an ISP who appeared to be a little more pro-active on its customers behalf.
I rang their sales line and had a chat to a guy who listened to what I told him about my sorry tale and then said that they did not guarantee to get my speed back to 2M, but that they would at least get BT to run tests before expecting me to pay - and they agreed with me that a line capable of 2M just doesn't drop to .5M without there being a problem somewhere along the path from the exchange to my house.
Long story short. After just two months with these guys, I have had BT replace my line from the exchange to the pole at the back of my house and I now have an Internet connection that has been stable for 96+ hours at roughly 2.5M.
The name of these guys is "Andrews and Arnold" and you can find them at http://aaisp.net.uk . Read their Broadband page, then their Support page and you will get an idea of where these guys are coming from. Their charging model won't suit everyone, but it sure suits the way I work.
Obviously two months is a very short time, and a long term assessment of these guys will take time.
But 2.5M in two months from .5M? That's huge! Especially after the other shower did nothing for 7 months.
Mostly, I got what I paid for - 2M download speeds and virtually unlimited data downloads. Occasionally, the Internet went away, but mostly it was fine. At renewal time, the cost of the 2M fixed package had increased, and the data allowance had gone down to 1G per month, purchase extra as required. After a review of my download habits, I decided 10G extra was enough. I renewed for a year, paid in advance to take advantage of a discount, and expected the same level of service as I had the year before. Didn't happen. Mostly, I got Internet at fluctuating speeds.
After another year, I renewed again, but this time with an 8M ADSL+ package that was cheaper than my fixed 2M package (by about £100 per year). That, if anything, was worse. The connection speed was up and down, the Internet took frequent holidays from me, and my ISP always started the diagnostics with "Please reboot your router" and then had me crawling under the desk to put the connection in the master socket (which was where it was from the last time!), then said the line test was fine etc etc.
Eventually I found mention of an ISP who appeared to be a little more pro-active on its customers behalf.
I rang their sales line and had a chat to a guy who listened to what I told him about my sorry tale and then said that they did not guarantee to get my speed back to 2M, but that they would at least get BT to run tests before expecting me to pay - and they agreed with me that a line capable of 2M just doesn't drop to .5M without there being a problem somewhere along the path from the exchange to my house.
Long story short. After just two months with these guys, I have had BT replace my line from the exchange to the pole at the back of my house and I now have an Internet connection that has been stable for 96+ hours at roughly 2.5M.
The name of these guys is "Andrews and Arnold" and you can find them at http://aaisp.net.uk . Read their Broadband page, then their Support page and you will get an idea of where these guys are coming from. Their charging model won't suit everyone, but it sure suits the way I work.
Obviously two months is a very short time, and a long term assessment of these guys will take time.
But 2.5M in two months from .5M? That's huge! Especially after the other shower did nothing for 7 months.
Sunday, 6 June 2010
Clearing the clutter
Many years ago I was made redundant. This was quite a shock to me, as I had always considered myself to be a good worker, giving value for money to my employer, going the "extra mile" when required.
I blamed myself, as I thought it was my fault - but when I saw who was kept and who was fired, I saw that there was no reason (other than the management having a "plan" to reduce costs ) to choose one over another.
So, what to do?
At that time, things were, economically speaking, quite difficult and the prospect of paid employment was quite low. I knew computers, book-keeping, computer games, the meat industry and how to talk to people. I really liked playing computer (and console games) and as I had been in the console games sector, I thought I would see what could be done in that area.
Realizing that retail premises were not an option, I went mobile. I started a market stall selling the console games of the day (Super Nintendo, MegaDrive, Neo-Geo (look it up!) etc. ). The business kept me afloat until the PlayStation and other CD based formats came to the fore. Since they were easily copied, they followed the way of the PC and put many legitimate sellers out of business.
Shame, but that's life.
Fast forward 10+ years.
Our loft is getting added insulation soon, so it had to be cleared out. While in the process of doing so, I came upon the remnants of my console games selling business. I found an Atari Jaguar console (no games, so useless), an American SNES (Super Nintendo) still in its original box (no games, no "universal" adapter), an Amstrad 664 home computer (with disks), several hundred Commodore 64 games on cartridge and tape, and an MSX home computer (circa 1983), many spares for many different consoles and a lot of leads etc that were just plain useless.
I suppose I could have had a good old wallow in the nostalgia evoked by these items, but, you know what? I dumped them. Unceremoniously. Without real thought.
Why?
Because they were useless. Because they brought nothing to my life now. Because they had been sitting in my loft for 10 years, never looked at.
The sum total of what I kept from the loft clear out was my chess board and pieces, and some financial papers that needed shredding.
I am currently in paid employment, with the prospect of being made redundant yet again. As we know, economic times are hard (again). My base skill set hasn't changed much, but the world has.
So, what to do?
Haven't got a clue, really.
But at least my loft is empty.
I blamed myself, as I thought it was my fault - but when I saw who was kept and who was fired, I saw that there was no reason (other than the management having a "plan" to reduce costs ) to choose one over another.
So, what to do?
At that time, things were, economically speaking, quite difficult and the prospect of paid employment was quite low. I knew computers, book-keeping, computer games, the meat industry and how to talk to people. I really liked playing computer (and console games) and as I had been in the console games sector, I thought I would see what could be done in that area.
Realizing that retail premises were not an option, I went mobile. I started a market stall selling the console games of the day (Super Nintendo, MegaDrive, Neo-Geo (look it up!) etc. ). The business kept me afloat until the PlayStation and other CD based formats came to the fore. Since they were easily copied, they followed the way of the PC and put many legitimate sellers out of business.
Shame, but that's life.
Fast forward 10+ years.
Our loft is getting added insulation soon, so it had to be cleared out. While in the process of doing so, I came upon the remnants of my console games selling business. I found an Atari Jaguar console (no games, so useless), an American SNES (Super Nintendo) still in its original box (no games, no "universal" adapter), an Amstrad 664 home computer (with disks), several hundred Commodore 64 games on cartridge and tape, and an MSX home computer (circa 1983), many spares for many different consoles and a lot of leads etc that were just plain useless.
I suppose I could have had a good old wallow in the nostalgia evoked by these items, but, you know what? I dumped them. Unceremoniously. Without real thought.
Why?
Because they were useless. Because they brought nothing to my life now. Because they had been sitting in my loft for 10 years, never looked at.
The sum total of what I kept from the loft clear out was my chess board and pieces, and some financial papers that needed shredding.
I am currently in paid employment, with the prospect of being made redundant yet again. As we know, economic times are hard (again). My base skill set hasn't changed much, but the world has.
So, what to do?
Haven't got a clue, really.
But at least my loft is empty.
Saturday, 8 May 2010
Money and Value
When I do something, I try to do it as well as I can - probably, so do you, and probably so does everyone else.
So why do we end up with ISP's who can't deliver what they sell?
Why do we have voting booths that run out of ballot papers?
Why do we have politicians who can't manage to be truthful?
Why do we have Public Servants who neither serve, nor care about the Public?
Why does any work you have done on your house have a 10 year guarantee, but anything that goes wrong isn't covered?
Why do we have insurance companies who will take every penny you've got in premiums, but demand you jump through all sorts of hoops before they fail to give you a penny?
Why do we have banks that are now my best friend, when two years ago they would cheerfully have put me in debt for the rest of my life?
If everybody is trying their best, why does it turn out like this?
My take is simple. Everything is measured in terms of money. We no longer have a value system, we only have a monetary system.
Example: I use the internet. I pay a premium to an ISP for a certain level of service. I value that service beyond the price I pay, because it is of importance to me. If that service slips, and I can no longer do what is of value to me, then I need to re-assess. Having re-assessed, I now pay more for my internet, but I only pay for what I use and I am not locked into a long term contract and I have an internet connection I can carry in my pocket and use almost anywhere in the UK. Overall, the value of the service I now have is more than the old one to me. The cost might be more (could be less depending on what I use) but the value is the important thing.
Example: I research a problem at work and come up with the answer that is best for the users of the system. I make a presentation to the "powers that be" and I am told it is "too expensive". When I ask what is the cost of the performance degradation on the users output when using a "less expensive" solution, I am met with blank stares. The value of the "too expensive" is not the amount we pay for it, but the amount by which the users productivity goes up as opposed to the "less expensive" solution. Simple concept, but totally alien to "the powers that be".
Let's start valuing again, instead of pricing.
So why do we end up with ISP's who can't deliver what they sell?
Why do we have voting booths that run out of ballot papers?
Why do we have politicians who can't manage to be truthful?
Why do we have Public Servants who neither serve, nor care about the Public?
Why does any work you have done on your house have a 10 year guarantee, but anything that goes wrong isn't covered?
Why do we have insurance companies who will take every penny you've got in premiums, but demand you jump through all sorts of hoops before they fail to give you a penny?
Why do we have banks that are now my best friend, when two years ago they would cheerfully have put me in debt for the rest of my life?
If everybody is trying their best, why does it turn out like this?
My take is simple. Everything is measured in terms of money. We no longer have a value system, we only have a monetary system.
Example: I use the internet. I pay a premium to an ISP for a certain level of service. I value that service beyond the price I pay, because it is of importance to me. If that service slips, and I can no longer do what is of value to me, then I need to re-assess. Having re-assessed, I now pay more for my internet, but I only pay for what I use and I am not locked into a long term contract and I have an internet connection I can carry in my pocket and use almost anywhere in the UK. Overall, the value of the service I now have is more than the old one to me. The cost might be more (could be less depending on what I use) but the value is the important thing.
Example: I research a problem at work and come up with the answer that is best for the users of the system. I make a presentation to the "powers that be" and I am told it is "too expensive". When I ask what is the cost of the performance degradation on the users output when using a "less expensive" solution, I am met with blank stares. The value of the "too expensive" is not the amount we pay for it, but the amount by which the users productivity goes up as opposed to the "less expensive" solution. Simple concept, but totally alien to "the powers that be".
Let's start valuing again, instead of pricing.
Monday, 5 April 2010
The Internetz and me
I started using dial-up many years ago when UUCP was doing the heavy lifting, and Kermit was fighting with X, Y and ZMODEM.
I have used remote access for a long time.
No technology in my life has ever frustrated me more. If I got a good connect, the file transfer failed. If I got lousy speed then the file transfers took forever and still failed. Mostly things just failed - but - every so often it all came together and it just worked, giving a glimpse of what could be possible.
Fast forward to today. We have blistering fast broadband in every house. Everybody is on-line all the time. We can download as much music, video, and other content as we want. It is always there, always on and always available. That's the picture painted by the ISP's, the media and those trying to sell you stuff.
The reality is that some of us have blistering fast broadband speed (cabled areas, those who live on top of the BT exchanges) - the rest of us have todays technology delivered (slowly) by last centuries infrastructure, intermittently.
The ISP's have restricted how much we can download (caps and "fair use" policies), they push ever faster (more expensive) packages, when they know the delivering technology cannot deliver anymore than what you are getting today, they change the fine print, they traffic shape your p2p traffic on the grounds that these services are "illegal", when the reality is that their crappy (rented) networks can't handle the traffic on the bandwidth they are overselling.
You are branded a thief if you download music (even from sites where the music is original and freely distributable), download a video (legally) from anywhere and boy, you better watch it quick, or it will disappear off your computer in 10/20/30 days whatever. Access a web site and chances are you will be asked to participate in a survey/have a pop-up add in your face, have to listen to someones (not your) choice of music, have spyware, crapware, trojans or viruses attacking your computer. And don't get me started on spam and spammers. There has to be a special version of Hell for those bastards!
Last October I changed my package with the same ISP from a fixed rate 2mb package (that was rock steady at 2 and a little bit megabits)to an ADSL+ package capable of upto 8Mb - I knew I would not get any increase in speed (my line is incapable of more), but it was UKP100 (per year) cheaper than the package I had.
Big Mistake.
I now have a rock steady (most days!) internet connection at .5 to .75Mb.
Roughly one quarter to one third the speed I had before.
I have contacted my ISP about this over the last 4 months, and they finally (after many tries and me jumping through all the diagnostic hoops they could conjure up) agree that the line is the problem, and to progress this, a BT engineer should come to the house and inspect my equipment and internal wiring (huh?). Oh, and by the way, they will charge you UKP188 pounds for the privilege - according to them, the charge is levied only if they find a fault in your equipment. They don't charge if they find a fault in their equipment - which is really big of them.
I have been dealing with BT from the days of ISDN data connections, and let me assure you, they *never* admit a fault. Your problem magically disappears some time after BT have been notified of it (usually!), but it was never their fault.
Case in point. I lost my internet connection in 2009 for three days. The ISP was useless, they couldn't do anything - it was a fault at the exchange. Finally, after three days it "just came back on". When I asked for the report from BT, I was given a one sentence report that explained nothing.
So, whats the answer?
Well, for me, I have changed ISPs to a firm that have a reputation of being a little bit more pro-active on their customers behalf, and I have moved onto a month by month contract (I will never get a long term contract again) and I have begun exploring other methods of Internet connection.
Satellite and Mobile Internet are the front runners at the moment.
For you, I don't know what the answer is, - but I can surely say that the model of using an ISP who doesn't own the delivering infrastructure is fatally flawed - you end up as the ball in a three-way ping-pong match between you, your ISP and BT.
Not pretty. Which just about sums up the state of the "Information Superhighway" in my part of the UK.
I have used remote access for a long time.
No technology in my life has ever frustrated me more. If I got a good connect, the file transfer failed. If I got lousy speed then the file transfers took forever and still failed. Mostly things just failed - but - every so often it all came together and it just worked, giving a glimpse of what could be possible.
Fast forward to today. We have blistering fast broadband in every house. Everybody is on-line all the time. We can download as much music, video, and other content as we want. It is always there, always on and always available. That's the picture painted by the ISP's, the media and those trying to sell you stuff.
The reality is that some of us have blistering fast broadband speed (cabled areas, those who live on top of the BT exchanges) - the rest of us have todays technology delivered (slowly) by last centuries infrastructure, intermittently.
The ISP's have restricted how much we can download (caps and "fair use" policies), they push ever faster (more expensive) packages, when they know the delivering technology cannot deliver anymore than what you are getting today, they change the fine print, they traffic shape your p2p traffic on the grounds that these services are "illegal", when the reality is that their crappy (rented) networks can't handle the traffic on the bandwidth they are overselling.
You are branded a thief if you download music (even from sites where the music is original and freely distributable), download a video (legally) from anywhere and boy, you better watch it quick, or it will disappear off your computer in 10/20/30 days whatever. Access a web site and chances are you will be asked to participate in a survey/have a pop-up add in your face, have to listen to someones (not your) choice of music, have spyware, crapware, trojans or viruses attacking your computer. And don't get me started on spam and spammers. There has to be a special version of Hell for those bastards!
Last October I changed my package with the same ISP from a fixed rate 2mb package (that was rock steady at 2 and a little bit megabits)to an ADSL+ package capable of upto 8Mb - I knew I would not get any increase in speed (my line is incapable of more), but it was UKP100 (per year) cheaper than the package I had.
Big Mistake.
I now have a rock steady (most days!) internet connection at .5 to .75Mb.
Roughly one quarter to one third the speed I had before.
I have contacted my ISP about this over the last 4 months, and they finally (after many tries and me jumping through all the diagnostic hoops they could conjure up) agree that the line is the problem, and to progress this, a BT engineer should come to the house and inspect my equipment and internal wiring (huh?). Oh, and by the way, they will charge you UKP188 pounds for the privilege - according to them, the charge is levied only if they find a fault in your equipment. They don't charge if they find a fault in their equipment - which is really big of them.
I have been dealing with BT from the days of ISDN data connections, and let me assure you, they *never* admit a fault. Your problem magically disappears some time after BT have been notified of it (usually!), but it was never their fault.
Case in point. I lost my internet connection in 2009 for three days. The ISP was useless, they couldn't do anything - it was a fault at the exchange. Finally, after three days it "just came back on". When I asked for the report from BT, I was given a one sentence report that explained nothing.
So, whats the answer?
Well, for me, I have changed ISPs to a firm that have a reputation of being a little bit more pro-active on their customers behalf, and I have moved onto a month by month contract (I will never get a long term contract again) and I have begun exploring other methods of Internet connection.
Satellite and Mobile Internet are the front runners at the moment.
For you, I don't know what the answer is, - but I can surely say that the model of using an ISP who doesn't own the delivering infrastructure is fatally flawed - you end up as the ball in a three-way ping-pong match between you, your ISP and BT.
Not pretty. Which just about sums up the state of the "Information Superhighway" in my part of the UK.
Saturday, 20 March 2010
Networks are wonderful, but.... (Take 2)
My previous blog post on this subject concluded that I should spend more time looking at backbone traffic patterns in order to recognise when something is “not quite right”.
I have been so doing, but it didn’t really help me in what happens next.....
I arrived back in my office at 4:15 pm to see a couple of my colleagues hovering around a monitor, saying things like “latency all over the place” “ping spikes” and “dropped packets”
A sense of dread started to threaten my calm.
Sure enough we were seeing the same sort of behaviour that occurred the fortnight before - traffic latency, ping spikes and lost packets. Looking closely at the data, we could see that the patterns were different within each symptom, but the end result was the same - poor network performance, a refusal to save documents on the first try, but subsequent saves OK, mail from the imap server being slow to open, dropped connections to servers etc.
Knowing that the wireless access point that was the cause of the last episode was in several pieces in a disposal bin in stores, we knew we had another problem....
To understand the problem, you will need to know about the network setup in our area:
We have four discontiguous “class c” or /24 networks running over the same wires, spread over a main building and a satellite site a few miles away. IP’s are allocated on a first come, first served basis. These IP’s are routable ip’s, reachable (in theory) from any other Internet connected computer. We now have four IP ranges in the 172.20.xxx.xxx/16 which “shadow” the third octet of our “class c” addresses. (e.g. 1.2.3.4 and 172.20.3.4). These are IANA private network space IPs that will not be routable outside the organisation.
Many groups within the network supply their own infrastructure (switches, cabinets, etc).
Many groups have their own private networks hidden behind NAT’ed gateways.
We have HPC clusters in the building and we have world facing servers supplying standard and non-standard services to other Institutions around the world.
We have a large mobile contingent who need to access local resources, and many collaborations that need controlled access to some services that can’t be world-facing.
We have a wireless network, and a DHCP server for known clients.
As we are part of a larger organisation, we need to allow that organisation to present their services to our users, and our users to present their own services to the larger organisation. We also need to allow the network security team from the centre access to all our networks for security scans etc.
We do not control the border router for these networks.
So, how do you control access to something like this?
We use a firewalling bridge machine. Every packet that comes in or goes out of the network goes through the “firebridge” - but it gets worse - even our local traffic (i.e cross-subnet traffic) must go in and out of the firebridge (because we don’t control the border router).
Our users are not the most security minded of individuals, and any new service that is a collaboration will inevitably result in a request for an access rule for the collaborators in the other institutions. A request for specific IP addresses will always result in “just let them all through - they could be on any computer at the Institution”.
So the “firebridge” is more for making the users feel better as opposed to a real effort at security.
Nevertheless, it is an important machine in the current overall scheme of things for our network.
And it wasn’t working properly.
When you trace a problem like this to a particularly busy machine, and the hardware checks out OK, the problem is usually a resource starvation one. The quick test for this is to restart the service (or reboot the machine). If it comes back in perfect working order it is probably a load related issue, but over time, as the load grows, it will start exhibiting problems again. You can then fix the resource problem before that happens.
If it doesn’t come back in perfect order, it is usually best to assume a replacement machine is required.
Our firebridge came back exhibiting the same problems as before.
No problem, we have a machine prepared for just such an emergency.
We dug it out of storage, checked it, loaded our latest firwall rules, and deployed it.
Within seconds we had the same network problems as before!
OK, regroup, re-think, coffee.
If it is the firebridge that is the culprit, then the only way to prove it was to take it out of the loop.
Scary thought for all those machines on the inside.
(I read somewhere that it takes an average of seven minutes for a new Windows machine to be compromised when exposed to the Internet.)
We took it out of the loop.
All network problems disappeared instantly.
We plugged it back in. The network problems re-appeared almost instantly.
So we hit the books, got some low-level diagnostics on the firebridge (packet level) and watched what was happening in real time.
A rule (added 2 months or so ago) to allow the recently implemented 172.20.xxx.xxx IP range to enter the network and cross the router had been implemented as 172.20.xxx.xxx/16 with a “keep state” argument.
This turned out to be a big mistake - but only when that network range started to be deployed, which has happened only in the last few weeks.
We did some revisions of our firebridge rules, some increasing of the allowed state table and various caches on the firebridge, and rebooted.
Watching closely we saw the entries in the firebridge state table rise to just over the previous minimum (and this is with the revised rules!), and then slowly creep upwards.
The network problems did not reappear. We sent 30,000 packets all around the network, and to the outside world. We lost none.
Twenty-four hours on, and no return of the problems.
While the real test will be on Monday around 10am, when things really start to hum in this network, I anticipate (fingers crossed!) that the problems will not reappear.
So what have we learned, grasshopper?
That changes in configuration need to be thoroughly assessed.
We were used to putting rules dealing with large blocks of IP addresses through our firebridge because we knew very few of them would be active at any one time. Not so with our own Institutions large block of new IP’s
That network problems with similar symptoms do not necessarily have the same cause.
This looked like more of the previous problem, but we knew the offending hardware was “decommisioned”, and besides, the network traffic patterns were different this time.
Hope you enjoyed this. Leave a comment if you have any questions.
Sunday, 7 March 2010
Networks are wonderful, but....
The advent of the LAN (Local Area Network) has been a landmark for computing.
Originally used to share expensive peripherals like disks and printers, it is now also used to control access to data - you get access to only that data that you need to do your job - everything else is locked down by ACL (Access Control Lists), policy driven firewalls or share permissions.
And that's great - when it works.
[ NB In all that follows, I was ably assisted by other members of my team - I am not a one-man-band.]
When it doesn't, this is what happens:
I got a phone call about 2:45 on Monday. It was one of the secretaries on the 5th floor. She was having trouble saving a document the she had been working on for several hours, and needed help now! A quick check over the network showed her machine was up and reachable. A visit to her office and a click on the Save button and all was well. So what was she complaining about? And here is where knowing your users is of vital importance - this secretary was not prone to exaggeration, reported problems were usually real problems and have in the past been early warning signs that something is "not quite right".
I went back down to my office and started my usual "is the network all there" ping scan. (Basically ping 100 packets over the longest paths and measure latency and return count).
The network was there, but the "round trip times" were way too long.
Time for a more focused approach. I took my trusty macbook and went to the wiring cabinet on the 4th floor (serves the front half of the building and all of the 4th, 5th and 6th floors) and plugged it into a switch serving the far corners of the 6th floor. After 30 seconds when I didn't get an IP address from our DHCP server, I started to get a little concerned. I physically checked the dhcp server and it was fine. I got back to my laptop and it had an address. So, from there, I repeated my "is the network all there" scan. And it was. And the results were within specification. Wash, rinse, repeat! Again, perfect. Back down to my office. Wash, rinse repeat!
Perfect.
OK, intermittent glitch. These will strike fear into any network admins heart. You never know when they are going to happen, and they don't last long enough to pinpoint.
And since we had a power cut on Sunday, I started to think of hardware problems in the areas where management would not spring for UPS support…
Next morning, logging in from home before travelling in, I got some random freezes on the SSH session I was using. Cut breakfast short and got in asap. Scan round trip times all over the place, latency up across the network, parts of the network not visible - trust me folks, this is not good!
Starting with the bits of the network that were not visible, I started checking the switches serving that part of the network. One switch had been replaced a few weeks ago and this replacement appeared to be working normally, but was not passing any packets through its uplink port. Replaced it, and it was fine. Meanwhile, the rest of the network was experiencing the intermittent connectivity shown the previous day.
The replaced switch went down. Same as before - wasn't passing packets through its uplink port. WTF??
We have several of these switches in many different parts of the network. I checked them all. Three out of 12 were not passing packets through their uplink. I pressed into service the emergency switches (you know, the ones you have replaced with newer higher spec models, but never got around to throwing out..).
Things started to settle down. By now it is 12 hours after breakfast and I am knackered. I checked a few more times and all was well.
Home.
Next morning, freezes in the SSH session once more. Skipped breakfast, arrived to find the problems manifesting themselves in the 3rd floor comms room (serves the back half of the building and 1st, 2nd, and 3rd floors).
I immediately zeroed in on the switches I had checked yesterday. All bar one were passing packets. Replaced it, things were well again - for a short while.
We checked servers (lots of NFS mounts in our network), we checked switches. Some switches were not passing traffic on uplink ports, although were fine on inter-switch traffic. Lots of head scratching, theorising about power cuts affecting the newer switches only, lots of side paths explored, and frenzied testing as the intermittent trouble flared and then subsided. Everything seemed fine by 7pm. Time for home.
Next morning, freezing SSH sessions again. I had already informed everybody that there was going to be downtime from 7am this morning, so I got into work, ran the basic "is the network there?" test and got loopy results.
Into the 3rd floor comms room, plugged into the backbone switch and started checking packet counts. Actual packet counts were normal, but multicast packets as a proportion of all packets was off on one port. That port led to a switch (of the kind that was playing up the previous days) that aggregated several areas of the network. Checking that switches packet counts showed two ports with larger than normal multicast packets. Shutdown the worst offender and things started to settle down - latency didn't spike, round trip times were closer to normal - all in all, a much healthier network.
Tracing that port to its other end, I found a very old netgear wireless access point plugged into the port. Now, our wireless network is on a different physical network, and there should not be WAPs plugged into the normal wired network. This particular WAP should have been plugged into socket next to it (marked with red tape).
I ran some tests on the WAP over the course of the day, and sure enough, it would go into a packet spewing frenzy for an indeterminate amount of time, then be normal for an indeterminate amount of time. It was dismantled with extreme prejudice and a hammer.
Lessons learned? Well, without knowing what was "normal" for the backbone switches I could have missed the elevated multicast packet count, so I guess more time checking traffic patterns on the backbone wouldn't go amiss.
And start with the backbone switches!
Oh yes, and bolt down the WAP ethernet ports so nobody can unplug them from the ports I put them in….
[postscript] Those switches that weren't passing packets on the uplink ports had "storm control" enabled, whereas others did not. Switch preparation procedures amended to consistently apply a known configuration.
Subscribe to:
Comments (Atom)