Neds Place: Technology

Showing posts with label Technology. Show all posts

Saturday, 14 May 2011

The Cloud and you

Two things I can say for sure:

The cloud will fail.
The cloud will get better.

How much any of these two things means to you is dependent on how willingly, and how totally, you embrace the cloud.

Obligatory history lesson starts here:
In 1979 I built my first computer. I soldered the chip-carriers (2, I think) to the tiny PCB, and I soldered all the discretes in their rightful places and I attached power and it had life. A little while later it died (faulty character generator, if you must know).
From then on I had various computers (PET, TRS-80, AMSTRAD 464/664/PCW9512, CoCo, Amiga, Atari, Jupiter Ace, Sinclair Spectrum, IBM XT, IBM AT, a multitude of clones, home-brews and lately Apple Macs.)
All have failed in some way or another.

Lessons learnt from the above history? Technology will fail you (usually when you need it most) and technology improves if it lasts long enough.
Obligatory history lesson ends here.

Therefore, the cloud will fail, but it will improve.

If you are going to use the cloud, decide the level of failure you can tolerate, then use the cloud up to that level, and no more.

One example I know of:
A small business was having server problems - capacity and hardware were below par. The owner, a fairly tech-savvy guy, crunched his numbers and came up with a solution using Amazon. Unfortunately, like a lot of small businesses, the risk analysis was pretty much non-existent, and Disaster Recovery and Business Continuity were words heard once in a management seminar. But, hey, it's Amazon - what could go wrong? Well, Google "Amazon Cloud outage" and you will have a good idea. His business is still going, but some of his customers are tending to use the business less these days, and some are still in "negotiations" over goods supplied late.

Amazon are doing a lot to make sure this doesn't happen again, and so is the small business owner.

Between them, (with lots of hard work) they should be able to put this incident to rest. Out of it will come an improved cloud service, and a chastened, but wiser small business owner.

Monday, 2 May 2011

Why I use Open Source Software

Don't get me wrong - I use and advocate the use and support of closed source software any time I feel it is appropriate to do so. But it's not that often these days.

But I don't think the following scenario would have been possible using closed source equivalents, and I am damn sure that the cost of using them would have blown my budget out of the water!

I run a network. It has over 400 users, many of whom are mobile, work abroad for extended periods of time, and work all the hours God send us. Planned downtime is a rarity. Unplanned downtime is happening more frequently, but due to outside problems (power outage, internet congestion etc) rather than internal problems - although we have our fair share of those too!
I use Open Source software wherever possible and I do so because it is generally a better "fit" for the network tasks I have than some proprietary software. And I can usually bend it to fit what I want - I can't do that with closed source.

So when I get an OS solution that works, I tend to generally leave it alone. Oh, I apply security patches, but rarely do I update anything that's working unless I need the new feature(s) or they come with a security update.

That's why you can find installations of Apache 1.3 still working on intranet machines, why I still have working Slackware 11 installations and why some un-maintained programs are still doing the business on the network - they work and they are on internal machines with no security implications.

So when a power outage along with a faulty UPS takes out a machine that has been working steadily for the last 5 years as a dhcp server, a nat box, a wireless sign-on web page, a transparent proxy and a router for several private IP ranges, I take the opportunity to upgrade the hardware and software with thanks. When it happens on the Friday of a long weekend ( Friday through to Tuesday ), I am even more thankful for the opportunity to work on it uninterrupted.

Here is the setup:
Hardware: 4 disk rack mount 1U box with dual Athlon processors and 2 gigabytes of RAM ( A bit light these days, but should be enough) and 2 disks only installed
Software: Slackware64 13.1, standard full install. Main packages are Squid, Apache, dhcpd, dnsmasq, and some custom start up scripts for adding addresses to ethernet cards and starting iptables with the nat table entries and port redirects for the transparent proxy.

The process went something like this:

Install Slackware. ( 30 Minutes )
Get dnsmasq working as DNS server only
Get dhcpd (installed version) working. (15 Minutes )
Get Apache in default mode working then configure for my defaults. ( 15 Minutes )
Get Squid. Get Slackbuild script for Squid. Compile Squid. Install Squid ( 45 Minutes )
Read Squid documentation (BIG package, lots of changes since I last used Squid in anger!) (4 Hours )
Implement necessary changes to Squid configuration, test, and repeat. ( 12 hours, including Internet searches, reading Blogs, Wikis etc. )
( Transparent proxying using Squid was a hack in Squid 2, that has been elevated to "built-in" in Squid 3 - but judging by the Blog pages and wiki's it is problematic in Squid 3...)
Curse Squid (5 Minutes)
Get a copy of Squid 2, try to compile in 64 bits on target box. (2 hours, failed )
Curse Slackware ( 30 Seconds )
Find, install, configure and test an alternative to Squid for transparent proxying (TinyProxy )( 1 hour )
Install, test, debug and eventually modify the PHP pages for the wireless page signon. ( 3 hours)

Test all functions from various areas of the building ( 4 Hours )

Total time taken: ~ 28 hours on the software, spread over 2 days.

All the software was available, free and easily downloadable - no feature crippled demos, no limit on the number of connections/users/CPU's, nobody upselling, nobody bombarding me with phone calls/emails for stuff I don't want/don't need and am quite capable of finding for myself if or when I do, and no expiry date where they get a chance to do it all again in 12 months time.

And that is why I will put up with the occasional failure (looking at you, Squid*) in the Open Source model - they don't market this stuff, they just make it useful!

(* By the way, I am quite happy for Squid users to prove me wrong - it is a BIG package, and has over 170+ options, so there is every chance that I screwed up and not Squid - but TinyProxy went in, I did a minimal config and it just worked...)

Sunday, 23 January 2011

Web Makeover, Part 2

In the original Web Makeover article, I spoke about using Aperture 3 to produce "web journals" then incorporating them into my RapidWeaver site.
That worked, but any updates are a long-winded process.
So I investigated Rapid Album, a plug-in for RapidWeaver specifically for producing my photographic galleries. And it does the job.
But that doesn't mean I have stopped looking - and I am considering hand-coding my own solution, which is something I do at work a lot, but I really don't want to have to do it for this site - I could use the time for something much more productive :-)

I have checked out a new "plug-in" from the SymfoniP people - Gallery Box - which produces a Gallery a lot like many "off-the-shelf" websites that I have seen. They have the pictures in box with a "carousel" of thumbnails along the bottom. There are lots of options and it seems to work well. I might use it at sometime in the future.

But for now the quest continues for my version of the "perfect gallery" - I know it's out there, somewhere.

Sunday, 21 November 2010

Web Makeover

I was bored with the look of my website.
I was using Rapidweaver with specific plugins for my photo album and I was bored with that, too.
I couldn't see my photos on my iPhone or iPad from my website (no Flash), so I looked around for an html only photography gallery type generator.
Turns out I had one all the time - Aperture 3 does "web pages" and "web journals".

In Aperture 3, select some photos, select new from the Menu, choose Web page or Journal (the journal allows you to add text blocks to your pages), set up your options and then "export" it to disk (or MobileMe, of course!)
So, that's what I did.
Then using Rapidweaver, I took my existing site, changed the theme and removed the unused pages plus my photography pages. I put in a "placeholder" page for the Aperture 3 webjournal, then exported the site to disk.
I used cut'n'paste to move the content of the source of the webjournal to where the source of the content of photography page would be, and then moved that file into the exported webjournal. I then moved the whole of the webjournal site over the top of the placeholder photography page in the Rapidweaver site. End result was that I had a lovely themed photography journal, integrated into my sites new theme.
Some css tweaks were needed to get the navigation working correctly on the webjournal pages, and I haven't as yet, themed the individual photo pages, but the end result looks fine to me.

Time from start to finish? 4 hours.

And the moral of this story? Well, I would say:" Choose your tools well"

Thursday, 16 September 2010

Participation

I spend a lot of time on-line. I look at hundreds of blog sites and hundreds of photos each month. I read dozens and dozens of articles on everything under the Sun - from the best way to live in a van to the progress of the Large Hadron Collider. I read tweets (hundreds and hundreds) and occasionally tweet something myself. I email. I surf. I flipbook.
In short, I consume. I consume in vast quantities, and I want more - and more - and more AND I don't want to pay (well, alright, a little bit - but not that much!).

In a second or two when I was not slurping up the products of someones else efforts, I thought:- "What would happen if everybody just produced ONE thing?"

Well, the internet would be a vastly more interesting place, for one. It might take some of the pressure off the inveterate producers of digital goodies, as well. It could help convince those organisations that simply moved their "real world goods" to the Internet (and continued to charge "real world" prices) that they don't own the game anymore. And, if you did it, it might even make you feel as though you were part of something bigger than you had ever been a part of before.

So, i just produced something - a slideshow of my photography set to music and themed with "a calming, tranquil" goal in mind. Each photograph is from Scotland, and while not technically perfect (is anything?), they are some of the ones that "stick in my mind" - for a variety of reasons.

Anyway, here it is, in a form suitable for an iPod/iPhone and one for the iPad/Mac/AppleTV and for all you HD fans, here's a 720HD version. The first is around 46M and the second is just under 160M, while the third weighs in at 272M. Large sizes, I know - but to make what I wanted I wasn't going to compromise (and spare a thought for me - I uploaded them from a 448k uplink :-( ). If you are a Windows user then either transcode them to your preferred format or install QuickTime for the PC.

If you like it, leave a comment. If you hate it, leave a comment.

Sunday, 18 July 2010

iPad - Magic, or meaningless?

When the iPad was announced, I waded through all the facts I could find (as apart from all the anguished "it's not a computer/replacement netbook/whatever/" and all the "apple fanboy" cries of "brilliant/earth-shattering/world-changing" etc). And factual stuff was hard to come by, as was unbiased reviews.

My decision was to leave it until I could get a "hands-on" and see then if it fitted my work flow.

I got the "hands-on" from the apple store and one early adopter who let me have a play for a couple of hours in return for setting up his email accounts.

My decision was that the iPad was well engineered, but it wasn't a lap-top replacement, and it wasn't a net book, and it wasn't a tool I could use in my work regimen. Leave it until generation 2 or 3 and see what comes with those (like facetime (camera(s)), like the ability to move files on and off without a third party app or emailing them!).

So why is there a first-gen iPad sitting next to me as I write this?

Simple - I got work to fund one to help me support of all the iPads that were popping up all over the place.

So, has my opinion changed after a week with the iPad?

Well, yes and no.

I still think it is a well engineered piece of kit, it's still not a laptop replacement and it's still not a netbook replacement. I can use it in my work regimen though - to answer emails and to grab information from the Internet.

But the fundamental shift in my thinking is that the iPad is a device for consuming. Consuming films, TV shows, screencasts, podcasts, youtube, music, audio books, books, blogs, photographs, games and any other thing that Apple can push through the iTunes/App store interface. The battery life is brilliant ( I played the movie Avatar back-to-back 4 times and still had 3% battery left ), the screen is superb, the sound quality is up to Apples usual standard, and when IOS4 comes to the iPad, I should be able to "fast-app switch" while my magnatune music streams keeps playing in the background. This combination makes it an excellent media consumption device.

I have used the iPad for ssh access to my servers, and that works, but the app is an iPhone app run in 2x mode (double pixels) so that's a bit clunky, but it does work. Next, I need a remote desktop protocol app for attaching to my Mac and PC - then I can use it as a replacement emergency netbook (I use an Asus eeepc for that now (thanks sis!) ).

Overall, not magic, but certainly not meaningless, not by a long way!

(the iPad model I have is WiFi 32G - iPad picture Courtesy of Apple)

Tuesday, 15 June 2010

A tale of two ISPs

I switched to Namesco as my ISP nearly 3 years ago, because they offered a fixed 2M package (my line had always done 2M and no more) with 100G download per month. I paid a year in advance.

Mostly, I got what I paid for - 2M download speeds and virtually unlimited data downloads. Occasionally, the Internet went away, but mostly it was fine. At renewal time, the cost of the 2M fixed package had increased, and the data allowance had gone down to 1G per month, purchase extra as required. After a review of my download habits, I decided 10G extra was enough. I renewed for a year, paid in advance to take advantage of a discount, and expected the same level of service as I had the year before. Didn't happen. Mostly, I got Internet at fluctuating speeds.
After another year, I renewed again, but this time with an 8M ADSL+ package that was cheaper than my fixed 2M package (by about £100 per year). That, if anything, was worse. The connection speed was up and down, the Internet took frequent holidays from me, and my ISP always started the diagnostics with "Please reboot your router" and then had me crawling under the desk to put the connection in the master socket (which was where it was from the last time!), then said the line test was fine etc etc.

Eventually I found mention of an ISP who appeared to be a little more pro-active on its customers behalf.

I rang their sales line and had a chat to a guy who listened to what I told him about my sorry tale and then said that they did not guarantee to get my speed back to 2M, but that they would at least get BT to run tests before expecting me to pay - and they agreed with me that a line capable of 2M just doesn't drop to .5M without there being a problem somewhere along the path from the exchange to my house.

Long story short. After just two months with these guys, I have had BT replace my line from the exchange to the pole at the back of my house and I now have an Internet connection that has been stable for 96+ hours at roughly 2.5M.

The name of these guys is "Andrews and Arnold" and you can find them at http://aaisp.net.uk . Read their Broadband page, then their Support page and you will get an idea of where these guys are coming from. Their charging model won't suit everyone, but it sure suits the way I work.

Obviously two months is a very short time, and a long term assessment of these guys will take time.

But 2.5M in two months from .5M? That's huge! Especially after the other shower did nothing for 7 months.

Saturday, 20 March 2010

Networks are wonderful, but.... (Take 2)

My previous blog post on this subject concluded that I should spend more time looking at backbone traffic patterns in order to recognise when something is “not quite right”.

I have been so doing, but it didn’t really help me in what happens next.....

I arrived back in my office at 4:15 pm to see a couple of my colleagues hovering around a monitor, saying things like “latency all over the place” “ping spikes” and “dropped packets”

A sense of dread started to threaten my calm.

Sure enough we were seeing the same sort of behaviour that occurred the fortnight before - traffic latency, ping spikes and lost packets. Looking closely at the data, we could see that the patterns were different within each symptom, but the end result was the same - poor network performance, a refusal to save documents on the first try, but subsequent saves OK, mail from the imap server being slow to open, dropped connections to servers etc.
Knowing that the wireless access point that was the cause of the last episode was in several pieces in a disposal bin in stores, we knew we had another problem....

To understand the problem, you will need to know about the network setup in our area:

We have four discontiguous “class c” or /24 networks running over the same wires, spread over a main building and a satellite site a few miles away. IP’s are allocated on a first come, first served basis. These IP’s are routable ip’s, reachable (in theory) from any other Internet connected computer. We now have four IP ranges in the 172.20.xxx.xxx/16 which “shadow” the third octet of our “class c” addresses. (e.g. 1.2.3.4 and 172.20.3.4). These are IANA private network space IPs that will not be routable outside the organisation.

Many groups within the network supply their own infrastructure (switches, cabinets, etc).

Many groups have their own private networks hidden behind NAT’ed gateways.

We have HPC clusters in the building and we have world facing servers supplying standard and non-standard services to other Institutions around the world.

We have a large mobile contingent who need to access local resources, and many collaborations that need controlled access to some services that can’t be world-facing.

We have a wireless network, and a DHCP server for known clients.

As we are part of a larger organisation, we need to allow that organisation to present their services to our users, and our users to present their own services to the larger organisation. We also need to allow the network security team from the centre access to all our networks for security scans etc.

We do not control the border router for these networks.

So, how do you control access to something like this?

We use a firewalling bridge machine. Every packet that comes in or goes out of the network goes through the “firebridge” - but it gets worse - even our local traffic (i.e cross-subnet traffic) must go in and out of the firebridge (because we don’t control the border router).

Our users are not the most security minded of individuals, and any new service that is a collaboration will inevitably result in a request for an access rule for the collaborators in the other institutions. A request for specific IP addresses will always result in “just let them all through - they could be on any computer at the Institution”.

So the “firebridge” is more for making the users feel better as opposed to a real effort at security.

Nevertheless, it is an important machine in the current overall scheme of things for our network.

And it wasn’t working properly.

When you trace a problem like this to a particularly busy machine, and the hardware checks out OK, the problem is usually a resource starvation one. The quick test for this is to restart the service (or reboot the machine). If it comes back in perfect working order it is probably a load related issue, but over time, as the load grows, it will start exhibiting problems again. You can then fix the resource problem before that happens.
If it doesn’t come back in perfect order, it is usually best to assume a replacement machine is required.

Our firebridge came back exhibiting the same problems as before.

No problem, we have a machine prepared for just such an emergency.

We dug it out of storage, checked it, loaded our latest firwall rules, and deployed it.

Within seconds we had the same network problems as before!

OK, regroup, re-think, coffee.

If it is the firebridge that is the culprit, then the only way to prove it was to take it out of the loop.
Scary thought for all those machines on the inside.
(I read somewhere that it takes an average of seven minutes for a new Windows machine to be compromised when exposed to the Internet.)

We took it out of the loop.

All network problems disappeared instantly.

We plugged it back in. The network problems re-appeared almost instantly.

So we hit the books, got some low-level diagnostics on the firebridge (packet level) and watched what was happening in real time.

A rule (added 2 months or so ago) to allow the recently implemented 172.20.xxx.xxx IP range to enter the network and cross the router had been implemented as 172.20.xxx.xxx/16 with a “keep state” argument.
This turned out to be a big mistake - but only when that network range started to be deployed, which has happened only in the last few weeks.

We did some revisions of our firebridge rules, some increasing of the allowed state table and various caches on the firebridge, and rebooted.

Watching closely we saw the entries in the firebridge state table rise to just over the previous minimum (and this is with the revised rules!), and then slowly creep upwards.

The network problems did not reappear. We sent 30,000 packets all around the network, and to the outside world. We lost none.

Twenty-four hours on, and no return of the problems.

While the real test will be on Monday around 10am, when things really start to hum in this network, I anticipate (fingers crossed!) that the problems will not reappear.

So what have we learned, grasshopper?

That changes in configuration need to be thoroughly assessed.
We were used to putting rules dealing with large blocks of IP addresses through our firebridge because we knew very few of them would be active at any one time. Not so with our own Institutions large block of new IP’s

That network problems with similar symptoms do not necessarily have the same cause.
This looked like more of the previous problem, but we knew the offending hardware was “decommisioned”, and besides, the network traffic patterns were different this time.

Hope you enjoyed this. Leave a comment if you have any questions.

Sunday, 7 March 2010

Networks are wonderful, but....

The advent of the LAN (Local Area Network) has been a landmark for computing.
Originally used to share expensive peripherals like disks and printers, it is now also used to control access to data - you get access to only that data that you need to do your job - everything else is locked down by ACL (Access Control Lists), policy driven firewalls or share permissions.

And that's great - when it works.

[ NB In all that follows, I was ably assisted by other members of my team - I am not a one-man-band.]

When it doesn't, this is what happens:

I got a phone call about 2:45 on Monday. It was one of the secretaries on the 5th floor. She was having trouble saving a document the she had been working on for several hours, and needed help now! A quick check over the network showed her machine was up and reachable. A visit to her office and a click on the Save button and all was well. So what was she complaining about? And here is where knowing your users is of vital importance - this secretary was not prone to exaggeration, reported problems were usually real problems and have in the past been early warning signs that something is "not quite right".

I went back down to my office and started my usual "is the network all there" ping scan. (Basically ping 100 packets over the longest paths and measure latency and return count).

The network was there, but the "round trip times" were way too long.

Time for a more focused approach. I took my trusty macbook and went to the wiring cabinet on the 4th floor (serves the front half of the building and all of the 4th, 5th and 6th floors) and plugged it into a switch serving the far corners of the 6th floor. After 30 seconds when I didn't get an IP address from our DHCP server, I started to get a little concerned. I physically checked the dhcp server and it was fine. I got back to my laptop and it had an address. So, from there, I repeated my "is the network all there" scan. And it was. And the results were within specification. Wash, rinse, repeat! Again, perfect. Back down to my office. Wash, rinse repeat!

Perfect.

OK, intermittent glitch. These will strike fear into any network admins heart. You never know when they are going to happen, and they don't last long enough to pinpoint.
And since we had a power cut on Sunday, I started to think of hardware problems in the areas where management would not spring for UPS support…

Next morning, logging in from home before travelling in, I got some random freezes on the SSH session I was using. Cut breakfast short and got in asap. Scan round trip times all over the place, latency up across the network, parts of the network not visible - trust me folks, this is not good!

Starting with the bits of the network that were not visible, I started checking the switches serving that part of the network. One switch had been replaced a few weeks ago and this replacement appeared to be working normally, but was not passing any packets through its uplink port. Replaced it, and it was fine. Meanwhile, the rest of the network was experiencing the intermittent connectivity shown the previous day.
The replaced switch went down. Same as before - wasn't passing packets through its uplink port. WTF??
We have several of these switches in many different parts of the network. I checked them all. Three out of 12 were not passing packets through their uplink. I pressed into service the emergency switches (you know, the ones you have replaced with newer higher spec models, but never got around to throwing out..).
Things started to settle down. By now it is 12 hours after breakfast and I am knackered. I checked a few more times and all was well.

Home.

Next morning, freezes in the SSH session once more. Skipped breakfast, arrived to find the problems manifesting themselves in the 3rd floor comms room (serves the back half of the building and 1st, 2nd, and 3rd floors).
I immediately zeroed in on the switches I had checked yesterday. All bar one were passing packets. Replaced it, things were well again - for a short while.
We checked servers (lots of NFS mounts in our network), we checked switches. Some switches were not passing traffic on uplink ports, although were fine on inter-switch traffic. Lots of head scratching, theorising about power cuts affecting the newer switches only, lots of side paths explored, and frenzied testing as the intermittent trouble flared and then subsided. Everything seemed fine by 7pm. Time for home.

Next morning, freezing SSH sessions again. I had already informed everybody that there was going to be downtime from 7am this morning, so I got into work, ran the basic "is the network there?" test and got loopy results.

Into the 3rd floor comms room, plugged into the backbone switch and started checking packet counts. Actual packet counts were normal, but multicast packets as a proportion of all packets was off on one port. That port led to a switch (of the kind that was playing up the previous days) that aggregated several areas of the network. Checking that switches packet counts showed two ports with larger than normal multicast packets. Shutdown the worst offender and things started to settle down - latency didn't spike, round trip times were closer to normal - all in all, a much healthier network.
Tracing that port to its other end, I found a very old netgear wireless access point plugged into the port. Now, our wireless network is on a different physical network, and there should not be WAPs plugged into the normal wired network. This particular WAP should have been plugged into socket next to it (marked with red tape).

I ran some tests on the WAP over the course of the day, and sure enough, it would go into a packet spewing frenzy for an indeterminate amount of time, then be normal for an indeterminate amount of time. It was dismantled with extreme prejudice and a hammer.

Lessons learned? Well, without knowing what was "normal" for the backbone switches I could have missed the elevated multicast packet count, so I guess more time checking traffic patterns on the backbone wouldn't go amiss.
And start with the backbone switches!
Oh yes, and bolt down the WAP ethernet ports so nobody can unplug them from the ports I put them in….

[postscript] Those switches that weren't passing packets on the uplink ports had "storm control" enabled, whereas others did not. Switch preparation procedures amended to consistently apply a known configuration.

Neds Place