Before I start, Now We Are Talking have censored every post I have made in response to any other post, even when I go with a simple reply with facts, pointing out where someone else, or the editor, has got it wrong. For example, claiming the G9 would see all our infrastructure foreign owned, which isn’t the case, as there are 9 companies in the G9, and a lot of them are Australian companies, ran by Aussies, for Aussies. They can’t post the truth when it conflicts with their “great australian company” line. Telstra themselves aren’t “great australian”, they are “great rubbish”.
Back to the title of this post, which is a dedicated server, that doesn’t like data centres.
That’s right.
I own a server that I host OzVoIPStatus on.
A few months ago, the server was constantly crashing, and wouldn’t hold up nicely.
We swapped HDD’s, thinking it was the issue, but back at that point, it turned out HDDs weren’t the issue.
We kept following crash after crash, logging as much data remotely as we could with the server automatically rebooting to avoid any noticeable downtime, and that as done with the help of netconsole (also posted here).
I got the folk at Servers Australia, who did the great job, and still do the great job, of building and maintaining it for me, to save the massive costs of shipping a server back and forward between us, for work they are capable of performing.
So, they took the server out, and took it back to their business, and we decided on some stress testing to see what hardware is failing, and how we can replicate the issue outside of the data centre, and obviously, take a corrective measure, eg. Throw money at the problem so it goes away.
Unfortunately, however, all the testing done under a different OS, all came back positive. With no errors to be found. That was complete CPU and RAM testing, to demonstrate they could handle the load. We could already conclude that the HDD is fine, after it was swapped with other HDDs, as we thought it was the cause of the problems to start with.
They tested using Bart’s PE, they tested installing other OS’s, they did a complete memory test.
All came back fine.
So we narrowed it down that perhaps the hardware is fine, and its the operating system I was running, Cent OS.
It was started up, and unfortunately, it did something it couldn’t do in the data centre, and that’s get 14 days uptime.
Whilst all the testing was happening, they helped out greatly with finding a loan server for me to sit on whilst we work these issues out and get the beast running again.
So, we came to the conclusion at that point, that it’s either environmental, or a conflict with the software running on the machine, the machine type, and the OS type. For example, running mysql (32bit) on a 32bit OS, on 64bit hardware.
All seemed logical enough to say, well, let’s put some change in place, lets change the OS on the server, and build all the software applications from scratch.
Last night, the server was put into a different data centre, as well. So, we could be a little more sure its not something localised.
I woke up this morning with a nice bright MSN window. Logged in, it was up and running.
Looked at where yum was pointing, didn’t like that idea, so edited the core repo for Fedora to Pacific Net’s mirror, and installed nano.
Used nano to edit the other files in yum.repos.d, and went to install httpd, mysql, and mysql-server.
Pressed enter, and ..
… the not so good news was the SSH window bombed out on me.
So I reconnected, logged in, and executed the command again.
The yum application started processing my arguments and began to source files to start processing. The SSH window stopped responding again, I thought bugger, I so have to turn KEEP ALIVES on for this server.
Unfortunately, trying to connect failed.
I started pinging the server. No response.
I spoke to another Servers Australia staff member to determine if they block traffic at the router in the new data centre, and we came to the conclusion that the packets I wanted to let through should of got through, so we immediately learnt..
.. the server had crashed again.
So, the long running, instability issues continue. Why? We don’t know. How come? We don’t know. We’ve ran numerous tests on the hardware, and its seriously looking like software, but, when I don’t have anything but a clean Fedora x86_64 install, one begins to question, what on earth is happening if its not hardware? Is it the racks ? No, they are different racks. Is it its ethernet cable? Nope, it’s got different cable. Is it the neighbours? Nope, it’s in Equinix now, in a different rack. Is it not liking being in a data centre?. Who knows.
Hopefully soon I will get a picture of the console kernel panic (I didn’t even get to turn on panic reboots, so its there, all crashed, and idle :(), and with that, well, let’s just hope it says a lot more than the silence, and inconsistent behaviour we are seeing now.
Data centres are “colder”, than an office. So, temperature just doesn’t sound right, especially when the server is right on top of the air con, and its getting 14oC. We can’t determine CPU temperatures yet.
It’s not kernel versions either, 2.6.9 is different to 2.6.18, i believe it was running on Fedora. It’s such a confusing issue. The previous dumps suggested that it was a CPU issue, but that wasn’t replicated in the office, like you’d expect with faulty hardware, in an environment with an elevated temperature.
If we can’t get it to like data centres, we’ll have to go with one of two options: Run Fibre from sydney to my house, or, and more likely :(, make some hardware modifications, like add a Zalman CPU fan and see if we see any change, but that defies the complete testing theory above where it was stressed to the max, so much so that any high temperature issues should have been exposed rather quickly over the several days of testing.
The current situation is that it can’t seem to sit up by itself, and isn’t really fit for the intended purpose until it can at least maintain its own ground in the racks at any of the data centres.
Mind you, the current server, which isn’t as great as mine, is holding up nicely, with in excess of 30 days uptime, and hasn’t needed a reboot, or given me any issues really. So, that’s where I’m lead to believe its not a software issue, so we are probably best concentrating on whats happening in the racks.