Ever find yourself in a position with a server which you can’t watch a kernel panic on, and wonder, how the hell can I find the cause of this issue, with no access to the console, and no way of logging the kernel panic to disk?
It just so happens the situation has happened to me. And I was keen to get to the bottom of these bloody kernel panics. They are annoying.
You see, a server I have has been crashing at random intervals, with no software messages giving reason for the crashes.
After getting tired of pestering the folk who run the rack for reboots, I investigated a way of making the machine reboot by itself.
After a little bit of searching a resolution was found to make it reboot by itself:
edit the boot menu list (the list of kernels it can boot – /boot/grub/menu.lst) to include the line panic=20.
Each time a panic occurs, the server won’t sit there waiting for a reboot, it’ll simply reboot itself. Much better.
Now, we go back to the original issue, investigating the cause of the panics, when you aren’t able to look at the system, and the system panics randomly. Remote logging of panics. Fantastic idea, but how does it work? No idea. Until now.
The feature is called “netconsole”, and here is what you do.
Ping the server you want to log the panic: ping -c 1 123.456.789.012
After the ping executes, type arp -a, and find the MAC address (00:12:34:56:78:89) of the server you want logging to go to.
You need two machines in the same network (or you must have the others MAC address).
With that information at hand, we can start logging those bloody panics:
On the logging machine, ensure you have netcat installed (yum install nc). You can after install, execute:
nc -u -l -p 6969
Which starts netcat. Cat is a program for text capture / output. Netcat is the exact same, except it listens on the network.
On the crashing machine, while it is up, enter:
modprobe netconsole netconsole=
An example of this is:
modprobe netconsole netconsole=6969@192.168.0.2/eth0,6969@192.168.0.3/00:01:02:03:04:05
Once configured, you’ll see some fantastic useful output on the logging machine:
<5> […network console startup…]
That means its started up. Looking in the bad machines /var/log/messages, you’ll likely also see:
crash-a-lot kernel: netconsole: local port 6969
crash-a-lot kernel: netconsole: local IP 192.168.0.2
crash-a-lot kernel: netconsole: interface eth0
crash-a-lot kernel: netconsole: remote port 6969
crash-a-lot kernel: netconsole: remote IP 192.168.0.3
crash-a-lot kernel: netconsole: remote ethernet address 00:01:02:03:04:05
crash-a-lot kernel: […network console startup…]
Obviously, crash-a-lot will be your own hostname, and the parameters will be those you choose.
You need to ensure the UDP ports involved are also forwarded and allowed in any firewall.
Their is no load issues that are noticeable.
You need to remain logged into the logging SSH server to see the output, or use screen, or output to a text file.\
To make it permanent (on every boot), add this boot parameter to the boot menu list:
netconsole=6969@192.168.0.2/eth0,6969@192.168.0.3/00:01:02:03:04:05
Note that you don’t have modprobe or netconsole.
The configuration parameter can go at the end of the kernel parameters line, which looks like this:
kernel /vmlinuz-2.6.9-42.0.10.plus.c4smp panic=5 ro netconsole=6969@192.168.0.2/eth0,6969@192.168.0.3/00:01:02:03:04:05
And that will get you logging on the remote server. Remote server obviously must be listening to the UDP port, otherwise traffic won’t happen.
Trying this between two servers on remote networks without a MAC address didn’t work.
It’s not a lot, but its a great start to pinpointing errors without going to the data centre to find out why.
Note: Modifying system files or executing commands on this page is entirely at your own risk. The information offered is of an all care taken, but no responsibility basis. You should consult a professional if you have any doubts as to the information on offer.
Pingback: RAIN, RAIN, BEAUTIFUL RAIN! « The Blog of Wildstar, aka Jimpossible