KGDBoE - Debug Linux Kernel over Network
Linux kernel debugging can be painful. Finding a machine with a COM port on the motherboard to run KGDB can be tricky. Finding one with a JTAG port to do debugging directly can be near impossible (unless you're using an ARM-based development board). On the other hand, each computer these days has a network card that could be as good for debugging the kernel as a COM port. So why isn't it possible to debug your kernel through a network card? Well, now it is.
There was once a set of patches called KGDBoE originated from the KGDB debugger when it was not yet merged into the kernel itself, but it is not compatible with the new kernels and has problems on multi-core systems. So we took those patches, nailed down the issues and made a new KGDBoE tool that works!
The new KGDBoE
The new tool is open-source and easy to use. Just build a kernel module without recompiling your kernel, load it on the target computer, and you can connect to the kernel with GDB. The source code is available under the GPL license, so you can even tweak it yourself.
A basic debug session
Once you've loaded the kgdboe module on your machine it will show the load log and provide instructions on connecting a debugger:
kgdboe: Trying to synchronize calls to eth0 between multiple CPU cores...
kgdboe: found owner module for eth0: pcnet32
kgdboe: IRQ 19 appears to be managed by pcnet32 and will be disabled while stopped in debugger.
kgdboe: hooking TX queue #0 of eth0...
kgdb: Registered I/O driver kgdboe.
kgdboe: Successfully initialized. Use the following gdb command to attach:
target remote udp:192.168.0.113:31337
target remote udp:192.168.0.113:31337
warning: The remote protocol may be unreliable over UDP.
Some events may be lost, rendering further debugging impossible.
Remote debugging using udp:192.168.0.113:31337
kgdb_breakpoint () at /build/buildd/linux-3.8.0/kernel/debug/debug_core.c:1013
Now you can debug your kernel just like if it was a normal user-mode application.
How it works
Linux provides a special netpoll API that allows polling a network card for incoming packets without enabling the interrupts. This mechanism was successfully used by the original KGDBoE long ago until the multi-core systems started gaining popularity. The main reason why the old KGDBoE patches did not work on the modern multi-core systems is concurrency. Let's look at a common problem scenario:
- Core #0 is responding to a 'ping' request
- Core #1 hits a breakpoint in the kernel debugger
- Core #1 stops all other cores including core #0 so that the debugger can capture their state
- Core #1 tries to communicate to the debugger using ethernet
- Core #1 cannot communicate with the debugger because code #0 is already using the network card
This could look like a dead-end because it's hard to predict what resources would the network card driver require, but we actually found a solution that works surprisingly well with the modern network card drivers. When attaching to a network card, the new KGDBoE scans the kernel to find the following information:
- The kernel module owning the network card
- The IRQ number registered by the owner module
- Any timers registered by the module
- The functions provided by the module to query hardware information from the network card
- Several device-wide and system-wide spinlocks used by the common network drivers
The relevant functions are patched on-the-fly so that KGDBoE knows if any of them are running on another core. If that is the case, KGDBoE will wait for them to complete before freezing the other cores. As a result it can guarantee that once all other cores are stopped, the network card driver invoked by the debugger will be able to use the network card without distractions.
We have tested the reliability of the new KGDBoE by in some syscalls and letting it run the hit-continue-hit loop for more than 20000 iterations. Despite the hacky nature of the solution, it did not deadlock a single time, so you should be able to debug real problems pretty reliably.
In case your network card driver does something unexpected and still locks up, we have made a special safety mode that will simply disable all cores except #0 until a reboot. Modern Linux kernel allows doing it programmatically without restarting.
We want your feedback
We've tested the tool on all hardware we could reach, but that's a tiny fraction of the devices supported by Linux. So if you do kernel debugging sometimes, go ahead, try it on your hardware and let us know via our forums if that worked for you. We'll update our compatibility list or modify the hooking mechanism if your driver turns out to need more resources.
If you want to read more about KGDBoE, check out the following pages: