I just spent the last day tracking down an annoying bug in a multi-threaded event-driven program. This bug would cause the program to get stuck on rare occasions, after running for an hour or so. Examining the stuck process's state with GDB simply showed that it was stuck in a state that I didn't think was possible. This led me in the right direction, but I didn't understand how it could possibly end up in that state. I first tried adding printf
statements to my program to trace the execution, but generated tons of output (although in retrospect, tail
probably could have solved that part of the problem), and ended up slowing the program down enough that the bug did not happen. To solve this, I ended up logging data to a per-thread circular buffer. This was fast enough to still hit the bug, and then I could look at the last entries in the circular buffer to figure out how the program got stuck.
My program was written in C, and conveniently each thread already had a "state" structure to add a buffer. The buffer ended up looking like this:
char event_log[16][100];
int next_event;
I then used a macro to add an entry to the log:
#define LOG_EVENT(con, message, ...) \
snprintf(con->event_log[con->next_event], sizeof(*con->event_log), message, ##__VA_ARGS__); \
con->next_event++; \
if (con->next_event == EVENT_LOG_SIZE) con->next_event = 0
Anywhere I wanted to log something, I can call this macro like a printf
statement:
LOG_EVENT(con, "drizzle_state_row_read pkt: %d/%d v: 0x%02x",
(int) con->buffer_size, (int) con->packet_size, con->buffer_ptr[0]);
This was a very effective tool: the program still ran quickly and triggered the bug. It only took me a few passes to track down and fix the bug. The best part about a trick like this is that the log data will appear in a core dump, so if a crash occurs on some other system, you can still look at the last few events. I will definitely use this trick again.