The Four Month Bug: JVM statistics cause garbage collection pauses

about | archive


[ 2015-March-24 16:56 ]

TL;DR: The JVM by default exports statistics by mmap-ing a file in /tmp (hsperfdata). On Linux, modifying a memory mapped file can block until disk I/O completes, which can be hundreds of milliseconds. Since the JVM modifies these statistics during garbage collection and safepoints, this causes pauses that are hundreds of milliseconds long. To reduce worst-case pause latencies, add the -XX:+PerfDisableSharedMem JVM flag to disable this feature. This will break tools that read this file, like jstat. Update: how I found this problem

Long version: There are few things I find more satisfying that solving a challenging bug, and this is the hardest bug I can remember. I've spent four months figuring this one out. The root cause is that on rare occasions, writing to memory that is a memory mapped file on Linux will block, waiting for disk writes to complete. This is surprising, since in the code it doesn't look like modifying some variable is related to disk I/O. I wrote a test program to demonstrate this behaviour, and I was able to cause pauses on multiple file systems: ext2, ext3, ext4, btrfs (extremely common), and xfs (less common). The pauses occur even if the I/O is to a separate disk, or if you call mlock. The only workaround I've found is to place the mmap-ed file in tmpfs (a RAM disk), or disable it completely.

I don't know exactly why the Linux kernel does this, but the pauses do not seem to occur for reads, so the Linux kernel is marking these pages read-only. A friend suggested this is the kernel's way of reducing the write I/O rate under overloaded conditions. If you know exactly why the kernel is doing this, I would love to hear it.

Since I've discovered this, a number of services at Twitter are experimenting with adding the -XX:+PerfDisableSharedMem JVM flag and seeing a significant reduction in worst-case JVM pause latency. I highly recommend testing this out if you run a latency sensitive service.

Symptoms

I first noticed some JVM garbage collection pauses that were taking more real time than user plus system time. Here is an example garbage collection log entry:

2014-12-10T12:38:44.419+0000: 58758.830: [GC (Allocation Failure)[ParNew: 11868438K->103534K(13212096K), 0.7651580 secs]
  12506389K->741669K(17406400K), 0.7652510 secs] [Times: user=0.36 sys=0.01, real=0.77 secs]

The last section of this line shows that this garbage collection took 770 milliseconds of real time (real=0.77). However, the garbage collection threads only consumed 370 milliseconds of CPU time (user=0.36 sys=0.01). The JVM uses multiple threads for GC, so normally the user plus system time is multiple times higher than the real time. This suggests the kernel has blocked the GC threads for some reason.

The cause is that the JVM creates an "hsperfdata" file in /tmp, with the name hsperfdata_(username), which is a memory mapped file that records statistics about safepoints and garbage collections. When the JVM threads are reaching a safepoint, they increment some counters in this memory. Occasionally the Linux kernel blocks those threads until it finishes some amount of disk I/O.

Reproducing the pauses

You just need to cause a lot of writes to a single disk while measuring the time to write to an mmaped file:

  1. Start a machine with at least 4 CPUs to make sure there is plenty of CPU time available (I tested this on AWS and Google Cloud VMs with at least 8 CPUs and 7 GB of RAM). Use the latest Linux kernel and distribution you want. I used Ubuntu 14.10 with kernel 3.16.0-31-generic.
  2. Copy the files from my reproduction Github repository; See the README there for additional instructions.
  3. Compile and run mmapwritepause or MakeGarbage (with verbose safepoint times).
  4. While running, cause a lot of writes with diskload.sh.

Related Issues