A Pair of (somebody else’s) Concurrency Bugs

For this blog – a pair of (somebody else’s) concurrency bugs.  Well, more specifically spin-loop bugs…

Java Grande Forum – LUFact Tournament Barrier

The Java Grande Forum benchmark suite includes a parallel LU decomposition which uses a Tournament Barrier synchronization mechanism.  However The JGF implementation is buggy in at least 2 ways.  In practice it works when there are fewer (or equal) worker threads than CPUs.  If there are more workers than CPUs then it only works if the OS is polite (i.e. fair scheduling) and the JVM is dumb (no hoisting of constant array loads from a spin-loop).  Azul’s JVM and Linux’s default CFS failed (succeeded?  smart JVM optimizations and aggressive throughput scheduling) and the combination can make LUFact hang indefinitely.  Really I never saw it hang forever, but runtimes would be 2x to 10x worse than the best-case.  Enough Gloom and Doom – So what’s the bug?

The core of LUFact’s Barrier mechanism is an evil little loop that waits until all the worker threads are done… by spin-waiting.  Spin-waiting works when you have more real CPUs than threads spinning because threads needing to do Real Work ALSO get on a CPU.  Typically spinning is very bad if you don’t have enough CPUs.  If some threads are spinning on a condition… then they are making apparent work and using real CPU time to do that apparent work… preventing the Real Work from getting done.  Example time!  Suppose I run these 2 threads on a 1 CPU machine:

  Thread 1:  while (!flag ) /*nothing*/ ; return; // Spin, spin, spin until flag gets set...
  Thread 2:  for( i=0; i<LONG_TIME; i++ ) ...do Real Work...; flag = true; return; // work done!!

Now suppose the OS scheduler decides to give each thread 50% of the one CPU.  Then each time Thread 1 gets on the CPU and runs… it only spins.  On Thread 2’s turn it works it’s little butt off.. but with only half of the CPU time.  Runtime for the total program is 2x the hypothetical best scheduling.  Instead suppose the OS decides to only a thread run until that thread blocks (e.g. on I/O) or exits the program, and starts with Thread 1… then Thread 2 might NEVER RUN and this can program hang.  However on a 2 CPU machine both threads run full-tilt until Thread 2 gets done and flips the flag… so the program exits as soon as Thread 2 does all the Real Work.  Let me repeat that: ONE CPU machine ==> program hangs;  TWO CPU machine ==> program runs fine.

Lets make the example a little more realistic.  Suppose we want to divvy up a large pile of work amongst N threads and then signal when ALL the work is done.  Divvy’ing up the work is a simple divide-by-N, but after each Thread has done his 1/Nth share – they need to signal & wait for all other threads to do done.  SOMEBODY needs to notice when all threads are done and signal.. and that’s the job of a Barrier synchronization.  So here’s a little program that has a barrier sync in the middle:

  volatile boolean barrier[];
  void do_work( int tid/*thread id*/, int num_threads ) {
    int share = N/num_threads;
    for( int i=tid*share; i<tid*(share+1)-1; i++ )
      A[i] = fcn(A[i]);    // work on my share
    // Work Done!  Now comes the Barrier Synchronization
    if( tid != 0 ) {       // Thread zero is SPECIAL...
      barrier[tid] = true; // Reached the Barrier!  This thread (which is NOT thread zero) is done!
      while( !barrier[0] ) /* nothing */; // Spin until Thread zero reports done

    } else {               // Thread zero has more work to do than other threads...

      int done_threads=1;  // Thread zero spins until all other threads are done
      while( done_threads < num_threads ) { // Thread 0 spins until all threads are done
        done_threads = 1;
        for( int i=1; i<num_threads; i++ )
          if( barrier[tid] == true )
            done_threads++;
      }
      // At this point Thread 0 has noticed that all threads are reporting as
      // 'reached the barrier' and will now exit.
      barrier[0] = true;
    }
  }

This code is a simple version of what LUFact’s Barrier is trying to do, and it has TWO failure modes.  We’ve been talking about the first failure mode: threads spin when they are out of Real Work, and that spinning burns CPU time, taking it away from other threads that have Real Work.  Indeed, when I run this with 32 threads on a 16-way Intel Nehalem box I routinely see runtimes from 2x to 10x worse than the expected best case (achieved when there are exactly as many threads as CPUs).  The coders of the LUFact barrier were not so naive as this algorithm suggests; they added a Thread.yield() call in the middle of the spin-loops.  But…

Thread.Yield is Broken

But, alas, Thread.yield() is Broken.  In theory it is supposed to ‘yield’ the current running thread to some other runnable thread on the same CPU letting somebody else do work instead of this thread spinning.  THIS DOES NOT WORK for several reasons.

  1. There are zillions of spinning threads.  If the OS actually schedules another thread there’s is only a small chance you’ll get one with Real Work to do instead of getting Yet Another Spinning Thread… which will simply call Yield again.
  2. All those threads call Yield means Yield gets slammed… means the OS thread scheduler gets slammed with constant context-switch requests.  You spend all of of your time in the kernel context-switching.
  3. The OS likes to switch between a small number of threads per CPU (better cache locality, better throughput)… so if I have enough threads spinning then each CPU might end up switching between TWO spinning threads each calling Yield… and never allowing the last guy with Real Work on any CPU.
  4. More realistically, if the box is doing large batch jobs and is otherwise busy then the last LUFact thread trying to get Real Work done is throwing in his request for CPU time along with all the other LUFact threads… and all other busy processes on this busy box.  If I’m running LUFact with 32 threads, then I have 31 spinning threads and 1 working thread… so the odds of the OS scheduling picking my one dude needing to work fall off to 1/32 (notice that the OS thinks I have 32 threads that are usefully doing work, because the OS has no clue that calling Yield means “this thread is waiting for somebody else to change something else”.

All of these reasons have made Yield a sensitive hot-button kind of call to JVM engineers… so there are lots of JVM switches you can throw to change how the JVM treats Yield.  It can be ignored (no context switch, -XX:+DontYieldALot), it can call the OS’s yield equivalent (sometimes nothing on some OS’s, sometimes forces a context switch… but no guarantees that eventually all threads will get a CPU), it can do a micro-second sleep (sometimes only after the N’th yield call, or sometimes the sleep-time ramps up with successive Yield calls).  In short, JVM engineers fumbled around with the Yield call trying to figure out how to make it behave such that the common-case usage pattern of too-many-threads calling Yield in a spin-loop would do something useful.  Right now Azul’s JVM defaults to Yield being a micro-second sleep, but I’m leaning towards it to behave like the Ethernet’s exponential backoff protocol: successive yield calls should sleep() longer and longer.  Since Yield has such a wide variety of behaviors… and flags to control behaviors and none seem to do work very well in practice:

Please, Oh God, Please do not make your concurrency algorithm depend on Thread.yield!!!

LUFact Barrier’s Other Bug:

They forgot to do any volatile operations in the spin-loops (or maybe the benchmark code predates the Java Memory Model).  All those threads spinning on when Thread zero completes?  Really they are spinning on ‘barrier[0]’ changing state… but this is a simple array load.  It’s obviously an invariant load in those spin loops…. which the JIT does the obvious thing and hoists the load out of the loop – making it an INFINITE spin loop calling Yield.  BUT WAIT!!! I see the source code and there’s a volatile keyword on the barrier array!   Alas… this makes the load of the barrier array variable itself a volatile… but not any load of any array element.  You cannot make arrays of volatile primitive elements in Java.  You can do volatile-like array loads using the Unsafe classes… but nothing from the plain bytecodes or class-file format (never mind from Java… you can’t express the intent via bytecodes or normal class-file format, so cannot do this directly for any JVM based language).

I smartened up the Azul JIT a little, and it removed the invariant load… and suddenly LUFact’s barrier spin loops spin forever now.  Bleah…. stuck between dumbing-down my JVM vs failing to run a major public benchmark.

And now for something completely different:

Disruptor

Disruptor is this nifty idea: very low latency task handoff between threads in a FIFO queue (or sorta like StreamIt’s idea of a web of pipelines).  The core algorithm involves a ring buffer holding tasks and a version number to control when things are in (or out) of the buffer.  Threads spin to insert or remove items.  In theory it’s a fast elegant way to get multiple CPUs to parcel out high-volume small units of work (imagine reading network packets and making stock-market trading decisions in nano-seconds).

What Goes Right

Disruptor is probably 10x faster than the equivalent code written in the JDK 5 BlockingQueue implementation.  That’s a lot faster. It’s a little trickier to set up the boiler plate, but there’s clearly a substantial payoff.  Another issue that didn’t happen for me:

(From the authors of Disruptor: ) “The Diamond example can, but not always, show up a serious issue with the Oracle JVM and how is cause false sharing to happen with its card marking algorithm for GC. ”

(My reply) “Yeah, we filter cardmarks between generations.  Costs (very slightly) more in the no-contention case, but saves a ton when there are frequent cross-generation writes to the same cache-lines, i.e., when you are frequently writing new objects into an old Java array.  Happens to Doug Lea also, so he talked us into filtering card marks years ago.”

What Goes Wrong

  1. The Google Code testing harness mixes both Disruptor and BlockingQueue results together in a single timing number.  If you just run ‘ant throughput:test’ you get times out… but they are the sum of Disruptor times and the alternative BlockingQueue times.  Since the BQ time is often 10x slower than the Disruptor time, the reported results are 90% due to the BQ time.  In short, the “out of the box” experience with Disruptor performance is misleading.  Note that once the tests are pulled out of the ant script it is possible to run the Disruptor and BQ tests independently, and in any case results are presented separately.  It’s only when run from the ant script in the default format do you get blended results.  Fix your ‘ant’ script guys, so we can see the apples side-by-side instead of stacked together.
  2. The throughput tests are all done as classic bad MicroBenchmarks; I rant against these all the time.  See Caliper for a better answer.  In this case I saw Bad Code from the required OSR compiles, Bad Code from using ‘long’ types for the for-loop iterators and classic 1st-run-fast-2nd-run-slow syndrome.  Profiling ended up profiling the 90% case… which was the BlockingQueue implementation.  I hacked Azul’s JVM and managed to improve our BQ performance (yeah!) but I don’t think the goal of the disruptor microbenchmarks was to help make BQ faster!  In any case I eventually figured that I was running both benchmarks together, then figured out how to run the pieces independently of each other, and finally saw Disruptor running 10x faster than a BlockingQueue implementation.  I fixed much of the Bad Code issues in Azul’s JVM to be nice, but that same Bad Code will probably never occur in any real use-case of Disruptor.  I wish they had included realistic use-cases!
  3. Disruptor runs hard-up against the limits of modern hardware… and it’s performance becomes extremely sensitive to CPU placement of the working threads.

This last point will be harder for the Disruptor guys to address.  Basically if the work producer and the work consumer land on the same server *socket* and share the same L3… then performance is Good.  If they land on different sockets and have to communicate by sending packets through the motherboard (Intel QPI links)…then performance is 1/3 of Good.  Let me repeat this: based on the vagaries of OS scheduling performance varies by a factor of 3 … and in quantum jumps at that.  (To put in context: 3x slower than Good is still more than 3x faster than any alternative solution).

I hacked their original benchmark to report performance (throughput) results once/second.  I ran Disruptor for a loooong time on a 16-way nehalem box which was otherwise idle.  Fairly rapidly the JVM hits steady state – no more JIT’ing, GC is completely stable, etc.  By watching perfbar I can see 2 CPUs working furiously away… and I can tell which 2 CPUs it is.  Performance stabilizes at N ops/sec.  Then I perturb the system, and the OS schedules the working threads on 2 different CPUs… and the system gets stable again… but with a radically different performance level.  This happened repeatedly for me, across different JVMs and different Intel hardware platforms (different generations of CPUs, different numbers of sockets).  In short, the best performance will require the OS to schedule the worker threads on the same socket but not the same CPU.  This is a hard problem to solve in general, EVEN IF you own the OS “stack” and can modify the scheduler.  Basically the OS does not realize 2 threads are “talking” to each other via the cache hierarchy and so it has no idea how to best schedule the threads on CPUs.  You get pot-luck on performance, which a variation of 3x.  The situation is only worse when the pipeline has more CPUs or a more complex communication pattern.  Blah.

While we wait for OS’s to Do The Right Thing here, Java needs some kind of Processor & Socket Affinity calls.

Almost forgot – the connection to spin-loops?  I found lots when profiling BlockingQueue (including calls to Thread.yield), along with many hot calls to Park and Unpark (now a C2 JIT compiler intrinsic).  I also found one when profiling Disruptor… but only when I first had made some kind of JIT error.  The Disruptor microbenchmarks cleanup at the run end by waiting in a spin-loop until the queues drain out… and if you make a JIT error the queues might never report as drained-out so some CPU spins forever in a hot loop and becomes very visible to a profiler.   Spin-loops are OK for Disruptor which pretty much by definition is busted if it runs without each thread having a dedicated CPU.

All my bugs are fixed now… and thus endith my week of looking at spin-loops.

Cliff

18 thoughts on “A Pair of (somebody else’s) Concurrency Bugs

    • generally doesn’t hurt because (as far as I can tell) people who need disruptor by high end fastest-clock machines… which have more cores than they are using. If they need ALL the cores AND end up spinning AND doing GC… then something will have to give. But for what I’ve seen right now, concurrent GC is a win and is working on otherwise idle cores.

    • WRT GC, the Disruptor is designed to avoid medium-lived objects. There is some short-lived object creation that should never leave Eden, but most everything else should reach PermGen. For example, the heart of the Disruptor is the ring buffer array, which is pre-allocated at startup and the memory is constantly reused by producers and consumers (using the v1 Disruptor terminology). The idea is that the largest price of GC comes from stop-the-world compaction, so by minimizing the effect of that, Disruptor can run for 3-4 days without seeing such a stoppage.
      LMAX actually manually restarts their production servers once a day to ensure they never approach the compaction threshold.

      • This exactly points out why you need to do card-mark filtering. The One Big Buffer for handing off objects is in old-gen, but all the objects are perfectly short-lived young-gen objects… so every time you write into the buffer you need to ALSO write a card-mark. This makes multiple CPUs endlessly write mark-bytes into the card-mark buffer… i.e., ping-ponging (false) shared cache lines. Filtering adds a perfectly predicted conditional branch dodging a otherwise-guaranteed cache-missing write.

        Cliff

        • Are cache-missing writes really a concern? I’m probably showing my ignorance here, but I thought they were handled asynchronously without a performance penalty to the calling thread.

          BTW, Cliff, did you get an e-mail from me today? I tried using an Azul Systems address guessed from looking at others, and it hasn’t bounced back to me so I’m hoping it got through. If not, can you drop me a line at jallen at chariotsolutions.com? Thanks.

          • Now that I think about it, I suspect that behavior on a cache miss write is dictated by a policy, such as write-behind or write-through.

          • generally no penalty for cache-miss writes IF there is time to settle out the write before another miss comes in. In this case, 2 threads will be endless writing to the same cache-line, producing cache-missing writes, which will eventually fill up the hardware buffers and start to stall the CPUs. This came become THE bottleneck for Doug Lea’s fork-join framework (for high-volume enough task handoffs).

            Cliff

  1. I use “Disruptor-style” concurrency in C++, and you are absolutely right that affinity/numactl tweaks are essential. That’s usually not a problem for us, however, as it basically comes with the territory of highly latency-sensitive programming.

    As a general and related observation, it’s common in low-latency work to use concurrency in a somewhat unconventional way, purely for pipelining and keeping caches hot and not nearly as much for actually doing parallel work. (Core hops are cheap enough that even some sequential tasks can be made faster when pipelined across several cores.) Of course this relies completely on quick handoffs, hot caches and busy-waiting receivers, and it’s quite wasteful of computing resources from one perspective. On the other hand, it works very well. Disruptor-style concurrency is a natural fit for this model, but it’s hard to really go all the way in the JVM. The kernel scheduler can be convinced to cooperate with a scheme like this, but the JVM is a lot more stubborn.

  2. This remembers me my Java Performance Report, Part V (published in 2003 but not available anymore): “LUFact, SOR, Matmult: (…) Unfortunately, the IBM JDK cannot run the SOR test at all (it loops with 100% CPU usage); the lock happens in a busy-waiting synchronization code that apparently reveals that IBM’s optimizer is not respecting a volatile field tag. The busy-waiting mechanism is considered evil by many programmers (and it’s certainly a problem in J2EE containers), but sometimes it’s the best solution for the problem, except that performance is very dependent on the runtime’s and compiler’s behavior (for example, lightweight threads can lock busy-wait loops if scheduling is less than perfectly preemptive).”

    The barrier code, IIRC, is shared by these three tests. Back in 2003, SciMark2 was already old and creaky, but I was still using it because it was popular and the less-bad option for that kind of work. My analysis above was a bit amateurish, I didn’t know much about memory models before JSR-133 educated me about it… but I could notice the questionable concurrency code in JGF. Still, I “punished” the IBM JDK for entering an (apparently) infinite loop, when I probably should have commended that JVM for doing the most code optimization. Never too late to retract some injustice!

  3. Your blog is always educational for me. After reading your diagnosis of LUFact I finally understand C++’s seemingly insane definition of volatile, where

    some_type some_function()
    {
    volatile x;
    ….
    x = barrier[0];
    ….
    }

    is defined as making barrier[0] volatile!

    As for Thread.yield(), maybe it should take an optional argument saying which thread you prefer to yield to. Of course it’s nearly impossible to change fundamental APIs like this any more. Adding affinity API is probably a better idea, but how can it be abstracted from the actual hardware configuration?

  4. Cliff, could you please expand on the 1st-run-fast-2nd-run-slow syndrome? Usually the problem I see with bad microbenchmarks is 1st-run-slow-2nd-run-fast after the JIT optimizies the benchmark, what happens in the case you mentioned?

    • The provided benchmark runs two very large iteration counts around small loops doing producer/consumer with either BlockingQueues or Disruptor. The first loop requires an “OSR” – On-Stack-Replacement. OSRs happen with no profiling except for loop trip counts (Azul profiles in the client compiler, which starts profiling the NEXT time the benchmark is called but there never is a next-time), which is bad enough… but the 2nd loop is JIT’d with known-zero-counts (because the first loop is still executing, you’ve never made it to the 2nd loop)… so the 2nd loop does not, e.g. do any inlining, or other loop optimizations. The 2nd loop code-quality suffers relative to the first loop’s quality …. and it doesn’t matter if the 2nd loop is Disruptor or the BQ implementation, in all cases the 2nd loop runs with worse code.

      Cliff

  5. Hi Cliff,
    I am not sure if you have observed during benchmark tests of Disruptor. IMO, Thread.interrupted() is the greatest performance offender for BlockingQueue ones. I have similar to disruptor code (way simpler ringbuffer single reader, multiple writers) but w/o the busy-wait, just LockSupport.park/unpark. Removing Thread.interrupted() improves the performance around 4(!) times on Sun/Oracle JVM. Even Thread.currentThread().isInterrupted() acts better that just Thread.interrupted, which clears the flag. The entire j.u.c is riddled w/ Thread.interrupted(), so that might degrade the performance of BlockingQueue while Disruptor doesn’t honor the interrupted flag.

  6. Heh… interesting… I of course profiled the bee-jeebers out of BlockingQueue (by accident). Thread.interrupted did not show up at all, although Park and Unpark did. Turns out Azul has Thread.interrupted (and friend isInterrupted) are intrinsics for us…. but park and unpark were not. They are now and BQ performance went up another notch when I did that… but it’s still nowhere close to Disruptor.

    If you can send me your disruptor-like code in a handy-to-benchmark format, I’d be happy to run in on an Azul JVM and tell you where the time goes.

    Cliff

  7. Cliff, about Thread.interrupt, I was very a bad victim of running w/ -client. I realize it next morning. I ran the code without realizing on a 32bit OS… where -client is the default. Actually I was genuinely shocked to see Thread.Interrupt affecting it so badly, obviously -client hotspot didn’t use intrinsics. Hotspot is supposed to use intrinsics there as well (reading the code), although the static Thread.Interrupt() should have write volatile semantics if it sets the flag. I’ll prepare some benchmark, you can run w/ just “java -jar”.

Leave a Reply

Your email address will not be published. Required fields are marked *