The most common data-race…

Here’s the most common data-race I see:

    if( *p != null ) {
// other thread does 'p=null'


This is just my personal observation, both on code I’ve written and code other people have written but I’ve had to debug.  This is the most common race, by far.


Annoyingly, the race can be removed by an optimizing compiler – sometimes.  That is, the standard common-subexpression-elimination optimization will remove the redundant loads of ‘*p‘ and thus remove the chance of seeing a null after the 2nd load.  I say “sometimes” because optimizers are never required to optimize, it’s just nice when they do.  The problem is that with modern JVM’s you have no control over whether or not this optimization is done, and it’s likely done just when your code gets hot…. i.e., just after when a high load hits it.  Up to then, the code is likely running in an interpreted mode or in a ‘fast dumb JIT’ mode, with minimal optimizations.


The other problem with this code, is that you are normally solving a Large Complex Problem already – and concurrency is your tool to get the needed performance.  In order to solve the Large Complex Problem (LCP) at all you’ve added a few layers of abstraction to hide it’s complexity – which inadvertently hides the complexity of the concurrent algorithm!  Suppose ‘*p‘ hides some cached code and you have a few accessors to make what access means (in the context of LCP) more obvious:

    if( method.has_code() ) 
        // other thread does 'method.flush()'


Aha!  Now you (don’t) see the problem… good software engineering of the LCP has obscured the racey concurrent code.  Even worse: ‘method->flush()’ is a rare event, so this race is even rarer… meaning you’ll only crash on a heavy trading day with the VP of engineering breathing down your neck and never in the development cycle!  Here’s an even more obscure version of the same wrong code:


    if( !method.has_code() ) 


We just set the code variable, so it’s still set when we read it again to execute it – right?  Wrong!  Or maybe ‘Almost but not quite!’.


What’s the fix?  Acknowledging that the concurrent algorithm is just as complex as LCP (perhaps less code than the LCP but generally more subtle code).  Expose the shared variables and concurrent accesses explicitly.  Make them part of the good software engineering solution that’s going on for the LCP already.

    Code *c = method.get_code(); // Read racey shared variable ONCE
    if( !Method.has_code(c) )    // Pass in Code instead of re-reading
      c = method.set_code(method.compile());
    c.execute();                 // Execute what was read ONCE



Leave a Reply

Your email address will not be published. Required fields are marked *