Spot the DataRace

Looks like I need to take some more of my own medicine!  Here’s code I worked on over a year ago, and it took a year for the data race to finally come home to roost.  The following API is fundamentally flawed; no implementation of it can dodge the data race.  Here’s the deal: I’ve got a field which from time-to-time is reset to NULL, generally as part of some overall timeout-based cache-flushing policy.  It’s also in a time-critical function so I’m using lock-free accessing code with lots of comments on proper usage, Big Flashy Warnings, etc.  (The code to set the field uses locks, and isn’t shown here)

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; private volatile Code _code; // Can be reset to NULL by cache-flushing thread at any time</span>

And the gratis accessor function:

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; public Code code() { return _code; }</span>

And a some convenience functions to test various conditions.  Here’s one:

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; public boolean is_c2_code() { return code() != null &amp;&amp; code().is_c2_method(); }</span>

And some code using this API:

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; int nx = next_m.is_c2_code() ? ...next_m.code()... : ...blah...;</span>

Ok, everybody can spot the implementation race-condition – especially after the fact!  It’s easy in isolation although somewhat harder in a 0.5MLOC program.  The issue is in the implementation of “is_c2_code()” – I’ve called “code()” twice.  And very very rarely it changes between the loads of the underlying _code field flipping from not-null for the “code() != null” test and then to null for the “code().is_c2_method()” call.  Boom, instant NullPointerException.  Grumble, mumble, I go back and fix “is_c2_code()” to only read _code once:

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; public boolean is_c2_code() { Code code = code(); // read _code ONCE<br>&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; return code != null &amp;&amp; code.is_c2_method(); }</span>

Crank up my test case, wait a few hours and … Lo!  The bug is NOT fixed!  Sure enough, under the right combination of pressure and temperature I blow up at the “…next_m.code()…” bit with another NPE.  And then the flaw with the API hits me:  if I call the “is_c2_code()” test, it tests one version of _code – and then when I call “code()” again in “…next_m.code()…” I get another load of _code which might hold a completely different value!  In fact, I need to do an entire conceptual operation from a single load of _code.  I can’t use any convenience functions unless they take the already-loaded _code field as an argument.  And with that epiphany I finally got it right (well, I hope anyways):

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; private volatile Code _code; // Can be reset to NULL by cache-flushing thread at any time<br>&nbsp; &nbsp; public Code code() { return _code; }<br>&nbsp; &nbsp; // Convenience function: takes a pre-loaded Code value<br>&nbsp; &nbsp; static boolean is_c2_code( Code code ) { return code != null &amp;&amp; code.is_c2_method(); }</span>

And some code using this new API:

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; Code code = next_m.code(); // Read it ONCE for the whole operation<br>&nbsp; &nbsp; int nx = Code.is_c2_code(code) ? ...code... : ...blah...;</span>

And that’s my data-race lesson of the week: API’s can have data races as well as plain old code.

And now my question for you, the gentle reader: how do I enforce this over time?  In the actual code base there are about a dozen convenience functions, all now carefully labeled.  All the use-points now do a single clearly labeled read, then pass the loaded value around in all the convenience functions.  But in 6 months, when Junior Engineer comes along and looks at this class and says “I need a new function that tests _code and returns a blah”…. he likely makes another data-race API mistake.  Is there something better I can do than just comment, comment, comment?

 

Thanks,
Cliff

Leave a Reply

Your email address will not be published. Required fields are marked *