A Quick Update on Non-Blocking Hash Map

I just got an email from the owners of JCTools which is where NonBlockingHashMap currently resides.  The issue is interesting and subtle, and highlights a ‘spec bug’ in the Java CAS spec – crucial information the hardware returns is thrown away by the spec, and cannot be recovered by any means – which leads to this oddball situation where “removes can fail” due to a racing PUT.  The key snippet of NBHM is included in the bug report along with some interesting discussion.  My answer is below the bug report.

On 1/18/2018 2:17 AM, Nitsan Wakart wrote:

Verified the issue. Thanks for reporting and providing a test case.  Essentially the issue seems to be about this case:
https://github.com/JCTools/JCTools/blob/fe6d72a0f0ff0bd2c6412378c1231eef73539eed/jctools-core/src/main/java/org/jctools/maps/NonBlockingHashMap.java#L770-L770

<code>// Actually change the Value in the Key,Value pair
    if( CAS_val(kvs, idx, V, putval ) ) {
      // CAS succeeded - we did the update!
      // Both normal put's and table-copy calls putIfMatch, but table-copy
      // does not (effectively) increase the number of live k/v pairs.
      if( expVal != null ) {
        // Adjust sizes - a striped counter
        if(  (V == null || V == TOMBSTONE) &amp;&amp; putval != TOMBSTONE ) chm._size.add( 1);
        if( !(V == null || V == TOMBSTONE) &amp;&amp; putval == TOMBSTONE ) chm._size.add(-1);
      }
    } else {                    // Else CAS failed
      V = val(kvs,idx);         // Get new value
      // If a Prime'd value got installed, we need to re-run the put on the
      // new table.  Otherwise we lost the CAS to another racing put.
      // Simply retry from the start.
      if( V instanceof Prime )
        return putIfMatch(topmap,chm.copy_slot_and_check(topmap,kvs,idx,expVal),key,putval,expVal);
    }
    // Win or lose the CAS, we are done.  If we won then we know the update
    // happened as expected.  If we lost, it means "we won but another thread
    // immediately stomped our update with no chance of a reader reading".
    return (V==null &amp;&amp; expVal!=null) ? TOMBSTONE : V;
</code>

The NBHM is taking the view that if we lost the CAS we can return V which is the new observed value for this key. This is problematic here as several competing writes and removes to the same key can lead to reporting of the new value back multiple times.
In the write case the correctness of this is justified by the comment: “we won but another thread immediately stomped our update with no chance of a reader reading”. This is fine, but I disagree with returning the new value. Surely we should fake success by returning the old value in this case?
In the remove case this leads to the issue you observe where potentially multiple removes have failed due to a concurrent write and all report back the same value as if they removed it. That is clearly incorrect and can detected fixed by checking if the putval is TOMBSTONE when we failed the CAS. I think for a remove the correct response however is less clear. If we accept your assumption in the test (which seems valid), returning the old value or the new value is incorrect here, and returning null indicates the key was missing which is also incorrect. We can retry the remove until we succeed in removing or the key is gone (as some other concurrent remove was successful).
@cliffclick please review the issue and chip in?


Yeah, I went round and round on this one a bunch (toe to toe with Doug Lea, bleah).  The core issue (for me anyways), is that the Java Unsafe CAS returns a pass/fail boolean, instead of a “witness” – the value on which the CAS failed.  Intel CAS hardware returns the witness.  Given that another thread can immediately change the value again, the “witness” is forever lost – a flaw in the Java CAS spec, as far as I am concerned – and this is the root cause of this issue.

The comments above are only partially correct.

After the CAS fails, there is “V = val(kvs,idx);” – which loads a “best guess witness as to why the CAS failed”, but the actual witness is gone, gone, gone.

Having seen the CAS fail, having seen that we’re not rolling to a new table, what do to next?

The writer suggests “Surely we should fake success by returning the old value in this case?”

This is wrong, since on success your value would be installed, and since the value changed, your value MUST have been returned to some other CAS – and it is not.  Hence the current (failing CAS) thread believes it “handed off” ‘V’ to the table and another thread will pick it up and take responsibility for it.  That didn’t happen, and leads to leaking ‘V’.  You cannot return success here.

Returning the “almost witness” is actually correct, and also cheap (and also fairly close to some kind of “perfect” semantics”) and understandable – but it leads to this funny situation as noted by the author: “… and all report back the same value as if they removed it“.
Italics are mine.
No!  They report back “this is the write we failed on”.  Nothing got removed, since the CAS failed.  The remove failed.

Basically the removing thread is busy fighting with an installing thread, and the outcome is guaranteed fuzzy.
And if some other thread is trying hard to install new values, do you really care that the remove failed?

Instead the sentence could read:

“In the remove case this leads to the issue you observe where potentially multiple removes have failed due to a concurrent write and all report back the same value that caused all the removes to fail.”

Another course of action here might be to spin on remove (not in this function, but at a higher level) until something gets removed… which means some other writer thread which is busy furiously writing will eventually see the ‘null’ as the old value (which is not true if we report success on remove on a failed CAS).

Who should spin-on-failed-remove?

I voted that the library user should, but you could reasonably vote that the library itself should.

Cliff

HotSpot and Safepoints

Here’s a podcast on HotSpot and Safepoints – a method for allowing fast execution, highly optimized code and also bytecode-by-bytecode debugging.

Also see this written blog:
http://cliffc.org/blog/2015/02/22/how-does-java-both-optimize-hot-loops-and-allow-debugging/