Why WeakHashMap Sucks

Came across this surprise while working on a customer app.


Symptom: with Concurrent GC, the app performance slowly degrades over time, with more and more time spent in WeakHashMap lookups – until the load is removed (typically because all transactions start timing out).  Then suddenly the app becomes performant again, the WeakHashMap lookups are fast, etc … until slowly, the app performance starts to degrade again.  With the standard Stop-The-World fast parallel collector, the app performance is always fast (except during GC pauses, which is why the customer wants to use a Concurrent GC).


Root cause #1: Customer’s HashMap hashcode did NOT includes value hashes, but it’s equal call did!
Root cause #2: A failed “equals” calls in a WeakHashMap take time, and will typically cause a ConcurrentGC to believe the Key should be kept alive for another cycle.


Let me explain some more:


The App: ships objects over the wires between 2 JVMs, via RMI.  Each RMI structure carries a java object essentially encoded as a HashMap with Keys for fields and Values for field values.  I.e., Since String has 4 fields, an object like the String “abc” would be encoded as a HashMap with 4 key/value pairs: (“value”, char[3]={‘a’,’b’,’c’}), (“offset”,int=0), {“count”,int=3), and (“hash”,int=0xab123ab}.  The same type of objects are getting shipped around, so all these HashMaps have the same 4 Keys…. and the same hashcode! (’cause the customer’s HashMap hashcode ignores values) the ‘equals’ DID compare against values, so 2 maps with the same keys and different values will have the same hashcode but will not be equals.


These HashMap structures (suitably wrapped in a few more layers) are then stuffed into a WeakHashMap.  Alas, since all of ’em have the same hashcode the WeakHashMap turns into a linked list. This is a major bummer.  Lookups are forced to scan the linked list doing a full HashMap compare at each step.


Now comes the ConcurrentGC trick: CGC takes time.  During this time the application is probing the table.  As the app sweeps down the linked list it forces every Key it touches to be alive during this CGC cycle.  Assuming a single lookup can finish inside of a CGC cycle, all Keys end up being forced alive every time.  Nothing ever gets removed.  CGC starts running slower cycles because the live set is growing each time a new HashMap is inserted into the WeakHashMap.


If we stop the app even for a moment a CGC cycle completes with no probes causing Keys to remain alive and all the dead keys get stripped out at once.  Immediately the table is mostly empty and fast again (and our live set has hugely shrunk making CGC cycles fast again) – and we can start probing it with little problem since the linked list is short and the odds of CGC catching us in the middle of the linked-list walking is low.  But slowly over time we fail to remove some dead keys, the list grows, and the odds of getting caught in the middle of a walk grow until it’s a surety.


Suppose we look at a Stop-The-World collector.  When a GC cycle halts things, likely we’re in the middle of 1 key probe – but no more.  All the other dead keys get collected and our live set does not grow.  App performance remains good (except for the obvious GC pauses).


Suppose we look at a generational collector; young-gen is STW but old-gen is concurrent.  Same problem arises if we’ve got any keys in the old-gen.  The time to sweep the old-gen is so long that we surely touch every old-gen key in the table, keeping all of them alive.   As the young-gen periodically promotes a few keys they “get stuck” in the old-gen and can never be collected until we stop probing during a full old-gen cycle.  Old-gen cycles get slower and slower as the live set grows, unless we stop the app even for a moment.


Suppose we keep going with our ConcurrentGC (and crap hashcodes) and the linked list grows without bound.  Eventually the time to walk the linked list (and call equals()) could exceed a CGC cycle (I say “could” because as the list grows CGC as more work to do so it depends on the relative costs to walk what’s being kept alive vs the cost of an equals() call).  In this case, those portions of the linked list not walked inside the CGC cycle will be found dead and removed – i.e., there’s an upper bound to the list length, but it depends on the ratio of CGC-walk-time vs time-to-equals for objects on the list.  Paradoxically, a slower equals call will keep less stuff alive!

Suppose we correct the hashcode problem; the WeakHashMap turns into a real *hashtable* instead of a linked list.  Suppose also I work for a company with a ConcurrentGC which can cycle in a matter of seconds for almost any size heap… and thus the Weak keys in the WeakHashMap get stripped out every few seconds.  Then the WeakHashMap fails it’s intended purpose of being a cache (with lazy old-key cleanup semantics).  Really, the customer wants a cache with a n explicit timeout on the entries.

I read the javadocs for WeakHashMap and they pretty clearly explain the fallacy’s… but maybe WHM was made when Concurrent collectors where a far off dream, or could only cycle very slowly.  I think technology has caught up with WeakHashMap.



Pausing Transactional Memory Hardware

One of the more exciting trends in multi-core memory is the idea of Transactional Memory: memory which commits all of it’s state-changes atomically or not at all!  This idea is very powerful for coding concurrent algorithms, as it means a whole set of updates happen atomically.  Azul Systems felt this idea had so much merit that we included some Hardware Transactional Memory (HTM) support (as opposed to Software Transactional Memory) in our first generation product.


One of the issues with any TM system is livelock: transactions that continuously abort and retry (contrast this to programming with locks which suffer from deadlock).  At Azul, we use our HTM for accelerating Java locks so we can avoid avoid livelock by resorting back to the original Java locks.  However, we’d still like to profile what happens during HTM attempts!


Hardware event profiling is a common enough profiling technique; the hardware triggers an exception after every so many “events” and the exception handler records the event in memory.  “Events” can be things like every-1000-L2-misses or simply time-based or tick-based events.  The key issue is that the profiling data is recorded in memory.


Now mix profiling with HTM: the profile data is also stored in the transactional memory!  If the transaction commits then all is well…. but if it aborts then the profiling data is lost.  The CPU spent time doing work but there’s no record of what workWhere does the time go in a failed HTM attempt?    I don’t know!  The hardware has helpfully onwound all that “speculative” work!  I’m back to the start of the transaction with all state reset… except for the endless march of time.


SO… what I want is a way to ‘pause’ transactional memory hardware – to allow memory updates to actually happen and not be queued awaiting the success or failure of the transaction.  Visible side effects inside a transaction, if you will.  Hardware which allows writes to participate in the transaction or not on a per-instruction basis would also work. 


This change (allowing some writes to succeed despite a failed transaction) does not have to destroy the notion of a ‘transaction’ at the higher language level – it just demonstrates the need for flexibility in any hardware TM support.


This article and the idea of ‘pausing’ a transaction are specifically placed in the public domain. 



Spot the DataRace

Looks like I need to take some more of my own medicine!  Here’s code I worked on over a year ago, and it took a year for the data race to finally come home to roost.  The following API is fundamentally flawed; no implementation of it can dodge the data race.  Here’s the deal: I’ve got a field which from time-to-time is reset to NULL, generally as part of some overall timeout-based cache-flushing policy.  It’s also in a time-critical function so I’m using lock-free accessing code with lots of comments on proper usage, Big Flashy Warnings, etc.  (The code to set the field uses locks, and isn’t shown here)

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; private volatile Code _code; // Can be reset to NULL by cache-flushing thread at any time</span>

And the gratis accessor function:

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; public Code code() { return _code; }</span>

And a some convenience functions to test various conditions.  Here’s one:

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; public boolean is_c2_code() { return code() != null &amp;&amp; code().is_c2_method(); }</span>

And some code using this API:

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; int nx = next_m.is_c2_code() ? ...next_m.code()... : ...blah...;</span>

Ok, everybody can spot the implementation race-condition – especially after the fact!  It’s easy in isolation although somewhat harder in a 0.5MLOC program.  The issue is in the implementation of “is_c2_code()” – I’ve called “code()” twice.  And very very rarely it changes between the loads of the underlying _code field flipping from not-null for the “code() != null” test and then to null for the “code().is_c2_method()” call.  Boom, instant NullPointerException.  Grumble, mumble, I go back and fix “is_c2_code()” to only read _code once:

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; public boolean is_c2_code() { Code code = code(); // read _code ONCE<br>&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; return code != null &amp;&amp; code.is_c2_method(); }</span>

Crank up my test case, wait a few hours and … Lo!  The bug is NOT fixed!  Sure enough, under the right combination of pressure and temperature I blow up at the “…next_m.code()…” bit with another NPE.  And then the flaw with the API hits me:  if I call the “is_c2_code()” test, it tests one version of _code – and then when I call “code()” again in “…next_m.code()…” I get another load of _code which might hold a completely different value!  In fact, I need to do an entire conceptual operation from a single load of _code.  I can’t use any convenience functions unless they take the already-loaded _code field as an argument.  And with that epiphany I finally got it right (well, I hope anyways):

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; private volatile Code _code; // Can be reset to NULL by cache-flushing thread at any time<br>&nbsp; &nbsp; public Code code() { return _code; }<br>&nbsp; &nbsp; // Convenience function: takes a pre-loaded Code value<br>&nbsp; &nbsp; static boolean is_c2_code( Code code ) { return code != null &amp;&amp; code.is_c2_method(); }</span>

And some code using this new API:

<span style="color: rgb(153, 51, 0);">&nbsp; &nbsp; Code code = next_m.code(); // Read it ONCE for the whole operation<br>&nbsp; &nbsp; int nx = Code.is_c2_code(code) ? ...code... : ...blah...;</span>

And that’s my data-race lesson of the week: API’s can have data races as well as plain old code.

And now my question for you, the gentle reader: how do I enforce this over time?  In the actual code base there are about a dozen convenience functions, all now carefully labeled.  All the use-points now do a single clearly labeled read, then pass the loaded value around in all the convenience functions.  But in 6 months, when Junior Engineer comes along and looks at this class and says “I need a new function that tests _code and returns a blah”…. he likely makes another data-race API mistake.  Is there something better I can do than just comment, comment, comment?