To Clone or Not To Clone…

There’s this evil bit of the JDK involving NIO, where deep in the inner loop of a per-byte copy there’s a virtual call:

ByteBuffer bb = ...;<br>for( int i=0; i&lt;N; i++ )<br>&nbsp; buf[i] = bb.get(i);

No problemo, the JIT will inline the single target, right?
Suppose the abstract class ByteBuffer has a single implementing class called DirectByteBuffer (via the abstract class MappedByteBuffer).
Then the JVM can prove that any ByteBuffer must be an instance of a DirectByteBuffer (please pardon my not-quite-exact translation to java-pseudo-code):

ByteBuffer bb = ...;<br>for( int i=0; i&lt;N; i++ ) {<br>&nbsp; DirectByteBuffer tmp= (DirectByteBuffer)bb; // Cast is free!<br>&nbsp; buf[i] = tmp.get(i);<br>}

And then ‘tmp.get’ gets inlined (along with a few of it’s friends) like so:

ByteBuffer bb = ...;<br>for( int i=0; i&lt;N; i++ ) {<br>&nbsp; DirectByteBuffer tmp= (DirectByteBuffer)bb; // Cast is free!<br>&nbsp; if( i&lt;0 || i&gt;tmp.limit ) throw ...;<br>&nbsp; buf[i] = tmp.unsafe.getByte(tmp.address + i);<br>}

And then the compiler does it’s usual magic with range-check-elimination, loop-invariant hoisting, unrolling, etc: 

ByteBuffer bb = ...;<br>DirectByteBuffer tmp= (DirectByteBuffer)bb; // Cast is free!<br>unsafe ptr = tmp.address + 0;<br>for( int i=0; i&lt;min(N,tmp.limit,buf.length); i+=4 ) {<br>&nbsp; prefetch ptr+256; // some cache lines ahead<br>&nbsp; buf[i+0] = *(ptr+i+0); // psuedo-code for unrolled loop<br>&nbsp; buf[i+1] = *(ptr+i+1);<br>&nbsp; buf[i+2] = *(ptr+i+2);<br>&nbsp; buf[i+3] = *(ptr+i+3);<br>}<br>if( N&gt;tmp.limit || N &gt;buf.limit ) throw ...;

And suddenly the inner loop looks something like the guts of ‘memcpy’, and performance is quite good.  Yeah for JITs!

This also works if the single implementing class of ByteBuffer is HeapByteBuffer, yielding code very similar (after suitable levels of inlining, of course):

ByteBuffer bb = ...;<br>HeapByteBuffer tmp= (HeapByteBuffer)bb; // Cast is free!<br>if( tmp.position &gt;= tmp.limit ) throw...<br>int init = tmp.position;<br>unsafe ptr = tmp.hb + tmp.offset;<br>int reg = tmp.position; // hoist field into register<br>for( int i=0; i&lt;min(N,tmp.limit-init,buf.length); i+=4 ) {<br>&nbsp; prefetch ptr+256; // some cache lines ahead<br>&nbsp; buf[i+0] = *(ptr+reg+0);<br>&nbsp; buf[i+1] = *(ptr+reg+1);<br>&nbsp; buf[i+2] = *(ptr+reg+2);<br>&nbsp; buf[i+3] = *(ptr+reg+3);<br>&nbsp; reg+=4;<br>}<br>tmp.position = reg; // write register back into field<br>if( N&gt;tmp.limit || N&gt;buf.limit ) throw ...;

Again the JIT straightens out a nest of twisted inlining to yield something close to memcpy.


What Happens When I Call Them Both?


But suppose now this single loop is called with both HeapByteBuffers and DirectByteBuffers.  Then the compiler can’t prove a single target class exists and doesn’t inline the original ‘get’ call.  Performance of the same loop plummets (usually by a factor of 10x or so).  Just for exactly this case, HotSpot includes a special optimization called BimorphicInlining (think: Polymorphic inlining for the special case where “poly=2”).  If there are exactly TWO possible targets, and both are hot then HotSpot will insert a runtime type test and inline both targets roughly like so:

ByteBuffer bb = ...;<br>if( bb instanceOf HeapByteBuffer ) {<br>&nbsp; ...range checks, other setup<br>&nbsp; unsafe ptr = tmp.hb + tmp.offset;<br>&nbsp; int reg = tmp.position; // hoist field into register<br>} else { // else must be a DirectByteBuffer<br>&nbsp; ...range checks, other setup<br>&nbsp; unsafe ptr = tmp.address + 0;<br>}<br>for( int i=0; i&lt;min(N,tmp.limit-init,buf.length); i+=4 ) {<br>&nbsp; prefetch ptr+256; // some cache lines ahead<br>&nbsp; if( bb instanceOf HeapByteBuffer ) {<br>&nbsp; &nbsp; buf[i+0] = *(ptr+reg+0);<br>&nbsp; &nbsp; buf[i+1] = *(ptr+reg+1);<br>&nbsp; &nbsp; buf[i+2] = *(ptr+reg+2);<br>&nbsp; &nbsp; buf[i+3] = *(ptr+reg+3);<br>&nbsp; &nbsp; reg+=4;<br>&nbsp; } else {&nbsp; // must be DirectByteBuffer<br>&nbsp; &nbsp; buf[i+0] = *(ptr+i+0); <br>&nbsp; &nbsp; buf[i+1] = *(ptr+i+1);<br>&nbsp; &nbsp; buf[i+2] = *(ptr+i+2);<br>&nbsp; &nbsp; buf[i+3] = *(ptr+i+3);<br>&nbsp; }<br>}<br>if( bb instanceOf HeapByteBuffer ) tmp.position = reg;<br>if( N&gt;tmp.limit || N&gt;buf.limit ) throw ...;

Voila’! Performance is (almost) back to what is was before.  Of course, the loop body doubled in size (and the array math is subtly different)  and only 1/2 the code is ever used on any invocation.  And we have to do a type-check in the loop on every iteration (made cheaper by unrolling: the type-check now only happens 1/4th as often).


More Than Two Targets


What happens if we have more than 2 hot targets?  HotSpot stops trying, and falls back to the generic virtual call (and 10x performance lossage).  In theory, you can use an optimization called loop-unswitching – really loop-cloning – to clone a new copy of the loop per possible target of ‘get’ (no production JVM that I know of does this).  But why do I care?


Turns out this kind of coding style has a number of interesting use cases: any time the iteration logic is complex (hence wants to be shared) but the inner-loop work is tiny.  This is true for graphics BitBlit routines where you have lots of interesting edge cases as you work around the borders of some rectangle – but the work-per-iteration is usually something like: load a word from the 2 arrays, XOR them, then write back – or some minor variation thereof.  i.e., you’ve got a modestly complex set of nested outer loops, and you want a polymorphic bit of work in the inner loop.  If you only wanted a single kind of inner-loop work, you’d get that single target inlined and the JIT would Do The Right Thing.


But what do you do when you’ve got lots of inner-loop targets?  The graphics guys didn’t wait around for the compiler-writers of the world to catch up; they voted with their feet.  They cloned all those loops, by hand in the source code.  [Yeah, yeah, yeah, I’m being somewhat hyperbolic here; I don’t know what really goes on in the 2-D libraries for Windows or Mac under the hood; this situation was true for graphics when I wrote games 20 yrs ago and I keep hearing it pop up now and again].


To Clone Or Not To Clone, That is The Question…


And now I have a good friend busy writing more complex iterator code (for parallel & concurrent iteration over complex spaces) with unknown – but typically small – inner loops.  He wants to write the code Once and let the JIT clone whole loops on an as-needed basis.  Alas, this JIT optimization seems unlikely to happen soon.  He’s now debating generating copies of his loops specialized to a single target instance on demand.  I.e., an already complex parallel iterating library is about to pick up some even-more-complex bytecode rewriting stuff in an effort to get decent performance out of simple loops.


So: should he Clone his code?  Cloned not even in the source code (where you can at least look for & fix cut-n-paste bugs, tweak for performance on a per-loop basis, etc) – but cloned automatically (with all the evil performance problems THAT implies – excessive repeated JIT compilation, cloning when there’s no point, bytecode bloat, impossible-to-figure-out Exceptions coming from machine-generated code)?  Or suffer the slings and arrows of outrageously bad performance, for what’s supposed to be a easy-to-use high-performance Java library for parallel programming?




PS – I got the yellow-jackets from last post with a 2nd gopher bomb, thanks for all the suggestions!