JavaOne is fast upon us.  It’s a big year for Azul; we have 4 talks this year.  I’m giving 3 and Gil Tene, our CTO, is giving the other one.  I’m mentioning his talk specifically because it “sold out” and JavaOne kindly gave him a 2nd session in a new larger room at an earlier time – which will now be cavernously empty unless you sign up!  The talk is very interesting (to me) – because it’s essentially the GC version of “How NOT To Write A Microbenchmark” – it’s full of specific advice on how to benchmark GC for large applications, and what traps to avoid when trying to write your own GC benchmark or testsuite.  Some observations: SpecJBB2000 was originally intended as a Java middle-ware benchmark, yet was specifically designed so that GC never happened during the time portion of the run.  Do you never see a GC cycle during the “timed portion” of your Java workloads?  (SpecJBB2005 came about partially to correct this).  <link to slides coming soon>


Debugging Data Races is a re-run of a BOF I gave last year, but the material is more relevant that ever.  Slides here.  I had a number of comments from people who basically said “you shouldn’t write datarace bugs in the first place” implying that (1) writing such bugs is Bad because they are so hard to find & fix, and (2) you wanted to write such bugs in the first place!  My advice from last year still holds, and indeed I’ve got more evidence than ever that the bugs I’m targeting are the common bugs that bite you, and what I’m proposing is very accessible (no Phd required!).


Towards a Coding Style for Scalable Nonblocking Data Structures is … well, hopefully the title is fairly self-explanatory.  I’ve talked this on this blog before; slides here.  The basic idea is a coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a Finite State Machine built from the atomic update and logically replicated per array word.  The end result is a coding style allowing me to build several non-blocking (lock-free) data structures that also scale insanely well.  All the code is available on SourceForge.  At this point I’m sore tempted to add transactions to the non-blocking hash table and essentially create an uber-data-grid.


JVM Challenges and Directions in the Multi-Core Era is where Brian Goetz & I rant about the lack of any decent concurrent programming language.  Slides here.  Java’s the obvious best State-of-the-Practice concurrent programming language, and there are plenty of better “academic toy” concurrent programming languages Out There. 
But none seem to tackle head-on the main problem programmers have with large concurrent programs: lack of name-space control over inter-thread communication, i.e. shared variables.  By default in Java, all global variables are shared with all threads.  Instance fields and code can be made private or public but you can’t limit access e.g. to specific threads.  For example: all your DB threads in the DB thread pool can run all the code in your crypto-lib, should you accidentally let them get a crypto instance.  No language or compiler support to throw some kind of early access-error.  Erlang and CSP deny all global variables; Fortress, X10 & Scala provide convenient ways to express some kinds of parallelism without shared variables, but amongst the recent more well-known crop of academic-toy languages I don’t see anybody tackling the access-control issue head-on.


All right I’ll stop before I start really ranting, but it’s a topic I’m passionate about. 
JavaOne promises to be an exciting time, come and see the show!



Earth Day

Today is Earth Day, a celebration of all things Earth-friendly.  As a Foolish Earthling, my family & I have decided to “put our money where our mouth is”: we’ve invested both in solar panels and in a CSA.


Regrid Power installed the panels, which arrived two months early.   What was slated to happen in June is already installed on our roof!  As of this writing we are eagerly awaiting inspection by PG&E before they can be turned on. The installation went fairly smoothly and is basically invisible from street-level.  We suffered one day of roof-noise and three days of a work crew (very nice guys, very respectful of our privacy and property) wondering around our backyard and roof.  The economic theory goes something like this: during the hottest parts of the summer the panels will be generating at their peak power; we’ll sell the excess back to PG&E at peak power rates.  During the evenings & winter season the panel won’t generate enough, so we’ll buy power from PG&E at the off-peak rate.  PG&E won’t actually give us any money but the income from the excess power will be used to offset our normal power bill.  This also means we’ll be able to run our A/C during peak hours with a much reduced bill.


Community Supported Agriculture is basically a collection of people directly paying the farmers in advance for the food they are going to grow.  We’ve signed up with Live Earth Farm.  For $30/week, we get about 15-20lbs of fruit & vegetables every week for 8 months.  It means the farmer gets a steady income and we get veggies often within 24 hours of being pulled from the dirt.  We’ve been loving the change in diet, we’re eating a lot of new (to us) veggies we would never have tried otherwise.  Another good sign: our twin apple trees have had plenty of blossoms this spring, and lots of bees wondering around them.  No sign of the yellow jackets that plagued us last year.  The kids & I are anticipating apples galore this fall!


And now for the technical bits: Azul gear is very power efficient for server hardware.  We strive for high throughput, which means high core count (so tons of the very slow memory ops in flight at once – a fast out-of-order X86 might get 3 or 4 cache misses in flight at once while we routinely get ~150 memory ops per socket in flight at once).  High core count means small simple CPUs; these simple CPUs have short simple pipes and are hard to clock at high frequencies – and less productive to clock faster (you’ll just hit the next cache miss quicker).  So our clock rate is modest – and this has a huge impact on power.  We aim for high throughput and as a bonus we get low power: about 65W per die when going full blast (we’re also using DDR3 instead of FBDIMMs, which is another substantial power savings).  Here’s another way to do the math: 65W/socket and 48 core/socket means ~1.3watts/core.  Not embedded-systems terrain but definitely waaaay low power for high-end servers!   Azul also buys the CO2 emissions offsets for every box we sell, both the manufacturing emissions and the lifetime operating costs.


My kids are a daily reminder to me that
There’s Another Generation Following Us.


Happy Earth Day to You,

I Went to Boston And All You Got Was This Lousy Trip Report

CGO was in Boston this year.  I wanted to attend the workshops, so I needed to arrive by 1pm Sunday – there’s no way to do that from the West Coast without an extra overnight stay or taking a red-eye.  I was fighting off a cold, so I took a generic night-time-no-cough-do-all pill just before takeoff.  JetBlue has the most legroom in coach and I got the exit aisle to myself: within 10 minutes I was laid out flat and snoozing.  This had to be the comfiest 5 hour flight I’ve ever taken. The winds blew in our favor, and we arrived a full 1/2 hour early: 4:30am local time in Boston.  I yawned my way down to the ground-transport level.

I lived in Boston briefly maybe 12 years ago; I remembered that it’s a nightmare to drive around in but the local subway (“the T”) is fast & easy.  I bought a week-pass on the T.  There’s a bus from the airport to the nearest T stop; I hopped on the bus figuring I’d have some Real Boston Experiences while trying to re-figure out the T. First snag: the bus driver informs us that the T doesn’t start running till 6am on Sunday!  Oops – looks like I got a hour wait in the airport.  While hanging out, Manish Vachharajani (figured out how to compress trace files using BDD’s, both massive compression and you can do many trace-file analyses directly on the compressed format) jogs up: he too was stymied trying to take the T.  We end up splitting a cab. 

The Omni Parker House claims to be the oldest continuously operating hotel in America.  Certainly the building is very old, but with that old-hotel charm (gilded ceilings, massive old chandliers, dark wood paneling everywhere, helpful-not-intrusive staff).  It’s smack downtown Boston; right between the old North Church cemetery (buried: John Winthrop, first governor, and other loyalists) and Granary Burying Ground (buried: Sam Adams, John Hancock, Paul Revere & revolutionaries).

I paid for maybe one dinner, AMD paid for another, Google paid for another.  I eventually got to use my T pass to get to and from the Google reception (during rush hour: so the subway was a crush but we beat the bus) and also to the airport, so the pass wasn’t a waste.

As usual for these trip reports, I skip some sessions and brutally summarize what I do attend.  Apologies in advance if I gave your paper short-shrift, feel free to correct me here.  I actually attended 2 workshops and the CGO conference proper:

  • EPHAM homepage -Workshop on Exploiting Parallelism with Transactional Memory and other Hardware Assisted Methods
  • STMCS homepage – Third Workshop on Software Tools for MultiCore Systems
  • CGO homepage – 2008 International Symposium on Code Generation and Optimization

Here’s the talks I attended, sorted roughly by how important I feel they are (and not in any time order):

I attended at least 8 more talks, who’s notes are not worth repeating here.  I got to see snow, slush and drizzle; hiked around in 30 degree weather with cars splashing salt/sand/mud over the cracking stone curbs, dodged rusty metal poles, wandered through mildew-covered crumbling buildings and graffiti covered subways and generally remembered all the reasons why I left Boston so long ago.  It’s a fun place to visit, but I’ll never live there again.


Long talk with Maurice Herlihy about his current work

He’s building STM solutions at a higher level of abstraction, e.g. adding transaction support to a hashtable; the underlying structure (hash table in this case) can be anything with reversable operations. He basically builds an ‘abstract lock’ lazily as the transaction progresses, by taking locks covering all the ‘keys’ used in the hash table. A collision on the abstract lock means a (statistical likely) collision on the actual transactional data, and he can roll-back patially executed XTN’s by using the reversable operation.

The actual covering locks are just a pile of striped locks in some large hash table, so contention can be made arbitrarily low.  Also, the underlying datastructure, e.g. a hashtable, is treated as a black-box. We could, for instance, add transactions to my NonBlockingHashMap fairly easily with this idea.


Private conversation w/Manish

How to speed up worklist/task-queues.

His approach (low latency focus): keep producer & consumer seperated from each other by more than a cache-latency. i.e., producer stuffs into a ring buffer, consumer pulls from ring buffer. Ring buffer needs to be sized such that producer & consumer are not ping-ponging lines during worklist push/pop ops. Clearly the cache lines ping-pong – but you want them to ping-pong once per cycle around the whole ring buffer – not once per each time you insert or remove. Arrange to slow down consumer if it “catches up” to producer, because if the consumer gets to close, he slows down (the already slow) producer by making the producer take more cache-misses while inserting. Same in reverse for slow consumers (but if you have slow consumers then work accumulates ad-infinitem and really you need to throttle the producers).

My approach (scalable focus): stripe queues ab-absurdem until all queues are of length 1, hence really just an array word holding either null or task. Push/pop requires scanning the array for empty/full slots. Real trick for me: proper sizing of the array. Huge scanning is expensive, but some scanning is OK.

Combo: big array, with parallel insert/remove threads. Want an auto-resize heuristic, and an auto-“back-off” heuristic. Auto-resize: need a bigger array if getting too full (producers can’t find empty slot).  In the make-bigger logic, can throttle producers (sleep until some reasonable # of empty slots appear). Need to make smaller array if consumers can’t find producer’s work: if typically end up scanning a larger fraction of array before finding work (it’s OK to scan thru a large empty array IF we’re following just behind a producer, so really scanning very little). Auto-“back off”: if see CAS failures for insert/remove then we’ve run up against another threads’ (could be consumer or producer) cache. ie., we’re making the other guy slow (and not being fast ourselves) – so “back off”. Either pick a different spot in the array to work, or sleep or grow the array.

Keynote: Vivek Sarker (now at Rice)

Need more emphasis one auto-parallization.
Some recent successes have improved common use cases a bunch.
Lots of cores now.  Lots of libraries for parallel languages; java concurrency, Fortress, X10, intel tbb, Cilk, MPI, .NET parallel extensions

So… compilers are “under seige” – the “old model” of: parse-to-IR, then optimize, then reg alloc, then code emit… isn’t working. No room in the IR for all the new parallel paradygms.

Vivek suspects a coming paradygm shift (me too as well).

Shows what breaks if we include parallism in the low-level IR: e.g. parallel CFG; the usual “dominator” relationship doesn’t hold; no dominator tree. But this analogy is flawed (in my mind) because the DATAflow still happens “as-if” the parallel blocks executed sequentially, only the control-flow is parallel.

But probably want some form of DATA-parallel use; admit that the usage is racey in the IR, and compiler is allowed to take any value that could reach – or even emit code that makes that choice at runtime.

Goes on to (attempt to) derive a parallel IR onstage in a keynote.

Long discussion followed; he promises to look at using the open-source HotSpot along with Jikes. More-or-less invited me to come visit Rice sometime and check out the grad students.

Keynote, ATI, Norm Rubin

Pixar: 100,000 min of compute per min of image

Blinn’s law: time-per-frame is constant. It’s 1000 machines (all the movie company can manage) and 100 min/frame (all the schedule allows). Every new movie, toss out all old machines & buy new: they are faster.  Faster machines lets the graphics become more realistic.

8 PCs; 8 groups of 64 lock-step SIMD threads = 512 threads.
Each individual instruction is a 5-way VLIW.
(VLIW packing rate varies from 3.4 ops/VLIW to 4.2 ops/VLIW).
Pipeline is fairly long, but it’s 64×5 = 320 fp-ops / cycle.
Loads block a thread till they resolve.

Each SIMD pipe has 256 registers; each register is 128 bits.  If each thread needs 5 registers, then 256/5 = 51 wavefronts can get into run-queue. 51 wave-fronts = 3264 threads per simd or 13056 running or waiting threads.  Roughly 262,144 registers or 1meg of register space.

GPU programming model: producer/consumer relation. Hundreds of pending GPU instructions.  GPU runs a loop 30-60 frames/sec; build vertex buffer, set uniform inputs, draw.

User writes: vertex shader, pixel shader. Nothing about parallelism, or number of cores, or sync. That’s all added by the GPU JIT.  No user program has a data-race; the language doesn’t allow it.

Open interesting problems in register allocation.

Robert Greismer

Google server farm, map-reduce.
read/write using ‘protocal buffers’
WorkQueue – scheduling system for cluster of processors. Tries to co-locate tasks near data (no network i/o, just disk i/o).
MapReduce – abstracts the distributed system

Filter code is auto-parallelized, and fed into aggegrators.  Not sure what’s new here compared to StreamIt on a grand scale.
Sawzall – language of MapReduce. Domain specific. Like graphics, very parallel- but no concurrent programming.

Aggregators can be clever: cannot keep all the data coming in; must reduce the data. Compute sums, averages, std devs, top N.
Domain-specific language to write aggregators; strongly typed and concise. Mostly simple, regular syntax, close to C. Specific basic
types (strings, time, fingerprints, table types); value semantics, even for structs (no pointers). Lots of string regex support.
Explicit “undefined values”; want to abort sooner rather than later; willing to test for explicit “undefined value”; tests are not
auto-inserted everywhere, but must have explicit test. At runtime, can switch a flag to make undefined values silently ignored (better
to ignore e.g. network corruption rather than crash).

Quantifiers in the language: “when (i: some int; P(a[i])) {…}” will iterate over array ‘a’ for all indices ‘i’ and run some filter ‘P’
first. 3 forms: each, some, all. Nest nicely. Seems similar to Fortress. Generates bytecodes, which are then JIT’d. Ref-counting GC
(value semantics, no cycles possible); uniform object model (all things are ptr’s – even int’s).

Cliff Notes: Hey! We *could* do some kind of decent sample-based profiling in the HotSpot Tiered World. Periodically have a sample-thread wake up and hack the inline-caches of the top hot methods to call their C1 equivalents. Have the C1 code patch back to the C2 code, but then run on collecting info (for 1 method invocation).  IBM has already shown that if the frequency of profiling is >0.1% then you get good-enough profiling info; we’d only need to hack the inline-caches rarely to get decent continuous C1 profiling info.

Hardware Acceleration for Lock-free Data Structures and Software Transactional Memory – AMD

AMD – Advanced Sync Facilitiy – not in silicon yet.
ASF limited to 8 cache lines.  Issue a set of ‘locked’ loads. ‘acquire’ starts critical section. Checks for consistency. Stores rIP & rSP for rollback. No other registers.  Then allow mods to the locked locations.  Finally ‘commit’ (or abort).  Use ASF for the 1st 8 lines of a TinySTM transaction, then switch to software afterwards.

Cliff Notes: More or less, this is AMD announcing HTM support in their lab: free (good) simulation available, but no silicon.  The hardwired ‘8’ is a lower bound: you are guaranteed you can atomically update at least 8 lines.  Several (most?) lock-free algorithms require atomic update of 2-4 unrelated addresses.  This is an architectural guarantee that such algorithms would work on all future AMD processors with this support.  No actual hardware has been announced.

Energy Concerns using HTM
Maurice Herlihy

No space in a embedded to do a full STM or even a hybrid TM; so expect multiple cores in an embedded system with a simple HTM.

Basically, abort & retry costs power.

MP-ARM – has a 512b transaction cache (fully associative), plus an 8K L1. 128B scratchpad. On an abort, the CPU rolls-back the registers from the scratchpad & enters lower-power mode. It wakes up after a random interval & retries.

T-Cache costs them 20% more power – but make it up with shorter running time. Generally a win for apps with contention (less total power than locking & serializing).

Cliff Notes: Basically it’s ARM’s version of HTM

Synchronization Aware Conflict Resolution for Runtime Monitoring Using Transactional Memory

Use TM to detect dataraces. AUTO-insert TM begin/end around regions doing conflict detection. Tried it per basic-block in SpecInt – too expensive. Tried to make a TM per 8-12 basic blocks; then overhead is down to 10% or so. (Cliff Notes: What TM are they using? Since the TMs are small, they assume hardware TM which explains the low 10% overhead).

Turns out auto-inserted TM causes lots of live-lock issues.  He demos several live-lock scenarios with reasonable looking TM hardware & spin-locks.

So they spot spin-lock patterns and do not auto-insert TMs around spin-locks… no they make the hardware spot spin-locks and break the TM in a way that makes progress.

Automatic Array Inlining
Christian Wimmer

Uber object-inlining, for arrays. Great for short fixed-length arrays, ala what comes out of Reflection.

Really it’s object co-location, which the GC preserves OR will tell the JITs to recompile.

  1. phase 1 – decide what to co-locate (profile in C1); tell GC
  2. phase 2 – GC co-locates all such things; tells JIT
  3. phase 3 – JIT recompiles assuming co-location.

Also JIT must detect that objects are *co-allocated*, and not modified later (implicit final fields: detect changes with field guards).

This technique is easy for known fixed-length arrays: it’s just like inlining an object. Also works for unknown-length arrays that are not changing; the “tail” of the inlined-object-set isn’t known, but isn’t needed.

Can handle changing the inlined array field itself.  Can fold the array length check into the array field guard: set the old array length to 0 (the old array is ONLY pointed to by the original object, this is a precondition to inlining, so no other object/thread can see the length change).  If the range-check vs 0-length fails, 1st go to some slow code which re-checks after loading the proper field.  Eventually the GC resorts/realigns, and then the fast code works again.

Also can handle having remote /shared pointers to the inlined array, even with hacking the array length: have 2 array lengths. The “normal” length used by non-inlined code, and “inlined” length used only by inline-smart code.

Also can do inlining objects into arrays: first lay out the array, then have the GC place the objects following the array in order.  Alas, harder to detect changes & harder to detect setup conditions.  They gave up here – but seems to me you could do it if GC tells you the objects are unshared (only pointed-at by the array), and you catch all array-writes to that array (by tagging the array as read-only after GC does the layout, perhaps placing it in a read-only page).

Get 20% speedup on some SpecJVM98’s.
Alas no speedup on DaCapo. 🙁
Very surprising to me that no speedup on DaCapo; I would expect the String co-location to be worth alot – aha he’s stuck with the old JDK strings which share char arrays, so it doesn’t work for him.  I predict a ~10% DaCapo speedup on Azul based on the Strings optimization.


Azul looked at this earlier and declared it a good-idea for our next-gen. Basically, branch-on-random powers-of-2, i.e. either 50% or 25% or 12.5% or …, etc.  Use it to lower the costs of sampling.  Typical sampling cost uses a counter and every N (e.g. 1024) executions we take a heavy-weight sampling pass. But there is a cost to test&dec the counter, uses an extra register, extra control flow, extra cache, extra instructions.

‘brr’ is just a conditional branch, but the condition is a frequency. Only need 4 bits to cover all reasonable use cases. Low hardware requirements (a LSFR, total maybe 100 gates including mux’s); fits well into existing pipelines. Can be made tolerant of branch mis-predict (i.e., can get the info very early in the pipe).

Multi-core Thread-level Speculative Parallelism

Discovering lots of speculative parallism opportunities in sequential code. Lots more if you are willing to do large code transforms.
Ex: while-ptr-chaining-loop finding the min.
Ex: nest the above loop in a worklist which pulls the min element, then tranforms the list (ala a standard worklist algorithm).

Use value prediction to allow thread-level speculation. Basically, start a parallel loop at a random value-predicted pointer location.
After a few iterations either crash or line up with some linked-list body. Actually, capture some value on the 1st time through the inner
loop; on later iterations thru inner loop have some good choices for prediction. There’s some decent auto-load-balancing going on; each iteration of the inner loop records enuf info to allow a good (re)partition for the next iteration of the inner-loop.

Technique does not only do linked-lists; needs any double-nested loop. Inner loop gets parallized, outer loop allows the inner loop to be continously ‘profiled’ and ‘load balanced’. Really can parallize other ‘hard-to-parallize’ inner loops. Would work well in Java; I probably could do it in C2 with enuf effort.

Analyzing perf diffs in JVMs

Using “calling context tree” – similar to our 8-deep call-stack capture.

Doing ‘diffs’ on these trees. Diff’ing based both on structure, and on edge weight. Doing things like “adjusting” the weight of top-nodes in a tree so 2 trees match in weight, then looking at diff’s farther down the trees. Also looking at weight of normalized subtrees, etc.

Comes up with a fairly ‘diff’ like interface (cmd-line) to compare 2 forests of trees and highlight the diff’s.

Able to track down differences in 2 websphere/trade6 setups.
Discovered differences between timezone settings between AIX & ZLinux, and also DB setups.

PiPa – Pipeline Profiling
(best student paper)

Faster tool for doing e.g. cachegrind. Parallize instrumented application. In SuperPIN – idea: main app just forks off another thread periodically to do the instrumentation work. (how is this different from jamming ops/time into a buffer, and having background
threads pick up new buffers & gzip ’em).

Idea: process the background buffer in another thread; do full analysis on the buffer(s) in parallel – in real time with the app running.

Demo running a cache-simulator, in “real time”.  Using double-buffering, large shared buffer.  Pretty obviously equivalent to Azul’s tracing mechanism, except she’s doing a cache-sim instead of a gzip.  Can run the original program with 3x slowdown, but get a full cache sim.

Prediction & Trace Compression vs Nested Loop Recognition

Looks cool: find simple loops from traces; loops are simple linear algebra. Imperfect loops allowed. From loops, find nested loops.  Get massive compression ratios vs bzip & other trace compression techniques. Very simple to decompress; just “execute” the loops.

Fast Liveness Checking in SSA form

Answer based on demand: is var live here?
Faster precomp, slower queries.  Info depends on CFG & def-use chains
Invariant to adding & removing instrs; sounds useful for reg alloc adding & removing spill ops.

For HotSpot, liveness computation is probably 10% of total compile time.

Compute ‘highest point’ in dom tree where query point is dominated by def point. Compute all-paths dom-vs-each CFG node

– Compute transitive closure over reduced CFG (simple DAG). Simple post-order traversal
– For each block q, compute highest block (highest backedge target). Simple pre-order & post-order traversal.
– Arrange for highest node to be first set bit in trans closure

Query: goto highest block; see if def dominates highest block.
Looks fast & simple. Might speed up reg-alloc.

Nearly Optimal Instruction Selection on DAGs

Stupidly slick: tile the DAG once, just the way HotSpot does right now – except allow sharing always (right now HotSpot does “CSE” on sharing, forcing shared values in registers – except for small expressions that are common in address-arithmetic).

Then look at each place where a shared value was subsumed into larger instructions (hence replicated in the instructions). For just those places, see what the local cost would be if you did not share. If the cost is better to not-share, flag the node as getting it’s own register result. Then re-run the DAG tiling, but living with the shared results from the 1st pass.

Ex: ((a+8)*(b+8)). Should we share the ‘8’ or not?
Pass 1:
t1 = a+8   // 5 bytes
t2 = b+8   // 5 bytes
t3 = t1*t2 // 1 byte

But if ‘8’ goes into a register:
t0 = movlo 8 // 5 bytes
t1 = a+t1    // 1 bytes
t2 = b+t1    // 1 bytes
t3 = t1*t2   // 1 bytes

Reduces code size by 2-4% nearly always, and very close to optimal. What hotspot does is essentially what he calls ‘cse-leaf’ in paper. His NOTLIS does somewhat better.
Cliff Note: I did a bunch of Azul-specific hackery in our DAG share-vs-clone heuristic to try and get what he gets naturally. – grad’ing soon

Exception Check Motion for Java

  • Run aggressive PRE, ignoring exceptions checks
  • See if can move checks to allow the PRE
  • Undo any PRE motion where the checks can’t be moved

Got ~2% on production compiler – which confirms my suspicion that there’s not much more to be had there. Mostly moving null-checks around – and HotSpot is already very aggressive about moving & removing null checks.

His examples HS will all get another way; loop-peeling and Split-If transform will wipe out all his gains.

Another unsafe code-motion talk for Java.

He’s got better examples: loop-peeling won’t wipe out his first example (but versioning will).

Safety checks turned into data-dependence, just like HS

heh, another talk that’ almost getting as good as HS has been for the last 10 years.

Then convert data-dependency safety checks into IA64 predication bits and happily predicate speculative loads. Seeing 4% for just the PRE based motion (on SpecJVM98), another 2% for doing hardware predication & speculation bits.

Towards Fair, Scalable Locking
Barcelona Supercomputing & Microsoft

Scalable reader-writer locks (something I’ve thought about alot in the last year). Want fair (no starvation of either readers or writers). Want fine-grained striped locks (? specific to different data types)

Demo’s starvation in a HTM system similar to Azul – eager updates can mutually kill each other.  Demo’s cache-directory-reservation can lead to deadlock.

So far seems like a more complex version in hardware of something that’s easy enough in software. It’s only partially fair (not FIFO). Preserves proportions of readers & writers.

Using STM. Overhead looks very bad.

Runtime Tuning of SM Validation Techniques
Rice University

STM validation is expensive; no one technique wins all cases. Use runtime to dynamically change validation scheme. Periodically (every 5% of TM attempts) run the same TM with 1 of 3 different validation techniques; if the new technique is working better than the old one, then keep it – otherwise immediately revert back to what was working well.

Transactional Support for Dynamic Languages

Looking at Jython – get to use the fast JVM.  To make Python bytecodes need some huge bad wrappers around every little variable access. This makes Jython as slow as regular interpreted python.

What we’d like to do is skip the fancy python lookup and use plain Java variables – except that we want to keep the good python dynamic semantics. Would “lookup x in y” to be as fast as a local variable reference.

So use transactions to test the pre-conditions & use fast specialized code otherwise. i.e., execute inside a transaction & test preconditions.

Use best-effort HTM. Add a few specific speculation primitives to JVM (speculate, abort, commit). Jython generates both specialized-case fast-code and also full semantics. Use HTM to run the fast-case. If that fails, fall back to the full semantics bytecodes.

Cliff Notes: Sun Labs guy noticed that this technique requires abort info from the hardware (was it capacity vs remote thread) – does Rock TM lack the failure-mode data?

User-defined Priority Based Transaction Contention Management

Really, this is poster case for why TM’s will not save the world.  He’s proposing users can put some kind of priority on transactions – and higher priority transactions can “barge” over lower priority ones. What a mess!!!

Map-Reduce for GPU’s

More general term than Google’s map/reduce.  Really its stages of independent computation, followed by communication.

GPUs are great for doing ‘map’ – lots of GPUs, lots of bandwidth. But only local communication is cheap; global communication is very expensive. Also, reduction on GPUs is well known to be very difficult. Lots of PHDs going into deciding how much unrolling is best, or how best to split up computations. Sometimes see 60x slowdowns if the best-reduction for one dataset size is used for a different dataset size.

User writes a MAP function in CUDA (NVidia’s C+GPU operaters).  User’s map function outputs results in a per-thread local-store. User writes a binary associative reduction operator – takes 2 inputs and returns 1; and his framework will restructure the reduce step for best parallelism. (also needs an identity value, in case some MAP results are filtered out)

Analysis of CUDA Programs

auto-detect race conditions (detects potential dataraces!), detect bank conflicts (extreme poor memory access patterns)

CUDA abtractions: kernel functions, threads, scratchpad ram

Cores are SIMD; execute program in lock-step; program is in scratchpad ram.

2 global bookkeeping arrays, reads & writes to global memory.  two per-thread arrays: reads& writes of a single thread. After each access to a global, check for a race.

Scratchpad is 16-way banked per 16 threads; fast as a register if no bank collisions. Bank conflicts can reduce performance by a factor of 4; slowdown equal to not using scratchpad at all.

Streaming Applications on Multi-Core Systems

Boxes using different hardware for streaming: gpgpu, fpga, webcam, disk, network, etc. Very hard to simulate & debug when lots of different parts are moving.

Auto-Pipe is a set of tools to create, test, build, deploy, optimize distributed streaming apps. Lots of pipeline parallelism, and regular parallelism within a pipe stage.

Auto-Pipe – define a set of blocks and edges or computation & communication. Also describe the hardware layout: cpus, cores, fpus, etc.  Then provide a mapping from blocks to hardware chunks.  No need to worry about communication between the blocks.

Phase-Based Adaptive Recompilation in a VM

Using hardware perf counters to detect phase-changes, and trigger recompiles. There’s been a lot of work in this area already, I’m not sure what’s new here.

Finding Critical Instructions Without Hardware

“critical” – instructions on the hot path.  Create OFF-line synthetic instruction traces based on existing hardware profile info, including branch mis-predicts & aliasing info.  Then analyze those traces, and compile (schedule) to those.

Take a random-walk thru binary, throwing a weighted-die at each branch (starting at a random spot using weighted-die). Emulate a call stack for call/return sanity. Can “get stuck” in various oddball paths, so reseed the trace after N instructions.

More: seed abstract register state with random values, then start tracing. Loads & stores can use random addresses, or load/store addresses: aliases will appear “obvious” because the random addresses will collide.  Also simulate memory using the random addresses. Turns
out most aliasing is captured, but not all (structured linked lists will behave as random instead).

Finally: stuff the random traces thru the hardware simulator & come up with critical paths & schedules. Then use this info to compile better scheduling (he’s doing clustered VLIW’s). Seeing maybe 10% performance gain on his funny clustered VLIW’s.

Program Optimization Space Pruning for a Multithreaded GPU

NVIDIA CUDA talk. Need loads of threads (12000 hardware supported). Have 45G/sec bandwidth (Gbyte? GBit?) but still run out; must use local caches. Encouraging loop unrolling, hardware prefetch, traditional optimizations. Scheduling loops for thread-level and memory-level parallism. Have about 8000+ total registers.

Increasing regs-per-thread might increase per-thread performance, but run out of the shared register space and thus be able to launch fewer threads in parallel. In short, need to search large space of configs; often hand-tuned code is about 15% below max trapped in some local maxima.  A non-intuitive mix of other optimizations can move you to a better minimum.


We Don’t Know How To Program…

 the current (and future!) crop of machines.  “Yeah, yeah, yeah,” I hear you say, “tell me something new”.  Instead I’m going to tell you something olde.


10 Years Ago GC was widely understood and rarely used.  People were skeptical that is was ready for production use.  Now GC is well understood and widely used.  It’s less buggy than malloc/free (e.g. no dangling pointer bugs, but still leaks) and in general is faster to do allocation (hard to beat parallel bump-pointer allocation).  The ‘Collection’ part of GC has gotten substantially faster in the last decade: parallel collectors are available from all major Java Virtual Machines.  Rarely do I see GC consuming more than 10% of total CPU time; usually it’s closer to 5%.  Concurrent & Incremental GC’s are mostly available and mostly work (e.g. Azul can demo sustained 40+ Gigabyte/sec allocation & collection with no GC pauses); hard-real-time GC’s are starting to see real applications.


10 Years Ago – JIT’ing – Just-In-Time Compiling – was considered controversial.  People widely claimed “the code quality will never approach static compilation” and “its too slow”.  These days it’s fairly easy to find programs where Java beats “gcc -O2” (and vice versa; we’re clearly at some kind of performance plateau).  I personally mostly laid these claims to rest: HotSpot remains “The JVM to Beat”; with world-beating records in a variety of areas.  My name still appears in HotSpot source base, buried down in the src/share/vm/opto directory files.


10 Years Ago – JVMs and Managed Runtimes (no .Net 10 years ago; v1.0 appeared in 2002) were unstable on dual-CPU machines (that’s being charitable: the bug list from that era was shocking; e.g. at one point we (the HotSpot team) discovered that emitting all memory barriers had been accidentally disabled on multi-cpu X86’s).  These days JVMs are rock solid on 4-8 CPU machines; this count of CPUs has been Sun’s bread-and-butter for years.  Indeed, HotSpot routinely runs well on 64 – 128 CPU counts from a variety of hardware vendors; Azul’s version of HotSpot runs well on 768 cpus.


10 Years Ago – The JDK was in it’s infancy; the libraries generally were broken on multi-threaded programs.  Then the Abstract Data Types (widely called the Collection Classes for Java) got synchronized (slow, but correct), then in 1.4 unsynchronized versions appeared (fast, but incorrect for multi-threaded usage), then in 1.5 fast & correct concurrent library routines appeared.  These days you can drop in a hash table that will scale linearly to 768 concurrent threads doing inserts & lookups, or write parallel-for loops with a little bit of stylized boiler-plate.


In Summary: The JVM, Collection Classes & GC are Ready for large-core-count machines.  The OS has been ready for a long time (Cray, SGI, IBM et al showed you could beat down all the multi-CPU OS bugs years ago).  In short: every CPU cycle the hardware guys make comes up through the OS, JVM (& JDK & GC) and is usefully exposed to the application writer.




We Still Can’t Write the Million Line Concurrent Program.


I don’t mean the data-parallel (scientific) programming, where large datasets and large CPU counts have been the norm for years.  These guys generally have way fewer lines of code than the big complex Java apps; there’s some correlation there between bytes of dataset vs lines of code.  Both the scientific apps and big Java apps look at gigabytes of data; the problem is that the Java guys use 100x more lines of code – and it’s because they are doing something 100x more difficult to express in code.


I don’t mean non-main-stream programing languages; I mean programs as written by the millions of programmers today, using tools (both software and mental) that are widely available.  As for the specialty languages, Erlang is the only language I know of where million-line concurrent programs are written, maintained and used in a business-critical way on a daily basis.  Near as I can tell, all other specialty languages are used by a handful of academics or in a small-scale as experiments by desperate businesses.


Concurrent Programming is widely considered very hard; try googling “notoriously difficult concurrency” sometime.  Lots of folks are busy trying to figure it out – and it seems pretty clear to me: We Haven’t Figured It Out yet.  Here are some obvious differences between concurrent and serial programming that bear repeating:

  • “Parallelism” for Serial Programming is generally automatic and very fine-grained: e.g. out-of-order hardware or hit-under-miss cache.  Your normal code “just got faster” and the hardware auto-parallelized under the hood for you.  We mined out this kind of parallelism years ago, and just expect it (remember the speed-demon vs brainiac wars?)  “Parallelism” for Concurrent Programs generally means: task-level parallelism.  You gotta hack your code to expose it.
  • Testing provides Confidence for Serial Programming; you can get decent code coverage.  In practice, testing does little for concurrent programming.  Bugs that never appear in testing routinely appear in production – where the load issues are subtly different, or QA doesn’t have the same count of CPUs as production.  Indeed, in theory the situation is gruesome: there’s basically no chance Testing can cover the exponential explosion of possible code interleavings.  There’s a glimmer of hope here: there’s some evidence that the ‘n’ in the exponent can be reduced to 3 or 4 in many cases without missing too many real bugs.
  • Debugging serial programs can be made human-thought slow.  Time is NOT of the essence.  In concurrent programming, everything needs to be full-speed or the bug never appears – leading to the HeisenBug syndome.
  • Program reasoning is local for serial programs.  In concurrent programs, you have to reason about all the code in the system at once – because some other thread can be in every other piece of code concurrently with this thread.
  • Control flow is the main concern in serial programming; you ask questions like “how did I get here?” and “where do I go next?”.  Data flow is the main concern in concurrent programming; you ask questions like “who-the-flock changed this variable out from under me?”.
  • Timing is nothing in serial programming.  Nearly all languages have no way to express time, and program meaning is defined mathematically independent of time.  The same program can be faster or slower – which preserves it’s meaning but might change it’s usefulness.  Timing is everything in concurrent programming: correctness flows from well timed interleavings.  Changing that timing will often break programs that have functioned well for years.  Note that most languages (outside of Java) start out with a disclaimer like “in race-free programs, the meaning is this…”.  Most large concurrent programs have races, so most large concurrent (non-Java) programs have no theoretical meaning.
  • Nothing changes between statements in serial programs; it’s “as if” the Universe Stops and waits for you.  All of memory (could) change between “ticks” of the program’s clock in concurrent programming.  Just because you looked at some variables a nano-second ago, doesn’t mean those variables remain the same.
  • Serial programming has fundamental name-space control for code & data & their cross product.  Classes, Object-Oriented Programming, Functional Programming, lexical scopes, etc are all techniques to control access to either code, or data or both.  There is no equivalent notion for threads – any thread can touch any piece of data or run any piece of code.
  • The Tools are Mature, pretty much by definition.  It’s obvious that support for concurrent programming sucks right now, so therefore the existing (serial) tools must be mature.  Bleah.  Some examples: I don’t know of any widely used (and usable) tools for auto-extracting parallelism (“widely usable” excludes tools requiring a PhD, or a Cray).  I know of no tool which can do data-race detection on HotSpot: it’s too big, it’s too concurrent, it JIT’s code which has subtle race-invariants with the static code, it uses GC, it uses hardware exceptions, it uses lock-free idioms, etc.


What can we do, without changing our fundamental programming model, to use these extra cores to speed up our programs?  Some, but not much.

  • GC can become concurrent by default.  Other than the slowdown in the mutators to preserve the GC invariants, you’ll never see a GC pause.   Azul’s made a lot of hay here; “we solve GC problems”.  Stock JVMs on stock hardware are still “getting there” for production use – but clearly are heading that way.  This technique probably uses about 20% more CPUs to do the GC than you have mutators usefully producing garbage.
  • All codes are JIT’d in the best (most expensive) way.  JIT’ing parallelizes the same way as “pmake” parallelizes builds – you use a thread pool of background compiler threads to JIT furiously.  During a Big Application startup, Azul might see 50 compiler threads JIT’ing in the background, with many megabytes/sec of high quality code being produced.  Of course, post-startup this doesn’t speed up your programs anymore: there’s nothing more to compile.
  • Continuous profiling and introspection need only a fraction of a CPU.
  • More auto-parallelizing hardware tricks like the Run-Ahead Scout haven’t (yet) panned out: the “scout” is more like a blind idiot without warmed-up caches & branch-predictors.
  • Larger-scale speculation seems plausible for a bit, with combinations of hardware & runtime support.  While plain Speculative Lock Elision is a bit stiff on the hardware requirements, Azul’s version shows you can do this with a fairly small amount of hardware, plus some simple runtime support.  Azul’s SLE experience shows there’s a real upper bound to the amount of parallelism you can usefully expose here: the code really does communicate with data inside most locked regions.
  • There are also a bunch of tricks to auto-parallelize & speculate loop iterations, although these techniques probably only pay off for larger loop bodies running longer iteration counts than are normal for business Java.  i.e., this probably pays nicely for scientific apps but not XML-parsing or Web-Servers.


Moving outside of the “auto-parallelize” world, we see a bunch of coding styles to enable parallelism.  Transaction-level parallelism is common, with J2EE a large complex example of it.  This is really “chunk-o-data+work” kind of parallelism, where the “chunk-o-data” is much larger and more complex than in the standard scientific app.  Under this same vague notion of “chunk-o-data+work” we also see things like stream programming (GPGPU is the hardware-accelerated version of this).  Large Java uses a lot of pipeline-of-thread-pools style programming.


In general, performance analysis tools for complex concurrent programs remain very weak.  Tools to help programmers write in the pipeline-of-thread-pools style are totally lacking.  Classic tools tell me things like “your time goes in these routines; here’s the code being executed (alot)”.  What I want is things like “your threads can’t progress because they are blocked on this lock” (fairly easy: Azul shows this now), or “this thread pool is starved for work because that pool isn’t feeding it” (because that pools’ thread-count is capped too low, or it’s threads are being starved in turn from another pool, etc, etc). 


But how well do these techniques work as we move from dozens of cores to hundreds (Azul’s already there!)?  Right now, the Big 3 Application Servers can rarely use more than 50-100 cores before they become choked up on internal locks- and that includes the 20% for GC and using SLE.  Maybe we go to individual programs communicating via sockets (SOA?).  But isn’t this just a very complex version of CSP?  Might we be better off just switching to CSP in the first place (or some hopefully more modern version)?


In short, the easy pickin’s have long gone, and now we need complex tools and complex coding styles to get more performance from more cores using the existing languages.  While I think we can go farther with what we have, it’s time to get serious about exploring serious language changes.  Somewhere, Out There, There’s a Way to write large programs without me having to sweat out Every Bloody Little Detail about how my parallel program communicates internally.  Screw Java: I got a JVM with a super GC, fantastic JIT and decent concurrent libraries; it can do loads more stuff than just run Java.  I got reasonable OS’s and an ever-growing mountain of (parallel) CPU cycles.  Isn’t there Some Way to beat back the Concurrent-Programming Demon ?