Odds & Ends

Various pending goodies…

Load-Linked/Store-Conditional vs Compare-And-Swap

Pretty much all modern CPUs include some form of instruction for doing an atomic update – required for shared-memory multi-CPU machines (X86 has lots and lots!).  There was a long period of debate in the CompSci community one what constituted the “minimal” needed instruction to do useful multi-CPU work.  Eventually the community has decided that the Compare-And-Swap (CAS) instruction and the Load-Linked/Store-Conditional (LL/SC) instruction combo both are (1) sufficient to do useful work (“infinite concensus number”) and (2) relatively easy to implement in real hardware.  X86’s, Sparc’s and Azul’s CPUs use CAS.  IBM’s Power, Alpha, MIPS & ARM6 all use LL/SC. ARM5 only has an atomic-swap.


LL/SC and CAS are slightly different in how they work, leading to subtly different requirements on algorithms.  With LL/SC, you first Load-Linked a word.  The hardware marks the cache line as “linked”.  You then manipulate the word (e.g. add 1 to implement a simple shared atomic counter).  Finally you issue a Store-Conditional to the same address.  If the cache line is still “linked”, the store happens.  If not, then not.  The line remains linked as long as it does not leave this CPU’s cache; e.g. no other CPU requests the line.  Any attempt to take the line away causes SC failure (retry is up to the algorithm being implemented).  “Weak” LL/SC implementations can easily lead to live-lock – if Load-Linked requests the cache line in exclusive mode (required to do the Store-Conditional), then each LL causes all other CPUs to lose their “link” – and their SC’s will fail.  I suspect most modern implementations of LL do not request the line in exclusive mode – avoiding the obvious live-lock failure.  The downside is that a simple uncontended LL/SC on a word not in cache requires 2 cache-miss-costs: the original miss on the load, and a second miss to upgrade the line to exclusive for the SC.


With CAS, you typical first load a word with a normal load, then manipulate it.  Finally you issue the CAS which compares the memory contents with the original loaded value: only if they match does the swap happen updating memory.  CAS can succeed if the original cache line leaves the CPU and returns holding the same value – this allows the classic ABA bug.  In some cases, Garbage Collection can nicely side-step the ABA bug; you can never find an aliased copy of an “A” pointer unless all copies of A die first – including the one loaded before the CAS.  Similar to LL/SC, there can be 2 cache-miss costs: one for the original load and again to upgrade the line for the CAS.  Azul has a load-exclusive instruction to avoid with this – a plain load but the line is loaded exclusively.  With CAS you can issue any other instructions between the original load and the CAS; typically with LL/SC there’s a small finite number of operations that can happen between the LL and the SC without losing the “link”.  E.g., guarding a one-shot init function by atomically moving pointer from NULL to not-NULL: with LL/SC you must load the line before the SC; with CAS no separate load is needed (e.g. an “infinite distance” between the original load and the CAS). 


These atomic instructions only need to guarantee atomicity of update, not ordering w.r.t. other memory operations.  Nothing in the academic literature requires any sort of global ordering on these atomic instructions.  Instead the usual academic assumption is that all memory operations are strongly ordered – which is obviously not true on all modern hardware.  Practitioners are required to insert memory fences as needed to achieve the desired ordering.  Nonetheless most implementations of CAS include a strong ordering: X86 and Sparc certainly do.  Azul’s CAS does not, and this allows Azul to e.g. implement simple performance counters that do not drop updates and also do not force ordering w.r.t. to unrelated memory operations.  (As an experiment, try writing a multi-threaded program to increment a simple shared counter in a tight loop without locking.  Report back the %-tage of dropped counts and the throughput rate.  Then try it with an Atomic counter.  My simple tests show with a handful of CPUs it’s easy to achieve a 99%+ drop-rate – which basically makes the counter utterly useless).  I am less familiar with the fence properties of common LL/SC implementations.  Any of my gentle readers wish to report back on the situation for Power, MIPs, ARM, etc?






Some time ago I reported on a build tool I’ve been using.  Currently ‘build’ is being used to build HotSpot internally to Azul.  We’ve got it building 400+ files, plus build rules for building portions of the JDK, compiling w/gcc & javac and also ‘javah’, tar, jar, strip, binary signing, etc.  In addition to the typical parallel-make functionality, it includes auto-discovery of c++ header files, intelligent load-balancing of big builds, etc.  It’s blazingly fast, it’s easy to hack (being written in plain-olde Java).  I hunted around awhile and couldn’t find a good place to dump a medium-large blob of code, so here’s a sample build.java in 4 parts:

Part 1
Part 2
Part 3
Part 4




A “HotSpot -server” aka C2 compiler IR visualizer

No I didn’t do it!  This deserving grad student did it.

Here is the reference to the visualizer for the ideal graph of the HotSpot server compiler: 


It is a JNLP application hosted on a university server (http://ssw.jku.at/General/Staff/TW/ has also a test file to download), but the easiest way to get a first impression.  The source code of the tool is hosted onhttp://kenai.com/projects/igv  The server compiler instrumentation is part of OpenJDK and also included in Sun’s weekly builds of Java 7. 



More C2 Goodies

John Cavazos wrote: 
… One of the things I am currently looking at is determining the right phase-ordering of optimizations applied to a particular program being optimized.  I have some nice (un-published) results for JikesRVM, 
but it would be nice to replicate the research for HotSpot… 

I have strong arguments for nearly all orderings of our passes, so I’m curious to know about your phase-ordering results. 
The only obvious transform we might be missing is PRE, but our peephole opts pretty nearly subsumes PRE (they are not mathematically equivalent – I can write programs where either PRE or the peephole stuff makes progress against the other).  In practice, PRE will find essentially nothing once the peephole opts are done.  You’ll notice that we do the peephole pass alot; it’s totally incremental and provably linear bounded.  In other words, if there’s nothing to be gained then there’s no cost in trying.  The peephole pass includes amongst other things all pessimistic forms of copy propagation, value equivalence, constant propagation (especially the null/not-null property), constant test folding, repeated test folding, dead code elimination, load-after-store opts (aliasing analysis is included for free during building of SSA form), algebraic properties, and lots more. 

For HotSpot, the optimization ordering is: 

  • Build an inline tree, including class hierarchy analysis.  This is the one pass I’d be willing to move, as it happens too early.
  • (a single unified pass:) parse bytecodes, inline, build SSA form (the IR remains in SSA form always), do peephole opts over SSA form.  This pass typically costs 40% of compile-time budget.
  • Fixed-point the peephole opts over SSA form, once all backedges are known after parsing.
  • Loop opts round 1: “beautify loops” (force polite nesting of ill-structured loops), build a loop-tree & dominator tree, split-if (zipper-peel CFG diamonds with common tests, plus some local cloning where I can prove progress), peel loops (required to remove loop-invariant null checks)
  • Fixed-point the peephole opts over SSA form
  • Loop opts round 2: “beautify loops” (force polite nesting of ill-structured loops), build a loop-tree & dominator tree, lock coarsening, split-if & peel – but if these don’t trigger because there’s nothing to gain, the do iteration-splitting for range-check-elimination & a 1st loop unrolling.
  • Fixed-point the peephole opts over SSA form
  • Conditional Constant Propagation (the optimistic kind, instead of the pessimistic kind done by the peephole pass)
  • Iterate loop (split-if, peel, lock coarsen – but these typically never trigger again and take very little time to check), unrolling & peephole passes, until loops are unrolled “enough”.  On last pass, insert prefetches.  Typically this iterates once or twice, unless this is a microbenchmark and then unrolling might happen 8 or 16 times.
  • Remove tail-duplication, and a bunch of other minor code-shaping optimizations e.g. absorb constant inputs into deoptimization-info in calls, or commuting Add ops so that 2-address machines can do update-in-place.
  • Convert “ideal” IR into machine code IR.
  • Build a real CFG for the 1st time, including a dominator tree, loop tree.  Populate with frequencies from earlier profiling.
  • Global latency-aware (loop-structure-aware) scheduling.
  • Replace null-checks with memory references where appropriate.
  • Register allocate.  Internally this has many passes.  This pass typically costs 40% of compile-time budget.
  • Sort basic blocks to get good control flow ordering (forward branches predicted not-taken, backwards predicted taken, etc)
  • Some last-minute peephole opts.
  • Emit code into a buffer, including OOP-maps & deoptimization info.



Travel Plans

I got too much travel coming up!

Apr 30, May 1st – DaCapo in Boston.  Favorite small group; GC & Java focused. Website: http://www.cs.tufts.edu/research/dacapo/.  I had to turn down the invite to anSSA seminar in France because the dates conflicted.  Very sad.  I hope one of the attendees will post a trip report.

May 11-May 15.  IFIP WG2.4  (International Federation of Information Processors, Working Group 2.4 – a *really* old European group with a random mix of industry & academia.)  It’s a nice group to preview JavaOne talks, and always the meetings are a week-long rambling discussion in some quaint resort.

June 2-June 5.  JavaOne.  I owe slides for 3 talks by end of this week.  ‘enuf said.

July 6-10th – ECOOP, as a paid-for invited speaker.  A free vacation to Italy!   🙂




(yet another trip report to yet another conference)


‘Tis the Season to be Conferencing…

Blah, I’d rather be hacking, but here goes…  


I went to EclipseCon, to CGO and back to EclipseCon.  Here’s how it can down: I got invited out of the blue to talk at EclipseCon by Darin Wright, who had seen me speak earlier at the JVM Language Summit.  Like a dummy I accepted it instantly instead of checking my calander.  Naturally it conflicts with CGO – and I was on the CGO PC this year, so I felt obligated to attend CGO as well.  Turns out my talk at EclipseCon got scheduled for Wednesday, and I decided I could skip CGO on Monday (I think there was one talk I was interested in) – so I scheduled a 1-day only trip to Seattle.  Monday I wandered EclispeCon at the Santa Clara Convention Center, Tuesday I took the 1st plane up to Seattle and the last plane back, and Wednesday it was back to Santa Clara to give my talk and surf the EclipseCon booths.  The flights on Alaska Air were smooth and easy.  No WiFi at CGO – the Hotel wanted some totally outrageous price for conference WiFi (I think the local chair was saying $50/head prepaid?!?).  (Remember my grief over Hotel WiFi in the last blog?)  So over lunch break (which was Chinese food and I don’t like Asian food of any kind) I headed over to Starbucks.  Turns out Starbucks will happily give you 2hrs of free WiFi a day from ATT – provided you register $5 on a Starbucks card with them.  Yes!!!  Suddenly I have a $5/month nation-wide WiFi plan!  I snacked on some salad’y thing, drank my triple-latte, and happily Surfed the Internet until show-time at CGO came ’round again.


1st up: EclispeCon, mostly cause I got nothing to say.  These guys are in a totally different ball park from where I live; they mostly deal with complexity control (or at least 1/2 the talks essentially amounted to adding, changing, or removing a control layer).  Makes me wonder if the cure is really better than the disease. I gave my talk, but it was too much talk in too little a time window.  Both me and the attendees would have been better off if I could have spent more time on the details.  C’est la vie.  Other thing of note: the web site for EclipseCon is vastly better than the usual academic conference I attend.  Also the staff was exceptional friendly and easy to work with.

Next up: CGO.  Despite my 1 day fly-in-fly-out I got a lot more out of CGO. 


Keynote: Vikram Adve

Missed!  Arrived barely in time for the last 10 minutes.  That’s the penalty for attempting a 1-day fly-by.  I hear he said something good about Azul systems (he did ping me the day before for some details).  Basically it’s a Call To Arms for improving C compilers and C runtimes.  He does say that fully auto-parallelizing code is probably a bad idea, and that we need to change the language instead (something I very much agree with!).


Revisiting Out-Of-SSA Translation

<strong>Cliff Notes: </strong>HotSpot "comes out of" SSA form during register allocation, where it really preserves SSA form throughout the entire compile process, but adds register annotations as part of reg alloc (and inserts live-range splits as part of spilling). So mostly this talk is about stuff HotSpot has been doing for years... <strong>but there's a key new piece</strong>.

Linear-time colaescing congruence classes; speed &amp; memory footprints.
Why is coming out of SSA hard? (<strong>Cliff Notes:</strong> It's not!)

And now they show what HS has been using for years: insert copies as needed for a simple out-of-SSA translation, then aggressively coalesce away the extra copies. 

Nice diagrams for IFGs + copy-affinities.

Notices the value-equivalence property for defining IFGs (also done by HS).

<strong>Ok: here's what's new compared to HotSpot -server:</strong>

Fast interference test w/out building an IFG!
Only need to test against closest dominating var?
So scan the dom tree, with a stack-based DFS traverse of dom tree. 
Uses 2 sorted lists (?); more complexity for value-based IFG tests.
But no building (and rebuilding?) the quadratic IFG.
Claim 2x faster than Sreedhar for coming out of SSA.
Big memory footprint reduction, because no need for live-ness sets &amp; IFG!!!

40% of HS's total JIT budget is in reg-alloc! A faster reg-alloc is definitely interesting. A large part of that 40% is building the liveness &amp; IFG sets


<strong><span style="font-family: Arial;">Wave Propagation and Deep Propagation for Pointer Analysis</span></strong>

Wave Prop is very fast and uses huge memory (working on parallel version)
Deep prop is faster for small & precise memory sets – works better on desktop (and JIT?)

There’s a zoo of ptr-analysis names:
– context-sensitive or not
– flow-sensitive or not
– field-sensitive or not
– inclusion-based (Andersen-based) or unification-based (Steesgaard-based)

Cliff Notes: HotSpot -server uses context-INsenstive, flow-INsensitive, field-Sensitive, unification-based

Build a graph of constraints, and do contraint-propagation.
Code: “a = &b;” Constraint: “a superset {b}” – means b element-of P(a)
Code “a = b” constaint: “a superset b”- means P(b) contained in P(a)
Code “a = *b” or “*a = b” – complex constraint, must add edges to the graph as constraint propagation adds new sets to the nodes.

Now build a graph: nodes are P(a) – sets of things pted-at by ‘a’, and edges represent subsets.

Cycles are common; collapsing them is crucial for performance. All things in the cycle must be equal. Uses Nuutila to find strongly connected components in a single pass (so does HotSpot – Vick’s algorithm).  After collapsing cycles, you get a DAG – and Nuutila also produces a topo-sort as output.

(Still looks like a complex diff/delta propagation algorithm – but makes me wonder if standard bit-vector stuff is faster).

Not a bad-looking ptr-analysis. Might work in a JIT.
Manesh suggests using BDD’s.

Cliff Notes: Real question of course, is if all the extra alias precision actually buys you any performance. PLDI paper of a few years ago suggestions otherwise: that HS’s style of alias-analysis gets you 95% of all the possible gain. 


Fast Track: A Software System for Speculative Program Optimization

Chen Ding: another process-level fork/join speculative program optimization (contrast this to Emery Burger’s “Grace” system). Gives sequential program semantics. Add annotations to “suggest” fork/join
points. Actually fork a process, use COW & mmap to guarantee sequential semantics in the face of races (end speculation, or force commit-ordering on speculative threads, etc).

New Idea: User can write an alternative implementation of some module; run both versions in parallel; keep the faster version & kill the slower version: “return do_whichever_is_faster( methodX(), fast_methodX() )”

New Idea: “fast_methodX” might be speculative and *wrong*. So run slow-version tocompletion and do error checking after the fact. But let the fast version run ahead (and if it’s faster), it can spawn off new speculative fast versions. When slow version completes, use page-level protections to know what’s been changed, and verify thecorrect semantics of the fast version. Run all the correctness checks in parallel with all the normal fast-case stuff. Needs some careful throttling.

Example using gcc’s “mudflap“. Array under/overflow, leaks, un-init’d memory, etc. Run 2 versions of all code: the existing code is “fast track” and the mudflap version is the “slow correct track”. Goal: same speed as the “fast track” code with all the safety checks of the “slow track using mudflap”. In some cases can get real nice speedups.

Cliff Notes: can we turn on mudflag for Azul?

Scenario Based Optimization

Profile the hot methods. Tweak various parameters (e.g. loop unrolling, or cache-tiling parameters). Profile both old and new versions. Pick the best winner, and stick with it for awhile. Keep profiling continously, and switching code as needed.  Some JVM JITs do this.  HotSpot only sorta does.

The programmer has to pick what the new & old versions are, and the runtime picks the fastest one (no support for correctness – what if one of the versions is busted? Or tickles a timing bug which breaks things?)

Seeing some nice gains, some of the time. Avoids losing when turning on some aggressive optimizations. But generally unrealistic.

Software Pipelined Execution of Stream Programs on GPUs

Basically StreamIt on a GPU

StreamIt is much easier to use than e.g. CUDA or OpenCL.

StreamIt model is a collection of connected serial parts; connected via pipes. Due to splits & joins, might need to exec some parts more often than than others (e.g. part Y “eats” 2 ouputs from part “X” before outputing a result itself). Get complex patterns of execution; difficult to map down to GPGPUs. Have bounded-buffer problems: need to map to a steady-state schedule using a fixed count of buffers. Then can arrange to do no allocation during execution.

– Profile each filter standalone, to figure out it’s work load.
– Configure selection
– Add constraints 
– Use CPLEX to solve
– Output CUDA code.

Looks like a classic software-pipeline, just done on a GPGPU. Compute minimum initialization interval, compile resource constraints, etc. Have issues with data-layout, so can do parallel data access.

Impressive compile times (30sec to 178sec).
Can get pretty good utilization on some kernels, so seeing 100x speedups.

Cliff Notes: My usual caveat for domain-specific languages applies here: if your domain fits StreamIt’s model – then StreamIt is the fastest language for you!  If what you got doesn’t look like StreamIt, then Too Bad.


Stream Compilation for Real-Time Embedded Systems

StreamIt Again, but for Cell.
New constraint: out of memory per cell (distributed memory).
Different from GPU – GPU has shared memory, but very slow if doing bank-conflicting access – but can pile on more memory (more buffers) for some GPUs and less for others. Not so for Cell – must have a 
uniform split of memory.


PS – Thanks Doug for inspiring this blog





I suspect you mean C’est la vie …

Posted by: Alex Blewitt | Apr 5, 2009 1:23:10 PM


Fixed, thanks. My French isn’t very good…. 🙂

Posted by: Cliff Click | Apr 5, 2009 1:25:42 PM


Cliff, thanks for speaking at EclipseCon. I admit that while I was able to follow along for a significant part of your talk, by the end I got lost. I agree that it was probably not a lot of time. Either way, I appreciated that you exposed us to a facet of the VM that we don’t often talk about at EclipseCon, so I hope you don’t think it was a waste.

Posted by: Robert Konigsberg | Apr 5, 2009 6:06:46 PM


Somewhere on the JVM Lang Summit web pages is a full video of my talk there – where I go slower & have more time for meaningful Q&A. Part of the problem was me feeling rushed – so I ran ahead, instead of going back over slides where I could feel the audience slipping – and maybe just dropping a few slides off the back end.


Posted by: Cliff Click | Apr 5, 2009 6:42:42 PM


Thank Cliff for the many conference write-ups you have published. I always look forward to hearing what is going on in the research community summarized so concisely.


Posted by: William Louth | Apr 6, 2009 12:32:11 PM


authored by you and your colleagues from Sun cites a paper by the same authors called “Interference Graph Trimming” that is noted as having been submitted to the 2001 PLDI. That paper does not appear in the PLDI record, is it available anywhere?

Posted by: Carl | Apr 18, 2009 11:13:04 PM


It was never accepted to PLDI – definitely a bummer. The paper is all about how to write a fast graph-coloring allocator, and the reviewers all wanted to see a comparison to a linear-scan allocator – but there isn’t enough space in the 10pg PLDI format to do both. I’ll see if I can post a link here. I’ll send you a copy in any case.


Posted by: Cliff Click | Apr 19, 2009 5:04:43 AM


Appreciate if you can share the paper “Interference Graph Trimming” to me too. Thanks in advance.


Posted by: Leela Mohan Venati | Oct 25, 2009 10:50:50 AM