Final Fields, Part 2

I’ve been having waaaaaay too much fun this month, dealing with “final” fields – final is in quotes because I’ve been finding waaaaay too many Generic Popular Frameworks (TM) that in fact write to final fields long long after the constructor has flowed under the bridge.  Optimizing final fields is in theory possible, but in practice it’s busting a Lot of Popular Code.

From Doug Lea:

It might be worse than that. No one ever tried to reconcile JSR133 JMM JLS specs with the JVM specs. So I think that all the JVM spec says is:
http://java.sun.com/docs/books/jvms/second_edition/html/Concepts.doc.html#29882
Once a final field has been initialized, it always contains the same value.

Which is obviously false (System.in etc).

De-serialization plays nasty with final fields any time it has to re-create a serialized object with final fields.  It does so via Reflection (for small count of objects), and eventually via generated bytecodes for popular de-serializations.  The verifier was tweaked to allow de-serialization generated bytecodes to write to final fields… so de-serialization has been playing nasty with final fields and getting away with it.  What’s different about de-serialization vs these other Generic Popular Frameworks?  I think it’s this:

De-serialization does an initial Write to the final field, after <init> but before ANY Read of the field.

These other frameworks are doing a Read (and if it is null), a Write, then futher Reads.  It’s that initial Read that returns a NULL that’s tripping them up, because when its JIT’d its the value used for some of the later Reads.

Why bother?  What’s the potential upside to using final fields?

  • Expressing user intent – but final fields can be set via Reflection, JNI calls, & generated bytecodes (besides the “normal” constructor route), hence they are not *really* final.  It’s more like C’s “const”, just taking a little more syntax to “cast away const” and update the thing.
  • Static final field optimizations (ala Java asserts).  For these, Java asserts crucially rely on the JVM & JIT to load these values at JIT-time and constant fold away the turned-off assert logic.
  • Non-static final field optimizations.  This is basically limited to Common Subexpression Elimination (CSE) of repeated load, and then the chance to CSE any following chained expressions.

I claim this last one is almost nil in normal Java code. Why are non-static final field optimizations almost nil?  Because not all fields of the same class have the same value, hence there is no compile-time constant and no constant-folding.  Hence the field has to be loaded at least once.  Having loaded a field once, the cost to load it a 2nd time is really really low, because it surely hits in cache.  Your upside is mostly limited to removing a 1-cycle cache-hitting load.  For the non-static final field to represent a significant gain you’d need these properties:

  • Hot code. By definition, if the code is cold, there’s no gain in optimizing it.
  • Repeated loads of the field.  The only real gain for final-fields is CSE of repeated loads.
  • The first load must hit in cache.  The 2nd & later loads will surely hit in cache.   If the first load (which is unavoidable) misses in cache, then the cache miss will cost 100x the cost of the 2nd and later loads… limiting any gain in removing the 2nd load to 1% or so.
  • An intervening opaque operation between the loads, like a lock or a call.  Other operations, such as an inlined call, can be “seen through” by the compiler and normal non-final CSE will remove repeated loads without any special final semantics.
  • The call has to be really cheap, or else it dominates the gain of removing the 2nd load.
  • Cheap-but-not-inlined calls are hard to come by, requiring something like a mega-morphic v-call returning a trivial constant which will still cost maybe “only” 30 cycles… limiting the gain of removing a cache-hitting 1-cycle repeated final-field load to under 5%.

So I’ve been claiming the gains for final fields in normal Java code are limited to expressing user intent.  This we can do with something as weak as a C++ “const”.  I floated this notion around Doug Lea and got this back:

Doug Lea:

{example of repeated final loads spanning a lock}

… And I say to them: I once (2005?) measured the performance of adding these locals and, in the aggregate, it was too big of a hit to ignore, so I just always do it. (I suppose enough other things could have changed for this not to hold, but I’m not curious enough to waste hours finding out.)

My offhand guess is that the cases where it matters are those in which the 2nd null check on reload causes more branch complexity that hurts further optimizations.

Charles Nutter added:

I’ll describe the case in JRuby…
In order to maintain per-thread Ruby state without constantly hitting thread locals, we pass a ThreadContext object along the stack for almost all calls.  ThreadContext has final references to the JRuby runtime object it is associated with, as well as commonly used literal values like “nil”, “true”, and “false”.  The JRuby runtime object itself in turn has final references to other common literal values, JRuby subsystems, and so on.
Now, let’s assume I’m not a very good compiler writer, and as a result JRuby has a very naive compiler that’s doing repeated loads of those fields on ThreadContext to support other operations, and potentially repeatedly loading the JRuby runtime in order to load its finals too.  Because Hotspot does not consider those repeat final accesses that are *provably* constant (ignoring post-construction final modification), they enter into inlining budget calculations.  As you know, many of those budgets are pretty small…so essentially useless repeat accesses of final fields can end up killing optimizations that would fire if they weren’t eating up the budget.
If we’re in a situation where everything inlines no matter what, I’m sure you’re right… the difference between eliding and not eliding is probably negligible, even with a couple layers of dereferencing.  But we constantly butt up against inlining budgets, so anything I can possibly to do reduce code complexity can pay big dividends.  I’d just like Hotspot in this case to be smarter about those repeat accesses and not penalize me for what could essentially be folded away.

To summarize: JRuby makes lots of final fields that really ARE final, and they span not-inlined calls (so require the final moniker to be CSE’d), AND such things are heavily chained together so there’s lots of follow-on CSE to be had.  Charles adds:

JRuby is littered with this pattern more than I’d like to admit.  It definitely has an impact, especially in larger methods that might load that field many times.  Do the right thing for me, JVM!

So JRuby at least precisely hits the case where final field optimizations can pay off nicely, and Doug Lea locks are right behind him.

Yuch.  Now I really AM stuck with doing something tricky.  If I turn off final field optimizations to save the Generic Popular Frameworks, I burn JRuby & probably other non-Java languages that emit non-traditional (but legal) bytecodes.  If I don’t turn them off, these frameworks take weird NULL exceptions under load (as the JIT kicks in).  SO I need to implement some middle ground… of optimizing final fields for people who “play by the rules”, but Doing The Expected Thing for those that don’t.

Cliff

19 thoughts on “Final Fields, Part 2

  1. So…what of where you left things originally…detecting sets outside the constructor and disabling optimizations for that field? Or if you wanted to be really generous to Generic Framework, toss the JIT code impacted by the field, and *don’t* disable optimizations on the theory that Generic Framework probably won’t be doing it more than 1x per field so the regenerated JIT code should probably be okay (and if it’s not, it would just get flushed again). And Generic Unpopular Framework that no one knows about might be hosed b/c they do it continually, but they are getting what’s coming to them.

    • I think I’m coming around to declaring final/not-final on a field by field basis. In fact, no reason to look at the bytecode’s “final” flag; I’ll simply end up detecting final-like behavior.

  2. This is one of those situations where an informal API should be converted into a formal API.

    We need a JSR to implement a proper VM-level API that de-serialisation code can use to create partially-init’ed objects, write to their final fields, and then convert them to “really-initialised” objects.

  3. I’ll continue my comment in the previous blog, voting in David Leppik’s suggestion of introducing some JVM switch to enable a non-default option for ‘final’ optimizations. Ideally this option could have three states: Off, On, and Debug which is also on, but additionally logs a warning for dangerous writes of finals. Notice that not all such writes are bad – serialization and others from the core could be whitelisted to not issue warnings, and in application or library code may also have similarly non-malign usages. If the VM can be smart enough to detect benign patterns and not log warnings, it would be a nice bonus. The expectation is that performance hungry users would try to use this switch, but they would sometimes fail due to some evil Generic Popular Framework, and this would put pressure on the authors of said GPF to clean it up. Oracle could publish in a release notes or VM guide, a list of all GPFs that you know to have this problem, making developers aware that there’s some juicy speedup they cannot get until certain libraries are fixed.

    CSE of finals is relatively trivial optimization to add, so even if the benefit is small for most Java code, it’s probably worth the effort because we may observe significant benefits for some special cases, including the increasingly-important alternative JVM languages that often produce very different code patterns than standard Java app code.

    CSE can sometimes produce asymptotic speedups (well that’s potentially true for most if not all compiler optimizations)… Charles worries about inlining budgets which are a good example… but also, suppose a method that calls f(x) more than once, for a final x. If f() can be proven to be a pure function (already known for intrinsics like Math.* and possible to deduce for many trivial methods especially when inlined), you can fold the entire f(x). And this may deliver a speedup that’s orders of magnitude bigger than a simple memory load. Because we don’t have good enough CSE, highly optimized Javacode is often littered with extra local variables that “cache” fields or just hold temp values that will be used more than once: xr = sqrt(x); a = xr*y; b = xr*z… this is standard practice in hi-perf Java code, including core API implementation like Doug notices too.

    • Just to be clear, CSE’ing of final fields had been there for a long time. Charles reports it’s now turned off for Oracle (I haven’t looked at their recent code). I re-wrote the memory aliasing handling in the server compiler recently (doing what I should have done a decade ago!!!), and as a side-effect that beefed up final-field optimizations… which then started triggering these kinds of bugs. So the CSE is In There (at least for Azul), but I need to tone down the default behavior somehow, to rescue the Generic Frameworks.
      Cliff

  4. Hi Cliff,
    I am not sure why Doug says “generated bytecodes or reflection” during deserialization, Java6 (java7 looks the same) jumps on unsafe, so it cares not of the final flag.
    java.io.ObjectStreamClass$FieldReflector for reference.
    Detecting “final” behavior might be trickier for protected (or package private) fields and they would participate CHA, i.e. you might need to de-optimize code if any class loads some mutation code of otherwise final looking fields.

    In that aspect, I was thinking of forfeiting field-by-field basis and just go for the entire package. Neither JRuby, not ju.concurrent is going to suffer b/c of. Basically codes that plays by the rules is very likely to be in the same package and otherwise just as likely.

    As for CSE, I am used to manually optimize it (and dont wait for the JVM) both for function calls and simple final loads, esp. if I expected the code to be anywhere hot. Yet, having the JVM doing that would be great since most developers code w/ copy&paste. I suppose removal of the 2nd+ load can help remove extra branch checks as well.

    • Unsafe/Reflection/JNI all look pretty much the same to me; they all run through a handful of choke-points in the JVM itself. The actual de-optimization event will probably happen not on class-load, but on the actual mutation… since it happens in a few easily controlled points.
      Cliff

  5. For that matter, typical Scala code also has plenty final fields.

    I think JSR option is the best way to go, but, short term, maybe you could disable this optimization the instance a reflection setAccessible call is made? I suppose it might require dropping optimizations already made, though…

  6. Do any of them modify the same final field repeatedly? I’d suspect not. I’d bring my suggestion forward again, that you discard the JIT code touching the field/class (on modification) and leave it at that. Let the JIT happen again, with optimization on, and the new JIT code should be pretty stable, and, as a bonus, have the (now properly) optimized final value present. It saves you having to keep track of which fields can be optimized and which can’t.

    • They do modify multiple instances of the same final field repeatedly – which is why I can’t discard & re-JIT. That trick works for modifying static final fields but not instance fields.

      Thanks,
      Cliff

    • BTB’s cache a mapping from PC to PC. In this case, from a call-register or jump-register to the target (i.e., expected register contents). Inline-caches work great with or without a BTB, and they mostly remove the usefulness of a BTB – because there are very few “virtual” calls that actually use the full v-call mechanism and bottom-out in a jmp-register. Mostly virtual calls use an inline-cache which only needs static prediction. For those few calls that use the full v-call mechanism, the JVM vendor has a crucial choice to make which dictates how useful a BTB is.

      If the full v-call mechanism is inlined in the code, then the jmp-register op is itself replicated at each v-call site. In this case the BTB has a chance to work… although (as noted in that blog), there is a very very high chance that this call-site is megamorphic and no prediction will be possible. i.e., the targets are either truely random and randomly selected, or at least have a very high variability. Note that we’re shaving the hair in half here already: dynamically full v-calls are very very rare (i have billion-instruction traces which show them appearing at about a 0.1%(?) instruction rate).

      The other JVM implementation of choice is to have the inline-cache site call to a *shared* v-table stub. In this case, the v-table is shared across all mega-morphic call sites… and in this case the BTB really has no chance of predicting anything. Why would a v-table stub exist at all? Because the full-on v-call semantics do not fit in the restricted space allotted to an inline-cache… so you need to jump to some out-of-line code in any case.

      For Azul, I replicated the v-table stub per-call-site that goes megamorphic, so the BTB has a chance to work… .not that I expect much from it. But it was cheap enough to replicate the code (mostly related to engineering of the code lifetime).

      Cliff

  7. class LazyInit<IFactory> {
    private final IFactory factory;
    LazyInit(IFactory factory) { this.factory = factory; }
    public native T get();
    }

    …..
    class User {
    private final LazyInit a;
    …..
    public void use() { System.out.println(a.get().toString()); }
    }

    Any thoughts on whether LazyInit can be safely implemented internally using fences and added to JDK.

    • Well, you clearly didn’t give me enough code to decide what “safely” means… but also you only set final fields in constructors – the ONLY recommended use-case. So this looks “safe” to me. It looks like the classic use-case for finals and should work great. All this is totally unrelated to whether or not you can add anything to the JDK.

  8. It appears to me from the blog (perhaps mistakenly so) that using reflection to modify final fields is another way of doing lazy initialization. If this argument is moot, so is everything that follows below.

    The LazyInit.get here is supposed to initialize T data object lazily using the passed in factory if its not already done (DCL?).

    But instead of asking the user to write this code, the hope here is for compiler to place the right fences to do the get *efficiently*.

  9. You got no locks, so you example cant be using DCL.
    You got no Reflection, so thats not being done.
    You make no objects, so there’s nothing to init…
    …as I said, you don’t have enough code here to make an example.
    Maybe make a more complete example?
    Are you trying to get some field to be lazily init’d and effeciently accessed?
    Then make it a non-final field! They work great, they really do!

    Cliff

Leave a Reply

Your email address will not be published. Required fields are marked *