Starting with Ehcache 3.0

It’s been 4 years since we released Ehcache 2.0. While it was the second major release in almost 6 years, it didn’t break backward compatibility, but was “simply” the first release with a tight integration with the Terracotta stack, just 6 months after the acquisition. With JSR-107 close to being finalJava 8 lurking just around the corner and Ehcache having this close to a 10 year old API, it does seem like the perfect time to revamp the API, break free from the past and start working on Ehcache 3.0!

Ehcache 3.0

Ehcache is one of the most feature rich caching APIs out there. Yet it’s been growing “organically”, all founded on the very early API as designed some 10 years ago. We’ve learned a lot in the meantime and there have been compromises made along the way. But “fixing” these isn’t an easy task, often even impossible without breaking backward compatibility. In the meantime, while some caching solutions took very different approaches on it all, the expert group on JSR-107, Terracotta included, put great efforts in trying to come up with a standard API. We feel the new major version of the API should be based on a standardized API. JSR-107 is an option and a likely choice that will serve as a basis for the new version but Ehcache 3.0 will likely “extend” the specification in many aspects.

javax.caching

While JSR-107 lays a foundation to this new API, we also need to address some Ehcache specifics early on: One of these is the resource constraints user may wish to put on a cache, and the other is the exception type system. Let me try and expand a bit on these here.

Resource constraints

Ehcache users always had the possibility to constrain caches based on “element counts”, whether on-heap or on-disk. Over time we’ve introduced more tiers to cache data in (off-heap, Terracotta clustered) and more ways to constraint the caches in size (Automatic Resource Control, which let you constraint memory used by caches and cache managers in bytes). When capacity is reached, the cache will evict cached entries based on some eviction algorithm and fire events about these.

The javax.caching API doesn’t address this at all. Yet as this being a core concept in Ehcache and one that we want to keep, it seems only sensible we extend the API to support that.

Exceptions

We at Terracotta actually were never really fond of the exception types and their usages in Ehcache. Mainly because of the way Ehcache evolved to support more and more advanced topologies, without wanting to break backward compatibility. When it moved to support distributed caches, an entirely new set of problems emerged and failure became an inherent part of the system. Yet nothing on the API really allowed for that. JSR-107 seems to come with the same limitation.

While for simple, memory-only use-cases the API and its exceptions seem good enough, distributed caches do indeed need more. This is something we want address early on in the design of the new Ehcache 3.0 API.

Java 8

Ehcache 1.0 came out just a couple of days before Java 5 was released. While it was obvious that not everyone would jump on the boat right away, the enhancements that Java 5 brought were non-neglectable nonetheless, most noticeably generics. Yet, Ehcache never managed to integrate these in its own API. The net result being that 10 years later, Ehcache still lacks generics and leaves type safety entirely to the user to deal with. Lesson learned… This time around, we want to make sure the API is ready for the immediate future that is Java 8.

Lambdas

Like 10 years ago with Java 5, not everyone will migrate to Java 8 when it comes out. Actually probably only a very few reckless ones will do. But eventually, more and more people will and many of these will embrace its feature set. Lambdas are certainly a language feature that fits a caching API in many of its usages. As such, we want to make sure the Ehcache 3.0 API is ready for that. Foolishly enough, I believe it’s a task we can succeed in… hopefully without too much trouble!

Going about getting there

Having covered these two main “external” drivers (JSR-107 & Java 8) that are going to impact this Ehcache 3.0 API efforts, it would also be worthwhile to discuss the “internal” ones… Following are the very deep & ugly secrets of why we believe Ehcache 3.0 is a necessity!

Foundation work

As mentioned earlier, Ehcache grew its feature set organically. Most importantly it grew from a heap only caching solution to the feature richest API in the caching landscape, certainly in the open-source one. But some of it hadn’t been accounted for, or simply turned out to be harder to fit in. There are mainly two things we believe need fundamental fixing in order to make the life of users much simpler…

Lifecycle

Interestingly enough this is something the JSR-107 expert group also left out. Not really surprising, as it probably would be really hard to have all vendors implement such a standardized lifecycle, let alone agree on one!

Ehcache has a much simpler task to address in that regard. The set of deployments and topologies Ehcache support is pretty much nailed down now. So “all” there is to do is clearly define a lifecycle that encompasses all these and lets a users easily move from one deployment to another while minimizing the set of new problems they have to think about. Whether we’re talking about moving a memory only cache to a “restartable one” that outlives the lifecycle of the virtual machine, yet keeps it all type safe; or move to a clustered deployment, over WAN and all the implication that brings along… It’ll probably never be entirely transparent, but we probably can make it as smooth as possible…

Modular configuration

With the plethora of features, came the ever growing set of configuration options to tune it all for the millions of deployment scenarios out there… The current ehcache.xsd is as scary as it gets! And that’s not even addressing any of many conflicting configurations one can end up with.

Addressing this would require a much better isolation between these features and their configuration. As part of a prototype I did with Chris Dennis a while back now, we came up with a much more modular approach for it. An approach where “modules” don’t know about each other and can function without each other entirely. While also tying into the previous point on lifecycle, nailing this aspect right seems an absolute requirement for the user experience to be as smooth as possible.

Main API

Last but not least, the API. I’m talking about the main API here, the one that expands on the JSR-107 one. The one that users will use the most. Covering configuration, lifecycle and “everyday” cache operations. The code that makes all of the above concrete to engineers and developers.

Moar featurz!

And yet there is much more than that… There is this list of features Ehcache grew to have over the last 10 years! While they are not necessarily all to be ported to this new version, most of them are. Deciding how and when to port them is going to be the next step when done with the above. But there is also a caveat to this… We need to make sure that whatever is done during the foundation work phase, nothing rules out or impedes with these features that we know are going to be part of this new API. I will not even try to come up with an exhaustive list, but here are some of the main features to keep in mind:

  • BootstrapCacheLoader (sync/async)
  • Statistics
  • Search
  • Transaction (JTA-y, XA, Local, …)
  • Write Through (explicit)
  • Write Behind (explicit)
  • Read Through (explicit, with multiples Loaders)
  • Refresh Ahead (Inline & Scheduled)
  • … and more!

So what now?

So why this blog now? Well, I know there are many users out there. What’s great about users of a library such as Ehcache is that all these users are developers. Many of them actually seem to love to complain about the tools and libraries they use! Now is the time, more than ever: come and complain! But come and complain in a positive way is all we ask. And if you feel like doing more than complaining, you might even participate! We’ve decided to start spec’ing this new API all in the open. The discussions will start happening on our Google Group ehcache-dev and the prototypes to API and code to be all on github.com/ehcache. So we need you. We need you to make your issues known. We need you to make your solutions known and we need up to start debating with everyone on how to make Ehcache 3.0 the caching API you always wanted. We already have the knowledge and expertise to implement it all, based on all we already have in the 2.0 line and all the cool stuff we’ve been hacking on lately… But now is the time to let us know what is to be done!

We’ll be starting this exercise as of today: both discussing the above, starting with foundation work on top of the javax.caching API, as well as pushing the experimental code we have been hacking on to github. API work will be happening both on github, through code reviews, as well as on the group for more “general” discussions. All I can do now, is hope to see you there!

Starting with Ehcache 3.0

Choosing Hibernate’s caching strategy

One, if the not the, most common use case for Ehcache out there is using it with Hibernate. With only a little configuration, you can speed up your application drastically and reduce the load on your database significantly. Who wouldn’t want that? So off you go: you add the couple lines of configuration to your application. Nothing too complicated… You’re quickly left with two questions to answer though:

  1. What entities and/or relations do I cache?
  2. What strategy do I use for these?

We’ll let the former question as an exercise for the user to answer… but I’d like to spend some time here discussing the second question.

Choosing the strategy

So you decide to use caching on an entity or a relationship. You now have to tell Hibernate what caching strategy to use. Caching strategies come in 4 flavors: read-only, read/write, non strict read-write and transactional. What these entail is to be considered from two perspective: your ORM’s, i.e. Hibernate’s and from the cache’s perspective. The latter actually only applies to caches that loosen the consistency semantics, especially when you go distributed. Let’s start with the easy one first, Hibernate’s perspective.

From Hibernate’s perspective

The Hibernate documentation does give you some hints, but we’ll go in slightly more details here about what’s going on.

Read-only

Easiest one to reason about. The data held in the cache is immutable and as such will never be updated. Solves all isolation problems. Data is. Hibernate will actually go as far as prohibiting any mutations to that dataset, throwing an exception at you if you try to update any such entity.

In a “strongly consistent’ cache as you’d have a single VM from any cache vendor, there isn’t much more thoughts to put into this one. Straight forward… Gotta love this one!

Non strict Read/Write

Now if you have mutable entities, that’s one of the three options you have. As the name implies it’s read and write, but in a non strict sense… Non strict here means that Hibernate will not try to isolate any mutations to an entity from other concurrent transaction reading the same data. But what does this mean?

At this stage, I think it’s important to debunk a couple of misunderstanding about Hibernate. Firstly, and most importably here, Hibernate does NOT store your entities in the second level cache. Instead it stores a “dehydrated” representation of the entity. Hibernate comes with an option to have it store Maps in the cache. You would never want to do that on a production system, but this can be useful for debugging an app. It makes it so a cache entry is a Map of property name to property value for a given entity type (e.g. entity “Company”: [“id”: 1, “name”: “Terracotta”, “doesCaching”: true]). The default storage is merely a more efficient way of storing that same data. And secondly, the cache is accessed/updated as the database is accessed/updated. So say a query is issued against companies, trying to select all companies that do caching, yet the session holds newly created company (managed) instance, Hibernate would flush all these to the database. Should these be reloaded from the database within this uncommitted transaction, they’d also make it in the second level cache. This is where non strict is important. Pending (i.e. uncommitted) changes can become visible through the cache to other transactions, breaking the I guarantee from ACID.

Also worth mentioning is that the behavior above is taken from Ehcache’s NonStrictReadWrite strategy. We’ve tried our best to make it hard for users to shoot themselves in the foot. This strategy will invalidate cached entries on updates and only populate the cache with data loaded from the database. Note that these invalidations happen after the transaction has successfully committed. As a result there is a race where “old” values can be seen in the cache, while updated in the database.

Read/Write

Read/Write (or strict read/write) tries to account for the short-comings of the non strict approach. It does so by implementing a “soft-lock” mechanism, locking entries in the cache on a per entry granularity as they are mutated. Only after the transactions has successfully committed will these locks be released, installing the appropriate value in the cache. Using this caching strategy, Hibernate will lock entries on flushes to the database. Every other transaction accessing a locked entry will consider it a cache miss and hit the database instead. What’s nice about this strategy is that on contention, it let’s the database handle the concurrent access, meaning that the isolation level provided is resolved by the database, as if there was no cache at all.

Now this seems perfectly reasonable… Yet, there is one shortcoming to this strategy as well: it stores these soft-lock in a cache. A cache that could evict (or expire) these. The absence of a lock for an entry can result in stale data be present in the cache or (as for non-strict) result in uncommitted state being exposed to other transactions.

Transactional

Transactional deals with all these limitations. It basically expects your cache to be a full-blown JTA XAResource that can be modified alongside the database (meaning you do need to configure the both to use the same isolation level). Yet you also pay the price of two-phase commit. Also, fully XA compatibility does require recovery support, which for a Cache might be overkill (as it caches the data held in the cache and your application could potentially work with a cold, i.e. empty, cache).

From a distributed perspective

When you move to a distributed environment (i.e. multiple application nodes hitting the same databases), your cache needs to be kept in sync across all these nodes. While enabling Terracotta clustering with Ehcache is only a couple lines of config again, how you configure your clustered cache becomes important.

Terracotta provides basically two consistency models for clustered caches: Strong and Eventual. In Strong, you basically get JMM guarantees across your cluster. Our Hibernate caching strategies will account for all that’s to be taken care of for you to provide you with proper visibility semantics. In Eventual consistency though, things get slightly more complex for the user to understand (yet provide the better performance both in terms of read and write throughput).

Besides the consistency, Terracotta lets you configure the non-stop behavior of the clustered caches (what paragraph on distributed systems would be complete without mentioning CAP?!). If a certain operation can’t happen within a given configured time, you have multiple options. In the following paragraphs, when referencing non-stop, we will be talking about any behavior that’s not failure (i.e. favored A instead of C).

Read-only

That’s again the easiest one to reason about, since the data never mutates there is nothing to worry about in terms of isolation. Yet, in terms of visibility, multiple nodes could be putting the same value into the cache, resulting in more database hits than you would expect. Say you have a list of countries in such a cache, configured to hold all countries and never expire or evict them. This is very common pattern for reference data that you want to keep close to your application. It could well be some countries are loaded and put in the cache from multiple nodes, especially under very high initial load.

The same is true in the face of partitioning with non-stop configured: at worse you’ll see more database hits as you’d expect. But since data is immutable, there isn’t any stale state ever…

Non strict Read/Write

While this mode will also work fine with eventual consistency, you’re basically blowing the race wider for inconsistent data making it in the cache. It becomes very use case dependent on how much problem this may or not be to your application. Invalidation are being propagated asynchronously to all nodes, which makes outdated values available slightly longer on the nodes that haven’t done the mutation. But since data is only ever populated from the database, when populated it’s always with the latest state.

Non-stop here can result in even larger races. Say an invalidation is ignored during a partition (noop configured non-stop caches), that data will remain in the cluster until the next mutation happens. How this situation is acceptable to your application is again up to you…

Read/Write

This is where things get more interesting! While it could sound like it would be as acceptable as the other two strategies so far, it actually isn’t. All because of the Soft-Locks mainly. As reads remain mainly unlocked, we can’t provide Hibernate with the expectations it has on such a strategy (see below Hibernate 3 vs. 4).

As explained for non-strict, with non-stop caches, you could end up with stale locks in the cache, basically rendering the whole strategy useless. Long story short, don’t use this strategy with a distributed cache that isn’t strongly consistent.

Transactional

In Terracotta land, that one is actually surprisingly easy. As you need to have your cache be a proper XAResource, Ehcache will not let you configure anything non-sensible here. That is, Ehcache will only let you use an XA transactional cache with strong consistency. And here the only sensible behavior in the face of partitioning is to fail, but as such also to remain consistent.

One last thing… or two

Optimistic locking to the rescue

In order to deal with stale state (as it can happen even without a second level cache, your session being the first level cache already), you can implement an optimistic locking strategy within your data-model. As a result though, some layer(s) in your application will either have to deal with OptimisticLockingException (e.g. you tried to update the salary of Alex version 12, but that’s not the version present in the database at flush time anymore) or have your user deal with it… The latter not sounding too good probably, it is worth understanding what corruptions your data exposes itself with any given deployment (with or without second level cache, be it distributed or not).

Hibernate 3 vs. 4

In Hibernate 4, the caching provider can actually implement some of the behavior of the strategy himself. In Hibernate 3, nothing like it was feasible. As a result, there is smarter things (to a still limited extent though) we could do about dealing with these different modes in a clustered environment. Yet one main problem remains: Read/Write is a FSM which is not implementable atop a weakened consistency model. Also, as of today, Hibernate doesn’t try to handle cache “failures”, which is probably is fair thing to do, but forces you to understand those quirks.

In conclusion

Hibernate tries as best as possible to hide the complexity of the caching layer from the user. Yet there is only so much it can do about it. Yes, it enables users to easily plug in a cache without much further thought. But as your application’s deployment complexity might grow, you might be forced to revisit that initial strategy so it deals with the oddities of distributed systems in a way both acceptable by the domain and to its users.

Choosing Hibernate’s caching strategy