The New York Times published an interesting article on struggle for "Internet services" to provide 99.999% ("Five-Nines") reliability. From "99.999% Reliability: Don't Hold Your Breath":
MA BELL spoiled us.
AT&T's dial tone set the all-time standard for reliability. It was engineered so that 99.999 percent of the time, you could successfully make a phone call. Five 9s. That works out to being available all but 5.26 minutes a year.
Can we realistically expect that such availability will ever come to Internet services? Any given week, it seems, some well-known service suffers a shutdown. Recently, it was Hotmail and Skype. And Wikipedia, Facebook, Twitter, Foursquare and PayPal, among others, made the 2010 list of service interruptions compiled by Royal Pingdom, a company in Sweden that monitors the up time of Web services worldwide.
Internet computing, however, isn’t as unreliable as it may seem. After all, when was the last time you got to Google’s home page but couldn’t complete your search?
As more and more Web services companies acquire years of experience, we’ll see more consistent reliability — it’s just a matter of time and learning. Attaining Four-9s availability will become routine. That means available all but 52.56 minutes a year.
As for moving to 99.999, well, that may never come. “We don’t believe Five 9s is attainable in a commercial service, if measured correctly,” says Urs Hölzle, senior vice president for operations at Google. The company’s goal for its major services is Four 9s.
The article goes on to say how Google is able to provide almost Five 9s search service by providing mirrored live copies. I'll go further and say that in at least one respect, Google may have an easier task than some Internet services: search service responses don't have to be consistent: if one mirror provides different answers than another, it doesn't matter (much). In other words, they can live with some form of eventual consistency in their mirrored copies.
How do those providing Internet services (not to be confused with Internet service providers!) architect to improve availability?
One thing that Google and other companies offering Web services have learned to do is to keep software problems at their end out of the user’s view. John Ciancutti, vice president for personalization technology at Netflix, wrote on the company's blog in December about lessons learned in moving its systems from its own infrastructure to that of Amazon Web Services. He said Netflix had adopted a “Rambo architecture”: each part of its system is designed to fight its way through on its own, tolerating failure from other systems upon which it normally depends.
“If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond,” Mr. Ciancutti said. “We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.”
What does this have to do with Coherence? On several occasions, I have heard customers inadvertently notice that putting Coherence between their data sources and the application has shielded (or could shield) a back-end outage from the end user.
- One customer told me about a database outage that was not noticed by end users. The application buffered writes to the database using Coherence's write-behind capability. Coherence continued to buffer updates from the application, retrying until the database connection was re-established.
- Deb Ayers, Oracle Service Bus Product Manager, told how they cached web service call results in Coherence to improve performance. It was a small step for a customer to realize this optimization could do more than improve performance: by caching results from an external web service call (or, for that matter, to any other external resource), the application could shield itself from external resource failure. As is noted in the NY Times article, if the external resource is not available, some service may be degraded - for example, you may not have have up-to-date results. But in many cases, this degraded service may be good enough.
Coherence provides you several tools to help implement this. Exactly which ones suit you will depend on your application requirements. Here is a list of possibilities:
- The cache-store, which interfaces directly with your Internet service. In architecture in which Coherence provides the primary interface to data, the cache-store is what talks to the ultimate data source, in this case your Internet service.
- The Coherence refresh-ahead mechanism. Reading from our documentation: "In the Refresh-Ahead scenario, Coherence allows a developer to configure a cache to automatically and asynchronously reload (refresh) any recently accessed cache entry from the cache loader before its expiration. The result is that after a frequently accessed entry has entered the cache, the application will not feel the impact of a read against a potentially slow cache store when the entry is reloaded due to expiration. The asynchronous refresh is only triggered when an object that is sufficiently close to its expiration time is accessed—if the object is accessed after its expiration time, Coherence will perform a synchronous read from the cache store to refresh its value."
- Another entirely different approach to introduce Coherence using a variation of the cache-aside pattern. Using this approach, the application code would use the cache store results of Internet service calls, and would access the cache for results if the Internet service went down.
I tend to prefer architectures that shield the application from knowledge of the details of the data sources. Sometimes customers, especially newcomers to Coherence, prefer to get their feet wet with some form of cache-aside pattern.