Interesting Academic Paper on “DRAM Errors in the Wild”

As oxymoronic as the phrase “interesting academic paper” may sound — especially those with graduate training who have every right to know how seldom an exception pops up to break the rule of passive, run-on, dry-as-the-Sahara prose more common in this genre. Back when I studied anthropology I sometimes marveled at how an otherwise astute professional, who could survive indefinitely in the most hostile climes and situations, couldn’t write his or her way out of a paper bag. Alas, the same is all too often true in computer science as well, if my many years of subscriptions to ACM and IEEE computer journals is any indication.

Well, here’s a genuine exception (and indeed, even the writing is at least halfway decent, if not better than that). It’s a paper from the ACM Sigmetrics 2009 conference held in Seattle this summer from the 15th to the 19th of June. The paper in question is entitled DRAM Errors in the Wild: A Large-Scale Field Study, and was presented on Wednesday, June 17, in time slot between 3:00 and 4:30 PM at that conference. Rightfully so, this presentation earned the “Best Presentation Award” at the conference. The field in which the large-scale study occurred was at Google, where the researchers compiled DRAM error data from the tens of thousands of servers in Google’s many, many server farms around the world.

The study itself is cited, if you want to read it in its entirety (I have, and it’s worth it for those with enough curiousity to want to understand and know more after reading this synopsis). Basically, the researchers — namely, Eduardo Pinheiro and Wolf-Dietrich Weber of Google, and lead author Bianca Schroeder of the University of Toronto — collected information about hard and soft memory errors from all of Google’s servers over a 30 month period and teased some very interesting observations and information out of that gargantuan collection of data. Let me also explain that the difference between an uncorrectable or hard and a correctable or soft memory error: a soft memory error is transient (probably caused by some kind of interference with the memory chips themselves, or perhaps with memory bus communications between CPU or memory bus and memory modules) whereas a hard error is persistent (and usually means that the module which manifests such an error needs to be replaced owing to faulty components or connections. Many systems will shut themselves down when they detect that a hard memory error is occurring so as to avoid inadvertent damage to important files or system objects).

Here’s a brief summary of the conclusions from the paper, and how they compare with what had heretofore occupied a revered state of knowledge somewhere between “conventional wisdom” and “holy writ”. These are numbered Conclusions 1 through 6 exactly as they appear in the paper (I mostly paraphrase from same, but will quote anything lifted verbatim from the source);

  • Conclusion 1: The frequency of memory errors, and the range over which error rates vary, is much higher than has been reported in previous studies and in manufacturer’s own claims and specifications. The researchers observed that correctable error rates (which require use of EC RAM) “translate to an average of 25,000 to 75,000…failures in time per billion hours of operation” per megabit. Because of variability some DIMMs experience huge numbers of errors while others experience none. Here’s another pair of gems: “…error correcting codes are crucial for reducing the large number of memory errors to a manageable number of uncorrectable errors” and “the remaining incidence of 0.22% per DIMM per year makes a crash-tolerant application layer indispensable for large-scale server farms.” And here I was thinking of application resiliency as a luxury rather than a necessity!
  • Conclusion 2:  “Memory errors are strongly correlated” — That is, the more errors a DIMM has experienced in the past, the more errors it is likely to experience in the future. Even more interesting the researchers observed “strong correlations between correctable errors and uncorrectable errors,” so that “in 70-80% of the cases an uncorrectable error is preceded by a correctable error in the same month or previous month, and the presence of a correctable error increases the probabilty of uncorrectable error by factors between 9 and 400.” Ouch!
  • Conclusion 3:  The rate of correctable errors goes up over time, whereas that for uncorrectable one goes down owing primarily to replacements of modules that manifest uncorrectable errors. Aging starts to show effects in a period between 10-18 months of use. Moral: Memory doesn’t last forever, nor even terrribly long.
  • Conclusion 4: No evidence collected indicates that newer DIMMs show worse error behavior than older DIMMs. Despite increasing circuity density and bus speeds, error rates are more uniform than not across DDR, DDR3, and DDR and in the 1 GB, 2GB, and 4GB modules that provided the sample population for this study. Conventional wisdom had stated that newer devices should be more prone to errors than older ones because of density and speed increases. Apparently, it ain’t so.
  • Conclusion 5: “Temperature has a suprisingly low effect on memory errors.” Difference in temperature in the Google servers — “…around 20 ° C between the 1st and 9th temperature decile” — had only a marginal impact on memory error rates, when controlling for utilization. This, too, flies in the face of conventional wisdom, and counter to well-documented behavior for processing (not memory) chips of all kinds.
  • Conclusion 6: “Error rates are strongly correlated with utilization.” The more heavily the system is used and the busier the memory bus gets, the higher memory error rates climb. No big surprises here, as epitomized in the old saying “when you hurry, you make mistakes.”
  • Conclusion 7: “Error rates are unlikely to be dominated by soft errors.” Correctable error rates correlated strongly with system utilization, “…even when isolating utilization effects from the effects of temperature.” The researchers see more likelihood of hard errors in the DIMMs themselves, or errors in the datapath, as likely to be the cause of  such errors. This flies completely counter to previous academic work which “…has assumed that soft errors are the dominating error mode in DRAM,” with estimates that hard errors are orders of magnitude less frequent and thought to comprise less than 2% of all errors. This study says otherwise, and backs it up nicely.

I hope you’ll find this interesting enough to want to check out the original. For my part, it’s going to make it a whole lot more likely for me to keep extra memory modules around, and to replace them whenever they start showing unimistakable signs of hard errors. I’m also chewing on the idea that swapping out RAM every two to three years may be a good form of system maintenance — at least, for systems I keep that long. You may want to consider doing likewise.

Advertisements

Post a Comment

You must be logged in to post a comment.
%d bloggers like this: