Interesting Academic Paper on “DRAM Errors in the Wild”

As oxymoronic as the phrase “interesting academic paper” may sound — especially those with graduate training who have every right to know how seldom an exception pops up to break the rule of passive, run-on, dry-as-the-Sahara prose more common in this genre. Back when I studied anthropology I sometimes marveled at how an otherwise astute professional, who could survive indefinitely in the most hostile climes and situations, couldn’t write his or her way out of a paper bag. Alas, the same is all too often true in computer science as well, if my many years of subscriptions to ACM and IEEE computer journals is any indication.

Well, here’s a genuine exception (and indeed, even the writing is at least halfway decent, if not better than that). It’s a paper from the ACM Sigmetrics 2009 conference held in Seattle this summer from the 15th to the 19th of June. The paper in question is entitled DRAM Errors in the Wild: A Large-Scale Field Study, and was presented on Wednesday, June 17, in time slot between 3:00 and 4:30 PM at that conference. Rightfully so, this presentation earned the “Best Presentation Award” at the conference. The field in which the large-scale study occurred was at Google, where the researchers compiled DRAM error data from the tens of thousands of servers in Google’s many, many server farms around the world.

The study itself is cited, if you want to read it in its entirety (I have, and it’s worth it for those with enough curiousity to want to understand and know more after reading this synopsis). Basically, the researchers — namely, Eduardo Pinheiro and Wolf-Dietrich Weber of Google, and lead author Bianca Schroeder of the University of Toronto — collected information about hard and soft memory errors from all of Google’s servers over a 30 month period and teased some very interesting observations and information out of that gargantuan collection of data. Let me also explain that the difference between an uncorrectable or hard and a correctable or soft memory error: a soft memory error is transient (probably caused by some kind of interference with the memory chips themselves, or perhaps with memory bus communications between CPU or memory bus and memory modules) whereas a hard error is persistent (and usually means that the module which manifests such an error needs to be replaced owing to faulty components or connections. Many systems will shut themselves down when they detect that a hard memory error is occurring so as to avoid inadvertent damage to important files or system objects).

Here’s a brief summary of the conclusions from the paper, and how they compare with what had heretofore occupied a revered state of knowledge somewhere between “conventional wisdom” and “holy writ”. These are numbered Conclusions 1 through 6 exactly as they appear in the paper (I mostly paraphrase from same, but will quote anything lifted verbatim from the source);

  • Conclusion 1: The frequency of memory errors, and the range over which error rates vary, is much higher than has been reported in previous studies and in manufacturer’s own claims and specifications. The researchers observed that correctable error rates (which require use of EC RAM) “translate to an average of 25,000 to 75,000…failures in time per billion hours of operation” per megabit. Because of variability some DIMMs experience huge numbers of errors while others experience none. Here’s another pair of gems: “…error correcting codes are crucial for reducing the large number of memory errors to a manageable number of uncorrectable errors” and “the remaining incidence of 0.22% per DIMM per year makes a crash-tolerant application layer indispensable for large-scale server farms.” And here I was thinking of application resiliency as a luxury rather than a necessity!
  • Conclusion 2:  “Memory errors are strongly correlated” — That is, the more errors a DIMM has experienced in the past, the more errors it is likely to experience in the future. Even more interesting the researchers observed “strong correlations between correctable errors and uncorrectable errors,” so that “in 70-80% of the cases an uncorrectable error is preceded by a correctable error in the same month or previous month, and the presence of a correctable error increases the probabilty of uncorrectable error by factors between 9 and 400.” Ouch!
  • Conclusion 3:  The rate of correctable errors goes up over time, whereas that for uncorrectable one goes down owing primarily to replacements of modules that manifest uncorrectable errors. Aging starts to show effects in a period between 10-18 months of use. Moral: Memory doesn’t last forever, nor even terrribly long.
  • Conclusion 4: No evidence collected indicates that newer DIMMs show worse error behavior than older DIMMs. Despite increasing circuity density and bus speeds, error rates are more uniform than not across DDR, DDR3, and DDR and in the 1 GB, 2GB, and 4GB modules that provided the sample population for this study. Conventional wisdom had stated that newer devices should be more prone to errors than older ones because of density and speed increases. Apparently, it ain’t so.
  • Conclusion 5: “Temperature has a suprisingly low effect on memory errors.” Difference in temperature in the Google servers — “…around 20 ° C between the 1st and 9th temperature decile” — had only a marginal impact on memory error rates, when controlling for utilization. This, too, flies in the face of conventional wisdom, and counter to well-documented behavior for processing (not memory) chips of all kinds.
  • Conclusion 6: “Error rates are strongly correlated with utilization.” The more heavily the system is used and the busier the memory bus gets, the higher memory error rates climb. No big surprises here, as epitomized in the old saying “when you hurry, you make mistakes.”
  • Conclusion 7: “Error rates are unlikely to be dominated by soft errors.” Correctable error rates correlated strongly with system utilization, “…even when isolating utilization effects from the effects of temperature.” The researchers see more likelihood of hard errors in the DIMMs themselves, or errors in the datapath, as likely to be the cause of  such errors. This flies completely counter to previous academic work which “…has assumed that soft errors are the dominating error mode in DRAM,” with estimates that hard errors are orders of magnitude less frequent and thought to comprise less than 2% of all errors. This study says otherwise, and backs it up nicely.

I hope you’ll find this interesting enough to want to check out the original. For my part, it’s going to make it a whole lot more likely for me to keep extra memory modules around, and to replace them whenever they start showing unimistakable signs of hard errors. I’m also chewing on the idea that swapping out RAM every two to three years may be a good form of system maintenance — at least, for systems I keep that long. You may want to consider doing likewise.

Pleasant Surprise with New System Build & Windows 7 Starter

About three weeks ago, my wife’s old PC started to give up the ghost. I built that system four or five years ago around a low-end DFI socket 939 motherboard with onboard VIA graphics, 100 Mbps Ethernet, two SATA 1 connectors, and DDR memory. The system included an AMD K8 Sempron 3200+ CPU (1.8 GHz), 2 GB of DDR-400 RAM, a Philips SATA DVD burner, and a 300 GB SATA 1 Matrox hard disk in a cheapo no-name Taiwanese case. I added a four-way fan controller, and three very quiet 80mm fans to the case to keep things cool, and it included a rock-solid Seasonic 400W PSU, but this system comes as close to bare bones as anything I’ve ever built. Except for occasional problems with the motherboard finding the SATA devices at boot-up (click the reset button and try again; repeat until it sees those devices) I never had a single problem with this machine over its entire productive life. I think it cost me about $200 to put it togther, because many of its parts were leftovers from other projects or articles. This probably makes it one of my best builds ever — at least from a maintenance, upkeep, and reliability perspective.

But nothing lasts forever, even no-BS systems like that one. I’m not sure if it was the SATA controller starting to fail, or the drive itself starting to go, but my wife Dina reported that she was having trouble with file corruption and running programs. Soon thereafter, the system wouldn’t boot any more. I back all our systems up to an HP MediaSmart Server nightly, so I wasn’t worried about losing anything important, but a quick inspection of the system showed me it was time for a replacement. The hard disk was clearly corrupted, and not even my trusty old copy of SpinRite 6.0 could restore it to full working condition, and I was also concerned that the SATA controller was starting to fail (I had issues running a repair install from the optical drive as well).

As a temporary fix, I set her up with my Asus Eee PC 1000HE (Atom N280, 2 GB DDR2, 160 GB HD, Intel 950 Mobile graphics, GbE Ethernet interface) hoping that she might like it enough to make it her regular PC. After three or four days of use, we talked it over and she opined that the 1000HE — to which I had attached her Dell 2208WFP monitor, her Microsoft Comfort Curve 2000 keyboard, and a Logitech V550 Nano mouse to replace her older and no-longer-satisfactory Microsoft wired laser mouse — just wasn’t fast enough to meet her needs.

I decided to build a mini-ITX system for her, in part to keep noise and power consumption levels down, and also because I had lots of parts I could use to finish out such a machine. Visiting my old buddies at LogicSupply I settled on a bare-bones version of a complete system they offer there for sale, because I was able to furnish my own RAM and hard disk. A quick consultation with one of their technical sales guys convinced me to buy the system parts from them, and then to assemble the system myself, to save even more money. Here’s what I ended up buying from them:

  • MSI Industrial 945GME1 Core 2 Duo Mobile Mini-ITX Mainboard   $238.95
  • Morex T-3500-150W Mini-ITX Case, Black   $115.00
  • Panasonic UJ-875-A SATA Slimline Slot-Loading DVD Writer   $76.00
     - Cables: Slimline SATA CD/DVD Drive Converter Cable (+$7.00)
  • Intel Core T2300 Duo 1.66 GHz Processor: 667 MHz Socket M   $106.00

So far, my total outlay was $542.95. I supplied my own Seagate Momentus 2.5″ 5,400 RPM 160 GB hard disk (approximate retail: $55) and a Patriot 2GB DDR2-800 memory module (approximate retail: $25).

The Morex case includes an external 150W PSU that look just like (and probably is) also used for nettop or desktop replacement notebook PCs. The build went together pretty easily, except that I initially mounted the DVD player upside down (it seemed more natural to hook up the SATA cable that way, though I soon realized my mistake once I tried to start using some DVDs). I’m ballparking total costs of the system at around $600 ($622.95 to be precise, not including shipping and tax). The only fan in the unit is on its itty-bitty CPU cooler, so it’s quite a bit quieter than an ordinary desktop case, most of which include a larger CPU cooler plus at least 2 80mm fans (or larger).

I installed Windows 7 Starter Edition on this box, because I knew Dina didn’t care about Aero (she uses the machine pretty much exclusively for Web surfing and e-mail) and she doesn’t really pay much attention to OS look, feel, and behavior anyway. I was relieved that she is happy with the machine and professes herself satisfied with its capabilities and performance. I probably spent no more than two hours putting everything together, and another two hours installing the OS, updating the drivers, and using the Windows Easy Transfer utility to move her files, preferences, and settings from Asus Eee PC to her new mini-ITX machine.

My major learning event for this build was that the Intel system tray utility has to be set to “Single display” to take advantage of higher-resolution monitors like her Dell 2208WFP (native resolution: 1600 x 1050 pixels). By default, this utility was set to “Clone display” mode which automatically limits maximum screen resolution to 1024×768. I had to download the latest set of Windows 7 utilities from the Intel site to gain access to the necessary system tray (or should I say “notification area?”) widget.

The Intel Mobile Graphics Accelerator Utility was the key to proper resolution

The Intel Mobile Graphics Accelerator Utility was the key to proper resolution

Once I got the screen working properly, I had to update drivers for some of the USB devices on the motherboard, at which point I also learned that Intel is releasing chipset drivers for Windows 7 slowly but surely (this motherboard uses an ICH7, so that’s the device for which I grabbed and installed drivers). Once again, Windows 7 scored well in terms of the drivers it supplied during the install. I only had to download three drivers (SetPoint 4.80, the Intel Chipset drivers, and a driver for the Dell 2208WFP monitor) to bring things completely up-to-date.

Once the machine was up and running, it proved to be something of a honey. My Seasonic Power Angel showed power consumption levels for the unit never exceeded 55W. During boot-up most values fluctuated between 30 and 40W; at idle the system consumed 33-38 W; running a full system scan with Norton Internet Security 2010, the highest value I observed was 51 W. Temperatures were likewise fairly balmy (though I could probably bring them down further by replacing the itty-bitty reference cooler that MSI supplies with the motherboard with something a bit more capable) as shown in this screencap from Franck Delattre’s excellent HW Monitor program.

HW Monitor reports for mini-ITX system

HW Monitor reports for mini-ITX system

The unit generally runs cooler than a notebook PC (my Dell D620 with the same processor would typically run about 4-5 degrees Celsius hotter with a T2300 CPU; now with a T7200 it’s more like 10 degrees hotter on the same scale) but a little hotter than a well-ventilated desktop PC (even a quad core). Power consumption is extremely low, however — less than half that for the older DFI-based desktop it’s replacing, and less than a third of that for my two quad core desktops. To me, that makes the system a real winner, especially because it consumes 4-8W in sleep mode (and because Dina uses that machine less than 4 hours a day, sleep mode is basically where it lives). I’ll concede that for the same money you could buy a nice little notebook, and that you could buy a full-size desktop with the same or better specs for about $200 less. I’m not sure that the energy savings will make up that cost difference, but it’s a great-looking, compact machine and everybody whose opinion counts around here seems to like it. I’d include some photos of the build with this post, but it’s going to have to wait: Dina’s busy using that computer right now. Stay tuned!

In Memoriam: Cecilia Katherine Kociolek Tittel 1919-2009

I got back from a business trip Friday morning to learn that my Mom, aged 90, passed away peacefully in her sleep the previous  night. She spent the last year and a half of her life in an assisted living facility in Fairfax County, VA, after living with me and my family for just over two years in the home (with “mother-in-law wing”) we had built to care for her in her declining years.

I’d like to take this opportunity to remember her to all of you. She was a ferociously intelligent woman who did her best to take care of her family, and I’ll always be grateful to her for ensuring that I got such a good education. She graduated first in her high school class, and also at the top of her class in nursing school. She served in WWII with distinction, and attained the rank of Major in a mobile army surgical hospital, following the Army through Northern Africa, into Sicily, and then on to France. When I was a boy, she took a job as the school nurse in the Heidelberg American School system in Germany, in part to keep a closer eye on me and my sister. She always encouraged my love of learning and language, and I owe much of what I am today to her care and attention. I will miss her terribly.

Mom was also a multiple cancer survivor: after being diagnosed with colon cancer in 1987 and learning to live with a colostomy (at which point she quit smoking), she was then diagnosed with lung cancer in 1989 (at which point she had the upper lobe of her left lung removed). She managed to survive for 20 years after those medical misadventures, and remained cancer free until her dying day. If anybody wants to remember her, I’d ask them to make a donation to the American Cancer Society in her name.

Cecilia K Tittel, 88th Birthday 2007

Cecilia K Tittel, 88th Birthday 2007

Rude Surprises: Asus P5K Mobo Doesn’t Do VT; Bungled BIOS Flash Hoses Same

In Windows 7, running Windows XP mode requires that the computer support Virtualization Technology (VT). Most modern Intel and AMD CPUs support VT, but I am learning to my woe and dismay that some motherboards — including some relatively new ones — do not. This includes the Asus P5K motherboard that has otherwise proven itself to be a capable and rock-solid Windows 7 test platform: I’d been running it with 12 GB of RAM installed and it was fast, agile, and let me run as many as half-a-dozen VMs with Virtual PC 2007 and XP, Vista, and various Windows 7 versions.

Upon learning this, I could suddenly understand why my test platform wouldn’t run Windows XP mode. I called my resident hardware guru, Toby, and asked him if any relief might be at hand. He said “Download the Asus BIOS Update utility, and grab the latest BIOS. It might fix this problem, if Asus has added VT support to a later release.” What he didn’t tell me, and I didn’t know, was that the P5K models are subject to total BIOS obliteration if the flash fails to complete or to validate properly. When I flashed the BIOS and saw the latter failure reported, I figured “No problem. I’ll reload the old BIOS on my next boot.” Not gonna happen, apparently: the BIOS never even started to POST so I had no way to get back into the system to make the change.

The BIOS is completely hosed, and I’ve ordered a new BIOS chip from an eBay supplier for a mere $20. My gut feel is that the chip may restore the motherboard to operational status, but it’s unlikely that I’ll get the VT support that I need from this motherboard. I’m planning to order a new, ultra-stable model from Asus or Gigabyte to replace it, probably with the P45 or a newer chipset, which is much more likely to suppport the virtualization technology I need.

In the meantime, my primary test machine is down for the count until the new BIOS chip arrives in the mail. Good thing I’ve got another backup PC to put in its place in the meantime. It’ll have to wait until this weekend, however, when I should have time to run through yet another install and finish-out for Windows 7 on my currently unused Vista Media Center box. Wish me luck!

No Joy on In-place Upgrade; Clean Install Succeeds

I’d been hoping to try an upgrade install on my balky, problem-prone production PC to see if it could cure or at least help to address some of the issues that Vista has developed over time in that runtime environment. Alas, it was not to be. I’ll share the details in the next paragraphs, but for now I can only report that a strange and possibly spurious leftover from Trend Micro Internet Security 2008 stymied my in-place upgrade attempts. All contortions to remove its traces failed, and the upgrade utility wouldn’t let an upgrade proceed, so I performed a clean install instead. Overall results from that maneuver are 98% positive, as I will also report later in this blog as well. On to the (failed) in-place upgrade attempt.

Attempting In-Place Upgrade

On my initial attempt to run an upgrade install from Vista to Windows 7 Ultimate on my production machine, the first run produced the following list of applications that had to be uninstalled for the process to proceed:

  1. Intellitype and/or Intellipoint: With an MS Comfort Curve 4000 keyboard installed I had the former, so it was removed without incident. Close examination also showed the presence of Intellipoint as well, so it was removed as well. I used Revo Uninstaller throughout to clean up lingering files and registry traces after the built-in uninstall utility completed; both programs uninstalled themselves without leaving any lingering traces.
  2. Daemon Tools Lite: I used this to mount ISO images as virtual file systems on my PC, now that I’ve been downloading them regularly from MSDN (and having also grabbed some from BitTorrent during my work on a recent Windows 7 book). Interestingly, neither the Programs and Features item in Control Panel nor Revo Uninstaller sees this application. Fortunately, the built-in uninstaller worked to Microsoft’s satisfaction.
  3. Trend Micro Internet Security 2008 (TMIS08): Not installed on my PC, and I have no memory of ever having done so on this machine. Just to be safe, I uninstalled the two Trend Micro products I did have installed on this machine — namely, Hijack This! and Housecall, using Revo Uninstaller. No lingering traces for either item reported by that program. [Update on 8/14/09: On the phone with Rebekkah Hilgraves earlier this week she reminded me that I had indeed installed this software on my PC last year in connection with work for Digital Landing. It had been long since removed, with no obvious traces of its presence, but something must have been left behind.]

Alas, my next attempt to perform the in-place upgrade still failed, and reported that TMIS08 still needed to go. I searched my system drive for Trend Micro files and directories, found none. I searched and removed all Trend Micro references from my registry, ran CCleaner, rebooted, then tried again. No joy. Thinking it might be my current AV/anti-spyware package causing a false report, I uninstalled Spyware Doctor with Antivirus and tried again. Still no joy. I searched the Web for instructions on uninstalling TMIS08 and made sure I’d covered all the bases (I had, and even the MS Install Clean-up Tool reported no traces of this program on my system) and decided to give up and perform a clean install instead. I have to believe this was the proper course of action anyway, given the numerous problems I’ve been fighting in Vista on this machine. Though I wasn’t able to satisfy my perverse curiosity, I do think this was the the right thing to do anyway.

Performing a Clean Install

After spending about four hours trying to make the in-place upgrade work, it took less than half an hour to perform the clean install. After that, it took about an hour to get all of the Windows Updates items installed, including a quick install and post-install cleanup to get MS Office 2007 Enterprise Edition up and running. The updates brought in  new drivers for ACPI and my motherboard’s built-in RealTek GbE Ethernet adapter. Following that maneuver I installed the DriverAgent drive scanner to assess how Windows 7 did in supplying drivers for my motherboard, and had to install the latest Logitech SetPoint 4.80 version (out last Wednesday, 8/5),  a driver for my second monitor, a Dell 1905FP that showed up as a “Generic PnP Monitor” instead, and update the drivers for my Dell AIO 968 inkjet all-in-one unit. Not too shabby an experience, all-in-all —if anything, even better than what I experienced on half-a-dozen PCs (2 desktops, 4 notebooks) while working with the beta Windows 7 versions from Build 7000 through Build 7100 (the RC).

After that, I installed a pretty lengthy list of applications to re-create the everyday work environment on my production PC (but left everything not absolutely necessary, trimming total count from over 100 to 43 including system and driver related components listed in Revo Uninstaller):

Production PC Applications and Miscellany
Freeware Remarks Commercial SW Remarks
Driveragent Driver currency check MS Office Enterprise 2007 Standard productivity suite
FileZilla FTP client PC Doctor w/Antivirus Favorite AV/antispyware pkg
HP USB Format Tool Builds bootable UFDs Acronis TrueImage Home 2009 Use this for occasional image backups
Secunia PSI Software update monitor Corel PSP X2 Budget image editor for pix and screencaps
WinDirStat Visual disk space mapper HP MediaSmart Tools Client SW for HP MediaSmart Server
ISO Recorder Excellent ISO burning tool WinZip 12.1 Still my favorite file compression toolkit
Logitech SitePoint 4.80 Mouse driver and mgmt tool WAIK for Windows 7 For building minimal book/repair images
Firefox Alternative mainstay browser More Freeware More Remarks
Adobe Reader PDF reader Adobe Flash Flash players for IE and Firefox
Piriform CCleaner Registry and file clean-up tool Revo Uninstaller App uninstaller and clean-up tool
Skype VoIP and IM program Intel Matrix Storage Mgr 47 Manages mirrored boot/system disks
MS Intellitype 7.0 Keyboard mgmt app Dell AIO 968 tools AIO setup, mgmt, and misc tools

Total time expended for everything, including installing and minor OS tweaks (set up ReadyBoost, tweak Folder Options, configure e-mail accounts, and so forth) and installing all of the drivers and apps was about 12 hours. This is at least four hours shorter than my last major Vista rebuild, and I account for the the time difference thanks to Windows 7’s faster install time (1 hour for Vista versus half an hour for Windows 7) and an easier time with drivers and post-install set-up than with Vista (lots more updates to slipstream on an older operating system, to be sure).

What’s My Status?

My previous Vista issues have all but disappeared (see …Vista Mysteries for details): Sidebar and Event Viewer are working normally, there are no strange networking connectivity issues or spurious reports of same, and there are no dwm.exe or explorer.exe failures to report just yet. The HP MediaSmart connector and other software is functioning perfectly, and I’m once again able to interact with the MediaSmart Server as I should be. In short, all of my software mysteries have indeed been fixed. [Update on 8/14/09: I'm having WHS Connector problems on another Windows 7 machine, and thought I was having similar problems on the production machine as well, but they proved related to a failing D: drive gave up the ghost yesterday morning--though recovery took time, I was incredibly thankful to have a current backup).]

But all is not peaches and cream, either. I still have some issues with the memory card reader integrated into my Dell 2707 WFP monitor. Its USB hub works just fine now, and I can interact with SD cards, but the Compact Flash reader doesn’t appear to be working (and probably accounts for the Unknown Device warning that DriverAgent reports but that Device Manager does not). I do still have some USB issues on the system, but I’m increasingly inclined to suspect balky, damaged, or failing hardware (I bent the USB connector on the Corsair UFD that I now use for ReadyBoost — it’s my fastest flash drive —and I believe there’s an internal short or connection failure on the 2707’s CF memory card reader) for such problems as remain. But because I have a built-in card reader on the Dell AIO that works just fine, and even a plug-in CF-to-USB adapter, I’m not too concerned about the 2707 issue, particularly because my second monitor covers up those connectors anyway.

So far, I can live quite nicely with my current situation, and I see almost none of the disturbing signs of system instability under Windows 7 that I saw every day under Vista. My only current problem is that the video on my primary 2707 monitor goes black for a couple of seconds three or four times a day, with obvious signs of video driver issues (I’m running an Nvidia GeForce GTX 275 with driver version 8.15.11.8635 [Update on 8/14/09: yesterday MS provided a new, Windows 7 labeled Nvidia driver via Windows Update which I installed immediately; now, I’m done to one brief daily blackout). I’ll wait for more usage history to be reported online and may roll back to an earlier version if that shows signs of easing my plight.

Time will tell, as it always does with Windows, including this latest version. All in all, I’m much happier with Windows 7 on this production unit than I was with Vista. So far, my intuition that this would be the case is holding out pretty well, but I’m not inclined to declare victory until I have more time in the Windows 7 harness and can see how things go on a day-in, day-out basis. Going forward, though, I will be limiting my experimental installs of new or test software to virtual machines, and trying to limit the amount of gunking up that I allow on this newly rebuilt Windows image. I have to see that as a potential and likely cause of my earlier Vista woes on this system.