Production Systems Settle Down At Long Last

Anybody who’s been following this blog for any length of time knows I’ve been battling incessant stability problems on my production PC for some time now. In fact, I have two PCs that alternate between test and production roles (a hardware configuration table follows later in this posting), both of which have settled down completely in the last three weeks or so. I upgraded both machines in the first half of August, 2009, but it’s taken me some time to work out all the kinks I’ve encountered along the way. I even wrote a story on this topic for my good friend and colleague Esther Schindler at ITExpertVoice.com; it’s entitled “Maximize the Stability  Index on Your PCs.”

As of today, my primary production machine has held a Stability Index value of 10.0 since 11 PM on December 18 — that’s nearly 12.5 days as I write this blog, and at least 8 days longer than that machine has maintained that value at any time since I installed Windows 7 Ultimate x86 on August 9, 2009. Here’s the inflection point as recorded in Reliability Monitor’s graph log:

At the stroke of 11 PM, this machine hit a perfect "10"

And here’s what the stability graph has looked like recently as well:

The goal is what you see at right: flat-lining at the top!

How did I get to this much-sought-after position? You can read my many posting on this subject in my TechTarget Enterprise Windows blog, or the aforecited ITExpertVoice story for more details, but the quick version goes like this:

  1. I quit installing everything and anything on my machines, and stuck only to proven tools I needed to do my job.
  2. I troubleshot and fixed some driver issues, particularly with my Dell All-in-One 968, which kept installing itself as an XPS device instead of a raw print device for some unfathomable reason.
  3. I found a complete set of current and working drivers, thanks to DriverAgent.com and persistent experimentation with various versions as problems presented themselves.
  4. I quit using a RAID 1 array for my system disk, and switched to a single drive system disk instead. I can’t find much evidence that the Intel RAID drivers and Storage Manager cause problems with Windows Vista and 7 system disks, but my experience has been that stability zoomed after making this change on both systems where I’d used that configuration to try to speed things up. Now, I’m using a single SSD for the system disk on each of those two systems with great results.

Just for the record, here’s the info on my two QX9650 systems, where the Test and Production labels tell you which is which:

QX9650 Test and Production Systems
Item Production Test
Motherboard GA-X38-DQ6 Asus P5Q3
CPU Intel QX9650 3.33GHz Intel QX9650 3.00 GHz
RAM 2x2GB DDR2-800 2x2GB DDR3-1066
Graphics Nvidia GTX275 Nvidia 8800 GT
SysDisk Intel X-25M 80GB SSD SuperTalent 128GB SSD
Experience Rank 6.90 6.70

The production system experienced its last reliability fault on 12/10/09 and has been climbing upward ever since; the test system beat that by one day and has been climbing since 12/09. The test system didn’t hit 10.0 until 12/18 (the same day as the production system) and both have been trouble-free for nearly a month now. I don’t know how long this can last, but I’m going to love every minute of it in the meantime.

Advertisements

Redmond Path: Great Widget for Editing Path Variable

Those of you who, like me, have been tinkering with Windows for any length of time know that Windows uses an environment variable named ‘path’ to search for files when you enter input at the command line. As long as any program or filename you type in at the command line resides in a directory on this search path, Windows will find and run it for you without requiring you to provide a complete path specification (of the form C:\Program Files\CPU-Z\Redmond Path.exe, in the case of the executable that makes this fine little utility work, for example). Double-click that program name (or type it in the command line, provided it’s on your search path) and you’ll see a window like this one pop up.

Redmond Path lists all environment variables in a highly readable, easily accessible format

Redmond Path lists all environment variables in a highly readable, easily accessible format

Each variable shows up in its own line in the display. You can highlight any item, then click the red X to delete it. You click the plus sign to insert a new directory spec (by default additions always show up at the end of the vertical list), and you can use the up and down arrows to move selected items up or down (the higher an item appears in the list, the sooner it will be searched, which gives it priority when duplicate elements might appear in multiple directories on this list).

Editing the path variable in Windows XP, Vista, or 7 can be irksome. You can do it at the command line by making an assignment like path = %path%;C:\Example (which adds the C:\Example entry to the end of the path), or by using the Environment Variables/Edit System Variable or …/Edit User Variable control through the System Properties window (click Start, Control Panel, System, then Advanced System Settings to get here through the menus, or click the Windows Logo and Break key to jump right to the System window, and continue as before…). If you don’t do it at the command line, you get a textbox that’s only about 40 characters wide, through which you have to peer into the value of the path variable, which can easily be 100 characters or more in length. This screencap should give you a good idea why this might not be the most convenient presentation for this type of data (Redmond Path lists the values in their order of appearance, one value per line, and is much easier to see, understand, and manipulate).

Lots of text, little display room

Lots of text, little display room

Do yourself a big favor, and grab a copy of Redmond Path from the RedmondLab.net Website (part of GooglePages, actually). It’s free, it’s convenient, and it works nicely. It’s also a great little addition to any Windows user’s utility collection.

Interesting Academic Paper on “DRAM Errors in the Wild”

As oxymoronic as the phrase “interesting academic paper” may sound — especially those with graduate training who have every right to know how seldom an exception pops up to break the rule of passive, run-on, dry-as-the-Sahara prose more common in this genre. Back when I studied anthropology I sometimes marveled at how an otherwise astute professional, who could survive indefinitely in the most hostile climes and situations, couldn’t write his or her way out of a paper bag. Alas, the same is all too often true in computer science as well, if my many years of subscriptions to ACM and IEEE computer journals is any indication.

Well, here’s a genuine exception (and indeed, even the writing is at least halfway decent, if not better than that). It’s a paper from the ACM Sigmetrics 2009 conference held in Seattle this summer from the 15th to the 19th of June. The paper in question is entitled DRAM Errors in the Wild: A Large-Scale Field Study, and was presented on Wednesday, June 17, in time slot between 3:00 and 4:30 PM at that conference. Rightfully so, this presentation earned the “Best Presentation Award” at the conference. The field in which the large-scale study occurred was at Google, where the researchers compiled DRAM error data from the tens of thousands of servers in Google’s many, many server farms around the world.

The study itself is cited, if you want to read it in its entirety (I have, and it’s worth it for those with enough curiousity to want to understand and know more after reading this synopsis). Basically, the researchers — namely, Eduardo Pinheiro and Wolf-Dietrich Weber of Google, and lead author Bianca Schroeder of the University of Toronto — collected information about hard and soft memory errors from all of Google’s servers over a 30 month period and teased some very interesting observations and information out of that gargantuan collection of data. Let me also explain that the difference between an uncorrectable or hard and a correctable or soft memory error: a soft memory error is transient (probably caused by some kind of interference with the memory chips themselves, or perhaps with memory bus communications between CPU or memory bus and memory modules) whereas a hard error is persistent (and usually means that the module which manifests such an error needs to be replaced owing to faulty components or connections. Many systems will shut themselves down when they detect that a hard memory error is occurring so as to avoid inadvertent damage to important files or system objects).

Here’s a brief summary of the conclusions from the paper, and how they compare with what had heretofore occupied a revered state of knowledge somewhere between “conventional wisdom” and “holy writ”. These are numbered Conclusions 1 through 6 exactly as they appear in the paper (I mostly paraphrase from same, but will quote anything lifted verbatim from the source);

  • Conclusion 1: The frequency of memory errors, and the range over which error rates vary, is much higher than has been reported in previous studies and in manufacturer’s own claims and specifications. The researchers observed that correctable error rates (which require use of EC RAM) “translate to an average of 25,000 to 75,000…failures in time per billion hours of operation” per megabit. Because of variability some DIMMs experience huge numbers of errors while others experience none. Here’s another pair of gems: “…error correcting codes are crucial for reducing the large number of memory errors to a manageable number of uncorrectable errors” and “the remaining incidence of 0.22% per DIMM per year makes a crash-tolerant application layer indispensable for large-scale server farms.” And here I was thinking of application resiliency as a luxury rather than a necessity!
  • Conclusion 2:  “Memory errors are strongly correlated” — That is, the more errors a DIMM has experienced in the past, the more errors it is likely to experience in the future. Even more interesting the researchers observed “strong correlations between correctable errors and uncorrectable errors,” so that “in 70-80% of the cases an uncorrectable error is preceded by a correctable error in the same month or previous month, and the presence of a correctable error increases the probabilty of uncorrectable error by factors between 9 and 400.” Ouch!
  • Conclusion 3:  The rate of correctable errors goes up over time, whereas that for uncorrectable one goes down owing primarily to replacements of modules that manifest uncorrectable errors. Aging starts to show effects in a period between 10-18 months of use. Moral: Memory doesn’t last forever, nor even terrribly long.
  • Conclusion 4: No evidence collected indicates that newer DIMMs show worse error behavior than older DIMMs. Despite increasing circuity density and bus speeds, error rates are more uniform than not across DDR, DDR3, and DDR and in the 1 GB, 2GB, and 4GB modules that provided the sample population for this study. Conventional wisdom had stated that newer devices should be more prone to errors than older ones because of density and speed increases. Apparently, it ain’t so.
  • Conclusion 5: “Temperature has a suprisingly low effect on memory errors.” Difference in temperature in the Google servers — “…around 20 ° C between the 1st and 9th temperature decile” — had only a marginal impact on memory error rates, when controlling for utilization. This, too, flies in the face of conventional wisdom, and counter to well-documented behavior for processing (not memory) chips of all kinds.
  • Conclusion 6: “Error rates are strongly correlated with utilization.” The more heavily the system is used and the busier the memory bus gets, the higher memory error rates climb. No big surprises here, as epitomized in the old saying “when you hurry, you make mistakes.”
  • Conclusion 7: “Error rates are unlikely to be dominated by soft errors.” Correctable error rates correlated strongly with system utilization, “…even when isolating utilization effects from the effects of temperature.” The researchers see more likelihood of hard errors in the DIMMs themselves, or errors in the datapath, as likely to be the cause of  such errors. This flies completely counter to previous academic work which “…has assumed that soft errors are the dominating error mode in DRAM,” with estimates that hard errors are orders of magnitude less frequent and thought to comprise less than 2% of all errors. This study says otherwise, and backs it up nicely.

I hope you’ll find this interesting enough to want to check out the original. For my part, it’s going to make it a whole lot more likely for me to keep extra memory modules around, and to replace them whenever they start showing unimistakable signs of hard errors. I’m also chewing on the idea that swapping out RAM every two to three years may be a good form of system maintenance — at least, for systems I keep that long. You may want to consider doing likewise.

Pleasant Surprise with New System Build & Windows 7 Starter

About three weeks ago, my wife’s old PC started to give up the ghost. I built that system four or five years ago around a low-end DFI socket 939 motherboard with onboard VIA graphics, 100 Mbps Ethernet, two SATA 1 connectors, and DDR memory. The system included an AMD K8 Sempron 3200+ CPU (1.8 GHz), 2 GB of DDR-400 RAM, a Philips SATA DVD burner, and a 300 GB SATA 1 Matrox hard disk in a cheapo no-name Taiwanese case. I added a four-way fan controller, and three very quiet 80mm fans to the case to keep things cool, and it included a rock-solid Seasonic 400W PSU, but this system comes as close to bare bones as anything I’ve ever built. Except for occasional problems with the motherboard finding the SATA devices at boot-up (click the reset button and try again; repeat until it sees those devices) I never had a single problem with this machine over its entire productive life. I think it cost me about $200 to put it togther, because many of its parts were leftovers from other projects or articles. This probably makes it one of my best builds ever — at least from a maintenance, upkeep, and reliability perspective.

But nothing lasts forever, even no-BS systems like that one. I’m not sure if it was the SATA controller starting to fail, or the drive itself starting to go, but my wife Dina reported that she was having trouble with file corruption and running programs. Soon thereafter, the system wouldn’t boot any more. I back all our systems up to an HP MediaSmart Server nightly, so I wasn’t worried about losing anything important, but a quick inspection of the system showed me it was time for a replacement. The hard disk was clearly corrupted, and not even my trusty old copy of SpinRite 6.0 could restore it to full working condition, and I was also concerned that the SATA controller was starting to fail (I had issues running a repair install from the optical drive as well).

As a temporary fix, I set her up with my Asus Eee PC 1000HE (Atom N280, 2 GB DDR2, 160 GB HD, Intel 950 Mobile graphics, GbE Ethernet interface) hoping that she might like it enough to make it her regular PC. After three or four days of use, we talked it over and she opined that the 1000HE — to which I had attached her Dell 2208WFP monitor, her Microsoft Comfort Curve 2000 keyboard, and a Logitech V550 Nano mouse to replace her older and no-longer-satisfactory Microsoft wired laser mouse — just wasn’t fast enough to meet her needs.

I decided to build a mini-ITX system for her, in part to keep noise and power consumption levels down, and also because I had lots of parts I could use to finish out such a machine. Visiting my old buddies at LogicSupply I settled on a bare-bones version of a complete system they offer there for sale, because I was able to furnish my own RAM and hard disk. A quick consultation with one of their technical sales guys convinced me to buy the system parts from them, and then to assemble the system myself, to save even more money. Here’s what I ended up buying from them:

  • MSI Industrial 945GME1 Core 2 Duo Mobile Mini-ITX Mainboard   $238.95
  • Morex T-3500-150W Mini-ITX Case, Black   $115.00
  • Panasonic UJ-875-A SATA Slimline Slot-Loading DVD Writer   $76.00
     – Cables: Slimline SATA CD/DVD Drive Converter Cable (+$7.00)
  • Intel Core T2300 Duo 1.66 GHz Processor: 667 MHz Socket M   $106.00

So far, my total outlay was $542.95. I supplied my own Seagate Momentus 2.5″ 5,400 RPM 160 GB hard disk (approximate retail: $55) and a Patriot 2GB DDR2-800 memory module (approximate retail: $25).

The Morex case includes an external 150W PSU that look just like (and probably is) also used for nettop or desktop replacement notebook PCs. The build went together pretty easily, except that I initially mounted the DVD player upside down (it seemed more natural to hook up the SATA cable that way, though I soon realized my mistake once I tried to start using some DVDs). I’m ballparking total costs of the system at around $600 ($622.95 to be precise, not including shipping and tax). The only fan in the unit is on its itty-bitty CPU cooler, so it’s quite a bit quieter than an ordinary desktop case, most of which include a larger CPU cooler plus at least 2 80mm fans (or larger).

I installed Windows 7 Starter Edition on this box, because I knew Dina didn’t care about Aero (she uses the machine pretty much exclusively for Web surfing and e-mail) and she doesn’t really pay much attention to OS look, feel, and behavior anyway. I was relieved that she is happy with the machine and professes herself satisfied with its capabilities and performance. I probably spent no more than two hours putting everything together, and another two hours installing the OS, updating the drivers, and using the Windows Easy Transfer utility to move her files, preferences, and settings from Asus Eee PC to her new mini-ITX machine.

My major learning event for this build was that the Intel system tray utility has to be set to “Single display” to take advantage of higher-resolution monitors like her Dell 2208WFP (native resolution: 1600 x 1050 pixels). By default, this utility was set to “Clone display” mode which automatically limits maximum screen resolution to 1024×768. I had to download the latest set of Windows 7 utilities from the Intel site to gain access to the necessary system tray (or should I say “notification area?”) widget.

The Intel Mobile Graphics Accelerator Utility was the key to proper resolution

The Intel Mobile Graphics Accelerator Utility was the key to proper resolution

Once I got the screen working properly, I had to update drivers for some of the USB devices on the motherboard, at which point I also learned that Intel is releasing chipset drivers for Windows 7 slowly but surely (this motherboard uses an ICH7, so that’s the device for which I grabbed and installed drivers). Once again, Windows 7 scored well in terms of the drivers it supplied during the install. I only had to download three drivers (SetPoint 4.80, the Intel Chipset drivers, and a driver for the Dell 2208WFP monitor) to bring things completely up-to-date.

Once the machine was up and running, it proved to be something of a honey. My Seasonic Power Angel showed power consumption levels for the unit never exceeded 55W. During boot-up most values fluctuated between 30 and 40W; at idle the system consumed 33-38 W; running a full system scan with Norton Internet Security 2010, the highest value I observed was 51 W. Temperatures were likewise fairly balmy (though I could probably bring them down further by replacing the itty-bitty reference cooler that MSI supplies with the motherboard with something a bit more capable) as shown in this screencap from Franck Delattre’s excellent HW Monitor program.

HW Monitor reports for mini-ITX system

HW Monitor reports for mini-ITX system

The unit generally runs cooler than a notebook PC (my Dell D620 with the same processor would typically run about 4-5 degrees Celsius hotter with a T2300 CPU; now with a T7200 it’s more like 10 degrees hotter on the same scale) but a little hotter than a well-ventilated desktop PC (even a quad core). Power consumption is extremely low, however — less than half that for the older DFI-based desktop it’s replacing, and less than a third of that for my two quad core desktops. To me, that makes the system a real winner, especially because it consumes 4-8W in sleep mode (and because Dina uses that machine less than 4 hours a day, sleep mode is basically where it lives). I’ll concede that for the same money you could buy a nice little notebook, and that you could buy a full-size desktop with the same or better specs for about $200 less. I’m not sure that the energy savings will make up that cost difference, but it’s a great-looking, compact machine and everybody whose opinion counts around here seems to like it. I’d include some photos of the build with this post, but it’s going to have to wait: Dina’s busy using that computer right now. Stay tuned!

In Memoriam: Cecilia Katherine Kociolek Tittel 1919-2009

I got back from a business trip Friday morning to learn that my Mom, aged 90, passed away peacefully in her sleep the previous  night. She spent the last year and a half of her life in an assisted living facility in Fairfax County, VA, after living with me and my family for just over two years in the home (with “mother-in-law wing”) we had built to care for her in her declining years.

I’d like to take this opportunity to remember her to all of you. She was a ferociously intelligent woman who did her best to take care of her family, and I’ll always be grateful to her for ensuring that I got such a good education. She graduated first in her high school class, and also at the top of her class in nursing school. She served in WWII with distinction, and attained the rank of Major in a mobile army surgical hospital, following the Army through Northern Africa, into Sicily, and then on to France. When I was a boy, she took a job as the school nurse in the Heidelberg American School system in Germany, in part to keep a closer eye on me and my sister. She always encouraged my love of learning and language, and I owe much of what I am today to her care and attention. I will miss her terribly.

Mom was also a multiple cancer survivor: after being diagnosed with colon cancer in 1987 and learning to live with a colostomy (at which point she quit smoking), she was then diagnosed with lung cancer in 1989 (at which point she had the upper lobe of her left lung removed). She managed to survive for 20 years after those medical misadventures, and remained cancer free until her dying day. If anybody wants to remember her, I’d ask them to make a donation to the American Cancer Society in her name.

Cecilia K Tittel, 88th Birthday 2007

Cecilia K Tittel, 88th Birthday 2007