l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
April 21: Google Glass
Next Installfest:
TBD
Latest News:
Mar. 18: Google Glass at LUGOD's April meeting
Page last updated:
2007 Apr 11 01:12

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] ECC memory --- is it worth it? (semi-OT)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] ECC memory --- is it worth it? (semi-OT)



Rick Moen wrote:
> A bad bit in memory, if indicative of a physical defect, will quickly
> manifest unmistakeably on Linux in the manner I described.  If not thus
> indicative, (from empirical observation over a long period of time:)
> it's extremely unlikely to have detectable long-term consequences.  

You speculate that it contributes to premature httpd deaths but is
undetectable long term?

> And if we all had unlimited funds, we'd all pay through the nose to buy
> it for all of our machines.  Sadly lacking the wealth if Midas, however,

$10 a dimm requires you to "pay through the node" and the "wealth of
midas"?  I guess we have different standards for differential costs
on a machine designed for multi month uptimes.

> we're always obliged to decide in _which_ specific area of systems design 
> that extra dollar is best applied.  E.g., one might splurge on a disk
> redundancy, or a less cheap and cruddy HBA, or a less laughably
> inadequate PSU, all of which decisions are often (again, from empirical
> observation over a long period of time), in commodity PC purchases,
> likely to make a bigger difference to data integrity than does ECC.

True, but if you are stating from the beginning that you want a decent
design with multi-month uptimes you are IMO above this level of "laughably
inadequate" PSUs and similar crud.

> Again, if this were not basically damned close to a fantasy-novel
> scenario, my data would have melted down into slag a decade ago.  So
> would nearly everyone else's.

I agree that decent quality machine is required before the improvement
in ECC is detectable.  But since it's incredibly cheap, adding $10-$20
to a small server or desktop seems reasonable to me.

> You might preach cluster design to someone who didn't build the largest 
> Linux HPC cluster in history (#3 on the Top 500 list, when deployed).  ;->

Cool, er, do I need to ask the obvious?  Did it use ECC?  If it did how many
ECC errors did you see per GB per day.  I'll start collecting this but
it will be months before I have any useful numbers.

> I'd guesstimate about three orders of magnitude more likely to be a
> threat to data than is RAM corruption that is not based in outright
> defective RAM.  Where from?  From twenty years' experience, pretty much.

So what rate of disk errors have you seen that are not based on outright
defective disks?  Of course file corruptions are similarly hard to detect
unless you run tripwire or related checksum based monitoring.

In general people seem more worried about file integrity than memory
integrity, even though file integrity depends on memory integrity.

>> So you are saying that HD defects are 10 or 100 times likely then the
>> 1 bit per GB per month?  
> 
> If you assumed I was endorsing your figure, you assumed wrong.  If you 
> remain unclear on what I _was_ saying, you might want to re-read.

Which figures?  The wikipedia mentioned 1 bit per GB per month?  Or
the 1 sector per 10^14?  The later is from one of the seagate enterprise/
raid edition drives.  Granted real world numbers tend to be worse
than reported values, and MTBFs only very loosely correlated with real world
annual return percentages.  Then again a corrupt sector being read from
a disk that the system thinks is a valid sector seems very rare indeed.

What exactly do you mean by 3 orders of magnitude (base 2? base 10?)?
Undetected errors in healthy hardware?  Deaths? Detectable errors?
Loss of files?  Something else?

How often do you see this type of disk corruption?  Seems most fair to
equate clearly bad dimms with clearly dead disks, and non-ecc errors
caused by random effects being reported as valid memory equated with
corrupted disk sectors being reported as real.

I'm genuinely interested in information related to this and have
significant practical experience with these issues as well.  I still
want to compare notes though.
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
O'Reilly and Associates
For numerous book donations.