D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Bah! Disks! Bah!

 

On Thu, 28 Jan 2010, tom wrote:

Gordon Henderson wrote:
Had a server creak and groan overnight with disk failures of both drives 
(mirrored) I'd like to say something like: "What's the chances of that 
happening", unfortunately, it's not the first time as I had a server some 3 
years back which lost 3 out of it's 6 drives in an 18-hour period.
All 5 drives are/were WDC drives too.

Oh well - looks like this one will limp along until I can replace them in a few days time. (well - badblocks scans of the partitions that are live was OK)
It's only 8 months old too. Ho hum!

Anyone got any recomendations for good reliable SATA disks in the 500-700GB size range?
Gordon

I hope they're still under guarantee.
Yes they are, the whole rig is only 8-9 months old.

I'd recommend some investigation as to what went wrong - do you do thermal/voltage monitoring.
I monitor as much as I can - I think I've posted this before - I'm at the 
mercy of what the motherboard will give me via 'sensors' but everything is 
more or less constant, and I'm not sure anything is going to pickup a 
transient spike anyway ..
However - and I was going to post an update in a separate thread, but I'll 
just post it here...
After a day of massaging, badblocks scans, noodling, fiddling and so on, 
it's in a stable condition. I was able to deliberately fail the raid 
arrays on partitions where one side was OK and the other side faulty, and 
re-build the faulty sides - it's a "trick" I've done in the past to force 
a re-write of a bad sector with good results, (until such times as I've 
been able to change the drives), and now all partitions and mirrors read 
with no bad sectors. (Although all arrays are checked automatically every 
week anyway using the mdadm tools and this has never flagged an error)
So what changed... Well, yesterday afternoon, this popped up:

kernel: ata2: limiting SATA link speed to 1.5 Gbps
and after that, everything seemed to just settle down...

So they're SATA II drives (3Gb/sec), it's a SATA II motherboard and I used the SATA cables that came with the mobo.
9 months uptime, no real usage patterns change, (it was trivially lightly 
loaded! waste of a server IMO, but client is paying for it) then this 
happens.
So stuff going through my mind right now is a lot of "what if's"... What 
if the cables were cheap ones only good for 1.5Gb/sec. What if a cable has 
worked loose? (I drove that box to the co-lo myself and always check 
connectors before turning on the mains), what if it's just been some 
cosmic ray, what if I'm just unlucky, what if it's a kernel bug in the 
SATA driver? (Kernel 2.6.29.3 and I'm now scanning teh changelogs), Who 
knows.
Checking the drives, (via smartctl), I find that neither drive has any 
sectors remapped. One has a "pending bad sector" though, but they both 
pass smart tests OK, and badblocks scans/md check's of both disks is fine.
So I'm now revising my plans - I need to get a new server up there by the 
end of Feb, but I'm not going to make a mad dash up there now (Sheffield) 
- I'll build and test this new server, then when I go up, take new SATA 
cables, and new drives and give it a good testing while I'm up there 
installing the new server.
One thing I have noticed - there seems to be a lack of places openly 
advertising SATA II cables, but I admit I'm not real expert in that area - 
is there a big difference between SATA I and II (other than the speed, and 
I can see why you'd need better cables for 3GB/sec). Maplin seems to be 
the best place with a range of cables. Aria (who I usually buy from only 
has 1 SATA II cable (in 2 lengths) and they have right angled connectors 
which I'm not a fan of when disks are stacked one above the other (even 
with a gap, the cables still sometimes interfere)
Cheers

Gordon

--
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html