D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Bah! Disks! Bah!

 

On Thu, 28 Jan 2010, tom wrote:

Gordon Henderson wrote:

Had a server creak and groan overnight with disk failures of both drives (mirrored) I'd like to say something like: "What's the chances of that happening", unfortunately, it's not the first time as I had a server some 3 years back which lost 3 out of it's 6 drives in an 18-hour period.

All 5 drives are/were WDC drives too.

Oh well - looks like this one will limp along until I can replace them in a few days time. (well - badblocks scans of the partitions that are live was OK)

It's only 8 months old too. Ho hum!

Anyone got any recomendations for good reliable SATA disks in the 500-700GB size range?

Gordon

I hope they're still under guarantee.

Yes they are, the whole rig is only 8-9 months old.

I'd recommend some investigation as to what went wrong - do you do thermal/voltage monitoring.

I monitor as much as I can - I think I've posted this before - I'm at the mercy of what the motherboard will give me via 'sensors' but everything is more or less constant, and I'm not sure anything is going to pickup a transient spike anyway ..

However - and I was going to post an update in a separate thread, but I'll just post it here...

After a day of massaging, badblocks scans, noodling, fiddling and so on, it's in a stable condition. I was able to deliberately fail the raid arrays on partitions where one side was OK and the other side faulty, and re-build the faulty sides - it's a "trick" I've done in the past to force a re-write of a bad sector with good results, (until such times as I've been able to change the drives), and now all partitions and mirrors read with no bad sectors. (Although all arrays are checked automatically every week anyway using the mdadm tools and this has never flagged an error)

So what changed... Well, yesterday afternoon, this popped up:

kernel: ata2: limiting SATA link speed to 1.5 Gbps

and after that, everything seemed to just settle down...

So they're SATA II drives (3Gb/sec), it's a SATA II motherboard and I used the SATA cables that came with the mobo.

9 months uptime, no real usage patterns change, (it was trivially lightly loaded! waste of a server IMO, but client is paying for it) then this happens.

So stuff going through my mind right now is a lot of "what if's"... What if the cables were cheap ones only good for 1.5Gb/sec. What if a cable has worked loose? (I drove that box to the co-lo myself and always check connectors before turning on the mains), what if it's just been some cosmic ray, what if I'm just unlucky, what if it's a kernel bug in the SATA driver? (Kernel 2.6.29.3 and I'm now scanning teh changelogs), Who knows.

Checking the drives, (via smartctl), I find that neither drive has any sectors remapped. One has a "pending bad sector" though, but they both pass smart tests OK, and badblocks scans/md check's of both disks is fine.

So I'm now revising my plans - I need to get a new server up there by the end of Feb, but I'm not going to make a mad dash up there now (Sheffield) - I'll build and test this new server, then when I go up, take new SATA cables, and new drives and give it a good testing while I'm up there installing the new server.

One thing I have noticed - there seems to be a lack of places openly advertising SATA II cables, but I admit I'm not real expert in that area - is there a big difference between SATA I and II (other than the speed, and I can see why you'd need better cables for 3GB/sec). Maplin seems to be the best place with a range of cables. Aria (who I usually buy from only has 1 SATA II cable (in 2 lengths) and they have right angled connectors which I'm not a fan of when disks are stacked one above the other (even with a gap, the cables still sometimes interfere)

Cheers

Gordon

--
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html