D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Linux Mint - Bad Kernel? [FIXED]

 

On 23/05/13 20:54, bad apple wrote:
My parents PC has just gone mad - it's got a brand new OCZ Agility4 SSD
in it dual booting win8/mint15.

ghost@SILO ~ $ uname -a
Linux SILO 3.8.0-21-generic #32-Ubuntu SMP Tue May 14 22:16:46 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux
ghost@SILO ~ $ lsb_release -rc
Release:    15
Codename:    olivia
ghost@SILO ~ $ dmesg | grep ata | grep error | wc -l
72

The logs are full of ATA errors - sense status warnings, bus resets, the
lot: this usually results in mountall failing during boot which
disconnects Plymouth and dumps you in a broken system. On the rare times
it does boot - with no dmesg errors - the system will usually panic 5-15
minutes in, remount / read-only and generally trash everything. The
really weird thing is rebooting to a live USB distro and running
fsck.ext4 on the linux partition returns "file system clean"! WTF?



Ok, I'm finally in a position to follow up on this and provide the answer (I always get to the bottom of it eventually).

I can't pretend I understand exactly *why* this was happening, but it seems to be a current, unresolved bug - see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1063354

Of note, responders on that issue report the same issues with spinning rust as well as SSDs (all makes), confirming my suspicion that it had absolutely nothing to do with the SSD (admittedly OCZ do have a bad reputation, but whatever) per se. Windows 8 continues to run flawlessly from the same disk in dual boot mode. I went through everything, including checking partition alignment on 512 blocks, kernel parameters, firmware revisions, SATA cables, you name it. It also turns out that it wasn't anything to do with Mint - as much as I've come to despise it - as the replacement Debian Wheezy system soon started showing identical issues.

Wheezy was much more graceful about it, for what it's worth - the sudden spewing of ATA errors to syslog would only happen once every 5-10 boots (the rest of the time, it would boot clean) and even on an error-laden start up, unlike Mint, it wouldn't panic and lock up the filesystems in read-only mode almost instantly. Mint would show the errors every single boot, remount / read-only immediately on the very rare occasions it even managed to start and would 90% of the time fail mid-boot on calling "mountall" from Plymouth, somehow unable to even find the filesystems (despite the fact it had obviously already mounted / and was printing error messages directly from it - but that's another story, I'll leave my Ubuntu/Mint hating for another day).

So, that actual "fix" eventually came from the link above. I modified /etc/default/grub as suggested, changing line 1 for line 2:

1: GRUB_CMDLINE_LINUX=""
2: GRUB_CMDLINE_LINUX="libata.force=noncq"

Once I'd issued "update-grub" and rebooted, the problem just disappeared and dmesg showed the system switching to fallback SWNCQ mode instead. It's been fine ever since - for the sake of thoroughness, I reversed the changes and rebooted a few times - sure enough, within the first 5-10 boots the problem reappeared. A couple of quick disk benchmarks showed that performance hadn't been negatively impacted at all, so I've left it at that and called it a day.

This is the first time I've ever come across this problem across the many, many Linux boxes I look after - many with SSDs, and many of those with OCZ models. Hopefully google will index this for prosperity and make it a little bit easier for anyone else with the same issue to find the fix quicker: it took me quite a while before I chanced across that particular thread.

What a pain in the arse...

Regards

-- 
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/listfaq