D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Failing drives? Backing up/upgrading server?

 

Grant Sewell wrote:
> I try not to top-post, but I figured this would save many of you from
> reading this whole lot just to find out that... I forgot to talk about
> the backing up/upgrading part of the question. /me is stupid.
> 
> Anyway, I'm asking if anyone knows what could be causing these errors
> (below) and how to fix them (if possible) to save me from resorting to
> potentially reinstalling with new drives.
> 
> If, however, it does come down to reinstalling using new drives, what's
> the best strategy for finding out what packages I have installed
> already (it's a Debian Stable server) so I can make a list of what
> needs to be installed on the 'new' machine (if it comes to that), and
> would I simply backup /etc (and data 'volumes' such as /home as well,
> obviously) and then reinstate it after I had performed a fresh install
> with all previously installed packages reinstalled?
> 
> Cheers.
> Grant.
> 
> On Sat, 26 Jan 2008 17:05:44 +0000
> Grant Sewell wrote:
> 
>> Hi all,
>>
>> I have started to have 'problems' with my file/web/mail server.  I am
>> getting the following message several times over in dmesg output:
>>
>> hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error }
>> hdd: dma_intr: error=0x84 { DriveStatusError BadCRC }
>> ide: failed opcode was: unknown
>>
>> Occasionally I will also get:
>>
>> ide1: reset: master: error (0x7f?)
>>
>> fdisk shows hdd to be the following (which is correct):
>>
>> Disk /dev/hdd: 163.9 GB, 163928604672 bytes
>> 255 heads, 63 sectors/track, 19929 cylinders
>> Units = cylinders of 16065 * 512 = 8225280 bytes
>>
>>    Device Boot      Start         End      Blocks   Id  System
>> /dev/hdd1               1       19929   160079661   83  Linux
>>
>> and "smartctl -a /dev/hdd":
>>
>> Model Family:     Maxtor DiamondMax Plus 9 family
>> Device Model:     Maxtor 6Y160P0
>> Serial Number:    Y46CSYAE
>> Firmware Version: YAR41BW0
>> User Capacity:    163,928,604,672 bytes
>> Device is:        In smartctl database [for details use: -P show]
>> ATA Version is:   7
>> ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
>> Local Time is:    Sat Jan 26 16:42:22 2008 GMT
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> The above errors don't seem to affect general use of the machine,
>> however quite concern-making is that recently I have also been getting
>> these whenever I try to access *some* parts of the file-system on hdd
>> (mounted as /home):
>>
>> end_request: I/O error, dev hdd, sector 202454167
>> end_request: I/O error, dev hdd, sector 202454663
>> end_request: I/O error, dev hdd, sector 202454671
>> end_request: I/O error, dev hdd, sector 202454167
>> end_request: I/O error, dev hdd, sector 63
>> Buffer I/O error on device hdd1, logical block 0
>> lost page write due to I/O error on hdd1
>> end_request: I/O error, dev hdd, sector 127
>> Buffer I/O error on device hdd1, logical block 8
>> lost page write due to I/O error on hdd1
>> end_request: I/O error, dev hdd, sector 86507599
>> Buffer I/O error on device hdd1, logical block 10813442
>> lost page write due to I/O error on hdd1
>> end_request: I/O error, dev hdd, sector 86507695
>> Buffer I/O error on device hdd1, logical block 10813454
>> lost page write due to I/O error on hdd1
>> end_request: I/O error, dev hdd, sector 86507703
>> Buffer I/O error on device hdd1, logical block 10813455
>> lost page write due to I/O error on hdd1
>> end_request: I/O error, dev hdd, sector 202454167
>> end_request: I/O error, dev hdd, sector 86507695
>> EXT3-fs error (device hdd1): ext3_get_inode_loc: unable to read inode
>> block - inode=5407135, block=10813454 Aborting journal on device hdd1.
>> end_request: I/O error, dev hdd, sector 4303
>> Buffer I/O error on device hdd1, logical block 530
>> lost page write due to I/O error on hdd1
>> end_request: I/O error, dev hdd, sector 63
>> Buffer I/O error on device hdd1, logical block 0
>> lost page write due to I/O error on hdd1
>> EXT3-fs error (device hdd1) in ext3_reserve_inode_write: IO failure
>> end_request: I/O error, dev hdd, sector 63
>> Buffer I/O error on device hdd1, logical block 0
>> lost page write due to I/O error on hdd1
>> EXT3-fs error (device hdd1) in ext3_dirty_inode: IO failure
>> end_request: I/O error, dev hdd, sector 63
>> Buffer I/O error on device hdd1, logical block 0
>> lost page write due to I/O error on hdd1
>> ext3_abort called.
>> EXT3-fs error (device hdd1): ext3_journal_start_sb: Detected aborted
>> journal Remounting filesystem read-only
>>
>> Upon dropping to runlevel 1, then performing "umount /home" I
>> immediately get:
>>
>> end_request: I/O error, dev hdd, sector 4303
>> Buffer I/O error on device hdd1, logical block 530
>> lost page write to I/O error on hdd1
>>
>> (or something like that)
>>
>> Then a fsck /dev/hdd1 returns with:
>> end_request: I/O error, dev hdd, sector 69
>> (repeated lots, different sectors)
>>
>> fsck.ext3: Attempt to read block from filesystem resulted in short
>> read whilst trying to open /dev/hdd1 Could this be a zero-length
>> partition?
>>
>> Indeed, now an "fdisk -l /dev/hdd" shows:
>> end_request: I/O error, dev hdd, sector 0
>> printk: 30 messages suppressed.
>> Buffer I/O error on device hdd, logical block 0
>> (blah blah)
>>
>> Reboot and all is file again, until I try to do this again... then I
>> get errors again.
>>
>> I'm not really sure where to begin.  I've disabled DMA by adding a
>> kernel boot parameter of ide=nodma, but that doesn't seem to affect
>> this problem at all.  Booting from another medium and fscking both
>> hda1 and hdd1 come back fine.  When the disks are removed and
>> attached to anther machine via a USB-ATA adapter, all is OK, so I'm
>> inclined to think it might be the PATA controller on this motherboard
>> (don't ask me what it is, I have no idea), however this machine has
>> been working fine for ages... and more concerning I used to get these
>> sorts of errors on my "old" server before I retired it and performed a
>> harddrive-transplant to this "new" computer, and all was fine for a
>> while.
>>
>> Thanks for reading.  Any ideas?
>>
>> Cheers.
>> Grant. 
>>
> 

Have you run a "badblocks" scan ?

-- 
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html