Tugger the SLUGger!SLUG Mailing List Archives

Re: [SLUG] detecting hard drive failure ?


Hi,
I'm not sure about NAS boxes... but HP raid stores the raid array config on the disks themselves. Such that you could take out 4 disks of a raid array and put them in another server and the raid would come up ok. And this is on a different raid controller.

So... if you have a backup of the data, have you tried to just take out the disks and put them back in the same NAS box in different places? Perhaps the connector is faulty. See whether the problem follows the disk or the problem follows the slot where the disk is.

Thanks,
Ben Donohue
donohueb@xxxxxxxxxxxx


On 14/11/2010 12:57 PM, Voytek Eymont wrote:
On Sat, November 13, 2010 5:52 pm, David Balnaves wrote:

I'm not really sure what the best indicators are of a failing hard drive.
  I've used smart on a lot of  hard drives; I've seen undocumented smart
values and even hard drives function fine for a number of years when smart
  reports they are "FAILING NOW'.  I've also seen some drives enter a
state where they wont allow further smart tests (on/offline) to be run or
aborted. This has lead me to believe that smart as an indicator needs to
be considered on a per model basis and run carefully within the
capabilities of the drive.  The whole process has given me more questions
than answers.

I try to detect a failure by monitoring huge changes in the smart
attributes.  I've configured munin to monitor the smart attributes; It
wouldn't be too hard to change the plugin to monitor these values on your
  NAS (I imagine you can ssh/telnet to it).  You will notice some variance
in things like temperature and ECC, but unless they start behaving
erratically then I wouldn't worry.

Hope this helps in 'detecting and notifying' potential failures.
David, thanks

yes, I can ssh to it

I'm not very familiar with the raid utilities (beyond knowing what the
acronym stand for...)

but I get:

# mdadm --detail /dev/md0
/dev/md0:
         Version : 00.90.03
   Creation Time : Sat Jun 19 04:35:02 2010
      Raid Level : raid0
      Array Size : 3900774400 (3720.07 GiB 3994.39 GB)
    Raid Devices : 4
   Total Devices : 4
Preferred Minor : 0
     Persistence : Superblock is persistent

     Update Time : Sat Jun 19 04:35:02 2010
           State : clean
  Active Devices : 4
Working Devices : 4
  Failed Devices : 0
   Spare Devices : 0

      Chunk Size : 64K

            UUID : 79e23cd2:b3f9618d:58a8936b:5e0d814b
          Events : 0.1

     Number   Major   Minor   RaidDevice State
        0       8        3        0      active sync   /dev/sda3
        1       8       19        1      active sync   /dev/sdb3
        2       8       35        2      active sync   /dev/sdc3
        3       8       51        3      active sync   /dev/sdd3


  # mount
/proc on /proc type proc (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
sysfs on /sys type sysfs (rw)
tmpfs on /tmp type tmpfs (rw,size=32M)
none on /proc/bus/usb type usbfs (rw)
/dev/sda4 on /mnt/ext type ext3 (rw)
/dev/md9 on /mnt/HDA_ROOT type ext3 (rw)
/dev/md0 on /share/MD0_DATA type ext4
(rw,usrjquota=aquota.user,jqfmt=vfsv0,user_xattr,data=ordered,nodelalloc)

# ls  /share/MD0_DATA
ls: /share/MD0_DATA/Web: Input/output error
ls: /share/MD0_DATA/Network Recycle Bin: Input/output error
ls: /share/MD0_DATA/lost+found: Input/output error
ls: /share/MD0_DATA/Download: Input/output error
ls: /share/MD0_DATA/aquota.user: Input/output error
ls: /share/MD0_DATA/Multimedia: Input/output error
ls: /share/MD0_DATA/Usb: Input/output error
ls: /share/MD0_DATA/Recordings: Input/output error
ls: /share/MD0_DATA/Public: Input/output error
cameras/