Checking Hard Disk Sanity With Smartmontools (Debian & Ubuntu)

Want to support HowtoForge? Become a subscriber!
 
Submitted by falko (Contact Author) (Forums) on Mon, 2008-04-21 16:14. :: Debian | Ubuntu

Checking Hard Disk Sanity With Smartmontools (Debian & Ubuntu)

Version 1.0
Author: Falko Timme <ft [at] falkotimme [dot] com>
Last edited 04/08/2008

This guide shows how to install and use the smartmontools package on Debian Etch and Ubuntu 7.10. The smartmontools package provides utilities to check hard disks for disk degradation and failure, using the Self-Monitoring, Analysis and Reporting Technology System (SMART) built into most modern ATA and SCSI hard disks.

I do not issue any guarantee that this will work for you!

 

1 Installing Smartmontools

In order to install smartmontools, all we have to do is run:

apt-get install smartmontools

The smartmontools package comes with two utilities, smartctl which you can use to check your hard drives on the command line, and smartd, a daemon that checks your hard disks at a specified interval and logs warnings/errors to the syslog and can also send warnings and errors to a specified email address (usually the admin of the system).

 

2 Using Smartctl

Before we can use smartctl, we must find out how our hard disks are named. You can do this, for example, by running:

df -h

or

fdisk -l

server1:~# fdisk -l

Disk /dev/hda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/hda1   *           1       19269   154778211   83  Linux
/dev/hda2           19270       19457     1510110    5  Extended
/dev/hda5           19270       19457     1510078+  82  Linux swap / Solaris
server1:~#

As you see, my hard disk is called /dev/hda.

Now that we know the name of our hard drive, we can run smartctl as follows:

smartctl -a /dev/hda

If you run it for the first time, you'll probably see something like this:

server1:~# smartctl -a /dev/hda
smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST3160022ACE
Serial Number:    5JS3XTZX
Firmware Version: 9.01
User Capacity:    160,041,885,696 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Tue Apr  8 18:58:44 2008 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Disabled

SMART Disabled. Use option -s with argument 'on' to enable it.
server1:~#

So SMART is disabled, to enable it, we need to run that command again with the -s on switch:

smartctl -s on -a /dev/hda

Now we get more output, including all errors that are in the SMART log (if any):

server1:~# smartctl -s on -a /dev/hda
smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST3160022ACE
Serial Number:    5JS3XTZX
Firmware Version: 9.01
User Capacity:    160,041,885,696 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Tue Apr  8 18:59:14 2008 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Disabled

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (15556) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 111) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   059   056   006    Pre-fail  Always       -       163692057
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       -       722959
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       55
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       37
194 Temperature_Celsius     0x0022   039   046   000    Old_age   Always       -       39
195 Hardware_ECC_Recovered  0x001a   059   056   000    Old_age   Always       -       163692057
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 28 hours (1 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 5d 4c 85 e0  Error: ICRC, ABRT at LBA = 0x00854c5d = 8735837

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 5d 4c 85 e0 00      05:05:31.855  READ DMA EXT
  25 00 00 5d 4b 85 e0 00      05:05:31.810  READ DMA EXT
  25 00 00 5d 4a 85 e0 00      05:05:31.773  READ DMA EXT
  25 00 00 5d 49 85 e0 00      05:05:31.737  READ DMA EXT
  25 00 00 5d 48 85 e0 00      05:05:31.651  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        54         -
# 2  Short offline       Aborted by host               80%        54         -
# 3  Short offline       Completed without error       00%        54         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

server1:~#

Now that SMART is enabled, we don't need the -s on switch anymore, which means that you can now call smartctl as in the first example.

To learn more about smartctl and how it can be used, take a look at the smartctl man page:

man smartctl

 

3 Using Smartd

Smartctl is a nice tool, but you have to run it manually. Of course, it would be nice to have some daemon that monitors our hard disk at specified intervals and logs and/or emails us if something is wrong with the hard disk so that we can react before it fails completely. Smartd is just what we need.

To use smartd, we have to modify /etc/default/smartmontools first and uncomment the start_smartd=yes and smartd_opts="--interval=1800" lines (set the monitoring interval to whatever value (in seconds) you prefer; 1800 means 30 minutes):

vi /etc/default/smartmontools

# Defaults for smartmontools initscript (/etc/init.d/smartmontools)
# This is a POSIX shell fragment

# List of devices you want to explicitly enable S.M.A.R.T. for
# Not needed (and not recommended) if the device is monitored by smartd
#enable_smart="/dev/hda /dev/hdb"

# uncomment to start smartd on system startup
start_smartd=yes

# uncomment to pass additional options to smartd on startup
smartd_opts="--interval=1800"

Next we must configure the smartd configuration file, /etc/smartd.conf. You should take a look at

man smartd

to learn more about the available configuration options and also check out the examples that are in /etc/smartd.conf.

vi /etc/smartd.conf

For the beginning the following configuration is fine:

DEVICESCAN -m root -M exec /usr/share/smartmontools/smartd-runner

DEVICESCAN means that smartd will monitor all hard drives it can find. The -m switch specifies the user or email address that smartd will send warnings/errors to. For example, to monitor only /dev/hda and send warnings/errors to admin@example.com, you'd use the following configuration instead:

/dev/hda  -m admin@example.com -M exec /usr/share/smartmontools/smartd-runner

Afterwards we start smartd:

/etc/init.d/smartmontools start

Now if you take a look at /var/log/syslog, you should find the startup messages of smartd there:

tail -n50 /var/log/syslog

[...]
Apr  8 19:12:17 server1 smartd[3731]: smartd version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Apr  8 19:12:17 server1 smartd[3731]: Home page is http://smartmontools.sourceforge.net/
Apr  8 19:12:17 server1 smartd[3731]: Opened configuration file /etc/smartd.conf
Apr  8 19:12:17 server1 smartd[3731]: Drive: DEVICESCAN, implied '-a' Directive on line 22 of file /etc/smartd.conf
Apr  8 19:12:17 server1 smartd[3731]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Apr  8 19:12:17 server1 smartd[3731]: Problem creating device name scan list
Apr  8 19:12:17 server1 smartd[3731]: Device: /dev/hda, opened
Apr  8 19:12:17 server1 smartd[3731]: Device: /dev/hda, not found in smartd database.
Apr  8 19:12:17 server1 smartd[3731]: Device: /dev/hda, is SMART capable. Adding to "monitor" list.
Apr  8 19:12:17 server1 smartd[3731]: Device: /dev/hdc, opened
Apr  8 19:12:17 server1 smartd[3731]: Device: /dev/hdc, packet devices [this device CD/DVD] not SMART capable
Apr  8 19:12:17 server1 smartd[3731]: Monitoring 1 ATA and 0 SCSI devices
Apr  8 19:12:17 server1 smartd[3733]: smartd has fork()ed into background mode. New PID=3733.
Apr  8 19:12:17 server1 smartd[3733]: file /var/run/smartd.pid written containing PID 3733
[...]

If smartd finds something interesting about your hard disk or errors/warnings, it will also log these events, e.g.:

Apr  8 19:36:01 server2 smartd[13160]: Device: /dev/hda, SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37

(This is of course no error or warning, just something interesting.)

Errors and warnings will also be sent to a user/email address if you told smartd to do so.

 

4 Links


Please do not use the comment function to ask for help! If you need help, please use our forum.
Comments will be published after administrator approval.
Submitted by Arnaud (not registered) on Mon, 2008-09-15 23:12.

Smart how-to for a kick start. up to the user to explore the options.

thank you, was up and running in no time.

Submitted by Tenzer (registered user) on Mon, 2008-04-21 19:26.
If your hard drive is a SATA drive and is recognized as /dev/sd*, you have to include "-d ata" in the smartctl command, or else it thinks the drive is a SCSI drive and will fail to read any of the useful information. There is more information on that subject here: http://smartmontools.sourceforge.net/smartmontools_scsi.html