The FreeBSD Diary |
(TM) | Providing practical examples since 1998If you buy from Amazon USA, please support us by using this link. |
Fixing bad sectors found by smartd
20 August 2011
|
It was nearly 18 months ago that I wrote about Monitoring your HDD using SMART and Nagios. Last night, I got my first serious alert regarding an HDD. Nagios is reporting: Raid Critical!. gmirror is reporting: $ gmirror status Name Status Components mirror/gm0 DEGRADED ad0 (100%) ad2And these entries in /var/log/messages: Aug 19 11:06:34 bast smartd[1575]: Device: /dev/ad2, 1 Currently unreadable (pending) sectors Aug 19 11:36:33 bast smartd[1575]: Device: /dev/ad2, 1 Currently unreadable (pending) sectors In this article, I will show you how I fixed the HDD using common tools, but some interesting procedures, none of which I have used before. Do you dare to trust your HDD to my writing? Got backups? |
Initial research
|
I first started searching for this error message: Currently unreadable pending sectors I found a link to Bad block HOWTO for smartmontools. I read it and decided I needed to grab more information from smartd. [root@bast:/home/dan] # smartctl -l selftest /dev/ad2 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.2-STABLE i386] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 14921 - # 2 Extended offline Completed without error 00% 14920 - # 3 Short offline Completed without error 00% 6914 - # 4 Short offline Completed without error 00% 6914 - That shows not much at all. Just that previous tests have run fine. Let's go for the big picture. [root@bast:/home/dan] # smartctl -a /dev/ad2 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.2-STABLE i386] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Maxtor DiamondMax Plus D740X family Device Model: MAXTOR 6L040J2 Serial Number: 362129580341 Firmware Version: A93.0500 User Capacity: 40,027,029,504 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 5 ATA Standard is: ATA/ATAPI-5 T13 1321D revision 1 Local Time is: Fri Aug 19 16:36:40 2011 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 34) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 20) minutes. SMART Attributes Data Structure revision number: 11 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x0029 100 253 020 Pre-fail Offline - 0 3 Spin_Up_Time 0x0027 081 081 020 Pre-fail Always - 2437 4 Start_Stop_Count 0x0032 100 100 008 Old_age Always - 154 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 1 7 Seek_Error_Rate 0x000b 100 093 023 Pre-fail Always - 0 9 Power_On_Hours 0x0012 059 059 001 Old_age Always - 27416 10 Spin_Retry_Count 0x0026 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 008 Old_age Always - 144 13 Read_Soft_Error_Rate 0x000b 100 093 023 Pre-fail Always - 0 194 Temperature_Celsius 0x0022 080 077 042 Old_age Always - 51 195 Hardware_ECC_Recovered 0x001a 100 001 000 Old_age Always - 89569576 196 Reallocated_Event_Count 0x0010 100 100 020 Old_age Offline - 0 197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x001a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 1 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1 occurred at disk power-on lifetime: 27400 hours (1141 days + 16 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 00 ef 54 e0 54 00:03:49.769 READ DMA c8 00 00 00 ee 54 e0 54 00:03:49.765 READ DMA c8 00 00 00 ed 54 e0 54 00:03:49.755 READ DMA c8 00 00 00 ec 54 e0 54 00:03:49.751 READ DMA c8 00 00 00 eb 54 e0 54 00:03:49.747 READ DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 14921 - # 2 Extended offline Completed without error 00% 14920 - # 3 Short offline Completed without error 00% 6914 - # 4 Short offline Completed without error 00% 6914 - Device does not support Selective Self Tests/Logging Next, I decided to run a short test. [root@bast:/home/dan] # smartctl -t short /dev/ad2 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.2-STABLE i386] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Fri Aug 19 16:47:16 2011 Use smartctl -X to abort test. I wanted two minutes, and then I saw: [root@bast:/home/dan] # smartctl -l selftest /dev/ad2 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.2-STABLE i386] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 27416 - # 2 Extended offline Completed without error 00% 14921 - # 3 Extended offline Completed without error 00% 14920 - # 4 Short offline Completed without error 00% 6914 - # 5 Short offline Completed without error 00% 6914 - Hmmm, still no useful information. Let's try the long test. root@bast:/home/dan] # smartctl -t long /dev/ad2 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.2-STABLE i386] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 20 minutes for test to complete. Test will complete after Fri Aug 19 17:08:04 2011 Use smartctl -X to abort test. After 20 minutes, I found: [root@bast:/home/dan] # smartctl -l selftest /dev/ad2 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.2-STABLE i386] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 27416 786767 # 2 Short offline Completed without error 00% 27416 - # 3 Extended offline Completed without error 00% 14921 - # 4 Extended offline Completed without error 00% 14920 - # 5 Short offline Completed without error 00% 6914 - # 6 Short offline Completed without error 00% 6914 - Now that's some information we can use in our fix. The LBA of the [first] bad sector is 786767. How can we use this information? Let's look at the output of gpart: [dan@bast:~] $ gpart show => 63 78177708 mirror/gm0 MBR (37G) 63 78172227 1 freebsd [active] (37G) 78172290 5481 - free - (2.7M) => 0 78172227 mirror/gm0s1 BSD (37G) 0 1048576 1 freebsd-ufs (512M) 1048576 4141712 2 freebsd-swap (2.0G) 5190288 4167680 4 freebsd-ufs (2.0G) 9357968 1048576 5 freebsd-ufs (512M) 10406544 67765683 6 freebsd-ufs (32G) Please note that /dev/ad2 is part of the gmirror mirror/gm0. I'm told that gmirror LBAs are used unchanged. Let's proceed based on that perhaps dangerous assumption. It looks like this LBA is in the first 512M of the HDD. Which is part of / I'm sure. Let's see $ df -h Filesystem Size Used Avail Capacity Mounted on /dev/mirror/gm0s1a 496M 405M 51M 89% / devfs 1.0K 1.0K 0B 100% /dev /dev/mirror/gm0s1e 496M 61M 395M 13% /tmp /dev/mirror/gm0s1f 31G 6.7G 22G 23% /usr /dev/mirror/gm0s1d 1.9G 513M 1.3G 28% /var ngaio:/usr/ports/distfiles 138G 100G 27G 79% /usr/ports/distfiles devfs 1.0K 1.0K 0B 100% /var/named/dev devfs 1.0K 1.0K 0B 100% /var/db/dhcpd/devYep, that looks like it's in / for sure. Let's find out sector size: [root@bast:/home/dan] # diskinfo -v ad2 ad2 512 # sectorsize 40027029504 # mediasize in bytes (37G) 78177792 # mediasize in sectors 0 # stripesize 0 # stripeoffset 19161 # Cylinders according to firmware. 16 # Heads according to firmware. 255 # Sectors according to firmware. 362129580341 # Disk ident. And now for some partition information: # fdisk -t /dev/ad2 ******* Working on device /dev/ad2 ******* parameters extracted from in-core disklabel are: cylinders=19161 heads=16 sectors/track=255 (4080 blks/cyl) Figures below won't work with BIOS for partitions not in cyl 1 parameters to be used for BIOS calculations are: cylinders=19161 heads=16 sectors/track=255 (4080 blks/cyl) Media sector size is 512 Warning: BIOS sector numbering starts with sector 1 Information from DOS bootblock is: The data for partition 1 is: sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) start 63, size 78172227 (38170 Meg), flag 80 (active) beg: cyl 0/ head 1/ sector 1; end: cyl 1023/ head 254/ sector 63 The data for partition 2 is: But wait, as Wyze pointed out on IRC, fixed ad2 doesn't make sense. Instead, let's be sure ad0 is solid with no errors. [root@bast:/home/dan] # smartctl -t long /dev/ad0 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.2-STABLE i386] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 20 minutes for test to complete. Test will complete after Fri Aug 19 20:14:56 2011 Use smartctl -X to abort test. And some time later, I had my answer: all OK [root@bast:/home/dan] # smartctl -l selftest /dev/ad0 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.2-STABLE i386] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 51193 - # 2 Extended offline Completed without error 00% 38670 - # 3 Short offline Completed without error 00% 38670 - But I figured I wanted to do more. I wanted to read all of ad0, the HDD which seems to be OK: [root@bast:/home/dan] # dd of=/dev/null if=/dev/ad0 bs=1m 38172+1 records in 38172+1 records out 40027029504 bytes transferred in 1157.080130 secs (34593135 bytes/sec) [root@bast:/home/dan] # No errors appeared in /var/log/messages during this time. I'm pretty confident and ad0 is solid and reliable. Now, let's try ad2: [root@bast:~] # dd of=/dev/null if=/dev/ad2 bs=1m dd: /dev/ad2: Input/output error 2717+0 records in 2717+0 records out 2848980992 bytes transferred in 127.232046 secs (22392008 bytes/sec) [root@bast:~] # Oh. And in /var/log/messages, I see: Aug 19 22:15:50 bast kernel: ad2: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=5566208But compared to ad0, there is much less data... that's because dd stopped on the error. Let us try this: [root@bast:~] # dd of=/dev/null if=/dev/ad2 bs=1m conv=noerror dd: /dev/ad2: Input/output error 2717+0 records in 2717+0 records out 2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec) dd: /dev/ad2: Input/output error 38170+1 records in 38170+1 records out 40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec) [root@bast:~] # But still, only one error in /var/log/messages: Aug 19 22:26:39 bast kernel: ad2: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=5566208 But I also noticed this: Aug 19 22:36:34 bast smartd[1575]: Device: /dev/ad2, 2 Currently unreadable (pending) sectors Aug 19 22:36:34 bast smartd[1575]: Device: /dev/ad2, ATA error count increased from 1 to 3 At this point, I gave up for the day and sent an email off to the freebsd-stable mailing list. |
using the bad_block_scan script
|
NOTE: Where is that script from? I ran the script several times: sh ./bad_block_scan /dev/ad2 5566400 5566500 sh ./bad_block_scan /dev/ad2 5566000 5566500 sh ./bad_block_scan /dev/ad2 5560000 5566000 sh ./bad_block_scan /dev/ad2 5560000 5566000 Then I ran smartctl again: smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-STABLE i386] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Maxtor DiamondMax Plus D740X Device Model: MAXTOR 6L040J2 Serial Number: 362129580341 Firmware Version: A93.0500 User Capacity: 40,027,029,504 bytes [40.0 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 5 ATA Standard is: ATA/ATAPI-5 T13 1321D revision 1 Local Time is: Sat Aug 20 17:03:42 2011 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 112) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: ( 34) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 20) minutes. SMART Attributes Data Structure revision number: 11 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x0029 100 253 020 Pre-fail Offline - 0 3 Spin_Up_Time 0x0027 081 081 020 Pre-fail Always - 2437 4 Start_Stop_Count 0x0032 100 100 008 Old_age Always - 154 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 2 7 Seek_Error_Rate 0x000b 100 093 023 Pre-fail Always - 0 9 Power_On_Hours 0x0012 059 059 001 Old_age Always - 27440 10 Spin_Retry_Count 0x0026 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 008 Old_age Always - 144 13 Read_Soft_Error_Rate 0x000b 100 093 023 Pre-fail Always - 0 194 Temperature_Celsius 0x0022 081 076 042 Old_age Always - 50 195 Hardware_ECC_Recovered 0x001a 100 001 000 Old_age Always - 89881212 196 Reallocated_Event_Count 0x0010 099 099 020 Old_age Offline - 1 197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always - 2 198 Offline_Uncorrectable 0x0010 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x001a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 3 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 3 occurred at disk power-on lifetime: 27422 hours (1142 days + 14 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 00 ef 54 e0 54 00:03:34.065 READ DMA c8 00 00 00 ee 54 e0 54 00:03:34.061 READ DMA c8 00 00 00 ed 54 e0 54 00:03:34.057 READ DMA c8 00 00 00 ec 54 e0 54 00:03:34.053 READ DMA c8 00 00 00 eb 54 e0 54 00:03:34.049 READ DMA Error 2 occurred at disk power-on lifetime: 27421 hours (1142 days + 13 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 00 ef 54 e0 54 00:04:16.704 READ DMA c8 00 00 00 ee 54 e0 54 00:04:16.700 READ DMA c8 00 00 00 ed 54 e0 54 00:04:16.696 READ DMA c8 00 00 00 ec 54 e0 54 00:04:16.692 READ DMA c8 00 00 00 eb 54 e0 54 00:04:16.687 READ DMA Error 1 occurred at disk power-on lifetime: 27400 hours (1141 days + 16 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 00 ef 54 e0 54 00:03:49.769 READ DMA c8 00 00 00 ee 54 e0 54 00:03:49.765 READ DMA c8 00 00 00 ed 54 e0 54 00:03:49.755 READ DMA c8 00 00 00 ec 54 e0 54 00:03:49.751 READ DMA c8 00 00 00 eb 54 e0 54 00:03:49.747 READ DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 27416 786767 # 2 Short offline Completed without error 00% 27416 - # 3 Extended offline Completed without error 00% 14921 - # 4 Extended offline Completed without error 00% 14920 - # 5 Short offline Completed without error 00% 6914 - # 6 Short offline Completed without error 00% 6914 - Device does not support Selective Self Tests/Logging |