Categories
FreeNAS Tech Tutorials

Tutorial: How to replace a failing drive in FreeNAS 11.2

I’ve replaced aging disks in my FreeNAS box a few times now, and I always document procedures to save time later on. With that in mind, I thought it might help others if I shared the workflow online. So here you go!

Initial Diagnosis

I received an automated email from my freenas box with the following kernel log messages. Notice the MEDIUM ERROR:

freenas.lan kernel log messages:
[...]
 > (da2:mps0:0:4:0): Retrying command
 > (da2:mps0:0:4:0): READ(10). CDB: 28 00 00 e8 10 f0 00 01 00 00
 > (da2:mps0:0:4:0): CAM status: SCSI Status Error
 > (da2:mps0:0:4:0): SCSI status: Check Condition
 > (da2:mps0:0:4:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
 > (da2:mps0:0:4:0): Info: 0xe81110
 > (da2:mps0:0:4:0): Error 5, Unretryable error
 -- End of security output --
 

I also received an email with a more readable output:

[...]
  pool: REDPOOL_4X3TB
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Nov 17 00:00:02 2019
        437G scanned at 41.2M/s, 373G issued at 35.2M/s, 3.75T total
        904K repaired, 9.71% done, 1 days 04:02:43 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        REDPOOL_4X3TB                                   ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/4bc78f89-774a-11e6-a507-1c98ec0ec444  ONLINE       0     0     0
            gptid/4c75dd9a-774a-11e6-a507-1c98ec0ec444  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444  ONLINE       0     0     1
            gptid/9fc44493-081f-11e9-8568-a01d48c71098  ONLINE       0     0     0

[...]

I took note of which drive had the checksum error:

mirror-1 ONLINE 0 0 0
gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444 ONLINE 0 0 1

I then SSHed into my freeNAS machine (You can use the Shell web terminal through the FreeNAS web GUI if you prefer) and ran smartctl -a /dev/da2 to verify that /dev/da2 was indeed the failing drive. Here I’ve ommited the unrelated parts with [...]

% sudo smartctl -a /dev/da2

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N1FLTH0V
LU WWN Device Id: 5 0014ee 20d9cf75b
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Nov 17 23:46:58 2019 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[...]
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 114) The previous self-test completed having
                                        the read element of the test failed.
[...]

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
[...]
  9 Power_On_Hours          0x0032   062   062   000    Old_age   Always       -       27833
[...]

The following line tells you the number of hours the drive has been operational for:

9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 27833

In this case, the failing drive’s lifetime has been 3.17 years (27833 hours) so far. Unfortunately just outside of the Western Digital 3-year warranty!

Replacing The Failed Drive

Luckily I had a replacement drive ready to go. I grabbed the replacement drive and made a note of the key information:

Model: WD30EFRX-68EUZNO 3TB Western Digital Red NAS SATA HDD
Serial: WD-WCC4N1JUX6TN

So I made a plan to replace failing drive WD-WCC4N1FLTH0V with new drive WD-WCC4N1JUX6TN

I then logged into the FreeNAS web interface and navigated to Storage > Pools > Status

Storage > Pools > Status in Freenas 11 web interface

I selected da2p2 and clicked Offline. When the drive was taken offline, I shut down the server via the web interface and physically replaced the drive, double-checking the serial of the removed drive and its replacement.

Remember that drive indexes are zero-based. So in my case, da2 was the third drive.
When you remove the drive you think is failed, always double-check the serial!

I powered the server back on and waited for the FreeNAS web interface to come back up, logged in, then navigated to Storage > Pools > Status

I selected the missing drive (/dev/gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444) and clicked Replace.

The dialog that popped up offered to replace it with /dev/da3. Notice that since the old /dev/da2 was not available when the server restarted. The old da3 had taken da2‘s place. To be verify this, I ran another smartctl check via the FreeNAS shell:

% sudo smartctl -a /dev/da3
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N1JUX6TN
LU WWN Device Id: 5 0014ee 20d4a2714
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Nov 18 00:25:15 2019 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

[...]

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
[...]
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always  
[...]

Notice that the Serial Number: WD-WCC4N1JUX6TN line matches the serial of the replacement drive, confirming that /dev/da3 is the correct drive to select as the replacement in the Replace dialog box.

I selected da3 from the dropdown and clicked the Replace button. After a short wait, the web interface confirmed that rebuilding (resilvering) the data pool had begun.

Keeping an Eye on Things

In my case, replacing the 3TB drive was projected to take approximately 3 hours. I was able to check the status of the rebuild by running the zpool status command via the shell:

% zpool status
  pool: REDPOOL_4X3TB
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Nov 18 00:26:42 2019
        1.04T scanned at 725M/s, 668G issued at 454M/s, 3.75T total
        135G resilvered, 17.37% done, 0 days 01:59:18 to go
config:


        NAME                                              STATE     READ WRITE CKSUM
        REDPOOL_4X3TB                                     DEGRADED     0     0     0
          mirror-0                                        ONLINE       0     0     0
            gptid/4bc78f89-774a-11e6-a507-1c98ec0ec444    ONLINE       0     0     0
            gptid/4c75dd9a-774a-11e6-a507-1c98ec0ec444    ONLINE       0     0     0
          mirror-1                                        DEGRADED     0     0     0
            replacing-0                                   OFFLINE      0     0     0
              16082153518450749111                        OFFLINE      0     0     0  was /dev/gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444
              gptid/a6aff1f4-094e-11ea-8d34-a01d48c71098  ONLINE       0     0     0
            gptid/9fc44493-081f-11e9-8568-a01d48c71098    ONLINE       0     0     0


errors: No known data errors


  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:02:12 with 0 errors on Mon Nov 11 03:47:12 2019
config:


        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da4p2     ONLINE       0     0     0


errors: No known data errors

Don’t forget to label the failed drive or box you’re storing it in so that it doesn’t get reused later on.

Label failed drives and always dispose of responsibly.

Final Verification

It took 6 hours and 13 minutes to rebuild the pool, nearly twice the estimated time, so don’t be surprised if yours is taking a while too. Here is what the zpool status command returned when the operation completed:

% zpool status
  pool: REDPOOL_4X3TB
 state: ONLINE
  scan: resilvered 1.88T in 0 days 06:13:26 with 0 errors on Mon Nov 18 06:40:08 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        REDPOOL_4X3TB                                   ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/4bc78f89-774a-11e6-a507-1c98ec0ec444  ONLINE       0     0     0
            gptid/4c75dd9a-774a-11e6-a507-1c98ec0ec444  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/a6aff1f4-094e-11ea-8d34-a01d48c71098  ONLINE       0     0     0
            gptid/9fc44493-081f-11e9-8568-a01d48c71098  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:02:12 with 0 errors on Mon Nov 11 03:47:12 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da4p2     ONLINE       0     0     0

errors: No known data errors

Thanks for reading, and I hope this guide helps you out. Leave a comment if it helps you, or if you find any errors. Cheers!

Leave a Reply