Tutorial: How to replace a failing drive in FreeNAS 11.2

I’ve replaced aging disks in my FreeNAS box a few times now, and I always document procedures to save time later on. With that in mind, I thought it might help others if I shared the workflow online. So here you go!

Initial Diagnosis

I received an automated email from my freenas box with the following kernel log messages. Notice the MEDIUM ERROR:

freenas.lan kernel log messages:
[...]
 > (da2:mps0:0:4:0): Retrying command
 > (da2:mps0:0:4:0): READ(10). CDB: 28 00 00 e8 10 f0 00 01 00 00
 > (da2:mps0:0:4:0): CAM status: SCSI Status Error
 > (da2:mps0:0:4:0): SCSI status: Check Condition
 > (da2:mps0:0:4:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
 > (da2:mps0:0:4:0): Info: 0xe81110
 > (da2:mps0:0:4:0): Error 5, Unretryable error
 -- End of security output --

I also received an email with a more readable output:

[...]
  pool: REDPOOL_4X3TB
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Nov 17 00:00:02 2019
        437G scanned at 41.2M/s, 373G issued at 35.2M/s, 3.75T total
        904K repaired, 9.71% done, 1 days 04:02:43 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        REDPOOL_4X3TB                                   ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/4bc78f89-774a-11e6-a507-1c98ec0ec444  ONLINE       0     0     0
            gptid/4c75dd9a-774a-11e6-a507-1c98ec0ec444  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444  ONLINE       0     0     1
            gptid/9fc44493-081f-11e9-8568-a01d48c71098  ONLINE       0     0     0

[...]

I took note of which drive had the checksum error:

mirror-1 ONLINE 0 0 0 gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444 ONLINE 0 0 1

I then SSHed into my freeNAS machine (You can use the Shell web terminal through the FreeNAS web GUI if you prefer) and ran smartctl -a /dev/da2 to verify that /dev/da2 was indeed the failing drive. Here I’ve ommited the unrelated parts with [...]

% sudo smartctl -a /dev/da2

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N1FLTH0V
LU WWN Device Id: 5 0014ee 20d9cf75b
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Nov 17 23:46:58 2019 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[...]
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 114) The previous self-test completed having
                                        the read element of the test failed.
[...]

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
[...]
  9 Power_On_Hours          0x0032   062   062   000    Old_age   Always       -       27833
[...]

The following line tells you the number of hours the drive has been operational for:

9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 27833

In this case, the failing drive’s lifetime has been 3.17 years (27833 hours) so far. Unfortunately just outside of the Western Digital 3-year warranty!

Replacing The Failed Drive

Luckily I had a replacement drive ready to go. I grabbed the replacement drive and made a note of the key information:

Model: WD30EFRX-68EUZNO 3TB Western Digital Red NAS SATA HDD Serial: WD-WCC4N1JUX6TN

So I made a plan to replace failing drive WD-WCC4N1FLTH0V with new drive WD-WCC4N1JUX6TN

I then logged into the FreeNAS web interface and navigated to Storage > Pools > Status

Storage > Pools > Status in Freenas 11 web interface

I selected da2p2 and clicked Offline. When the drive was taken offline, I shut down the server via the web interface and physically replaced the drive, double-checking the serial of the removed drive and its replacement.

Remember that drive indexes are zero-based. So in my case, da2 was the third drive.

When you remove the drive you think is failed, always double-check the serial!

I powered the server back on and waited for the FreeNAS web interface to come back up, logged in, then navigated to Storage > Pools > Status

I selected the missing drive (/dev/gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444) and clicked Replace.

The dialog that popped up offered to replace it with /dev/da3. Notice that since the old /dev/da2 was not available when the server restarted. The old da3 had taken da2‘s place. To be verify this, I ran another smartctl check via the FreeNAS shell:

% sudo smartctl -a /dev/da3
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N1JUX6TN
LU WWN Device Id: 5 0014ee 20d4a2714
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Nov 18 00:25:15 2019 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

[...]

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
[...]
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always  
[...]

Notice that the Serial Number: WD-WCC4N1JUX6TN line matches the serial of the replacement drive, confirming that /dev/da3 is the correct drive to select as the replacement in the Replace dialog box.

I selected da3 from the dropdown and clicked the Replace button. After a short wait, the web interface confirmed that rebuilding (resilvering) the data pool had begun.

Keeping an Eye on Things

In my case, replacing the 3TB drive was projected to take approximately 3 hours. I was able to check the status of the rebuild by running the zpool status command via the shell:

% zpool status
  pool: REDPOOL_4X3TB
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Nov 18 00:26:42 2019
        1.04T scanned at 725M/s, 668G issued at 454M/s, 3.75T total
        135G resilvered, 17.37% done, 0 days 01:59:18 to go
config:


        NAME                                              STATE     READ WRITE CKSUM
        REDPOOL_4X3TB                                     DEGRADED     0     0     0
          mirror-0                                        ONLINE       0     0     0
            gptid/4bc78f89-774a-11e6-a507-1c98ec0ec444    ONLINE       0     0     0
            gptid/4c75dd9a-774a-11e6-a507-1c98ec0ec444    ONLINE       0     0     0
          mirror-1                                        DEGRADED     0     0     0
            replacing-0                                   OFFLINE      0     0     0
              16082153518450749111                        OFFLINE      0     0     0  was /dev/gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444
              gptid/a6aff1f4-094e-11ea-8d34-a01d48c71098  ONLINE       0     0     0
            gptid/9fc44493-081f-11e9-8568-a01d48c71098    ONLINE       0     0     0


errors: No known data errors


  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:02:12 with 0 errors on Mon Nov 11 03:47:12 2019
config:


        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da4p2     ONLINE       0     0     0


errors: No known data errors

Don’t forget to label the failed drive or box you’re storing it in so that it doesn’t get reused later on.

Label failed drives and always dispose of responsibly.

Final Verification

It took 6 hours and 13 minutes to rebuild the pool, nearly twice the estimated time, so don’t be surprised if yours is taking a while too. Here is what the zpool status command returned when the operation completed:

% zpool status
  pool: REDPOOL_4X3TB
 state: ONLINE
  scan: resilvered 1.88T in 0 days 06:13:26 with 0 errors on Mon Nov 18 06:40:08 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        REDPOOL_4X3TB                                   ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/4bc78f89-774a-11e6-a507-1c98ec0ec444  ONLINE       0     0     0
            gptid/4c75dd9a-774a-11e6-a507-1c98ec0ec444  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/a6aff1f4-094e-11ea-8d34-a01d48c71098  ONLINE       0     0     0
            gptid/9fc44493-081f-11e9-8568-a01d48c71098  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:02:12 with 0 errors on Mon Nov 11 03:47:12 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da4p2     ONLINE       0     0     0

errors: No known data errors

Thanks for reading, and I hope this guide helps you out. Leave a comment if it helps you, or if you find any errors. Cheers!

Tags: drive replacement freenas smartctl WD RED Western Digital zfs zfs pool

Comments 16

J Robert Burgoyne says:

5 years ago

Clear instructions for how to replace a failed drive in a FreeNAS pool, thank you.

- Ben says:
  
  5 years ago
  
  You’re welcome! Glad to help 🙂
  
Tom Arnold says:

4 years ago

A second thank-you here. The new FreeNAS interface has been bumbling around finding everything again…

- Ben says:
  
  4 years ago
  
  No worries – I know what it’s like. We’re on “TrueNAS” now!
  
zman says:

4 years ago

This process does not require the system to be taken offline and should be avoided whenever possible and only used as a last resort. Depending on which hardware you are utilizing you can try to identify the failed drive using several commands such as sas2ircu or sesutil to identify the drive by the activity light then setting the drive to offline, physically replacing it then using the replace option on the newly discovered drive then selecting the slot ID you wish to place it into, da2, da4 etc. There are other options as well such as performing disk read to /dev/null on the failed device to see activity that is more consistent the rest of the pool disks then perform the same offline and replace steps.

- Ben says:
  
  4 years ago
  
  True, and good points @zman. Unfortunately in my example, I’m using an HP Microserver Gen8. The backplane isn’t designed to safely handle the electrical load changes that happen when hot-swapping a drive. For more info there is a discussion at https://www.reddit.com/r/homelab/comments/7o5wyt/hotswap_on_gen8_hp_micro_server/.
  
Paul Colwell says:

4 years ago

Found elsewhere on the internet that there is a problem with WD firmware 82.00A82 when using with ZFS, so be aware!

- Ben says:
  
  4 years ago
  
  Thanks for sharing Paul. I’ll be on the lookout for odd behavior! Do you have a link to info on the bug?
  
Anthony says:

3 years ago

Thanks for this post Ben.
i have an issue that i cant see the new hdd i have inserted so when i click on replace, there is nothing to choose to replace it with.
Any ideas?

- Ben says:
  
  3 years ago
  
  Hi Anthony. Sorry to hear you’re having trouble. A few questions to clarify:
  1) Is the replacement HDD clean? Has it been previously partitioned or formatted?
  2) Are you using a RAID card that needs to initialize the newly inserted disk? (usually done via RAID BIOS (option ROM))
  3) Have you rebooted the server since installing the disk? (Not ideal, I know)
  
Anthony says:

3 years ago

Hi Ben
yes its a brand new hdd
as far as i recall the raid controller didnt require anything on the initial set up to initialise new disks
yes i tried doing hot swap which didnt work so i have since gone through the full reboot

- Ben says:
  
  3 years ago
  
  Hrmmm… so you’re still not seeing it after a reboot? May I ask what sizes the original and new disks are? Sometimes there are firmware issues with certain RAID/HBA that prevent disks larger than 2/6/10/12TB from being accessible.
  
  If you have a spare machine (and no data on the replacement disk), I would try hooking it up to verify partition table and filesystem is currently set up on it. Usually a new disk doesn’t require that much babying though.
  
  …Another thought – sometimes disk sizes vary slightly between brands – a 4TB WD and a 4TB Seagate, for example, might not have the same number of blocks available. If this is the case, ZFS can’t replace an existing volume member with a smaller capacity.
  
Anthony says:

3 years ago

so after messing around with it i have got it to find the new drive. now it sees it as faulted though
does this seem strange for a brand new drive, i dont know that ive ever had a drive with a fault straight out of the box?
it is the same drive as all the rest a 4tb WD Red

- Ben says:
  
  3 years ago
  
  Definitely try getting it replaced under warranty. WD is really good at shipping out replacements. I’ve never had a DOA at home but it does happen. There’s also a chance it was mishandled during transit. Are you able to get SMART readings from it?
  
John F. says:

3 years ago

Thank you, this was a great help. I’m lazy and impatient, so I executed a while() in the SSH command line:

while 1
? zpool status | grep resilvered
? sleep 10
? end

It goes like this….

status: One or more devices is currently being resilvered. The pool will
22.2G resilvered, 1.25% done, 0 days 10:15:41 to go
status: One or more devices is currently being resilvered. The pool will
22.9G resilvered, 1.29% done, 0 days 10:08:55 to go
status: One or more devices is currently being resilvered. The pool will
23.8G resilvered, 1.34% done, 0 days 09:59:07 to go
status: One or more devices is currently being resilvered. The pool will
24.6G resilvered, 1.38% done, 0 days 09:50:52 to go

- Ben says:
  
  2 years ago
  
  Great tip, thanks!

Tutorial: How to replace a failing drive in FreeNAS 11.2

Portfolio – Mother

Tutorial: How to set up iSCSI for ESXi on FreeNAS 11.2

Tutorial: How to set up iSCSI for ESXi on FreeNAS 11.2

Comments 16

Leave a ReplyCancel reply

Search

Recent Posts

Tutorial: How to replace a failing drive in FreeNAS 11.2

Initial Diagnosis

Replacing The Failed Drive

Keeping an Eye on Things

Final Verification

Share this:

Portfolio – Mother

Tutorial: How to set up iSCSI for ESXi on FreeNAS 11.2

Tutorial: How to set up iSCSI for ESXi on FreeNAS 11.2

Comments 16

Leave a ReplyCancel reply

Search

Recent Posts

Tags