I’ve replaced aging disks in my FreeNAS box a few times now, and I always document procedures to save time later on. With that in mind, I thought it might help others if I shared the workflow online. So here you go!
Initial Diagnosis
I received an automated email from my freenas box with the following kernel log messages. Notice the MEDIUM ERROR:
freenas.lan kernel log messages:
[...]
> (da2:mps0:0:4:0): Retrying command
> (da2:mps0:0:4:0): READ(10). CDB: 28 00 00 e8 10 f0 00 01 00 00
> (da2:mps0:0:4:0): CAM status: SCSI Status Error
> (da2:mps0:0:4:0): SCSI status: Check Condition
> (da2:mps0:0:4:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> (da2:mps0:0:4:0): Info: 0xe81110
> (da2:mps0:0:4:0): Error 5, Unretryable error
-- End of security output --
I also received an email with a more readable output:
[...]
pool: REDPOOL_4X3TB
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub in progress since Sun Nov 17 00:00:02 2019
437G scanned at 41.2M/s, 373G issued at 35.2M/s, 3.75T total
904K repaired, 9.71% done, 1 days 04:02:43 to go
config:
NAME STATE READ WRITE CKSUM
REDPOOL_4X3TB ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/4bc78f89-774a-11e6-a507-1c98ec0ec444 ONLINE 0 0 0
gptid/4c75dd9a-774a-11e6-a507-1c98ec0ec444 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444 ONLINE 0 0 1
gptid/9fc44493-081f-11e9-8568-a01d48c71098 ONLINE 0 0 0
[...]
I took note of which drive had the checksum error:
mirror-1 ONLINE 0 0 0
gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444 ONLINE 0 0 1
I then SSHed into my freeNAS machine (You can use the Shell web terminal through the FreeNAS web GUI if you prefer) and ran smartctl -a /dev/da2 to verify that /dev/da2 was indeed the failing drive. Here I’ve ommited the unrelated parts with [...]
% sudo smartctl -a /dev/da2
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-WCC4N1FLTH0V
LU WWN Device Id: 5 0014ee 20d9cf75b
Firmware Version: 82.00A82
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Nov 17 23:46:58 2019 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[...]
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 114) The previous self-test completed having
the read element of the test failed.
[...]
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
[...]
9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 27833
[...]
The following line tells you the number of hours the drive has been operational for:
9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 27833
In this case, the failing drive’s lifetime has been 3.17 years (27833 hours) so far. Unfortunately just outside of the Western Digital 3-year warranty!
Replacing The Failed Drive
Luckily I had a replacement drive ready to go. I grabbed the replacement drive and made a note of the key information:
Model: WD30EFRX-68EUZNO 3TB Western Digital Red NAS SATA HDD
Serial: WD-WCC4N1JUX6TN
So I made a plan to replace failing drive WD-WCC4N1FLTH0V
with new drive WD-WCC4N1JUX6TN
I then logged into the FreeNAS web interface and navigated to Storage > Pools > Status
I selected da2p2 and clicked Offline. When the drive was taken offline, I shut down the server via the web interface and physically replaced the drive, double-checking the serial of the removed drive and its replacement.
I powered the server back on and waited for the FreeNAS web interface to come back up, logged in, then navigated to Storage > Pools > Status
I selected the missing drive (/dev/gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444) and clicked Replace.
The dialog that popped up offered to replace it with /dev/da3. Notice that since the old /dev/da2 was not available when the server restarted. The old da3 had taken da2‘s place. To be verify this, I ran another smartctl check via the FreeNAS shell:
% sudo smartctl -a /dev/da3
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-WCC4N1JUX6TN
LU WWN Device Id: 5 0014ee 20d4a2714
Firmware Version: 82.00A82
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Nov 18 00:25:15 2019 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[...]
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
[...]
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
[...]
Notice that the Serial Number: WD-WCC4N1JUX6TN
line matches the serial of the replacement drive, confirming that /dev/da3 is the correct drive to select as the replacement in the Replace dialog box.
I selected da3 from the dropdown and clicked the Replace button. After a short wait, the web interface confirmed that rebuilding (resilvering) the data pool had begun.
Keeping an Eye on Things
In my case, replacing the 3TB drive was projected to take approximately 3 hours. I was able to check the status of the rebuild by running the zpool status command via the shell:
% zpool status
pool: REDPOOL_4X3TB
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Nov 18 00:26:42 2019
1.04T scanned at 725M/s, 668G issued at 454M/s, 3.75T total
135G resilvered, 17.37% done, 0 days 01:59:18 to go
config:
NAME STATE READ WRITE CKSUM
REDPOOL_4X3TB DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
gptid/4bc78f89-774a-11e6-a507-1c98ec0ec444 ONLINE 0 0 0
gptid/4c75dd9a-774a-11e6-a507-1c98ec0ec444 ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
replacing-0 OFFLINE 0 0 0
16082153518450749111 OFFLINE 0 0 0 was /dev/gptid/4d387bd8-774a-11e6-a507-1c98ec0ec444
gptid/a6aff1f4-094e-11ea-8d34-a01d48c71098 ONLINE 0 0 0
gptid/9fc44493-081f-11e9-8568-a01d48c71098 ONLINE 0 0 0
errors: No known data errors
pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:02:12 with 0 errors on Mon Nov 11 03:47:12 2019
config:
NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da4p2 ONLINE 0 0 0
errors: No known data errors
Don’t forget to label the failed drive or box you’re storing it in so that it doesn’t get reused later on.
Final Verification
It took 6 hours and 13 minutes to rebuild the pool, nearly twice the estimated time, so don’t be surprised if yours is taking a while too. Here is what the zpool status command returned when the operation completed:
% zpool status
pool: REDPOOL_4X3TB
state: ONLINE
scan: resilvered 1.88T in 0 days 06:13:26 with 0 errors on Mon Nov 18 06:40:08 2019
config:
NAME STATE READ WRITE CKSUM
REDPOOL_4X3TB ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/4bc78f89-774a-11e6-a507-1c98ec0ec444 ONLINE 0 0 0
gptid/4c75dd9a-774a-11e6-a507-1c98ec0ec444 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/a6aff1f4-094e-11ea-8d34-a01d48c71098 ONLINE 0 0 0
gptid/9fc44493-081f-11e9-8568-a01d48c71098 ONLINE 0 0 0
errors: No known data errors
pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:02:12 with 0 errors on Mon Nov 11 03:47:12 2019
config:
NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da4p2 ONLINE 0 0 0
errors: No known data errors
Thanks for reading, and I hope this guide helps you out. Leave a comment if it helps you, or if you find any errors. Cheers!
Clear instructions for how to replace a failed drive in a FreeNAS pool, thank you.
You’re welcome! Glad to help 🙂
A second thank-you here. The new FreeNAS interface has been bumbling around finding everything again…
No worries – I know what it’s like. We’re on “TrueNAS” now!
This process does not require the system to be taken offline and should be avoided whenever possible and only used as a last resort. Depending on which hardware you are utilizing you can try to identify the failed drive using several commands such as sas2ircu or sesutil to identify the drive by the activity light then setting the drive to offline, physically replacing it then using the replace option on the newly discovered drive then selecting the slot ID you wish to place it into, da2, da4 etc. There are other options as well such as performing disk read to /dev/null on the failed device to see activity that is more consistent the rest of the pool disks then perform the same offline and replace steps.
True, and good points @zman. Unfortunately in my example, I’m using an HP Microserver Gen8. The backplane isn’t designed to safely handle the electrical load changes that happen when hot-swapping a drive. For more info there is a discussion at https://www.reddit.com/r/homelab/comments/7o5wyt/hotswap_on_gen8_hp_micro_server/.
Found elsewhere on the internet that there is a problem with WD firmware 82.00A82 when using with ZFS, so be aware!
Thanks for sharing Paul. I’ll be on the lookout for odd behavior! Do you have a link to info on the bug?
Thanks for this post Ben.
i have an issue that i cant see the new hdd i have inserted so when i click on replace, there is nothing to choose to replace it with.
Any ideas?
Hi Anthony. Sorry to hear you’re having trouble. A few questions to clarify:
1) Is the replacement HDD clean? Has it been previously partitioned or formatted?
2) Are you using a RAID card that needs to initialize the newly inserted disk? (usually done via RAID BIOS (option ROM))
3) Have you rebooted the server since installing the disk? (Not ideal, I know)
Hi Ben
yes its a brand new hdd
as far as i recall the raid controller didnt require anything on the initial set up to initialise new disks
yes i tried doing hot swap which didnt work so i have since gone through the full reboot
Hrmmm… so you’re still not seeing it after a reboot? May I ask what sizes the original and new disks are? Sometimes there are firmware issues with certain RAID/HBA that prevent disks larger than 2/6/10/12TB from being accessible.
If you have a spare machine (and no data on the replacement disk), I would try hooking it up to verify partition table and filesystem is currently set up on it. Usually a new disk doesn’t require that much babying though.
…Another thought – sometimes disk sizes vary slightly between brands – a 4TB WD and a 4TB Seagate, for example, might not have the same number of blocks available. If this is the case, ZFS can’t replace an existing volume member with a smaller capacity.
so after messing around with it i have got it to find the new drive. now it sees it as faulted though
does this seem strange for a brand new drive, i dont know that ive ever had a drive with a fault straight out of the box?
it is the same drive as all the rest a 4tb WD Red
Definitely try getting it replaced under warranty. WD is really good at shipping out replacements. I’ve never had a DOA at home but it does happen. There’s also a chance it was mishandled during transit. Are you able to get SMART readings from it?
Thank you, this was a great help. I’m lazy and impatient, so I executed a while() in the SSH command line:
while 1
? zpool status | grep resilvered
? sleep 10
? end
It goes like this….
status: One or more devices is currently being resilvered. The pool will
22.2G resilvered, 1.25% done, 0 days 10:15:41 to go
status: One or more devices is currently being resilvered. The pool will
22.9G resilvered, 1.29% done, 0 days 10:08:55 to go
status: One or more devices is currently being resilvered. The pool will
23.8G resilvered, 1.34% done, 0 days 09:59:07 to go
status: One or more devices is currently being resilvered. The pool will
24.6G resilvered, 1.38% done, 0 days 09:50:52 to go
Great tip, thanks!