beszel icon indicating copy to clipboard operation
beszel copied to clipboard

S.M.A.R.T support

Open geekifan opened this issue 1 year ago • 16 comments

I follow the manner of GPUManager to add support for S.M.A.R.T to the agent. Since I am not an expert in Go and do not have enough physical devices around for testing, I hope someone can do a basic review of my code and test it on their own devices. Once everything is ready, I will proceed with modifying the hub's code.

TODO:

  • [X] Add S.M.A.R.T. manager in agent
  • [X] Show disk and smart info in web ui
  • [ ] Add S.M.A.R.T. failing alert
  • [ ] Documentation & tests

geekifan avatar Feb 22 '25 09:02 geekifan

Hi Yifan, thank you very much for your work!

This looks like a great start. Let me get back to you later in the week as I have limited time right now and am trying to get the next release out as soon as possible.

On the hub side we should probably create a new table (PocketBase collection) for this data.

From my limited knowledge I think parsing smartctl output is a fine approach and should work on MacOS also. But I may be wrong.

There's also this Go library which provides SMART information: https://github.com/anatol/smart.go

And a standalone application, Scrutiny, which is written in Go and may be a helpful reference: https://github.com/AnalogJ/scrutiny

As far as hardware, I'm in the same boat as you. I actually don't even own a HDD, but we should be able to find some output samples online and use them as test data (or people using Beszel can provide them).

Again, I appreciate your time and will get back to you as soon as I can.

Edit: If anyone reads this and wants to provide sample output, please change the serial numbers before sharing.

henrygd avatar Feb 23 '25 01:02 henrygd

Thank you very much for your detailed response.

First, I have considered using smart.go. If we use smart.go, we will be dependent on all its aspects (such as potential bugs and the possibility that its smart database may not be updated in a timely manner). If such issues arise and it is no longer maintained, all we can do is fork it, fix the bugs, or update the smart database. This would add a significant burden to the maintenance of beszel. In contrast, smartctl is a very widely used tool, with timely updates to the smart database and more prompt maintenance in case of bugs. Its support for JSON-formatted output is a great advantage for data parsing in Go.

Regarding the macOS issue, I currently also have macOS and will conduct tests later.

The hardware I currently have available for testing includes: NVMe/SATA/SCSI (only testable under Linux platform), and USB storage, which should cover mainstream hardware. What I really worry about are some corner cases.

Additionally, I have a few issues that I am unsure how to handle:

  1. The SMART data for SATA/SCSI uses the ATA format, while NVMe uses a different format, leading to inconsistencies in SMART key values. Other hardware might have more SMART formats, so I believe we need everyone's help to find the appropriate data structures to store and monitor them.
  2. Due to the hot-swappable nature of hard drives, if a hard drive is unplugged, the agent part will delete the corresponding data entry when report to the hub. But how will the hub handle the missing data? Will it delete the corresponding hard drive data when displaying, or will it retain the state at the time of unplugging? (Sorry, I am not familiar with PocketBase and some database operations.)

EDIT: I checked the code of https://github.com/AnalogJ/scrutiny. Scrutiny parses the json output of smartctl to get the SMART info.

geekifan avatar Feb 23 '25 02:02 geekifan

Sounds good, I agree with the direct smartctl approach.

I don't think there's any reason to worry about corner cases in the first iteration. We'll get sample output and include the most important or common values.

If there's an issue parsing then we'll just log an error. We can add support for more formats as people request them.

Hopefully the JSON structure is consistent and it's just the properties that differ, because dealing with inconsistent JSON is not fun.

The regular non-JSON output looks easy to parse, so we could just use bufio to scan the output line by line for the values we need.

Here's output from my laptop with one nvme drive:

smartctl --scan
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
smartctl --scan -j
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "x86_64-linux-6.13.2-arch1-1",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--scan",
      "-j"
    ],
    "exit_status": 0
  },
  "devices": [
    {
      "name": "/dev/nvme0",
      "info_name": "/dev/nvme0",
      "type": "nvme",
      "protocol": "NVMe"
    }
  ]
}
sudo smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.13.2-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WD PC SN810 SDCPNRY-1T00-1006
Serial Number:                      226223861317
Firmware Version:                   HPS2
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity:           0
Controller ID:                      8224
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001c44 8b25c6eb61
Local Time is:                      Sun Feb 23 19:34:35 2025 EST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     88 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.25W    8.25W       -    0  0  0  0        0       0
 1 +     3.50W    3.50W       -    0  0  0  0        0       0
 2 +     2.60W    2.60W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     5000   10000
 4 -   0.0035W       -        -    4  4  4  4     3900   45700

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        34 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    20,427,281 [10.4 TB]
Data Units Written:                 27,523,884 [14.0 TB]
Host Read Commands:                 308,278,905
Host Write Commands:                722,398,619
Controller Busy Time:               2,230
Power Cycles:                       3,086
Power On Hours:                     1,392
Unsafe Shutdowns:                   173
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged
sudo smartctl -aj /dev/nvme0
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "x86_64-linux-6.13.2-arch1-1",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-aj",
      "/dev/nvme0"
    ],
    "exit_status": 0
  },
  "local_time": {
    "time_t": 1740357511,
    "asctime": "Sun Feb 23 19:38:31 2025 EST"
  },
  "device": {
    "name": "/dev/nvme0",
    "info_name": "/dev/nvme0",
    "type": "nvme",
    "protocol": "NVMe"
  },
  "model_name": "WD PC SN810 SDCPNRY-1T00-1006",
  "serial_number": "286223861317",
  "firmware_version": "HPS2",
  "nvme_pci_vendor": {
    "id": 5559,
    "subsystem_id": 5559
  },
  "nvme_ieee_oui_identifier": 5920,
  "nvme_total_capacity": 1024209543168,
  "nvme_unallocated_capacity": 0,
  "nvme_controller_id": 8224,
  "nvme_version": {
    "string": "1.4",
    "value": 66560
  },
  "nvme_number_of_namespaces": 1,
  "nvme_namespaces": [
    {
      "id": 1,
      "size": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "capacity": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "utilization": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "formatted_lba_size": 512,
      "eui64": {
        "oui": 5930,
        "ext_id": 592171146913
      }
    }
  ],
  "user_capacity": {
    "blocks": 2000409264,
    "bytes": 1024209543168
  },
  "logical_block_size": 512,
  "smart_support": {
    "available": true,
    "enabled": true
  },
  "smart_status": {
    "passed": true,
    "nvme": {
      "value": 0
    }
  },
  "nvme_smart_health_information_log": {
    "critical_warning": 0,
    "temperature": 34,
    "available_spare": 100,
    "available_spare_threshold": 5,
    "percentage_used": 0,
    "data_units_read": 20427312,
    "data_units_written": 27524011,
    "host_reads": 308279032,
    "host_writes": 722405653,
    "controller_busy_time": 2230,
    "power_cycles": 3086,
    "power_on_hours": 1392,
    "unsafe_shutdowns": 173,
    "media_errors": 0,
    "num_err_log_entries": 0,
    "warning_temp_time": 0,
    "critical_comp_time": 0
  },
  "temperature": {
    "current": 34
  },
  "power_cycle_count": 3086,
  "power_on_time": {
    "hours": 1392
  },
  "nvme_error_information_log": {
    "size": 256,
    "read": 16,
    "unread": 0
  },
  "nvme_self_test_log": {
    "current_self_test_operation": {
      "value": 0,
      "string": "No self-test in progress"
    }
  }
}

If a drive is unplugged and not in current updates, we'll just keep the record for some predefined time, like a week.

So the data would remain the same as when the drive was unplugged. We could show a 'last updated' time or up/down indicator.

I'll use a scheduled job to delete records that haven't had an update in a week. We could also give users an option to delete the drive themselves.

You can keep the scope of this PR as narrow as you'd like. Just having something working on the agent side is a huge help! I can handle the rest of it no problem.

There's also no rush as I have two other big PRs in the queue as well.

henrygd avatar Feb 24 '25 00:02 henrygd

As far as hardware, I'm in the same boat as you. I actually don't even own a HDD, but we should be able to find some output samples online and use them as test data (or people using Beszel can provide them).

Edit: If anyone reads this and wants to provide sample output, please change the serial numbers before sharing.

Let me know what you need (and more so how to pull it) and I'll happily provide from across my drives.

sym0nd0 avatar Apr 13 '25 10:04 sym0nd0

Recently, I've been occupied with other projects and haven't been able to dedicate much time to the SMART feature development. However, I may now be able to allocate some time to work on this, particularly on the front-end and database aspects (though I can't guarantee significant progress at this stage).

Regarding the front-end implementation, I'd like to get your thoughts @henrygd : Do you think we should display the SMART data in a separate page, tab, or pop-up window? If so, where would you recommend placing it for optimal user experience?

I don't have much expertise in UI/UX design, so I'd be happy to hear any suggestions or ideas from anyone ;).

geekifan avatar Apr 18 '25 12:04 geekifan

No worries Yifan! Please only work on it if you want to. Don't feel any obligation. What you've already done will be helpful even if you don't do anything more.

We don't need to commit to a specific design right now, but my first thought is to put the SMART data on its own page.

Here's how Scrutiny does it for reference: https://imgur.com/a/5k8qMzS

Maybe on /system/system-name/smart we can have a table similar to the 'All Systems' table that lists all the system's drives with the most useful info. Then clicking on a row will bring you to system/system-name/smart/drive-name with details.

Alternatively, we can just stick the table under the other graphs on the system details page instead of making a standalone page for the SMART data table.

In the future maybe we can include a table on the home page that lists all drives from all systems as well.

IMO the most important part is getting the data where we need it. The layout can always be improved later.

Edit: We use shadcn so you might find something here that fits well: https://ui.shadcn.com

henrygd avatar Apr 19 '25 00:04 henrygd

Alternatively, we can just stick the table under the other graphs on the system details page instead of making a standalone page for the SMART data table.

I think this would be perfect, at least for start. One panel with table, each drive in row. Temperature sensor data may be added to Temperature panel.

evrial avatar Apr 19 '25 12:04 evrial

Thought it might be useful to provide some output from a system with a large number of drives.

My output is of the same commands as above, just with a grep -v serial. json version only for the smartctl, both tabular and json of smartscan.

smartscan.txt smartscanjson.txt

This setup is a total of 10 drives in the following configuration:

So the megaraid_disk_0n output in the scan is duplicate, and in the strictest sense, not all of the devices listed are actually SCSI devices. Probably doesn't matter if you're just using it to pull your list of devices for the output of smartctl, but I know that my /etc/smartd.conf (where the tests are configured) definitely cares that you specify the right type of disk (sat vs scsi) when invoking tests.

Also, I suggest that you key your data on the SN of the disk (or /dev/disk/by-id) rather than the device ID, because sometimes disks can change drive letters at boot when you have this many spread across multiple devices.

wesgeorge avatar Jun 12 '25 19:06 wesgeorge

@wesgeorge Thank you very much for providing the data and suggestions. I will modify the code for the agent part to make it more robust.

Besides, I finished a front-end demo using some hard drive data (with fake serial numbers) I have on hand. Does anyone have any suggestions? Personally, I prefer displaying all disks in a list format and showing more detailed SMART information by clicking on the corresponding row (just like Proxmox VE).

image

EDIT1: I finished the SMART detail dialog.

image

geekifan avatar Jun 16 '25 10:06 geekifan

@geekifan This looks really good, but could you give some indication of warnings. May be an extra column showing number and type of warnings or something similar

zero77 avatar Jun 19 '25 10:06 zero77

@zero77 Thank you for your suggestion! I will display the "When Failed" attribute in the SMART information table and highlight the failed attributes in red (or add an error icon) based on this property.

geekifan avatar Jun 19 '25 10:06 geekifan

It's encouraging to see that the feature I was going to suggest is already in the works. However, I was wondering if the temperature of the HDD could be integrated into the existing temperature tab to track its history?

muro-dot avatar Jun 27 '25 08:06 muro-dot

@muro-dot On the agent side, the hard drive temperatures are read via SMART data and then incorporated into the system temperature readings. Therefore, you can find the temperature curves of different hard drives in the temperature sensor charts.

geekifan avatar Jun 27 '25 13:06 geekifan

Is there any chance to try it out before henrygd officially releases it? I'm really excited about the new features Sorry for the comment not directly related to the development

muro-dot avatar Jul 02 '25 01:07 muro-dot

Well, I'm currently facing an issue. Should I set up alerts based on the smartd output from the monitoring target or implement alerts using the metrics obtained from Beszel? (I think the former might be better? After all, a solution implemented at the beszel hub side probably wouldn't be as comprehensive as the alerts in smartd.)

geekifan avatar Jul 20 '25 04:07 geekifan

Well, I'm currently facing an issue. Should I set up alerts based on the smartd output from the monitoring target or implement alerts using the metrics obtained from Beszel? (I think the former might be better? After all, a solution implemented at the beszel hub side probably wouldn't be as comprehensive as the alerts in smartd.)

I do agree to start the the alerts based on the smartd output, alerts based on other mertics can always be added later if there is an need for it.

svenvg93 avatar Jul 24 '25 09:07 svenvg93

Is there any ETA for when this is added? Looks awesome. I would just change the "Type SAT" to "Type SATA" though. SAT sounds weird lol. "SAS" is a different type though.

RikudouGoku avatar Oct 01 '25 14:10 RikudouGoku

I'll try to get this in soon. Hopefully this month.

henrygd avatar Oct 01 '25 16:10 henrygd

I'll try to get this in soon. Hopefully this month.

Awesome!

RikudouGoku avatar Oct 01 '25 16:10 RikudouGoku

@henrygd Thanks hank! This PR is almost done except SMART monitor alerts. I'm busy with my academic work so I have no time to finish the SMART alerts. I would appreciate it if you could finish the rest. This PR is now review-ready.

geekifan avatar Oct 02 '25 01:10 geekifan

No worries Yifan, I'll finish it off. Thanks again for your work :+1:

henrygd avatar Oct 02 '25 16:10 henrygd

Hi, Would it be possible to add the "-n standby" parameter of smartctl program to avoid to wakeup "sleepy" disks ? Maybe this one could be configurable

M3rcur-x avatar Oct 02 '25 18:10 M3rcur-x

This is has finally been added. I need to finish the documentation and clean up a few things, but I'll try to have a release out this weekend.

Thanks again for your efforts, Yifan! Sincerely appreciated.

@M3rcur-x I added standby handling so it should only wake disks once. Then if the disk is sleeping again it will use the previous data.

henrygd avatar Oct 24 '25 23:10 henrygd