diskscan icon indicating copy to clipboard operation
diskscan copied to clipboard

monitor commands causing POR

Open ziegi opened this issue 2 years ago • 2 comments

I am running diskscan 0.19 (tried also master and 0.20) on Debian 10 kernel 5.8 and Debian 12 kerneln 6.5 accessing SATA disks (6-16 TB Seagate, WD, Toshiba) attached to an LSI SAS Adapter through the Linux mpt3sas driver

Each time one of the code functions (maybe more ?) in lib/diskscan.c

static void disk_ata_monitor_start(disk_t *disk)
static void disk_ata_monitor(disk_t *disk)

is executed the drive does a POR because a command times out

kernel: sd 0:0:1:0: attempting task abort!scmd(0x00000000bfee609e), outstanding for 62048 ms & timeout 60000 ms
kernel: sd 0:0:1:0: [sdb] tag#3615 CDB: ATA command pass through(12)/Blank a1 0c 0e d0 01 00 4f c2 00 b0 00 00
kernel: scsi target0:0:1: handle(0x001a), sas_address(0x300605b012dd2901), phy(1)
kernel: scsi target0:0:1: enclosure logical id(0x300605b012112900), slot(8) 
kernel: scsi target0:0:1: enclosure level(0x0000), connector name( C2  )
kernel: sd 0:0:1:0: task abort: SUCCESS scmd(0x00000000bfee609e)
kernel: mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
kernel: sd 0:0:1:0: Power-on or device reset occurred

I tried increasing the timeouts but with no success. So i am using the following workaround to exclude a drive POR from the errors:

--- diskscan-0.20/lib/diskscan.c	2017-08-25 21:24:14.000000000 +0200
+++ ../diskscan-0.20/lib/diskscan.c	2024-01-10 11:30:21.933342563 +0100
@@ -498,7 +498,8 @@
 	data_log(&disk->data_log, offset/disk->sector_size, data_size/disk->sector_size, &io_res, t);
 
 	// Handle error or incomplete data
-	if (io_res.data != DATA_FULL || io_res.error != ERROR_NONE) {
+	if ((io_res.data != DATA_FULL || io_res.error != ERROR_NONE) 
+	    && !(errno == 0 && io_res.info.sense_key == 0x06 && io_res.info.asc == 0x29 && io_res.info.ascq == 0x00) /* ignore POR */) {
 		int s_errno = errno;
 		ERROR("Error when reading at offset %" PRIu64 " size %d read %zd, errno=%d: %s", offset, data_size, ret, errno, strerror(errno));
 		ERROR("Details: error=%s data=%s %02X/%02X/%02X", error_to_str(io_res.error), data_to_str(io_res.data),

I guess there is a better solution for this by changing the ata_monitor commands, unfortunately I do not know how.

ziegi avatar Jan 10 '24 10:01 ziegi

If the drives hit a timeout and do a reset that's not something that should be skipped and ignored.

baruch avatar Feb 19 '24 16:02 baruch

Hi, I figured I'd try this software, I'm running an LSI SAS HBA and I had been testing drives doing read and writes on them, diskscan immediately doesn't like the disk, but I can read/write fine from it using direct IO, it looks like diskscan is doing something "special" that the HBA doesn't like:

[410983.728031] sd 4:0:1:0: attempting task abort!scmd(0x0000000013012edb), outstanding for 61716 ms & timeout 60000 ms
[410983.728038] sd 4:0:1:0: [sdc] tag#9150 CDB: ATA command pass through(12)/Blank a1 0c 0e d0 01 00 4f c2 00 b0 00 00
[410983.728040] scsi target4:0:1: handle(0x000a), sas_address(0x4433221101000000), phy(1)
[410983.728043] scsi target4:0:1: enclosure logical id(0x54cd98f05e438500), slot(6) 
[410983.728045] scsi target4:0:1: enclosure level(0x0001), connector name(     )
[410983.783323] sd 4:0:1:0: task abort: SUCCESS scmd(0x0000000013012edb)
[410984.150460] mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[410984.411106] mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[410984.973536] sd 4:0:1:0: Power-on or device reset occurred

I don't think this is a disk problem in this particular case.

zougloub avatar Nov 04 '24 01:11 zougloub