ltfs LTO-9 READ_ATTRIBUTE and LOG_SENSE timeout values are too small

Describe the bug

Hello! We use LTFS 2.4.7.1 from GitHub (with Phobos) and IBM LTO-9 SAS drives. According to IBM, when a Read Attribute command is issued, the drive might generate an EOD and in our case, the drive was not able to set the write head position to the track (apparently a thing if the tape was calibrated using v21). The drive enters a recovery process of up to 10 x 12 seconds. It looks like in LTFS, the timeout for READ_ATTRIBUTE is hardcoded to 60s which is then too low in this case. IBM told us that "If the command was not aborted, the drive may be able to write the EOD to the tape."
This is why these timeout values should probably be set to either 120s or 132s (IBM did not exactly specify this yet, we're waiting to hear from them).

BTW, I tried to increase the value define here to 120 and recompiled LTFS but unfortunately I am still seeing occasional Read Attribute timeout occurrences still at 60s. I wonder why 60s is still used and what I missed... Any ideas?

[456431.184458] st 1:0:1:0: attempting task abort!scmd(0x000000008b10e4e8), outstanding for 62233 ms & timeout 60000 ms
[456431.184486] st 1:0:1:0: tag#4695 CDB: Read attributes 8c 00 00 00 00 00 00 00 00 09 00 00 00 0d 00 00

To Reproduce

Write files using LTFS 2.4 for example with Phobos (medium synchronization every 500 files / 16 GiB).

Expected behavior Longer READ_ATTRIBUTE timeout to allow the drive doing its special recovery when an EOD is needed.

Additional context OS is Rocky Linux 9.5

Apr 09 '25 18:04 thiell

This is a big surprise to me. Because I think LTFS never issues READ_ATTRIBUTE SCSI command while it is writing data.

Could you confirm the things below?

Try to access some data like tape UUID on LTFS via virtual extended attribute while writing a file
Another program accesses to LTFS via a st device while LTFS is mounting a tape

Finally, what is the log lines that you put on the issue description? I believe LTFS never use any st device at all. Why log system prints such kind of st device error?

[456431.184458] st 1:0:1:0: attempting task abort!scmd(0x000000008b10e4e8), outstanding for 62233 ms & timeout 60000 ms`
[456431.184486] st 1:0:1:0: tag#4695 CDB: Read attributes 8c 00 00 00 00 00 00 00 00 09 00 00 00 0d 00 00

Apr 10 '25 02:04 piste-jp

BTW, I tried to increase the value define here to 120 and recompiled LTFS but unfortunately I am still seeing occasional Read Attribute timeout occurrences still at 60s. I wonder why 60s is still used and what I missed... Any ideas?

Oops, I think because it is not a SCSI command which is issued from LTFS.

In my guess, an upper layer S/W, Phobos, might issue a READ_ATTRIBUTE SCSI command via the st device which is connected to the same tape drive. LTFS uses it via sg device.

LTFS never expects this kind of side channel access at all. So LTFS never guarantee the behavior at all.

Apr 10 '25 02:04 piste-jp

Hi @piste-jp and thanks for your response!

Sorry, here are more logs, we can see that LTFS does actually complain about READ_ATTR and LOGSENSE timing out, so it must issue those commands somehow. I think Phobos just uses statfs() while LTFS is writing to get statistics, which triggers those via LTFS. It's a good guess thought but I don't think Phobos issue the READ_ATTRIBUTE SCSI command directly.

Apr 02 22:11:29 elm-ent-dm01 kernel: st 1:0:1:0: attempting task abort!scmd(0x000000008b10e4e8), outstanding for 62233 ms & timeout 60000 ms
Apr 02 22:11:29 elm-ent-dm01 kernel: st 1:0:1:0: tag#4695 CDB: Read attributes 8c 00 00 00 00 00 00 00 00 09 00 00 00 0d 00 00
Apr 02 22:11:29 elm-ent-dm01 kernel: scsi target1:0:1: handle(0x0019), sas_address(0x52110090a50057fb), phy(5)
Apr 02 22:11:29 elm-ent-dm01 kernel: scsi target1:0:1: enclosure logical id(0x50012be000090dff), slot(42)
Apr 02 22:11:29 elm-ent-dm01 kernel: scsi target1:0:1: enclosure level(0x0001), connector name( C0  )
Apr 02 22:11:29 elm-ent-dm01 kernel: st 1:0:1:0: task abort: SUCCESS scmd(0x000000008b10e4e8)
Apr 02 22:11:29 elm-ent-dm01 ltfs[4997]: 139b LTFS30263I READ_ATTR returns Command TIMEOUT (-21102) /dev/sg2.
Apr 02 22:11:29 elm-ent-dm01 ltfs[4997]: 139b LTFS30262I Forcing drive dump.
Apr 02 22:11:29 elm-ent-dm01 ltfs[4997]: 139b LTFS30253I Saving drive dump to /tmp/ltfs_10110057FB_2025_0402_221129_f.dmp.
Apr 02 22:11:41 elm-ent-dm01 ltfs[4997]: 139b LTFS30233I Cannot read attribute (-21102).
Apr 02 22:11:41 elm-ent-dm01 ltfs[4997]: 139b LTFS12056W Cannot get Volume Change Reference parameter: read attribute failed (-21102).
Apr 02 22:11:41 elm-ent-dm01 ltfs[4997]: 139b LTFS17236I Wrote index of 060203 (Gen = 4070, Part = b, Pos = 35952070, 10110057FB).
Apr 02 22:11:41 elm-ent-dm01 ltfs[4997]: 139b LTFS11337I Update index-dirty flag (0) - 060203 (0x0x55a8526a50a0).
Apr 02 22:11:41 elm-ent-dm01 ltfs[4997]: 139b LTFS17070I Synced index of 060203 (0) 10110057FB.
...
Apr 02 22:12:42 elm-ent-dm01 kernel: st 1:0:1:0: attempting task abort!scmd(0x00000000437e2f2e), outstanding for 61195 ms & timeout 60000 ms
Apr 02 22:12:42 elm-ent-dm01 kernel: st 1:0:1:0: tag#3613 CDB: Log Sense 4d 00 57 00 00 00 00 ff ff 00
Apr 02 22:12:42 elm-ent-dm01 kernel: scsi target1:0:1: handle(0x0019), sas_address(0x52110090a50057fb), phy(5)
Apr 02 22:12:42 elm-ent-dm01 kernel: scsi target1:0:1: enclosure logical id(0x50012be000090dff), slot(42)
Apr 02 22:12:42 elm-ent-dm01 kernel: scsi target1:0:1: enclosure level(0x0001), connector name( C0  )
Apr 02 22:12:42 elm-ent-dm01 kernel: st 1:0:1:0: task abort: SUCCESS scmd(0x00000000437e2f2e)
Apr 02 22:12:42 elm-ent-dm01 ltfs[4997]: 1387 LTFS30263I LOGSENSE returns Command TIMEOUT (-21102) /dev/sg2.
Apr 02 22:12:42 elm-ent-dm01 ltfs[4997]: 1387 LTFS30262I Forcing drive dump.
Apr 02 22:12:43 elm-ent-dm01 ltfs[4997]: 1387 LTFS30253I Saving drive dump to /tmp/ltfs_10110057FB_2025_0402_221242_f.dmp.
Apr 02 22:12:54 elm-ent-dm01 ltfs[4997]: 1387 LTFS30229I Cannot get remaining capacity: get log page 0x17 failed (-21102).
Apr 02 22:12:54 elm-ent-dm01 ltfs[4997]: 1387 LTFS12030E Cannot get capacity data: backend call failed (-21102).
Apr 02 22:12:54 elm-ent-dm01 phobosd[3333]: 2025-04-02 22:12:54.967501000 <ERROR> statfs('/mnt/phobos-sg2') failed: Connection timed out (110)

Re: st I think it's just the tape driver that is shown here asst, at least this is like that on Rocky Linux 9.5. We use sg devices everywhere in Phobos, I'm pretty sure of that, for example, this is how ltfs is mounted:

ltfs:/dev/sg2 on /mnt/phobos-sg2 type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

Apr 10 '25 03:04 thiell

And BTW, this is a rare occurrence, we see that maybe only once or twice per week when writing non-stop using 4 LTO-9 drives. Our tape vendor support analyzed the drive dumps with IBM and said the drive received the READ ATTRIBUTE command but did not have the time to respond within 60s due to issue finding the track on time, but with more time it could have worked.

I was personally surprised that a read attribute command could trigger some kind of write buffer flush (EOD), but apparently IBM says so.

Apr 10 '25 03:04 thiell

Could you attach entire log file here with compression? I need to see the context of errors.

Apr 10 '25 04:04 piste-jp

See this one for example that I already prepared for our support (file too big for github so I put it on Google Drive): elm-ent-dm02.fulllogs.txt.gz It was a previous occurrence with another drive/tape. Here the tape involved is 060029L9. Note that at that time we used a previous version of LTFS (2.4.6.2) and drive FW (R3G0 instead of S2S0 now) but it is the same behavior.

Key events:

Tape mounted:

Mar 17 12:03:29 elm-ent-dm02 ltfs[2200514]: 2193c2 LTFS11031I Volume mounted successfully. 060029 : Gen = 356 / (a, 5) -> (b, 11110319) / 10230057FB.

Last Writing index before timeout:

Mar 24 01:51:55 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS17235I Writing index of 060029 to b (Reason: Sync by EA, 1057689 files) 10230057FB.

READ_ATTR timeout at 60s (I don't see "Wrote index" line before so the drive was writing I guess):

Mar 24 01:53:22 elm-ent-dm02 kernel: st 9:0:3:0: attempting task abort!scmd(0x000000002439203c), outstanding for 62035 ms & timeout 60000 ms
...
Mar 24 01:53:22 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS30263I READ_ATTR returns Command TIMEOUT (-21102) /dev/sg5.

Often followed by LOGSENSE timeout at 60s (this doesn't always happen, but often):

Mar 24 01:54:35 elm-ent-dm02 kernel: st 9:0:3:0: attempting task abort!scmd(0x0000000082822b4f), outstanding for 61167 ms & timeout 60000 ms
...
Mar 24 01:54:35 elm-ent-dm02 ltfs[2200524]: 2193d9 LTFS30263I LOGSENSE returns Command TIMEOUT (-21102) /dev/sg5.

Really appreciate you looking into this, thank you.

Apr 10 '25 05:04 thiell

BTW, I tried to increase the value define here to 120 and recompiled LTFS but unfortunately I am still seeing occasional Read Attribute timeout occurrences still at 60s. I wonder why 60s is still used and what I missed... Any ideas?

Oops, I think because it is not a SCSI command which is issued from LTFS.

In my guess, an upper layer S/W, Phobos, might issue a READ_ATTRIBUTE SCSI command via the st device which is connected to the same tape drive. LTFS uses it via sg device.

LTFS never expects this kind of side channel access at all. So LTFS never guarantee the behavior at all.

Sorry, I totally forgot the REPORT SUPPORTED OPERATION CODES (RSOC) command support.

Now LTFS automatically fetches the timeout values from the report from the tape drive itself. The timeout tables are used only when LTFS failed to fetch the timeout value from the tape drive.

This logic is implemented into the call chain of sg_open() -> init_timeout_rsoc(), and LTFS fetches the recommend timeout values from the tape drive.

Please take a look the SCSI spec published from IBM if you have an interest.

https://www.ibm.com/support/pages/node/6490249

Apr 11 '25 00:04 piste-jp

@piste-jp super interesting!! Thank you very much! I guess the LTO-9 drive should likely report a higher timeout value then... I will let our tape support folks know about that!

Apr 11 '25 01:04 thiell

For timeout issue analysis.

The context might be

phobos issued sync operation to LTFS via VEA at Mar 24 01:51:55
LTFS returned good to phobos even if getting Volume Change Reference (VCR) parameter is failed because of READ_ATTR timeout
- VCR is optional for LTFS format spec, so LTFS ignore this error as designed
I guess, phobos issued statfs() call against LTFS, but LTFS returned an error because of another timeout against LOGSENSE
- LOGSENSE might be failed because the drive is in the middle of some thing recovery process that IBM said...
phobos treated statfs() error as a critical error

So the issue from user side is not READ_ATTR timeout, but LOGSENSE timeout.

READATTR timeout log

Mar 24 01:51:55 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS11338I Syncing index of 060029 10230057FB.
Mar 24 01:51:55 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS17068I Syncing index of 060029 (Reason: Sync by EA) 10230057FB.
Mar 24 01:51:55 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS17235I Writing index of 060029 to b (Reason: Sync by EA, 1057689 files) 10230057FB.
Mar 24 01:53:22 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS30263I READ_ATTR returns Command TIMEOUT (-21102) /dev/sg5.
Mar 24 01:53:22 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS30262I Forcing drive dump.
Mar 24 01:53:22 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS30253I Saving drive dump to /tmp/ltfs_10230057FB_2025_0324_015322_f.dmp.
Mar 24 01:53:34 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS30233I Cannot read attribute (-21102).
Mar 24 01:53:34 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS12056W Cannot get Volume Change Reference parameter: read attribute failed (-21102).
Mar 24 01:53:34 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS17236I Wrote index of 060029 (Gen = 3217, Part = b, Pos = 23238752, 10230057FB).
Mar 24 01:53:34 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS11337I Update index-dirty flag (0) - 060029 (0x0x55f0fc1d50d0).
Mar 24 01:53:34 elm-ent-dm02 ltfs[2200524]: 2193ce LTFS17070I Synced index of 060029 (0) 10230057FB.

It is little bit funny to me, because LTFS writes down the index in the sequence below.

Write a leading filemark (FM) with no buffer flush
Write an index (an index could be composed of 1 or more blocks)
Write a following FM with buffer flush
- I think the tape drive flushed all buffered data to the tape, so it is reasonable that the drive write down a EOD at the end of tape stream here!!
Fetch the VCR value with a READ_ATTR command
Write down the coherency information into auxiliary memory on tape cartridge with WRITE_ATTR command

Points are

You said 'IBM said READ_ATTR is timed out because of seeking for EOD write is too long'. But In my sense, EOD shall be written on step3
The SCSI spec saids READ_ATTR timeout value is 60 secs (both nominal and recommended), LTFS just follow the information provided from the tape drive itself

From the points above, I need to say this is an issue of the tape drive side. Please contact to IBM again with my analysis. (You can say this analysis was done by Atsushi Abe to IBM. I was a member of tape dev until 2024/06.)

LOGSENSE timeout log

Mar 24 01:54:35 elm-ent-dm02 ltfs[2200524]: 2193d9 LTFS30263I LOGSENSE returns Command TIMEOUT (-21102) /dev/sg5.
Mar 24 01:54:35 elm-ent-dm02 ltfs[2200524]: 2193d9 LTFS30262I Forcing drive dump.
Mar 24 01:54:36 elm-ent-dm02 ltfs[2200524]: 2193d9 LTFS30253I Saving drive dump to /tmp/ltfs_10230057FB_2025_0324_015435_f.dmp.
Mar 24 01:54:48 elm-ent-dm02 ltfs[2200524]: 2193d9 LTFS30229I Cannot get remaining capacity: get log page 0x17 failed (-21102).
Mar 24 01:54:48 elm-ent-dm02 ltfs[2200524]: 2193d9 LTFS12030E Cannot get capacity data: backend call failed (-21102).
Mar 24 01:54:48 elm-ent-dm02 ltfs[2200524]: 2193d9 LTFS11003E Cannot retrieve device capacity data (-21102).
Mar 24 01:54:48 elm-ent-dm02 ltfs[2200524]: 2193d9 LTFS11337I Update index-dirty flag (1) - 060029 (0x0x55f0fc1d50d0).

This might be a kind of side effect of previous READ_ATTR error. But you need to contact IBM to investigate the LOGSENSE timeout with /tmp/ltfs_10230057FB_2025_0324_015435_f.dmp.

To be more realistic, phobos can issue fsstat() a few times when it is failed. It might work as workaround.

Apr 11 '25 01:04 piste-jp

Hi @piste-jp

I got a response from IBM via our support channel. Looks like they are realizing something is wrong...

…Read Attribute command generally does not require tape motion. That's why the command timeout is 60 seconds because the drive can respond to the host almost immediately. Only exception is this "Volume Change Reference" attribute which responds to the host after writing EOD to the tape. This means that the command timeout of this attribute must be 60 seconds plus write filemarks command's timeout (1680 seconds). But there is no SCSI interface to allocate different timeout to a specific attribute.

In this case, the 60 seconds timeout expired while the drive was executing a couple of write retries due to unstable servo condition. But the time was still within the write filemarks command timeout so that the tape drive and tape media have no problem. That's why I mentioned that the customer does not need to replace the tape drive and the media.

The development team realized the issue and is now considering the solution…

They did not say anything about LOGSENSE though.

We have no ETA for a fix...

Also:

I've reached out to the Phobos developers to see if they could retry the statfs() call, at least in case of ETIMEDOUT error.
I have to ask: Would you consider retrying READ_ATTR at the LTFS level in that case? Or perhaps I could apply a local patch to do that? That could also temporarily solve our issue. If you have any suggestion let me know and I will work on it. Thank you.

Apr 16 '25 15:04 thiell

Hmm, I have a small concern about this response because it looks they think the point is how to specify a different timeout value on getting "Volume Change Reference" attribute" (VCR) via READ_ATTRIBUTE command...

…Read Attribute command generally does not require tape motion. That's why the command timeout is 60 seconds because the drive can respond to the host almost immediately. Only exception is this "Volume Change Reference" attribute which responds to the host after writing EOD to the tape. This means that the command timeout of this attribute must be 60 seconds plus write filemarks command's timeout (1680 seconds). But there is no SCSI interface to allocate different timeout to a specific attribute.

The real key to me is here. Because LTFS issued WRITE_FILEMARK before issuing the READ_ATTRIBUTE command to get VCR. The drive doesn't need to make any tape motion for getting VCR in this case. I would like to know why the drive tried to write down something on READ_ATTRIBUTE even if the previous command is WRITE_FILEMARK.

In this case, the 60 seconds timeout expired while the drive was executing a couple of write retries due to unstable servo condition. But the time was still within the write filemarks command timeout so that the tape drive and tape media have no problem. That's why I mentioned that the customer does not need to replace the tape drive and the media. The development team realized the issue and is now considering the solution…

Apr 17 '25 00:04 piste-jp

I have to ask: Would you consider retrying READ_ATTR at the LTFS level in that case? Or perhaps I could apply a local patch to do that? That could also temporarily solve our issue. If you have any suggestion let me know and I will work on it. Thank you.

Sure, but I would like to create a quick patch to override READ_ATTRIBUTE timeout to 1,680. I believe it is enough for you I think. (Th patch is only available for the Linux sg tape backend...)

Does my plan work for you?

Apr 17 '25 00:04 piste-jp

Here is a quick patch, the code base is the head of v2.4-stable branch, the commit hash is d8c223b.

This patch is not confirmed by any builds and tests (sorry I have only Mac env at this time...).

READATTR.patch

Apr 17 '25 01:04 piste-jp

@piste-jp Thank you, you are so helpful! I really appreciate it. I think it's a great solution for now: either READ_ATTR will return almost immediately (like most of the time) or we will hit this rare problem and the additional timeout should let the drive enough time to figure it out, hopefully. I like it. I will test the patch and keep you posted.

Apr 17 '25 01:04 thiell

@piste-jp just a quick update, after more than a week running with your patch (slightly adapted as we have FH drives) and archiving literally millions of files, I've not seen a single timeout error!

diff --git a/src/tape_drivers/linux/sg/sg_tape.c b/src/tape_drivers/linux/sg/sg_tape.c
index 19f2beb..828431d 100644
--- a/src/tape_drivers/linux/sg/sg_tape.c
+++ b/src/tape_drivers/linux/sg/sg_tape.c
@@ -3611,6 +3611,19 @@ int sg_read_attribute(void *device, const tape_partition_t part,
        if (timeout < 0)
                return -EDEV_UNSUPPORETD_COMMAND;
 
+       /*
+        * QUICK PATCH:
+        * Override the timeout value to the WRITE_FILEMARK's timeout
+        * when VCR attribute is requested and the drive is LTO9.
+        */
+       if (id == TC_MAM_PAGE_VCR
+               && IS_LTO(priv->drive_type)
+               && DRIVE_GEN(priv->drive_type) == 9) {
+               timeout = get_timeout(priv->timeouts, WRITE_FILEMARKS6);
+               if (timeout < 0)
+                       return -EDEV_UNSUPPORETD_COMMAND;
+       }
+
        /* Build request */
        req.dxfer_direction = SCSI_FROM_TARGET_TO_INITIATOR;
        req.cmd_len         = sizeof(cdb);

Apparently IBM is working on a fix, but I have no ETA. Until then, I will be using your workaround. Thanks again very much.

May 16 '25 22:05 thiell

ltfs ltfs copied to clipboard

LTO-9 READ_ATTRIBUTE and LOG_SENSE timeout values are too small

ltfs
ltfs copied to clipboard