hyperion Maximum load on qeth card crashes Linux qdio driver

Hi, I am pretty sure this is a Linux kernel bug and not a Hercules issue, but I wanted to run it by here first just to make sure that there weren't any mismatched assumptions that might be made between the emulated hardware Hercules is presenting and what the kernel is assuming.

Here is how I have my networking set up:

On the host system, physical interface enp1s0f1 (on an Intel x520-DA2 10G card). It's physically connected using a copper DAC.
Virtual L2 interface tap0 is pre-created (hercifc only attached to it, is not responsible for creating or destroying it).
These two interfaces are bridged together into br0. None of the three interfaces have an IP address assigned. All of them have jumbo frames enabled (MTU 9000).

Hercules config:

ARCHLVL     z/Arch
CCKD RA=2,RAQ=4,RAT=2,WR=2,GCINT=5,GCPARM=0,NOSTRESS=0,TRACE=0,FREEPEND=-1
CNSLPORT 3270
CONKPALV (3,1,10)
CPUMODEL 3090
CPUSERIAL 012345
DIAG8CMD ENABLE
ECPSVM YES
LOADPARM 0A95DB..
LPARNAME HERCULES
MAINSIZE 10240
MOUNTED_TAPE_REINIT DISALLOW
MAXCPU 64
NUMCPU 32
OSTAILOR LINUX
SHCMDOPT NODIAG8
 
# .-----------------------Device number
# |     .-----------------Device type
# |     |       .---------File name and parameters
# |     |       |
# V     V       V
#---    ----    --------------------
 
# console
001F    3270
 
# terminal
0009    3215
 
# dasd (one disk) 
0120    3390    ./dasd/matoro-s390dev.dasd
 
# qeth (nic)
0A00.3    QETH    chpid F0  ifname tap0 mtu 9000

Inside the guest system, I have persistent interface names enabled, so this becomes enca00 when I bring it online. MTU also set to 9000 inside the guest.

The problem: When maxing out bandwidth (this is reproducible using iperf3 to a LAN host, or downloading something very fast) the kernel hits a WARN_ONCE and then the driver breaks (packets stop flowing). This is fixable by taking the interface down, destroying and recreating the channel grouping, then bringing the interface back up, so a reboot is not strictly required.

What the iperf3 output looks like:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  30.5 MBytes   256 Mbits/sec    0    418 KBytes       
[  5]   1.00-2.03   sec  33.2 MBytes   271 Mbits/sec    0    837 KBytes       
[  5]   2.03-3.03   sec  37.4 MBytes   312 Mbits/sec    3   1.35 MBytes       
[  5]   3.03-4.00   sec  7.50 MBytes  64.9 Mbits/sec    2   8.72 KBytes       
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    1   8.72 KBytes       
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   8.72 KBytes       
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   8.72 KBytes       
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   8.72 KBytes       
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   8.72 KBytes

And the kernel WARN in question (plus the setup lines when bringing the interface up):

   169.077336! qeth: register layer 2 discipline
   169.108208! qeth 0.0.0a00: CHID: 0 CHPID: f0
   169.144656! qeth 0.0.0a02: qdio: OSA on SC 5 using AI:1 QEBSM:1 PRI:1 TDD:0 SIGA:RW
   169.176580! qeth 0.0.0a00: Device is a OSD Express card (level: HRC1)
               with link type OSD_100.
   169.178097! qeth 0.0.0a00: MAC address 06:ef:f6:29:29:0a successfully registered
   169.236750! qeth 0.0.0a00: MAC address e6:a9:40:9a:5c:70 successfully registered
   169.253112! qeth 0.0.0a00 enca00: renamed from eth0
   260.716674! ------------  cut here !------------
   260.716787! WARNING: CPU: 0 PID: 12 at drivers/s390/cio/qdio_main.c:184 qdio_do_sqbs+0x30c/0x318  qdio!
   260.717212! Modules linked in: qeth_l2 qeth qdio ccwgroup
   260.717748! CPU: 0 PID: 12 Comm: ksoftirqd/0 Not tainted 6.0.0-gentoo-s390x #1
   260.717966! Hardware name: HRC 3090 EMULATOR EMULATOR (LPAR)
   260.718071! Krnl PSW : 0704e00180000000 000003ff8001ce88 (qdio_do_sqbs+0x310/0x318  qdio!)
   260.718658!            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
   260.720563! Krnl GPRS: 0000000000000044 fffffffffffefffa 0000000000000000 fffffffffffefffa
   260.720942!            0000000000000000 0000000000000020 0000000000000041 0000000000000005
   260.723472!            0000000000000000 0000000000000020 0000000088053800 000000000000001b
   260.723670!            000000008035c000 0000000000000000 0000038000077a48 00000380000779c0
   260.724232! Krnl Code: 000003ff8001ce7a: e310a6180004        lg      %r1,1560(%r10)
                          000003ff8001ce80: a7f4ffad            brc     15,000003ff8001cdda
                         #000003ff8001ce84: af000000            mc      0,0
                         >000003ff8001ce88: b9f97029            srk     %r2,%r9,%r7
                          000003ff8001ce8c: a7f4ffbc            brc     15,000003ff8001ce04
                          000003ff8001ce90: eb6ff0480024        stmg    %r6,%r15,72(%r15)
                          000003ff8001ce96: b90400ef            lgr     %r14,%r15
                          000003ff8001ce9a: e3f0ff70ff71        lay     %r15,-144(%r15)
   260.727909! Call Trace:
   260.732367!   ! qdio_do_sqbs+0x310/0x318  qdio!
   260.734002!   ! qdio_add_bufs_to_input_queue+0x294/0x328  qdio!
   260.735289!   ! qeth_rx_refill_queue+0x19c/0x228  qeth!
   260.735683!   ! qeth_poll+0x11a/0xe80  qeth!
   260.736794!   ! __napi_poll+0x3c/0x1d0
   260.737047!   ! net_rx_action+0x19c/0x380
   260.737246!   ! __do_softirq+0x114/0x2b8
   260.737474!   ! run_ksoftirqd+0x36/0x48
   260.737663!   ! smpboot_thread_fn+0xd8/0x198
   260.738522!   ! kthread+0x108/0x118
   260.741187!   ! __ret_from_fork+0x36/0x50
   260.742989!   ! ret_from_fork+0xa/0x40
   260.745480! Last Breaking-Event-Address:
   260.745618!   ! qdio_do_sqbs+0x284/0x318  qdio!
   260.745883! ---  end trace 0000000000000000 !---

The kernel code in question is here: https://github.com/torvalds/linux/blob/master/drivers/s390/cio/qdio_main.c#L184

/**
 * qdio_do_sqbs - set buffer states for QEBSM
 * @q: queue to manipulate
 * @state: new state of the buffers
 * @start: first buffer number to change
 * @count: how many buffers to change
 *
 * Returns the number of successfully changed buffers.
 * Does retrying until the specified count of buffer states is set or an
 * error occurs.
 */
static int qdio_do_sqbs(struct qdio_q *q, unsigned char state, int start,
			int count)
{
	unsigned int ccq = 0;
	int tmp_count = count, tmp_start = start;
	int nr = q->nr;

	qperf_inc(q, sqbs);

	if (!q->is_input_q)
		nr += q->irq_ptr->nr_input_qs;
again:
	ccq = do_sqbs(q->irq_ptr->sch_token, state, nr, &tmp_start, &tmp_count);

	switch (ccq) {
	case 0:
	case 32:
		/* all done, or active buffer adapter-owned */
		WARN_ON_ONCE(tmp_count);
		return count - tmp_count;
	case 96:
		/* not all buffers processed */
		DBF_DEV_EVENT(DBF_INFO, q->irq_ptr, "SQBS again:%2d", ccq);
		qperf_inc(q, sqbs_partial);
		goto again;
	default:
		DBF_ERROR("%4x ccq:%3d", SCH_NO(q), ccq);
		DBF_ERROR("%4x SQBS ERROR", SCH_NO(q));
		DBF_ERROR("%3d%3d%2d", count, tmp_count, nr);
		q->handler(q->irq_ptr->cdev, QDIO_ERROR_SET_BUF_STATE, q->nr,
			   q->first_to_check, count, q->irq_ptr->int_parm);
		return 0;
	}
}

The only thing that stood out to me is that Hercules is presenting this device as only a 100MB card, yet iperf3 is able to push three times that speed - I don't know if this might cause some sort of discrepancy.

I did attempt to troubleshoot this by enabling the debug option for the qeth adapter in my Hercules config. However, this caused the max speed iperf3 was able to achieve to drop down to about 20-30Mbps, which seemed to be slow enough that the problem did not occur. I wasn't sure if there was any way to keep the debug option without dumping every single network packet to the console.

If this needs to go upstream to Linux kernel maintainers, that's totally fine. Just thought I'd collect all the information I had so far in case you had any ideas. Thanks for your amazing work on this great tool.

Oct 12 '22 17:10 matoro

I suppose the question to ask is does this problem occur when you use standard frames rather than jumbo frames, i.e. using an MTU of 1500? I have no experience of jumbo frames with Hercules, and I have never heard of anyone using them before.

Unless you can demonstrate this problem occurs when Linux is driving a real hardware OSD I suspect the kernel maintainers, who in this case would be IBM, will have not the slightest interest.

It was the dumping of the packets that caused the speed to drop that avoided the problem. All debug does is dump packets, or other information, so you can't have one without the other.

Oct 12 '22 23:10 mcisho

I suppose the question to ask is does this problem occur when you use standard frames rather than jumbo frames, i.e. using an MTU of 1500? I have no experience of jumbo frames with Hercules, and I have never heard of anyone using them before.

Unless you can demonstrate this problem occurs when Linux is driving a real hardware OSD I suspect the kernel maintainers, who in this case would be IBM, will have not the slightest interest.

It was the dumping of the packets that caused the speed to drop that avoided the problem. All debug does is dump packets, or other information, so you can't have one without the other.

You are right - I should have tested that ahead of time. I went ahead and tested it now, and setting a standard MTU of 1500 on everything does NOT fix the problem, but it does make it somewhat harder to hit. With jumbo frames I can immediately reproduce the issue in a single iperf3 run after about 4-5 seconds, but with normal frames it takes 5-6 runs (50-60 seconds of continuous transmission) in order to trigger. The maximum bitrate is also reduced to around 125Mbps. Here's a log from a failed iperf3 run with standard frames:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.04   sec  10.6 MBytes  85.3 Mbits/sec    2    128 KBytes       
[  5]   1.04-2.05   sec  13.8 MBytes   115 Mbits/sec    6    160 KBytes       
[  5]   2.05-3.02   sec  15.0 MBytes   130 Mbits/sec    5    184 KBytes       
[  5]   3.02-4.00   sec  12.5 MBytes   107 Mbits/sec    0    205 KBytes       
[  5]   4.00-5.00   sec  10.0 MBytes  83.9 Mbits/sec    0    230 KBytes       
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    2   1.39 KBytes       
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.39 KBytes       
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.39 KBytes       
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    1   1.39 KBytes

It seems to trigger more reliably the higher the bandwidth being pushed.

Oct 13 '22 00:10 matoro

...and setting a standard MTU of 1500 on everything does NOT fix the problem, but it does make it somewhat harder to hit.

Did you also disable all "jumbo frame like" offloads (e.g. LRO, GRO, LSO, TSO, etc) on your host's physical adapter as well? (*)

Hercules cannot support frames larger than 1500 bytes (approximately), so even though your guest might be configured to use an MTU of only 1500 bytes, if you host isn't (i.e. if your host's physical/real adapter is still configured to use large (i.e. jumbo) frames and/or Large Send/Receive Offload), what oftentimes occurs during periods of high load is, the host packages up multiple 1500-byte frames together into one much larger frame and sends that instead, with the end result being Hercules networking comes to a screeching halt.

Try disabling all offloads on your host's physical adapter and then try again. If that works, carefully enable each offload one by one and re-test, until you find the one that causes the problem. (I think it's usually LSO, but it's been a while, and as I said, I'm not a Linux person.)

(*) Normally I would suggest disabling it on your virtual interface (i.e. tuntap) adapter but I'm not sure if you can do that on Linux so it's easier to just suggest doing it on the physical adapter instead. (I'm not a Linux person and trying to locate any helpful detailed documentation on Linux tuntap functionality/use is an exercise in futility.)

Oct 13 '22 02:10 Fish-Git

@Fish-Git Thank you very much for the tip. Unfortunately it did not seem to work. Totally took all the jumbo frames out of the picture (host, guest, and Hercules config) and disabled all of the following offloads on the physical NIC:

rx off tx off sg off tso off ufo off gso off gro off lro off

After which here is what ethtool has to say:

Features for enp1s0f1:
rx-checksumming: off
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: off
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: on [fixed]
        tx-checksum-sctp: off
scatter-gather: off
        tx-scatter-gather: off
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: on
tx-udp-segmentation: on
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
hw-tc-offload: off
esp-hw-offload: on
esp-tx-csum-hw-offload: on
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

But this did not work, I got ~19 seconds worth of max transmit before crashing again.

Linux doesn't represent hardware features in virtual interfaces, I checked that also. Thank you very much for looking into this.

Oct 13 '22 03:10 matoro

Actually I am finding now that reducing the MTU & turning off these feature is making is EASIER to trigger the problem with normal network traffic. Now just accessing files from my NFS server is triggering it, whereas it only triggered with large downloads or synthetic loads with 9000 MTU.

Oct 13 '22 03:10 matoro

Hmm... remarkably easy to reproduce when the client and the server are both running on the same host (i.e there is no physical network between the client and the server), my Hercules Linux guests usually abend within a second of the iperf3 client being started. The host system and the host systems network organisation seem to be irrelevant, the abend occurs in the Hercules Linux guest when Hercules is running on Linux (Fedora 36 x64) or Windows (11, virtual machine). The layer also seems to be irrelevant, the abend occurs when the Hercules Linux guest is using layer 2 or layer 3. Investigation continues!

Oct 13 '22 14:10 mcisho

Could it be an issue with the Linux tuntap driver software? Maybe it's not dealing with task offloads properly? Your ethtool display is still showing a lot off offloads still enabled!

Oct 13 '22 18:10 Fish-Git

Hi Fish. No, the problem isn't on the tuntap side of Hercules, the problem is between Hercules and the qeth/qdio driver of the Linux guest. Whether the problem is caused by Hercules, or by the Linux qeth/qdio driver, or a combination of the two is unknown, and I currently have absolutely no idea how to go about producing diagnostics to determine what the problem is.

The major problem I have is that I don't understand how qdio buffers work, I was never involved in that side of things, it resulted from Harold's investigative work, and Jan's and then your coding.

Oct 13 '22 20:10 mcisho

I forgot to mention that, as I believed it was a buffer problem, I have a Hercules log with the output produced by a qeth debug on all queues sbale siga command. Someone who understands the output might spot a clue.

Oct 13 '22 20:10 mcisho

This issue is a QDIO problem with "QEBSM" instructions: SQBS (Set Queue Buffer State) and EQBS (Extract Queue Buffer State).

Based on the current zLinux source at:

https://github.com/torvalds/linux/blob/master/drivers/s390/cio/qdio.h
https://github.com/torvalds/linux/blob/master/drivers/s390/cio/qdio_main.c

both instructions should set a return code in bits 24-31 of the instruction's r3 register. The current implementation sets CC, but not the return code (which defaults to 0).

Under low workload, a return code of 0 is likely to be correct. Under high workload however, the SQBS (Set Queue Buffer State) tries to set (change) a buffer state from OS owned to adapter owned for a number (count) of buffers. But it is likely that all the 'count' buffers are not in the same state. The SQBS instruction needs to set the return code to 96 indicating an incomplete operation, which causes the zLinux driver to retry the SQBS instruction until all the buffer states are updated.

The following QDIO patch is my proposed correction:

qdio.patch

I've tested this patch with zLinux Ubuntu 22.04.4 LTS:

doing an Ubuntu 22.04.4 live server instalation with an FTP copy of the installation iso ( 1.2GB )

iperf3 after installing Ubuntu 22.04.4 LTS. With a layer 2 tap interface, iperf3 performance from Ubuntu to a iperf3 server on the same box. The results were:

 tn529@ubuntu2204:~$ iperf3 -c xxx.xx.xx.175
 Connecting to host xxx.xx.xx.175, port 5201
 [  5] local xxx.xx.xx.178 port 51694 connected to xxx.xx.xx.175 port 5201
 [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
 [  5]   0.00-1.01   sec  54.6 MBytes   454 Mbits/sec    0    325 KBytes
 [  5]   1.01-2.02   sec  60.0 MBytes   496 Mbits/sec    0    342 KBytes
 [  5]   2.02-3.01   sec  50.0 MBytes   424 Mbits/sec    0    342 KBytes
 [  5]   3.01-4.01   sec  47.5 MBytes   399 Mbits/sec    0    342 KBytes
 [  5]   4.01-5.01   sec  55.0 MBytes   459 Mbits/sec    0    342 KBytes
 [  5]   5.01-6.02   sec  40.8 MBytes   341 Mbits/sec    0    395 KBytes
 [  5]   6.02-7.00   sec  43.8 MBytes   372 Mbits/sec    0    417 KBytes
 [  5]   7.00-8.01   sec  43.8 MBytes   365 Mbits/sec    0    417 KBytes
 [  5]   8.01-9.03   sec  52.4 MBytes   431 Mbits/sec    0    720 KBytes
 [  5]   9.03-10.01  sec  43.8 MBytes   374 Mbits/sec    0    720 KBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bitrate         Retr
 [  5]   0.00-10.01  sec   491 MBytes   412 Mbits/sec    0             sender
 [  5]   0.00-20.11  sec   491 MBytes   205 Mbits/sec                  receiver

 iperf Done.

As these instruction definitions are inferred from zLinux source, the patch requires broader testing across a number of environments including both qeth and zfcp usage.

I’m also a bit confused with qdio.c as both SQBS and EQBS have a comment:

   /* Condition code 3 if subchannel does not exist,
      is not valid, or is not enabled or is not a QDIO subchannel */

but then set:

regs->psw.cc = 2;

With an instruction return code, why would there be a second mechanism? With the CC comment/code discrepancy, I’ve reached the conclusion that these instructions don’t set the CC.

But, that's a discussion for a later date.

Jim

Aug 14 '24 23:08 JamesWekel

The following QDIO patch is my proposed correction:

Your change seems reasonable to me. I'll give it a try as soon as I can.

Aug 15 '24 18:08 Fish-Git

I’m also a bit confused with qdio.c as both SQBS and EQBS have a comment:
   /* Condition code 3 if subchannel does not exist,
      is not valid, or is not enabled or is not a QDIO subchannel */
but then set:

regs->psw.cc = 2;

That does look odd!

But after some digging it appears only the comment is bad. It should say "Condition code 2 if...", (not 3). Jan simply forgot to fix/update his original comment when he committed his fix.

With an instruction return code, why would there be a second mechanism?

Because it's pretty standard that most instructions set a condition code. Especially I/O type instructions (which I would classify both SQBS (Set Queue Buffer State) and EQBS (Extract Queue Buffer State) to be. They are "QDIO" (Queued Direct I/O) instructions after all). Why is that so strange? It's faster to just check an instruction's condition code (which only takes a single branch on condition instruction after the I/O instruction) than it is to check its return code (which would require two instructions: a separate instruction to test the return code register followed by a separate branch on condition). Besides that, I would expect that the return code value might not even be valid if the instruction itself failed in some way, so relying solely on just the return code seems riskier IMO.

With the CC comment/code discrepancy, I’ve reached the conclusion that these instructions don’t set the CC.

I seriously doubt that. I highly suspect they both set a condition code. Why else would Jan Jaeger make his change to correct his previous guess as to what the condition codes should be?

As evidence, refer to qdio.c commit 439bdbca929e2b0fc81b1777df82b0e696f45cdf followed by commit 5977e20ab257909db90601fc9fa53c22baa0480a, both made by Jan Jaeger (near the end of March 2012): in the first mentioned commit, he admitted (via his comment) that he was GUESSING as to the condition code (CC) that should be set in each situation, but in the second "fix" commit, it is clear that he was finally able to determine that his original guess was wrong, and so fixed them to be correct (i.e. his second commit to fix the wrong CCs was NOT a guess).

Refer to the attached ExamDif Pro visual source file comparison report:

diff_qdio-5977e20a.002_qdio-1648238d.002_Aug-15-2024 11-01.zip

And I trust Jan Jaeger as someone who had the knowhow/skills/wherewithal to be able to determine such things! It was Jan, after all, that coded our existing undocumented IBM instructions, such as the B220 "SERVC" (Service Processor Call) instruction in source file 'service.c': (*)

Instructions not listed in Principles of Operation or your yellow/pink/blue/white reference card

(*) I suspect he either had access to some internal IBM documentation, or else had friends at IBM or else was very skilled at reading/understanding open source zLinux source code or else was very skilled at doing hardware tracing on real hardware and deducing how such undocumented instruction(s) actually behaved or else ALL OF THE ABOVE. The guy was a freaking genius!

Aug 15 '24 19:08 Fish-Git

The following QDIO patch is my proposed correction:

Your change seems reasonable to me! I'll try to test it on z/OS just as soon as I can.

Aug 15 '24 20:08 Fish-Git

Fish,

Based on your historical review, I've updated the patch to fix the comment discrepancy, remove my FIXME comments, fix the CC when rc=97 for EQBS, and ensure a default rc of zero.

Updated patch: qdio_patch2.txt

I keep forgetting that there is a 13 year commit history. After reviewing a few commits, I was surprised to see commit https://github.com/SDL-Hercules-390/hyperion/commit/5977e20ab257909db90601fc9fa53c22baa0480a from Jan Jaeger which removed setting return code and just set the cc.

Thanks for testing the patch with Z/OS.

Jim

Aug 16 '24 17:08 JamesWekel

Based on your historical review, I've updated the patch to fix the comment discrepancy, remove my FIXME comments, fix the CC when rc=97 for EQBS, and ensure a default rc of zero.

Updated patch: qdio_patch2.txt

Thanks. I'll remove your previous patch and apply your new one and retest. (My original testing of your original patch didn't work as well as I had hoped. It produced inconsistent/confusing results.) I'll let you know how it goes once I complete my new testing.

I keep forgetting that there is a 13 year commit history. After reviewing a few commits, I was surprised to see commit 5977e20 from Jan Jaeger which removed setting return code and just set the cc.

git blame is your friend. (*) :)

(*) I have no idea how useful it is as a command-line git command, but it has proven to be invaluable when used with a visual (GUI) Git tool such as TortoiseGit:

https://tortoisegit.org/about/screenshots/#TortoiseGitBlame

I love TortoiseGit !!

Aug 16 '24 22:08 Fish-Git

I keep forgetting to ask you about this too:

I question the following code sequences:

SQBS - Set Queue Buffer State

    /* set return code in bits 24-31 of r3 */
    if (count == 0 )                      regs->GR_HHLCL(r3) = 0;   /* no error                */
    else if (nextstate & SLSBE_OWNER_CU ) regs->GR_HHLCL(r3) = 32;  /* buffer owned by adapter */
    else if (count > 0)                   regs->GR_HHLCL(r3) = 96;  /* incomplete, try again   */
    else                                  regs->GR_HHLCL(r3) = 0;

EQBS - Extract Queue Buffer State

    /* set return code in bits 24-31 of r3 */
    if ( count == 0 )                    regs->GR_HHLCL(r3) = 0;   /* no error                    */
    else if ( count > 0 )                regs->GR_HHLCL(r3) = 96;  /* incomplete, try again       */
    else if ( state != nextstate )       regs->GR_HHLCL(r3) = 32;  /* next buffer state different */
    else                                 regs->GR_HHLCL(r3) = 0;

"count" is a U32 (i.e. unsigned), so it will ALWAYS be either "0" or "> 0" (i.e. zero or non-zero). So anything after the "else if (count > 0)" statement will NEVER BE REACHED!

This is SIGNIFICANT in the EQBS case: due to the sequence of tests, rc=32 will NEVER BE SET!

And in both cases, the final "else rc=0" is totally redundant.

Aug 16 '24 23:08 Fish-Git

Fish,

I completely agree with your analysis. The zLinux code (switch statement) only shows what return codes or a 'default' are checked. All we have is the executed code and very brief comments to infer the reason for the return code.

So for EQBS, the best case is count == 0, no error. But, how do you choose between rc = 96 (count > 0) or rc=32 ( state != nextstate) as in our code, both are true when 'count > 0' because state != nextstate? So, I've included both tests to show the order of how our instruction implementation selected the rc.

    /* set return code in bits 24-31 of r3 */
    if ( count == 0 )                    regs->GR_HHLCL(r3) = 0;   /* no error                    */
    else if ( count > 0 )                regs->GR_HHLCL(r3) = 96;  /* incomplete, try again       */
    else if ( state != nextstate )       regs->GR_HHLCL(r3) = 32;  /* next buffer state different */
    else                                 regs->GR_HHLCL(r3) = 0;

Yes, the else regs->GR_HHLCL(r3) = 0; is not required. Just part of my style to ensure that a default rc is set, if any of the if conditions/statements are changed. (It wasn't part of the first patch).

Jim

Aug 17 '24 01:08 JamesWekel

But, how do you choose between rc = 96 (count > 0) or rc=32 ( state != nextstate) as in our code, both are true when 'count > 0' because 'state != nextstate'?

Oh come on! You're joking, right?

    /* set return code in bits 24-31 of r3 */
    if (count == 0)
        regs->GR_HHLCL(r3) = 0;         /* complete; no error */
    else
    {
        if (nextstate == state)
            regs->GR_HHLCL(r3) = 96;    /* incomplete, try again */
        else
            regs->GR_HHLCL(r3) = 32;    /* next buffer state different */
    }

And I think your opening comments need to be updated too:

/*    rc = 32:      buffer error. A buffer owned by the adapter was encounterd      */
/*    rc = 96:      incomplete error. A buffer owned by the OS was encounterd       */
/*                  with a different state. Try the operation again.                */

but for EQBS (Extract Queue Buffer State), you have:

/*    rc = 32:      buffer error. Next buffer state was different                   */
/*    rc = 96:      incomplete error. A buffer was encounterd                       */
/*                  with a different state. Try the operation again.                */

p.s. "encounterd" is misspelled too.

Aug 17 '24 02:08 Fish-Git

Fish,

Oh come on! You're joking, right?

No.. the state check loop is

    state = nextstate = ARCH_DEP(wfetchb)((VADR)(slsba+bidx), USE_REAL_ADDR, regs);

    while(count && state == nextstate)
    {
        if (autoack)
            switch(nextstate) {
....
            }

        bidx++; bidx &= 0x7F;              /* Advance and wrap index */
        count--;

        if(count)
            nextstate = ARCH_DEP(wfetchb)((VADR)(slsba+bidx), USE_REAL_ADDR, regs);
    }

so to exit the loop with 'count > 0', doesn't the states have to be not equal ( state != nextstate )? Maybe I'm just not seeing clearly today!

The EQBS and SQBS comments for rc=32 are based on the zLinux code, For EQBS (qdio_do_eqbs) routine: next buffer state different

	case 0:
	case 32:
		/* all done, or next buffer state different */
		return count - tmp_count;
	case 96:
		/* not all buffers processed */
		qperf_inc(q, eqbs_partial);
		DBF_DEV_EVENT(DBF_INFO, q->irq_ptr, "EQBS part:%02x",
			tmp_count);
		return count - tmp_count;
	case 97:
		/* no buffer processed */
		DBF_DEV_EVENT(DBF_WARN, q->irq_ptr, "EQBS again:%2d", ccq);
		goto again;
	default:
		DBF_ERROR("%4x ccq:%3d", SCH_NO(q), ccq);
		DBF_ERROR("%4x EQBS ERROR", SCH_NO(q));
		DBF_ERROR("%3d%3d%2d", count, tmp_count, nr);
		q->handler(q->irq_ptr->cdev, QDIO_ERROR_GET_BUF_STATE, q->nr,
			   q->first_to_check, count, q->irq_ptr->int_parm);
		return 0;
	}

For SQBS (qdio_do_sqbs) routine: active buffer adapter-owned

	case 0:
	case 32:
		/* all done, or active buffer adapter-owned */
		WARN_ON_ONCE(tmp_count);
		return count - tmp_count;
	case 96:
		/* not all buffers processed */
		DBF_DEV_EVENT(DBF_INFO, q->irq_ptr, "SQBS again:%2d", ccq);
		qperf_inc(q, sqbs_partial);
		goto again;
	default:
		DBF_ERROR("%4x ccq:%3d", SCH_NO(q), ccq);
		DBF_ERROR("%4x SQBS ERROR", SCH_NO(q));
		DBF_ERROR("%3d%3d%2d", count, tmp_count, nr);
		q->handler(q->irq_ptr->cdev, QDIO_ERROR_SET_BUF_STATE, q->nr,
			   q->first_to_check, count, q->irq_ptr->int_parm);
		return 0;

I keep going back to the zLinux code, trying to determine what are the QDIO specs. RC 32 seems to be a error when exiting with 'count > 0'. Because SQBS error is more specific to 'adapter owned', I gave error 32 rc priority over the 96 rc (incomplete, not all buffers process). With our implementation of EQBS, I gave 96 rc (incomplete, not all buffers process) priority as "next buffer state different" is how we determine that it is incomplete ( count>0 ).

But, this discussion has cause me to realize that I need to remove rc 97 from EQBS. The current rc 97 check is 'count==0' on entry, so we will never process any buffers. But, the zlinux driver just retries ... resulting is an infinite loop in the driver!

What's the next step?

Jim

Aug 17 '24 04:08 JamesWekel

Fish,

After some more review of the zLinux code for the EQBS usage and next buffer state different comment, the routine 'get_buf_states' (which calls 'qdio_do_eqbs' ( EQBS)) has non-EQBS code:

	if (is_qebsm(q))
		return qdio_do_eqbs(q, state, bufnr, count, auto_ack);

	/* get initial state: */
	__state = q->slsb.val[bufnr];

	/* Bail out early if there is no work on the queue: */
	if (__state & SLSB_OWNER_CU)
		goto out;

So, EQBS rc 32 is likely related to active buffer adapter-owned state, but just for the first buffer state. You are correct, SQBS and EQBS rc 32 are control unit buffer owned states.

Jim

Aug 17 '24 17:08 JamesWekel

Help!

Just for the record, the following documents were created by one of our developers (Harold Grovesteen @s390guy) many years ago during our initial development effort of providing QETH (OSA) device support (sometime around June 2010 from the looks of it):

I tried cloning torvalds's linux.git repository (to examine not only the current version of qdio_main.c but to also review all of the commits (changes) that have been made to it since June 2010), but unfortunately I got an error during my clone attempt:

git.exe clone --progress -v "https://github.com/torvalds/linux.git" "C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_GIT\linux"
Cloning into 'C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_GIT\linux'...
POST git-upload-pack (185 bytes)
POST git-upload-pack (gzip 42912 to 21609 bytes)
remote: Enumerating objects: 10345755, done.
remote: Counting objects: 100% (652/652), done.
remote: Compressing objects: 100% (318/318), done.
remote: Total 10345755 (delta 457), reused 406 (delta 334), pack-reused 10345103 (from 1)
Receiving objects: 100% (10345755/10345755), 4.91 GiB | 26.67 MiB/s, done.
Resolving deltas: 100% (8440205/8440205), done.
error: invalid path 'drivers/gpu/drm/nouveau/nvkm/subdev/i2c/aux.c'
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

git did not exit cleanly (exit code 128) (503275 ms @ 8/17/2024 10:26:17 AM)

and my cloned repository directory is completely empty! (except for the hidden .git git control directory).

My goal was to try and see what has changed with respect to the how Harold's document says they (SQBS and EQBS) behave. For example, on page 39 of Harold QDIO document, he documents:

Return Code Values (bits 0-31 of operand three's even register): 0x00, 0 - all buffer states successfully processed. 0x20, 32 – all buffer states successfully processed and next buffer state different 0x60, 96 – not all buffers processed 0x61, 97 – not all buffers processed (The meaning of the low order bit being set is not able to be determined from the Linux code.)

And I wanted to try and verify that. But as I said, my git clone failed. :(

~Does anyone know why it failed~ or how to actually clone his repository? I'm assuming it's the official Linux repository. Yes?

What bothers me about our current code (implementation) is how we purposely skip fetching the state of the next buffer when our count value is decremented to zero. IMO we should not be doing this. We should always fetch the state of the next buffer, regardless of the count value. That way we can distinguish between return code 0 and 32.

I'd also like to verify rc=97 too of course.

Here's what I'm thinking maybe we should be doing:

SQBS:

    oldstate = nextstate = ARCH_DEP(wfetchb)((VADR)(slsba+bidx), USE_REAL_ADDR, regs);

    while (count && nextstate == oldstate)
    {
        ARCH_DEP(wstoreb)(state,(VADR)(slsba+bidx), USE_REAL_ADDR, regs);
        bidx++; bidx &= 0x7F;           /* Advance and wrap index */
        nextstate = ARCH_DEP(wfetchb)((VADR)(slsba+bidx), USE_REAL_ADDR, regs);
        count--;
    }

    ...

    /* set return code in bits 24-31 of r3 */
    if (count > 0)                      /* incomplete */
    {
        regs->GR_HHLCL(r3) = 96;        /* incomplete; try again */
    }
    else /* (count == 0) */             /* all buffers processed */
    {
        if (nextstate == oldstate)
            regs->GR_HHLCL(r3) = 0;     /* all buffers processed */
        else
            regs->GR_HHLCL(r3) = 32;    /* all processed but next
                                           buffer state different */
    }

EQBS:

    state = nextstate = ARCH_DEP(wfetchb)((VADR)(slsba+bidx), USE_REAL_ADDR, regs);

    while (count && state == nextstate)
    {
        if (autoack)
        {
            switch (nextstate)
            {
#if 0 // FIXME: why is this disabled?
                case SLSBE_INPUT_COMPLETED:
                    ARCH_DEP(wstoreb)
                        (SLSBE_INPUT_ACKED, (VADR)(slsba+bidx), USE_REAL_ADDR, regs);
                    break;
                case SLSBE_OUTPUT_COMPLETED:
                    ARCH_DEP(wstoreb)
                        (SLSBE_OUTPUT_PRIMED, (VADR)(slsba+bidx), USE_REAL_ADDR, regs);
                    break;
#endif
            }
        }

        bidx++; bidx &= 0x7F;              /* Advance and wrap index */
        nextstate = ARCH_DEP(wfetchb)((VADR)(slsba+bidx), USE_REAL_ADDR, regs);
        count--;
    }

    ...

    /* set return code in bits 24-31 of r3 */
    if (count > 0)                      /* incomplete; try again */
    {
        regs->GR_HHLCL(r3) = 96;        /* incomplete; try again */
    }
    else /* (count == 0) */             /* all buffers processed */
    {
        if (nextstate == state)
            regs->GR_HHLCL(r3) = 0;     /* all buffers processed */
        else
            regs->GR_HHLCL(r3) = 32;    /* all processed but next
                                           buffer state different */
    }

But again, I'd like to examine the current Linux code first. It looks like James got his qdio.h and qdio_main.c from torvalds's linux repository, but when was the last time it was refreshed? (updated?) And how did you manage to clone it? On Linux? ~Maybe my Windows version of git has a bug that's preventing me from cloning it on Windows?~

Help! :(

Aug 17 '24 18:08 Fish-Git

Does anyone know why it failed...

...
error: invalid path 'drivers/gpu/drm/nouveau/nvkm/subdev/i2c/aux.c'
...

(FUCK!) I keep forgetting that Windows has some "reserved names" that you aren't allowed to use:

https://www.google.com/search?q=Windows+reserved+file+names

"AUX" being one of them. :(

How the fuck am I supposed to clone this damn repository?! :(

Aug 17 '24 18:08 Fish-Git

Does anyone know why it failed...

... error: invalid path 'drivers/gpu/drm/nouveau/nvkm/subdev/i2c/aux.c' ...

(FUCK!) I keep forgetting that Windows has some "reserved names" that you aren't allowed to use:
* https://www.google.com/search?q=Windows+reserved+file+names
"AUX" being one of them. :(

How the fuck am I supposed to clone this damn repository?! :(

Hi Fish, there are a couple options for Windows:

git under WSL2
try git config --global core.protectNTFS false from here
do a sparse checkout as described here

Aug 17 '24 18:08 matoro

Fish,

Are we going down a rat hole... To fix the immediate 'Maximum load on qeth card crashes Linux qdio driver' problem, only SQBS need to be modified to return rc 0, when count==0, and rc 96, when count>0.

With the immediate problem resolved, further investigation can continue into the when/why rc=32 is returned can continue for both EQBS and SQBS.

Jim

Aug 17 '24 18:08 JamesWekel

Thank you all for an interesting trip down memory lane including 9 months of $#$# I went through reverse engineering the Linux driver.

EQBS and SQBS were always somewhat of a mystery. The final version I think was developed by Jan.

I would say that Fish' suggestion has some merit.

Aug 17 '24 19:08 s390guy

Fish wrote:

How the fuck am I supposed to clone this damn repository?!

matoro wrote:

Hi Fish, there are a couple options for Windows:

git under WSL2

try git config --global core.protectNTFS false from here

do a sparse checkout as described here

Thanks, matoro. Unfortunately however, I did something wrong (didn't do something correctly) because it didn't work for me, and I'm too frustrated by all of this to waste any more of my time on it. Besides, it's neither here nor there anyway, since the only files I was really interested in were qdio.h and qdio_main.c, so I just downloaded those two specific files directly from the GitHub repository.

And for the record, they do indeed exactly match the ones that James provided.

The only reason I was wanting to clone the repository was so I could more easily review the commit history (commit log entries) for those two files to see what, if any, significant changes were made to them to hopefully:

Get a better understanding of how the two instructions are supposed to work, and:
To verify Hercules's code isn't missing an important fix. (i.e. to try and determine whether our current implementation might be flawed (contain a bug), which I suspect it probably does.)

I did notice that GitHub does have a commit log history web user interface, but it's not exactly user friendly, whereas the one provided by TortoiseGit is (which is one of the primary reasons I was trying to clone it in the first place: not so much for the code, but mostly for easy access to the git commit log history).

Oh well. It's no big deal on the grand scheme of things. I'll just have to manually study the Linux source code a little closer to see if I can eventually figure it out, and then more than likely do some experimenting.

Thanks anyway, matoro. Much appreciated.

Aug 18 '24 06:08 Fish-Git

Are we going down a rat hole...

"Rat hole"? Or "rabbit hole"?

To fix the immediate 'Maximum load on qeth card crashes Linux qdio driver' problem, only SQBS need to be modified to return rc 0, when count==0, and rc 96, when count>0.

Yes, you're right. To be honest, I haven't tried to reproduce your problem and/or verify your fix on Linux. So far, all I've been concerned with was verifying it didn't cause any problems on Windows (i.e. didn't make matters worse for Hercules on Windows).

With the immediate problem resolved, further investigation can continue into the when/why rc=32 is returned can continue for both EQBS and SQBS.

Yes, you're right again. Sorry. I'll get back to my testing. I apologize for the distraction.

But as I said, it honestly looks like Hercules's code (implementation) might be wrong, which lead me to wonder "What important changes/fixes have been made to those two files over the past years since Hercules's original QETH/OSA device support was first introduced?", which of course led me down the clone Linux repository path so I could review their git commit log history. You can usually learn a lot by reviewing how a given piece of code evolves (changes) over time!

Anyway... back to my testing.

Aug 18 '24 06:08 Fish-Git

"I haven't tried to reproduce your problem and/or verify your fix on Linux."

Am I to understand a change to Linux is part of getting this working? Is the change in the QDIO driver modules or in the Hercules host Linux (not QDIO)? Either is a "temporary work around" in my view. Neither is a "fix". To expect a user to apply any change to Linux is not the desirable end state and we are not going to get any change to the QDIO drivers.

Hercules has complete control over what the guest sees as QDIO activity. So ultimately we are looking at a potential incompatibility between the real OSA adapter (for which Linux is written and not necessarily bug free) and the Hercules OSA adapter seemingly introduced when traffic is "high" (whatever that means). This puts the real problem somewhere between the TUNTAP interface from which Hercules receives the traffic from the host and the Hercules implementation of QDIO buffer handling.

Buffers are expected to be filled in sequence by either the adapter or the host depending upon the direction of the traffic and the buffer ownership.

To me it appears that something is causing this process to be "corrupted". Empty buffers for dropped data? The buffer handling skipping buffers?

I realize this is not going to be readily fixed.

But, just trying to better understand the situation from my own experience a decade ago.

Fish, is right in trying to understand how the modules have changed over time. Code outside of the actual use of EQBS and SQBS themselves may be the cause. Understanding those changes may be helpful in understanding why the problem is occurring. It is of course always possible that this "bug" has always been present but past timing caused it to be hidden.

Just my thoughts, sorry to be critic, and somewhat late to this "party". Harold Grovesteen

Aug 18 '24 13:08 s390guy

Am I to understand a change to Linux is part of getting this working?

No, there is no change to Linux. The change is to Hercules' qdio.c only.

Aug 18 '24 13:08 mcisho

...when traffic is "high" (whatever that means).

Linux has a tool named iperf, which is used to measure the throughput of network adaptors. When iperf was used from the Hercules host to the Hercules guest Linux, the QDIO code in the guest Linux would abend. Jim investigated and created a patch for Hercules' qdio.c which prevented the abend. iperf over a Hercules QETH interface has a reported throughput greater than 400 Mbits/sec, over a Hercules CTCI interface has a reported throughput greater than 100 Mbits/sec. 400 Mbits/sec is what "high" means in this case.

Aug 18 '24 14:08 mcisho