axpbox icon indicating copy to clipboard operation
axpbox copied to clipboard

Slow networking

Open RaymiiOrg opened this issue 3 years ago • 44 comments

Saw the below error including unstable / slow network while testing DECwindows via X forwarding.

On a debian machine in the same network (no firewalls in between), started X server:

Xephyr -screen 1024x786 -ac -query 0.0.0.0 :1

Setup the remote display in openvms:

set display/create/node=10.0.2.15/transport=tcpip/server=1

(10.0.2.15 is the debian vm, 1 is the x display (:1)

Started an application on OpenVMS:

RUN DECW$EXAMPLES:ICO.EXE 

Or multiple, mail & file manager:

SPAWN/NOWAIT/INPUT=NL: RUN SYS$SYSTEM:DECW$MAIL.EXE
SPAWN/NOWAIT/INPUT=NL: RUN SYS$SYSTEM:VUE$MASTER

afbeelding

Most often it works speedly:

afbeelding

The ICO program moves smoothly, but cannot show that on a screenshot:

afbeelding

But after a few minutes, it became quite slow, even crashing:

afbeelding

afbeelding

This was the x servers output:

afbeelding

On the AXPbox command line window:


CPacketQueue(rx_queue):add() packet lost! Size = 4314.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02
CPacketQueue(rx_queue):add() packet lost! Size = 2894.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02
CPacketQueue(rx_queue):add() packet lost! Size = 2894.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02
CPacketQueue(rx_queue):add() packet lost! Size = 2894.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02
CPacketQueue(rx_queue):add() packet lost! Size = 2894.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02
CPacketQueue(rx_queue):add() packet lost! Size = 2894.. dst: 08-00-de-ad-be-ef .. src: 52-54-00-12-35-02

RaymiiOrg avatar Nov 09 '20 15:11 RaymiiOrg

Wow, you got quite far with DECwindows - my attempts always ended up crashing OpenVMS.

About the issue - I don't see much of a chance of progressing with component bugs before they can be isolated in unit tests, which requires refactoring the code (especially getting rid of global variable use in classes), so I'll try to get that started.

lenticularis39 avatar Nov 09 '20 15:11 lenticularis39

One semi-random thought: As a workaround for #24, which could be the root cause of this issue, you can try setting the delay (sleep) in CDEC21143::run() to a lower value.

lenticularis39 avatar Nov 09 '20 20:11 lenticularis39

Yesterday evening I tried a few things:

  • A lower delay (every item from 10 to 1) does not make the network go faster nor the error go away
  • The packet size is checked in Ethernet.cpp::add_tail (cant be more than 1514) and it seems it's just too large of a packet. Tried to change a few openvms parameters related to MTU, did not succeed. That ethernet code also checks for a config parameter (queue), defined that in es40.cfg to 1024, didn't help (didn't expect it to help since the size is too large).
  • What made the packetqueue packet lost error go away is changing the network adapter to a 100mbit model instead of a gigabit adapter in virtualbox.

Without the lost packet error, there is still slowness and crashes when x11 forwarding, most often now the programs just hang and crash with an error;

XIO:  fatal IO error 65535 (network partner disconnected logical link) on X server "_WSA1:"
      after 685 requests (614 known processed) with 108 events remaining. 
%XLIB-F-IOERROR, xlib io error

It does seem that both Xehpyr and XNest get slower over time. If openvms has just booted, it goes well for a few minutes. But the longer it runs, the slower it responds, hangs, etc.

I'm going to see if I can get netbsd running and try x forwarding there, maybe it makes a difference or help narrow down issues.

RaymiiOrg avatar Nov 10 '20 11:11 RaymiiOrg

Without the sleep (commented out) in the thread, the SRM console gave other errors:

Testing the EW* Network*** Error (ewa0),
 Mop loop message timed out from: 08-00-2b-3b-42-fd*** 
List index: 7 received count: 0 expected count 2

Networking did work however inside OpenVMS. I'm also going to see if this issue (slowness and crashes) happen on actual hardware (outside of virtualbox) with 2 interfaces.

Testing the EW* Network*** Error (ewa0), Mop loop message timed out from: 08-00-2b-3b-42-fd*** List index: 7 received count: 0 expected count 2

RaymiiOrg avatar Nov 10 '20 11:11 RaymiiOrg

I'm going to see if I can get netbsd running and try x forwarding there, maybe it makes a difference or help narrow down issues.

Saw a report on Twitter stating x and ssh work with netbsd but are "slow", depending on the hardware used: 550B3047-6890-455D-AD3C-FD992E54B9A4

I still have to test with actual hardware, will report back in later.

RaymiiOrg avatar Nov 11 '20 08:11 RaymiiOrg

Got the same problem, just when copying files (with decnet via TCPIP) to the axpbox machine. I'm running axpbox on a Fedora33 machine with more than one nic. What I noticed is that in the package lost message a Mac-adress appears as src that I do not know and cannot be traced on our network.

Probably related : when I leave axpbox with OpenVMS booted running overnight somewhere in the night it start giving every second(?) the package loss message with some mac-adresses, which I do not know as src and FF-FF-FF-FF-FF-FF as dst.

joukj avatar Nov 11 '20 08:11 joukj

Networking did work however inside OpenVMS. I'm also going to see if this issue (slowness and crashes) happen on actual hardware (outside of virtualbox) with 2 interfaces.

Can confirm this issue also happens without virtualbox. Two (gigabit) NIC's, one for AXPbox (openVMS) and one for the PC, networking does work, but X11 has the same slowness. Lost packet messages also appear (but I suspect that is due to gigabit).

RaymiiOrg avatar Nov 12 '20 05:11 RaymiiOrg

Looks like DecNET is much more stable than TCPIP. I'm running already for more than 1.5 hours 4 X11applications (ICO,DecW$clock,Decw$mail and vue$master) and have them displaying on a "real" alpha runing OpenVMS.

I see the same instability when copying files to axpbox : (by "decnet" or "decnet via TCPIP") copy *.c 19.10"user passw"::[] works OK copy *.com 10.9.9.9"user passw"::[] hangs after a few files.

joukj avatar Nov 12 '20 13:11 joukj

Very interesting. Once I have a while to do work on AXPbox I'll try to look into this - all this information will definitely help.

lenticularis39 avatar Nov 12 '20 15:11 lenticularis39

the package loss message with some mac-adresses, which I do not know as src and FF-FF-FF-FF-FF-FF as dst.

This looks suspiciously like ARP broadcast messages (who-has xxx tell yyy). It might be another device openvms is trying to communicate with. Are they bigger than the 1514 size? That seems large for ARP requests....

RaymiiOrg avatar Nov 12 '20 18:11 RaymiiOrg

Looks like DecNET is much more stable than TCPIP. I'm running already for more than 1.5 hours 4 X11applications (ICO,DecW$clock,Decw$mail and vue$master) and have them displaying on a "real" alpha runing OpenVMS.

Out of personal interest, could you maybe share screenshots of vue, clock and mail? I only get those halfway rendering...

RaymiiOrg avatar Nov 12 '20 18:11 RaymiiOrg

The experiments, Ireported yesterday were with a modified version of axpbox : I raised the 1514 to 9000 both in DEC21143.cpp as in Ethernet.cpp. Have to do more test with this.

I'm wondering why ETH_MAX_PACKET_RAW is defined in Ethernet.h but is never used. I think the hard coded 1514 in the .cpp files should be replaced by this one.

joukj avatar Nov 13 '20 08:11 joukj

I'm wondering why ETH_MAX_PACKET_RAW is defined in Ethernet.h but is never used. I think the hard coded 1514 in the .cpp files should be replaced by this one.

1514 is around the maximum Ethernet protocol frame length (the exact length depends on the protocol type). As you see in Ethernet.h, 1514/1518 (the first being with CRC, the second one without it) length is used, corresponding to this format:

obrazek

#define ETH_MAX_PACKET_RAW 1514
#define ETH_MAX_PACKET_CRC 1518

struct eth_frame { // ethernet (wire) frame
  u8 src[6];       // source address
  u8 dst[6];       // destination address
  u8 protocol[2];  // protocol
  u8 data[1500];   // data: variable 46-1500 bytes
  u8 crc_fill[4];  // space for max packet crc
};

struct eth_packet {             // ethernet packet
  int len;                      // size of packet
  int used;                     // bytes used (consumed)
  u8 frame[ETH_MAX_PACKET_CRC]; // ethernet frame
};

I'll check both the Ethernet and DEC21143 implementation and try to find any bugs, also doing some small refactoring in the process (like replacing the constants with macros as you mentioned).

lenticularis39 avatar Nov 14 '20 08:11 lenticularis39

So the large packets causing the warning are read from pcap. Setting pcap's snaplen to ETH_MAX_PACKET_CRC removes the warning, but the issue with network instability after some time persists - this makes sense, cause it truncates the packets that are too long instead of fragmenting them.

lenticularis39 avatar Nov 14 '20 12:11 lenticularis39

Is this related: https://github.com/the-tcpdump-group/tcpdump/issues/389 - or is there an option in pcap to (re-)assemble packets for us? Back when I worked at an ISP we often had "issues" relating to https://en.wikipedia.org/wiki/Large_send_offload - nowdays there even is OpenStack documentation on it: https://docs.openstack.org/developer/performance-docs/test_plans/hardware_features/hardware_offloads/plan.html

RaymiiOrg avatar Nov 14 '20 12:11 RaymiiOrg

The tcpdump issue is a different one - it concerns packets over 64 kB. Here the problem is packets larger than ETH_MAX_PACKET_CRC (1518 B) are captured by pcap, likely due to large send offload as you say (libcap doesn't fragment the packets, see https://packetbomb.com/how-can-the-packet-size-be-greater-than-the-mtu/).

lenticularis39 avatar Nov 14 '20 12:11 lenticularis39

I'm not however sure whether this is related to the networking slowing down, which could be a problem in the emulated NIC itself.

lenticularis39 avatar Nov 14 '20 12:11 lenticularis39

The tcpdump issue is a different one - it concerns packets over 64 kB. Here the problem is packets larger than ETH_MAX_PACKET_CRC (1518 B) are captured by pcap, likely due to large send offload as you say (libcap doesn't fragment the packets, see https://packetbomb.com/how-can-the-packet-size-be-greater-than-the-mtu/).

I did found a patch for libpcap and fragmentation: https://seclists.org/tcpdump/2007/q2/112

Can I help you in any way with testing specific things?

RaymiiOrg avatar Nov 14 '20 13:11 RaymiiOrg

I did found a patch for libpcap and fragmentation: https://seclists.org/tcpdump/2007/q2/112

Interesting. This does the opposite to what we want here, though. A fragmentation function will have to be added to Ethernet.cpp to support the large packets generated by the Linux networking stack.

Can I help you in any way with testing specific things?

Currently no patch exists, so there's nothing to test. I'll let you know once I get to something.

lenticularis39 avatar Nov 14 '20 13:11 lenticularis39

Based on looking at the simh pcap networking implementation LSO is solved there. Maybe porting the entire network emulation from simh would be a reasonable choice.

lenticularis39 avatar Nov 15 '20 10:11 lenticularis39

Sure 1500 is the normal frame size, but not when some interfaces are set to "jumbo frames" than the limit is I think just under 9000.

joukj avatar Nov 16 '20 09:11 joukj

The machine, when package size is set to 9000 survived the weekend. However -the VMS-clock stopped ticking at friday night 18.45h (sh tim gives always the same time) -the console (Putty session) hangs. last message is from friday 18.37h

joukj avatar Nov 16 '20 09:11 joukj

I did some debugging and came up with the a patch (attached), which makes the network more stable. I was able to start CDE environment (see screenshot). Screenshot 2021-01-22 at 14 52 49 DEC21143.patch.txt

dmzettl avatar Jan 22 '21 15:01 dmzettl

I did some debugging and came up with the a patch (attached), which makes the network more stable. I was able to start CDE environment (see screenshot). Screenshot 2021-01-22 at 14 52 49 DEC21143.patch.txt

For personal interest, could you maybe explain a bit what the patch does?

RaymiiOrg avatar Jan 22 '21 15:01 RaymiiOrg

For personal interest, could you maybe explain a bit what the patch does?

What happened (and what the patch fixes) is that only partial frames were written to the pcap filter, because the second buffer was not considered when collecting the ethernet frames in dec21143_tx. It can happen that both buffers to wich tdes2 and tdes3 point contain data . When this happens the data of both buffers have to be combined to get a valid ethernet frame and hence IP packet. The patch simply checks if buf2_size is greater 0 and if that's true append the data from the buffer pointed to by tdes3 to the current frame.

dmzettl avatar Jan 22 '21 16:01 dmzettl

For personal interest, could you maybe explain a bit what the patch does?

What happened (and what the patch fixes) is that only partial frames were written to the pcap filter, because the second buffer was not considered when collecting the ethernet frames in dec21143_tx. It can happen that both buffers to wich tdes2 and tdes3 point contain data . When this happens the data of both buffers have to be combined to get a valid ethernet frame and hence IP packet. The patch simply checks if buf2_size is greater 0 and if that's true append the data from the buffer pointed to by tdes3 to the current frame.

Thank you for explaining! I'm going to try it as well.

What is your networking setup? The screenshot looks like os x, do you use a virtual machine?

RaymiiOrg avatar Jan 22 '21 16:01 RaymiiOrg

I do use a virtual machine running on ESXi and yes, I do connect from an OS X machine.

For personal interest, could you maybe explain a bit what the patch does?

What happened (and what the patch fixes) is that only partial frames were written to the pcap filter, because the second buffer was not considered when collecting the ethernet frames in dec21143_tx. It can happen that both buffers to wich tdes2 and tdes3 point contain data . When this happens the data of both buffers have to be combined to get a valid ethernet frame and hence IP packet. The patch simply checks if buf2_size is greater 0 and if that's true append the data from the buffer pointed to by tdes3 to the current frame.

Thank you for explaining! I'm going to try it as well.

What is your networking setup? The screenshot looks like os x, do you use a virtual machine?

Yes, I'm using FreeBSD virtual machine on ESXi. On this virtual machine I run AXPbox. And yes, I connect from OS X to AXPbox.

dmzettl avatar Jan 22 '21 19:01 dmzettl

With the patch enabled I get new (error) messages in the SRM prompt:

Testing the System
Testing the Network

*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 0 received count: 3 expected count 4


*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 1 received count: 3 expected count 4


*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 2 received count: 2 expected count 4


*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 3 received count: 2 expected count 4


*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 4 received count: 2 expected count 4


*** Error (ewa0), Mop loop message timed out from: 08-00-de-ad-be-ef

*** List index: 5 received count: 2 expected count 4

Loop Reply from: 08-00-de-ad-be-ef
Loop Reply from: 08-00-de-ad-be-ef
Loop Reply from: 08-00-de-ad-be-ef
Loop Reply from: 08-00-de-ad-be-ef

It took me a while to get up and running because I forgot the SYSTEM password. Fixed that: https://gist.github.com/RaymiiOrg/d70258c698857659f4fadfa282556ae8 - now able to test the patch in OpenVMS.

This is the branch I'm testing with: https://github.com/RaymiiOrg/axpbox/tree/combine_tdes2_tdes3_buffer_for_valid_ethernet_frame - If you don't want to create a pull request I could do that for you as well, for Tomáš to review.

I can confirm that the most of my tests in the first topic now run much better (mail, vue, clock):

mcr decw$clock

afbeelding

EDIT/TPU/DISPLAY=DECWINDOWS

afbeelding

Trying a CDE session (run sys$system:decw$startlogin.exe) does take a while to load, but it loads!

afbeelding

afbeelding

afbeelding

Lots of looking at the hourglass cursor. The CPacketQueue(rx_queue):add() are gone though.

Looks promising! Doesn't get any further than the blue screen, but still, specific applications do work quite well:

For my own reference:

  • Calculator: RUN SYS$SYSTEM:DECW$CALC
  • Calendar: RUN SYS$SYSTEM:DECW$CALENDAR
  • Cardfiler: RUN SYS$SYSTEM:DECW$CARDFILER
  • Clock: RUN SYS$SYSTEM:DECW$CLOCK
  • CDA Viewer: VIEW/INTERFACE=DECWINDOWS filename
  • DECsound: RUN SYS$SYSTEM:DECSOUND
  • DECterm: CREATE/TERMINAL=DECTERM
  • EVE: EDIT/TPU/DISPLAY=DECWINDOWS
  • FileView: RUN SYS$SYSTEM:VUE$MASTER
  • Mail: RUN SYS$SYSTEM:DECW$MAIL
  • Message Panel: RUN SYS$SYSTEM:DECW$MESSAGEPANEL
  • Notepad: RUN SYS$SYSTEM:DECW$NOTEPAD
  • Print Screen: RUN SYS$SYSTEM:DECW$PRINTSCREEN
  • Paint: RUN SYS$SYSTEM:DECW$PAINT
  • Puzzle: RUN SYS$SYSTEM:DECW$PUZZLE
  • Bookreader: RUN SYS$SYSTEM:DECW$BOOKREADER

Via: https://vmssoftware.com/products/decwindows-motif/ - Using DECwindows Motif for OpenVMS

RaymiiOrg avatar Jan 22 '21 20:01 RaymiiOrg

I'm glad that it works for you as well. The new error messages you're seeing happen sometimes - and from what I've observed have nothing to do with the patch. I just started AXPbox and I didn't see the errors. The CPacketQueue(rx_queue):add() error isn't entirely fixed. When there's heavy network use it can happen again. The patch improves the overall network stability because fewer retransmits are sent to the network. I'll try to find a way to improve the CPacketQueue(rx_queue):add() error situation, though.

dmzettl avatar Jan 23 '21 07:01 dmzettl

If you don't mind, could you please do the pull request for me - Thanks a million

dmzettl avatar Jan 23 '21 08:01 dmzettl