USaBUSe icon indicating copy to clipboard operation
USaBUSe copied to clipboard

Discussion - full duplex by splitting HID in / out into separate composite functions (no issue)

Open mame82 opened this issue 7 years ago • 64 comments

Continue of ongoing discussion from here

mame82 avatar Mar 30 '17 17:03 mame82

@RoganDawes Thought opening a new issue is a good idea before going off Topic too much. I haven't much time for coding right now, but wanted to test the idea discussed.

So here are the new descriptors for the two Composite HID functions:

Input to host:

0x06, 0x00, 0xFF,  // Usage Page (Vendor Defined 0xFF00)
0x09, 0x01,        // Usage (0x01)
0xA1, 0x01,        // Collection (Application)
0x09, 0x01,        //   Usage (0x01)
0x15, 0x00,        //   Logical Minimum (0)
0x26, 0xFF, 0x00,  //   Logical Maximum (255)
0x75, 0x08,        //   Report Size (8)
0x95, 0x40,        //   Report Count (64)
0x81, 0x02,        //   Input (Data,Var,Abs,No Wrap,Linear,Preferred State,No Null Position)
0xC0,              // End Collection

Output from host:

0x06, 0x00, 0xFF,  // Usage Page (Vendor Defined 0xFF00)
0x09, 0x01,        // Usage (0x01)
0xA1, 0x01,        // Collection (Application)
0x09, 0x02,        //   Usage (0x02)
0x15, 0x00,        //   Logical Minimum (0)
0x26, 0xFF, 0x00,  //   Logical Maximum (255)
0x75, 0x08,        //   Report Size (8)
0x95, 0x40,        //   Report Count (64)
0x91, 0x02,        //   Output (Data,Var,Abs,No Wrap,Linear,Preferred State,No Null Position,Non-volatile)
0xC0,              // End Collection

Not sure if using a collection is still necessary and if usage on the second descripor really Needs to be changed to two, but they should work.

And here Comes the first problem, thanks to powershell. Need to find a way to distinguish between in and out Interface, not sure if this is possible with WMI. Test code:

function GetDevicePath($USB_VID, $USB_PID)
{
    $HIDGuid="{4d1e55b2-f16f-11cf-88cb-001111000030}"
    foreach ($wmidev in gwmi Win32_USBControllerDevice |%{[wmi]($_.Dependent)} ) {
        #[System.Console]::WriteLine($wmidev.PNPClass)
	    if ($wmidev.DeviceID -match ("$USB_VID" + '&PID_' + "$USB_PID") -and $wmidev.DeviceID -match ('HID') -and -not $wmidev.Service) {
            $devpath = "\\?\" + $wmidev.PNPDeviceID.Replace('\','#') + "#" + $HIDGuid
            "Matching device found $wmidev"
        }
    }
    #$devpath
}

$USB_VID="1D6B"
$USB_PID="fdde" # full duplex device ;-)

GetDevicePath $USB_VID $USB_PID

Result:

Matching device found \\WUNDERLAND-PC\root\cimv2:Win32_PnPEntity.DeviceID="HID\\VID_1D6B&PID_FDDE&MI_03\\8&B609427&0&0000"
Matching device found \\WUNDERLAND-PC\root\cimv2:Win32_PnPEntity.DeviceID="HID\\VID_1D6B&PID_FDDE&MI_04\\8&2F37D1E9&0&0000"

Please ignore my nice hostname ;-)

So if this would be Linux, I guess I would be able to check which device file is readable and which is writable. As this unfortunately is Windows, I have to provide this Information to createfile. I'm going to check the Win32_USBControllerDevice attributes for usefull Information on this tomorrow. Worst case: using HIDD* win32 methods for enumeration would be needed.

Let me know if you have any ideas on this

mame82 avatar Mar 30 '17 17:03 mame82

I honestly don't think it is necessary to have two raw hid interfaces, although technically it may be possible to double your throughput as a result. I think the real problems are shitty powershell and lack of "streaming". If you get 1000 packets per second, each packet has to go within 1 ms of each other. However, I measured latencies of 10-20 ms just changing from writing to reading in Powershell, which kills your throughput right there.

Making the protocol less chatty, i.e. having the sender continue to send until the receiver indicates to slow down seems like the way to go!

RoganDawes avatar Mar 30 '17 19:03 RoganDawes

Hi @RoganDawes

As promised I've done some tests on synchronous transfers using two separate Interfaces.

technically it may be possible to double your throughput as a result

You're absolutly right on this.

I've done 4 tests from Powershell:

  1. Writing out 1000 64byte reports on dedicated HID out Interface, result: About 8 seconds (= 8Kbyte/s)

  2. Writing out 1000 64byte reports on dedicated HID out Interface, echo them back and and reading them from a dedicated HID in Interface via seperate thread. Result: Again about 8 seconds (= 8Kbyte/s) reading back input data while writing data out has no Speed Impact

Test 3 and 4 have been the same, but I was hoping that the FileStream could write up to 8 concurrent reports, as they are created with FILE_FLAG_OVERLAPPED. The results are still disappointing, transfering 64KByte takes about 100ms less time, Here's the testoutput of test 3 and 4:

Path: \\?\hid#vid_1d6b&pid_fdde&mi_02#8&2324206c&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_fdde&mi_03#8&b609427&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 0
Path: \\?\hid#vid_1d6b&pid_fdde&mi_04#8&2f37d1e9&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 0, Output: 65
Writing 1000 reports with synchronous 'Write'
 Hello World                                                     
 Hello World                                                     
 .. snip ... (1000 Hello World from output thread, echoed back by bash via cat /dev/hidg2 > /dev/hidg1)
 Hello World                                                     
HID out thread finfished, time taken 8,1890945 seconds
Writing 1000 reports with async 'BeginWrite', 8 concurrent writes
 Hello World                                                     
 .. snip ... (1000 Hello World from output thread, echoed back by bash via cat /dev/hidg2 > /dev/hidg1)
 Hello World                                                     
HID concurrent output thread finfished, time taken 7,9576403 seconds
Killing remaining threads
 Hello World                                                     
Godbye

To sum up: It seems the FileStream methods of .NET aren't able to reach the maximum transfer rate (1000 64byte reports per second on USB 2.0), no matter how hard I try. So i give up on synchronous Transfer, as the benefit is low while the effort is high (according that both of us have working implementations with multiple protocol layers).

i.e. having the sender continue to send until the receiver indicates to slow down

Considering my tests, I doubt that there would be a speed increase with this (at least not for P4wnP1, as HID communication always runs at maximum Speed, while the upper thread based layers work on demand). Here's the Output code, which has no console IO or Array creation Overhead, but reaches 8KB/s max:


    $outbytes = New-Object Byte[] (65)
 
    $msg=[system.Text.Encoding]::ASCII.GetBytes("Hello World")
    for ($i=0; $i -lt $msg.Length; $i++) { $outbytes[$i + 1] = $msg[$i] }

    for ($i=0; $i -lt 1000; $i++)
    {
        $HIDout.Write($outbytes,0,65)
    }
    

And here's my testscript, use it as you need to. I solved the problem of enumerating device Interfaces based on HID report descriptors, which took a ton of shitty csharp code. This is another reason to leave this path. The only useful thing about this code is that I'm able to spot my Composite device based on Serial + manufacturer string, wwhich isn't possible with WMI numeration (Strings for interface drivers are different), which is nice because as said I often Change VID/PID, but again crating a temporary csharp file for inline compilation renders this useless.

So I guess im stuck at ~4Kbyte Maximum synchronous Transfer or could achieve ~8Kbytes at the costs of shitty NET code (while consuming additional USB EPs). Maybe I'll use dedicated input / output reports for faster file transfer later on and I'll still be slower than my first Modem ;-).

Best regards and thanks for the Exchange on this.

P.S. Excuse typos, my spellcheck is fighting against English language

mame82 avatar Mar 31 '17 13:03 mame82

Hi,

It is certainly NOT the case that Windows/Powershell cannot achieve higher than what you are currently able to get. I have been able to get up to 48kBytes/second using powershell code (however, it was unreliable/inconsistent, and didn't include all the "refinements" required for multiple concurrent connections).

It did maintain that rate for at least several seconds, so I don't think that this is an unreachable goal.

Rogan

RoganDawes avatar Mar 31 '17 13:03 RoganDawes

Could you provide a snippet of code reaching this rate, the snippet I provided above has no (obvious) room for improvement (writes data out in minimal loop)

mame82 avatar Mar 31 '17 13:03 mame82

Here is some real sample code, and some numbers to go with it:

On Windows:

speedtest.ps1

$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
	public class w {
		[DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
		public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);

		public static FileStream o(string fn) {
			return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
		}
	}
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs

& {
	$devs = gwmi Win32_USBControllerDevice
	foreach ($dev in $devs) {
		$wmidev = [wmi]$dev.Dependent
		if ($wmidev.GetPropertyValue('DeviceID') -match ('1209&PID_6667') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
			$fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
		}
	}
	try {
		$f = [n.w]::o($fn)
		$d = New-Object IO.MemoryStream
		$c = 0
		$b = New-Object Byte[]($M+1)
		$sw = $null
		while($c -lt 1024 * 1024) {
			$r = $f.Read($b, 0, $M+1)
			if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
			$d.Write($b,1, $M)
			$c += $M
		}
		$sw.Stop()
		$sw.Elapsed
		$d.Length
		([Text.Encoding]::ASCII).GetString($d.ToArray())
	} catch {
		echo $_.Exception|format-list -force
	}
	exit
}

This waits for the first successful read from the RAW HID interface and starts the stopwatch, then exits after 16384 iterations (1024*1024/64==16384)

I run it like so:

> powershell -exec bypass -file speedtest.ps1 > log.txt

On the Pi:

# time seq -f "%063g" 1 17000 > /dev/hidg1

This writes the numbers 1-17000, formatted into a 63 character zero-padded string (with a CR added to make 64 bytes) to the hid interface. I run longer than 16384 to account for any packets getting lost.

The results (with ^M's removed):

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 16
Milliseconds      : 402
Ticks             : 164026490
TotalDays         : 0.000189845474537037
TotalHours        : 0.00455629138888889
TotalMinutes      : 0.273377483333333
TotalSeconds      : 16.402649
TotalMilliseconds : 16402.649

1048576
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000002
...
000000000000000000000000000000000000000000000000000000000016383
000000000000000000000000000000000000000000000000000000000016384

In other words, 1MB successfully transferred (with no lost packets, since the last number is indeed 16384) in 16.4 seconds, a total rate of 63 937 bytes/second.

RoganDawes avatar Mar 31 '17 14:03 RoganDawes

What gets interesting is adding a Write statement into the powershell after $d.Write(), that writes the received bytes back to the device, then updating the command on the pi to:

time seq -f "%063g" 1 16384 | socat - /dev/hidg1 > t

In theory, seq would generate 16384 lines, the powershell would echo them one at a time, and "t" would contain those 16384 lines.

In practice, I ended up with only around 2000 lines in "t", and the powershell waiting to receive its full 1MB. I even upped the number of lines to 65536, and the powershell still didn't terminate, indicating that even though I sent 4 times more data than required, it still didn't successfully receive a full 1MB :-(

So, this seems to be a fairly fundamental limitation of USB raw hid (perhaps only on Windows, and perhaps only with Powershell. More testing required!)

In fact, this might justify using two endpoints, one for reading, and one for writing, which I otherwise thought was a bad idea. ;-)

RoganDawes avatar Mar 31 '17 14:03 RoganDawes

And this code, while not beautiful by any means, gets reasonable performance, while not losing any data!

$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
	public class w {
		[DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
		public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);

		public static FileStream o(string fn) {
			return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
		}
	}
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs

$readloop = {
	Param($M, $f, $q)

	try {
		$d = New-Object IO.MemoryStream
		$c = 0
		$sw = $null
		while($c -lt 1024*1024) {
			$b = New-Object Byte[]($M+1)
			$r = $f.Read($b, 0, $M+1)
			if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
			$d.Write($b,1, $M)
			$c += $M
#			[System.Threading.Monitor]::Enter($q)
			$q.Enqueue($b)
#			[System.Threading.Monitor]::Pulse($q)
#			[System.Threading.Monitor]::Exit($q)
		}
		$sw.Stop()
		$sw.Elapsed
		$d.Length
		([Text.Encoding]::ASCII).GetString($d.ToArray())
	} catch {
		$_.Exception|format-list -force
	}
	exit
}

$writeloop = {
	Param($M, $f, $q)

	try {
		while ($true) {
#			[System.Threading.Monitor]::Enter($q)
#			[System.Threading.Monitor]::Wait($q)
			[System.Console]::Write("!")
			if ($q.Count -gt 0) {
				[System.Console]::WriteLine($q.Count)
				while ($q.Count -gt 0) {
					$b = $q.Dequeue()
					$f.Write($b, 0, $M+1)
				}
			}
			Start-Sleep -m 10
#			[System.Threading.Monitor]::Exit($q)
		}
	} catch {
		[System.Console]::WriteLine("Write Thread Done!")
		$_.Exception
	}
	exit
}

$Q = New-Object System.Collections.Queue
$Q = [System.Collections.Queue]::Synchronized($Q)

$devs = gwmi Win32_USBControllerDevice
foreach ($dev in $devs) {
	$wmidev = [wmi]$dev.Dependent
	if ($wmidev.GetPropertyValue('DeviceID') -match ('1209&PID_6667') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
		$fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
	}
}
$f = [n.w]::o($fn)

$readThread = [PowerShell]::Create()
[void] $readThread.AddScript($readloop)
[void] $readThread.AddParameter("M", $M)
[void] $readThread.AddParameter("f", $f)
[void] $readThread.AddParameter("q", $q)

$writeThread = [PowerShell]::Create()
[void] $writeThread.AddScript($writeloop)
[void] $writeThread.AddParameter("M", $M)
[void] $writeThread.AddParameter("f", $f)
[void] $writeThread.AddParameter("q", $q)

[System.IAsyncResult]$AsyncReadJobResult = $null
[System.IAsyncResult]$AsyncWriteJobResult = $null

try {
	$AsyncWriteJobResult = $writeThread.BeginInvoke()

	Sleep 1 # Wait 1 second to give some time for the write thread to be ready
	$AsyncReadJobResult = $readThread.BeginInvoke()
	Write-Host "Ready"
} catch {
	$ErrorMessage = $_.Exception.Message
	Write-Host $ErrorMessage
} finally {
	if ($readThread -ne $null -and $AsyncReadJobResult -ne $null) {
		$readThread.EndInvoke($AsyncReadJobResult)
		$readThread.Dispose()
	}

	if ($writeThread -ne $null -and $AsyncWriteJobResult -ne $null) {
		$writeThread.EndInvoke($AsyncWriteJobResult)
		$writeThread.Dispose()
	}
	exit
}

On Linux:

pi@raspberrypi:~ $ time seq -f "%063g" 1 16384 | socat - /dev/hidg1 > t

real	0m40.080s
user	0m1.030s
sys	0m3.360s
pi@raspberrypi:~ $ wc -l t
16384 t
pi@raspberrypi:~ $

So, 40 seconds elapsed, to send 2MB back and forth (1MB in each direction), with no errors or lost packets. That's pretty good, I think!

I was hoping to use the Monitor class to allow one thread to notify the other, but I ended up with a deadlock, where the reader had already added an item and pulsed $q, while the writer had just ended the while loop and was getting around to calling Wait($q). Since the reader had read the last packet, there were no more "pulse"s sent, and the writer waited forever, even though there was actually one last packet in the queue.

RoganDawes avatar Mar 31 '17 16:03 RoganDawes

I'll test this with the two endpoint script provided above. It's likely that I missed the transfer interruption which occurred in your test, because I terminated the loops after 1000 reports. Additionally I have to test the code provided by you and try to work out what causes the transfer rate difference. I'm not able to work on this during weekend, but will report back on both next week.

mame82 avatar Mar 31 '17 16:03 mame82

Although I hadn't time to fully dive into you code it seems you could change it. The $q object is already thread safe I think (using it on my stage 2 without manual synchronizing without issues). As you applied a 10 ms sleep at the write loop, you could move this sleep into an else branch of the condition if ($q.Count -gt 0). Thus you would react on enqueued data with a 10ms delay (only if $q.count has fallen to 0, not if there's continuous data in the queue). After doing this it should work without Monitor lock (trigger based on synchronized $q.Count).

I've done this here (line 233 and 310) but with a 50 ms delay. CPU consumption goes below 1 percent with the sleep. Sleep has no impact on throughput of continuous data, but again this is on an upper layer.

So I'm going to report back next week...promise

mame82 avatar Mar 31 '17 17:03 mame82

Ignore the last comment. I missed that the inner while loop empties the queue before the sleep is called. You've been looking for a way to avoid polling queue count via monitor, as far as I understand.

mame82 avatar Mar 31 '17 19:03 mame82

Although I should do other things right now, I couldn't stop thinking about your examples. So I tested the following code (only changed PID/VID and added early out to WMI device enumeration):

$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
	public class w {
		[DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
		public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);

		public static FileStream o(string fn) {
			return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
		}
	}
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs

& {
	$devs = gwmi Win32_USBControllerDevice
	foreach ($dev in $devs) {
		$wmidev = [wmi]$dev.Dependent
		if ($wmidev.GetPropertyValue('DeviceID') -match ('1D6B&PID_0137') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
			$fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
            break # second dev string invalid handle
		}
	}
	try {
		$f = [n.w]::o($fn)
		$d = New-Object IO.MemoryStream
		$c = 0
		$b = New-Object Byte[]($M+1)
		$sw = $null
		while($c -lt 1024 * 1024) {
            #[Console]::WriteLine("$c")
			$r = $f.Read($b, 0, $M+1)
			if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
			$d.Write($b,1, $M)
			$c += $M
		}
		$sw.Stop()
		$sw.Elapsed
		$d.Length
		([Text.Encoding]::ASCII).GetString($d.ToArray())
	} catch {
		echo $_.Exception|format-list -force
	}
	exit
}

On my first attempt I thought it wasn't working (thats why I added in the printout of $c). Leaving the Code running for more than a minute, I recognized that my issue isn't PowerShell:

PS D:\P4wnP1> D:\P4wnP1\powershell\tests\fastread_usabuse.ps1


Days              : 0
Hours             : 0
Minutes           : 2
Seconds           : 11
Milliseconds      : 64
Ticks             : 1310641048
TotalDays         : 0,00151694565740741
TotalHours        : 0,0364066957777778
TotalMinutes      : 2,18440174666667
TotalSeconds      : 131,0641048
TotalMilliseconds : 131064,1048

1048576
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000002
... snip ...
000000000000000000000000000000000000000000000000000000000016381
000000000000000000000000000000000000000000000000000000000016382
000000000000000000000000000000000000000000000000000000000016383

I'm still stuck at ~7,8 KBytes/s

I'm using usb_f_hid.ko with libcomposite.ko which is needed for the P4wnP1 Features, could it be that you're using g_hid.ko ?

I don't get it - why is my USB communication too slow :-(

mame82 avatar Apr 01 '17 09:04 mame82

Okay, you're using the same .

So I'm not sure where to go now, have to think about what causes the speed drop.

mame82 avatar Apr 01 '17 09:04 mame82

I'm exactly at 1/8th of your Speed (8ms per report read/write). This explains my 3,2KBytes/s upper bound. I use alternating read/write which consumes 16ms per 64byte Fragment, leaving me at an effective maximum of 4KBytes/s for one direction. This drops to 3,2 KByte/s because of my naive implementation for fragment reassembling.

Could we compare Raspberry specs:

root@p4wnp1:~# ls /sys/class/udc
20980000.usb
root@p4wnp1:~# uname -r
4.4.50+
root@p4wnp1:~# lsmod | grep hid
usb_f_hid              10837  6
libcomposite           49479  15 usb_f_ecm,usb_f_hid,usb_f_rndis

I hope the UDC is always the same ... still don't get it

mame82 avatar Apr 01 '17 09:04 mame82

I've found the root cause (wouldn't be able to sleep otherwise). New results:

PS C:\Users\XMG-U705> D:\P4wnP1\powershell\tests\fastread_usabuse.ps1


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 16
Milliseconds      : 400
Ticks             : 164008304
TotalDays         : 0,000189824425925926
TotalHours        : 0,00455578622222222
TotalMinutes      : 0,273347173333333
TotalSeconds      : 16,4008304
TotalMilliseconds : 16400,8304

1048576
000000000000000000000000000000000000000000000000000000000000001
... snip ...
000000000000000000000000000000000000000000000000000000000016383
000000000000000000000000000000000000000000000000000000000016384

It was a really silly mistake. I used this code for my HID device

# create RAW HID function
# =======================================================
if $USE_RAWHID; then
mkdir -p functions/hid.g2
echo 1 > functions/hid.g2/protocol
echo 1 > functions/hid.g2/subclass
echo 8 > functions/hid.g2/report_length
cat $wdir/conf/raw_report_desc > functions/hid.g2/report_desc
fi

The report legth I used was exactly 1/8th of the size it should be (copy paste from HID Keyboard code). So I'm at the same speed, as you now. I have a minor issue in my Code, because I reassemble the reports by manually creating new static arrays. This has a huge performace impact on large transfers (a static array with size=old_size+64 is created before concatenating new data) and hinders me in doing meaningfull Transfer tests. I'll do some new speed tests on file transfer after changing the report assembler to work with MemoryStream instead.

I've already implemented a application layer function, which loads remote files to a dynamially generated powershell variable in the host process, which could be used for file transfer testing. I'm going to revisit the full dulpex approach with two EPs only if this function is working to slow (less than half of maximum transfer rate). Anything else should be doable with more code optimization. Continuous alternating read/write seems to be okay for me.

@RoganDawes I want to thank you very much for discussing these points, if you'd like me to do additional tests ask at any time (but please avoid forcing me to use an AVR ;-)).

Additionally I want to mention a new problem, which could affect both of our projects. See here

mame82 avatar Apr 01 '17 16:04 mame82

Nice work! As you can see, I am implementing the same idea both on AVR/ESP and Linux, so we can continue to collaborate ;-)

For the moment, I'm happy with the default Raspbian kernel, not having run into the problems that you have yet. I'll keep it in mind should I encounter them, though! Thanks!

RoganDawes avatar Apr 01 '17 18:04 RoganDawes

The continuous alternating read/write effectively halves your throughput, which you can recover by doubling the number of end points. An alternative is to simply reduce the number of acknowledgements required. My thinking is to eliminate the ACK's from my code entirely, and assume that the packet was received unless a NAK packet is received. This would be triggered by an out of sequence sequence number being received (1,2,4, for example). If this happens, the sender could retransmit the missing packet, and continue from that point.

Of course, this means that the sender needs to keep a number of past packets available for retransmission if necessary. And, given that my sequence numbers are only 4 bits, there is a chance that the wrong packet gets retransmitted. Hmmm!

I wonder how well "packing" a number of ACK's into a single packet would work. i.e. Set a "P-ACK" flag, then pack a whole lot of ACK's into the data portion of the packet. At 4 bits, and 60 data bytes, one could pack up to 120 ACKs into a single packet, significantly reducing the number of ACK packets required, and boosting the one-way data rate. Instead of 1:1 (data:ack), you could get 120:1, and the one way traffic would essentially approach the maximum. Sending actual data in the other direction would by necessity require flushing the pending ACK queue first, before sending the actual data.

Regardless, implementing the blocking read thread, and updating the rest of the code in Proxy.ps1 should result in significant improvements! Let's see if I can actually achieve 32000 bytes per second?!

RoganDawes avatar Apr 01 '17 18:04 RoganDawes

I'm not sure why, but I'm at 48KBps with alternating read/write (should be <32KB).

https://www.youtube.com/watch?v=MI8DFlKLHBk&yt:cc=on

Used code is here (stage1_mini.ps1, stage2.ps1 and hidsrv9.py): https://github.com/mame82/P4wnP1/tree/devel/hidtools

mame82 avatar Apr 02 '17 11:04 mame82

I have to recap your suggestions and remaining possibilities:

  1. P-ACK solution

  • pro: nearly max rate in one direction
  • con: keeping track of acks
  • con: complicated if communication changes direction
  1. SYNC/ACK (alternating read/write)

  • con: half Speed
  • pro: SYN/ACK wouldn't be needed at all. I use a 4 Byte header on every Report. SND and RCV fields fullfill the Need of SYN/ACK but aren't considered. In fact the only time I effectivly use the SND field, is if the payload in a report is smaller than 60 Bytes (to trim it down, but this could be done in higher layers, too).
  • pro: I was able to achieve more than half data rate (48 KByte/s, see Video link above) with this Approach, when I changed stream re-assembling to use MemoryStream instead of Byte[]. The Performance improvment seems to be influenced by disabeling of other USB functions (RNDIS/ECM) and more Speed increase was achieved by using AsyncWrite on the MemoryStreams
  1. Dedicated OUT / IN endpoints

  • pro: Maximum rate should be achievde (could be increased even more, by using additional endpoints, like 2 INPUT devices and 2 OUTPUT devices, as 1000 Reports/s seems the per device Limit)
  • con: SYN/ACK has to be reintroduced, because blocking write calls seem to be imposible, thus data written to the out report could be overwritten by next report if the Receiver hasn't read already. This could be circumvented with an ACK on every packet received (again alternating read/write) or multiple ACKs in a later packet (P-ACK Approach)
  • con: consumes additional USB EPs

So for me, the alternating read/write approach seems the way to go, for the following reasons:

  • large Transfers in both directions (data exfiltration from target, file injection from device to target)
  • USB EPs needed for other USB Gadget functions (RPi UDC seems to be able to provide only 8 EPs)
  • Transfer rate is okay, while using a simple protocol, without much Overhead
  • read/write packets are used as heartbeat if empty and deliver STAGE2 payload as continuous Loop (on an otherwised unused src/dst, which is a channel in your terminology)

mame82 avatar Apr 02 '17 13:04 mame82

Maybe the report ID could be used to improve things further (destination port, or in your case the channel could be outsourced to there). I don't know how much the report descriptor grows if one defines 255 possible report IDs (0 seems to be reserved for no ID in use)

mame82 avatar Apr 02 '17 13:04 mame82

Sending actual data in the other direction would by necessity require flushing the pending ACK queue first, before sending the actual data.

Why ? As far as I understand your idea, payload data would be decoupled from Logical Header data (SYN, ACK, P-ACK, SEQ). The Scenario of changing send direction is like PUSH..., instead of sending the P-ACK packet, the old Sender is waiting for, the new sender sends its data (with an urgent or push flag) and the old Sender (which is now receiver) knows that he has to reassemble an incoming stream, before he continues to send its own data (and receive the p-ACK). A peer which wants to send data, while still receiving an incoming stream (in the middle of the 120 packets) could decide on his own, if he wants to send his data with URGENT flag (PUSH) or if the pending outbound data should be cached in a send queue till the incoming transmission is fully received. If the outbound buffer is growing too large, due to permanent incoming transmissions, data could be forced to be sent with PUSH flag set.

But from my understanding, no matter how far you optimize, it ends up in a more optimized HALF-DUPLEX mode.

The alternating read/write I'm using asures that a FULL DUPLEX channel is available at anytime (but at half rate ... still not sure why I achieved more than 32KB/s), as every packet could carry payloads in the direction needed (on demand).

mame82 avatar Apr 03 '17 08:04 mame82

Yes, indeed, you are absolutely right, that there is no need to flush the P-ACKs.

As you have observed, I am modelling my protocol on TCP, so I'm not sure where you get half-duplex from. Using the P-ACK approach, each side can get up to (120/121)% of the available bandwidth for their own transmissions, if the other side is not transmitting any actual data. If both sides need to transmit, it should balance appropriately depending of the ratios that each side requires, up to a 50:50 split.

What I haven't established yet is whether having a dedicated thread for reading the HID interface in Powershell ends up effectively prioritising reads at the expense of writes, making the Linux side dominant when writing.

Of course, this is all very specific to using RAW HID as a transport, using something else like a text-based printer, or CDC or ACM would have less of an issue, I suspect, simply because the available data rate would be that much higher.

RoganDawes avatar Apr 03 '17 08:04 RoganDawes

up to a 50:50 split

That's why I stated one ends up with half-duplex (if both side are sending, although your idea scales very well, if only one side is sending)

What I haven't established yet is whether having a dedicated thread for reading the HID interface in Powershell ends up effectively prioritising reads at the expense of writes, making the Linux side dominant when writing.

I'm still trying to get this. I was planning to extend your read-loop example, to a test with concurrent but independant read and write on both ends (separate threads). As I still don't get, why I could reach a Transfer rate >32KBps, I think it could be possible that the device file allows writing and reading at the same time (SHARED ACCESS) which essentialy would mean FULL DUPLEX is possible with a single HID device interface.

I have to delay this test, as it turns out that the WriteAsync method of my MemoryStream objects is only available with NET 4.5 and I lost backward compatibility to Windows 7, which I want to fix first.

Of course, this is all very specific to using RAW HID as a transport, using something else like a text-based printer, or CDC or ACM would have less of an issue, I suspect, simply because the available data rate would be that much higher.

Sure, but moving away from HID would mean to get more loud (triggering Endpoint Protection on target). I like the HID approach very much and using ACM or other Communication Classes would be to easy ;-) .

mame82 avatar Apr 03 '17 09:04 mame82

News:

Replacing AsyncWrite on my Background MemoryStreams with BeginWrite/EndWrite dropped my Transfer rate into the expected range:

End receiving /tmp/test received 1.048.517 Byte in 41,2182 seconds (24,84 KB/s)

Every call to BeginWrite assures that EndWrite is called for previous write, which I haven't done like this while using WriteAsync. Inspecting the source of the AsyncWrite implementation in MemoryStream.cs is seams every AsyncWrite is done in a seperate AsyncTask. I'm still not sure how my Transfer rate could grow above 32KB/s with AsyncWrite, as I'm waiting fo an answer to every packet sent before sending more data. Anyway, I updated my code and move on to prepare a concurrent read/write test.

Going to ping back with the results

mame82 avatar Apr 03 '17 11:04 mame82

I struggle to believe that BeginWrite/EndWrite on a MemoryStream is necessary, or even efficient. Unless you are doing it to be compatible with Async operations on other types of streams, I'd just do a Write, and be done with it.

RoganDawes avatar Apr 03 '17 13:04 RoganDawes

I struggle to believe that BeginWrite/EndWrite on a MemoryStream is necessary, or even efficient. Unless you are doing it to be compatible with Async operations on other types of streams, I'd just do a Write, and be done with it.

Your're again right, benefit couldn't be measured, but I was wondering why I achieved this high rate with WriteAsync ?!?!

Anyway, meanwhile I've found the answer. Good News... believe it or not: The devicefile is FULL DUPLEX. You can send data in both directions with about 64 KB/s in parrallel

Test output powershell side (read and writing at the same time):

PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
16384 reports have been read in 16.5860813 seconds (61.7385132436316 KB/s)
16385 reports have been written in 16.61491 seconds (61.635151800401 KB/s)
Killing remaining threads
Godbye

And the other end (python, first read is the Trigger to start threads):

Count 1 reports read in 0.000141143798828 seconds (0.0 KB/s)
Count 16384 reports written in 16.5881521702 seconds (61.730805788 KB/s)
Count 16383 reports read in 16.6130959988 seconds (61.5779262382 KB/s)

Althoug I didn't measured overall time, reading and writing have been done concurrently. The difference is, that I fully decoupled inbound and outbound data (no echo Server).

I haven't implemented any tests for packet loss, but sending and receiving threads have to have matching Report Counts in order to allow the threads to terminate. So it is very likely that there is no packet loss. Anyway, packet Content has to be checked on both ends (I'm sure that I ran into an issue, where riding a Report before reading the pending input Report cleared the input Report).

Testing this is easy, as i included routines to print out inbound data on both sides (disabled to reduce influence on time measurement).

Test code will be up in some minutes

mame82 avatar Apr 03 '17 14:04 mame82

Next interesting Observation: Starting only the powershell side of communication (no python endpoint on Pi Zero) it turns out that write() is blocking to. If no reports are read on RPi's end, a Maximum of 4 reports could be written before write() blocks.

This means there should never be any packet (=Report) loss, which again means there's no Need to use ACKs on per Report Basis. One could simply reassemble reports to a bigger stream. No Problem on Linux, but PS is still missing something like FIFOMemoryStream (which in fact could be implemented esily on your side, as you use inline C-Sharp).

Test from PS printing out Report Count on write, without listener on RPi running:

PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
reports written 0
reports written 1
reports written 2
reports written 3

Note: None of the pending reports is lost, if the python side is started some minutes later

mame82 avatar Apr 03 '17 14:04 mame82

@RoganDawes here're the test files https://github.com/mame82/tests/tree/master/fullduplex

I guess I have to reimplement everything, as alternating read/write is the worst approach.

Have you found a replacement for named pipes to Interface with upper protocol layers (some sort of FIFO Stream available on NET 3.5) ?

mame82 avatar Apr 03 '17 14:04 mame82

Very interesting results! I guess my implementation was naive, as an echo server, introducing delays!

I never bother reassembling the stream, I simply write the data portion of the packets to their destination as I receive them. So I have no need for a FIFO memory stream at all.

The main issue then is failure to read packets on the Powershell side, resulting in lost data. This is easily seen by introducing a console write in the read loop, I ended up losing about 500 packets each time! If you keep that loop clean and tight, then hopefully there should be no packet loss in that direction either!

RoganDawes avatar Apr 03 '17 14:04 RoganDawes

The main issue then is failure to read packets on the Powershell side, resulting in lost data. This is easily seen by introducing a console write in the read loop, I ended up losing about 500 packets each time! If you keep that loop clean and tight, then hopefully there should be no packet loss in that direction either!

I haven't had packet loss at any time. This seems to be clear now, as write() is blocking if data isn't read on the other end. Read() was blocking, too.

mame82 avatar Apr 03 '17 14:04 mame82