userland
userland copied to clipboard
Support arm64 compilation
I've got a RPi3 with a test 64-bit kernel + userland setup going, and tried to compile the VideoCore userland, without success. First obstacle was:
interface/vmcs_host/linux/vcfilesys.c:286:19: error: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘int64_t {aka long int}’ [-Werror=format=]
DEBUG_MINOR("vc_hostfs_lseek returning %lld)", read_offset);
^
Stripping out -Werror allowed it to continue, leading it to:
interface/khronos/common/khrn_int_hash_asm.s: Assembler messages:
interface/khronos/common/khrn_int_hash_asm.s:36: Error: unknown architecture `armv6'
interface/khronos/common/khrn_int_hash_asm.s:37: Error: unknown pseudo-op: `.object_arch'
interface/khronos/common/khrn_int_hash_asm.s:38: Error: unknown pseudo-op: `.arm'
interface/khronos/common/khrn_int_hash_asm.s:104: Warning: unknown register 'a1' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:105: Warning: unknown register 'a2' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:106: Warning: unknown register 'a3' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:107: Warning: unknown register 'a4' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:110: Warning: unknown register 'ip' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:111: Warning: unknown register 'lr' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:113: Error: operand 1 should be an integer register -- `ldr BB,=0xDEADBEEF'
followed by many more errors for khrn_int_hash_asm.s. Might be more problems after that's cleared.
The GPU is a 32 bit processor. I haven't checked, but I'm expecting that there's a heck of a lot more work to do to get Khronos or other multimedia extension stuff up and running against a 64bit kernel than just getting userland to build.
It looks like the build scripts have been merged, so perhaps the issue needs to be closed?
I forgot I had filed this bug actually. The build scripts @Electron752 are referring to were part of PR #347 which adds -DARM64=ON to only compile known-working 64-bit code. But the fact remains that a lot of 64-bit broken code still exists. Maybe this bug should remain open and be used to refer to work on fixing the 64-bit broken code? I'll leave that decision to the repo maintainers.
I wouldn't be surprised if its from the use of thumb, as its deprecated in aarch64
Since there have been no updates to this in a year, I'm inclined to close it. Any objections?
@JamesH65 : I'm closely tracking this issue, as the reporter expanded with the following question :
But the fact remains that a lot of 64-bit broken code still exists. Maybe this bug should remain open and be used to refer to work on fixing the 64-bit broken code? I'll leave that decision to the repo maintainers.
What is your take on this?
The RPF are not putting any dev effort in to a 64 bit userland, it's enough work supporting 32bit! I've no idea if that will change - its is a LOT of work I believe. So any updates will be coming from third parties, and there haven't been any posts here for a year, so presumably either no-one is actually working on it, or its being documented elsewhere.
I'm sure there are certain applications where having a 64-bit kernel (let alone userland) may be beneficial, but I suspect the hoped-for performance improvements didn't materialise, otherwise people would be waving benchmark results at us demanding an RPi-supported aarch64 kernel.
What is the best way to do benchmarks to post? I have a full 64-bit compile with march=armv8-a+crc and neon set in the compile, so it's pretty much optimized to the max of RPi hardware.
You would think that Neon benchmarks would be the best ones to look at - Aarch64 Neon has double the number of Neon registers.
No, the aim is not to find something that a 64-bit kernel will excel at, but rather a benchmark or two that reflect performance for the (mythical) typical user by including a bit of everything.
Is there such a benchmark I could use?
Hi everyone, whats the status here?
We have not been working on this, so no change.
I'm an experienced developer; if I wanted to hack on this in my spare time, where would be a good place to start? I understand if even figuring that out is more work than you guys want to put into this heh, but I figured it wouldn't hurt to ask.
On 30 July 2018 at 23:06, Robert Thompson [email protected] wrote:
I'm an experienced developer; if I wanted to hack on this in my spare time, where would be a good place to start? I understand if even figuring that out is more work than you guys want to put into this heh, but I figured it wouldn't hurt to ask.
I've wondered about doing this; I think it's actually relatively straightforward. There's a good chance I only think that because of a combination of ignorance and hubris though.
But anyhow, the basic problem is that there are various ARM/VideoCore interfaces around which are all designed around a 32 bit architecture on both sides. The tricky part is that there are places where the ARM side passes in a context, VC does some stuff, and then sends back a message with that context. The ARM side then does whatever it needs to do. That context is a pointer to some memory.
If you're on a 64 bit architecture, then that isn't going to work - your pointers are obviously too large.
So, what to do?
Well, I think one way is to allocate a virtual region (vma) with a suitably large size (e.g. 128MB virtual should be plenty, whatever, it's virtual so it doesn't matter). And then allocate memory in there. In the APIs, just pass the offset into this region, and on the way back, convert back to a pointer by adding back the offset.
Note that VideoCore only actual ever reads or writes at physical addresses as there is no IOMMU, so the virtual address can be anywhere, and it won't matter. However, there are certainly some code paths that will want a contiguous region (e.g. the VCHIQ circular buffers).
The place to start looking at this is in the vchiq driver - fix that and everything else will be easy (famous last words). There's a vchiq test program, so once that works you are home and dry.
For example - vchiq_service_params_struct has a void* userdata - that's an example of the problem. In vchiq_add_service_internal() it stuffs that pointer into some shared memory - that's probably where you would want to patch it up.
I think that should take care of vchiq.
There's also a shared memory driver where VC gets actual ARM-side addresses; I don't know how that can possible work directly, probably it will require a special allocator from this same region, but I think other architectures have similar problems, so it might not be that hard to overcome.
I can't help feeling though that I've overlooked something important!
Luke
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks for the description!
I got most of it compiled in 64bit removing some assertions and changing some types from int
to int32_t
for example or removing void
pointers completely. There are still tons of pointer to integer casts warnings that are probably critical, but I could not investigate further.
The main problem here is that mmal and vcos code base (in interfaces) are definitely not 64bit compatible because of above reasons. Another issue is that mmal has a dependency (khronos) where 32bit assembly was used /interface/khronos/common/khrn_int_hash_asm.s. I am not that familiar with 32bit nor 64bit arm assembly to convert this. But maybe this file could be excluded??
Well I am glad that there are more people interested in doing this!!
Greetings!
If you can make your code available somewhere then I might be able to have a look.
Don't worry abbot mmal for now, as it requires vchiq. vchiq kernel driver is the place to start.
On Tue, 31 Jul 2018, 09:03 Konstantin Wachendorff, [email protected] wrote:
Thanks for the description!
I got most of it compiled in 64bit removing some assertions and changing some types from int to int32_t for example or removing void pointers completely. There are still tons of pointer to integer casts warnings that are probably critical, but I could not investigate further.
The main problem here is that mmal and vcos code base (in interfaces) are definitely not 64bit compatible because of above reasons. Another issue is that mmal has a dependency (khronos) where 32bit assembly was used /interface/khronos/common/khrn_int_hash_asm.s. I am not that familiar with 32bit nor 64bit arm assembly to convert this. But maybe this file could be excluded??
Well I am glad that there are more people interested in doing this!!
Greetings!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/userland/issues/314#issuecomment-409132447, or mute the thread https://github.com/notifications/unsubscribe-auth/AFFYF_5t7XgLko-q5DzGTZXQIrehIC5vks5uMA9HgaJpZM4Ieaf3 .
I have deleted all of it because I was stuck on the assembly file. But it wasn't that much work, I just changed the CMakeLists so it would compile everything with an aarch64 compiler (just remove the if not arm64), downloaded the newest linaro toolchain from here and fixed the errors on the go.
I hope you have more time, patience and skill than I do! :)
The assembly issue is trivial - there is a C implementation. Just remove:
#ifndef __arm__ // Use the version in khrn_int_hash_asm.s instead
from interface/khronos/common/khrn_int_hash.c
and thre reference to common/khrn_int_hash_asm.s
in interface/khronos/CMakeLists.txt
So as far as I can tell, you can currently use a 32-bit userland with a 64-bit kernel, and everything that I would expect to work does work (omxplayer, glmark2-es2-dispmanx, etc). Is this just a coincidence of the fact that the userland is always going to be putting a 32-bit pointer into the ostensibly 64-bit void*
in that structure? Or is there more going on here?
@popcornmix I guess I missed that... @rothomp3 I never came across of that kind of setup, but I haven't looked into Gentoo nor Arch... For me it is really important that mmal works because I need the Camera working.
To your question, I don't know but I guess that the original authors maybe didn't plan to make it 64 bit in the first place ... because they might have expected to change the VC or so anyway so it was not necessary to take precautions for 64 bit
Sorry my question was really directed @luked99 heh, should have made that explicit…
On 1 August 2018 at 20:37, Robert Thompson [email protected] wrote:
Sorry my question was really directed @luked99 heh, should have made that explicit…
Well, I'm a bit surprised at your finding, but an ounce of experience is worth a pound of theory(*)!
If you do "cat /dev/vchiq" it should give you a list of the services (I know this should be in debugfs...). If that has something sensible then it means that vchiq is working 64bit, which makes life much easier.
In that case, fixing mmal might be just a matter of patching up the structure definitions, and perhaps doing something as crude as a lookup table to map from 64 bit address to 32 bit context (ugly, but I suspect performance might well be fine). Otherwise we have to make vchiq work 64 bit but I think that should still be OK.
The place where it will start getting tricky is if we ever have a 64 bit Raspberry Pi with more than 1GB of physical memory - at that point I think the 32 bit VideoCore (combined with it's various cache aliases) won't be able to address all of the available memory. But we're not at that point yet.
I'm on vacation right now so I can't really do anything other than theorize, sorry!
Luke
(*) I should use SI units, I know, sorry.
@luked99 indeed:
pi@raspberrypi3 ~> uname -a
Linux raspberrypi3 4.17.10-v8+ #4 SMP PREEMPT Wed Jul 25 20:35:40 EDT 2018 aarch64 GNU/Linux
and
pi@raspberrypi3 ~> cat /dev/vchiq
State 0: CONNECTED
tx_pos=7b3b20(@000000004d94bef2), rx_pos=4ae20(@0000000005bf166a)
Version: 8 (min 3)
Stats: ctrl_tx_count=3142, ctrl_rx_count=3158, error_count=0
Slots: 30 available (29 data), 0 recyclable, 0 stalls (0 data)
Platform: 2835 (VC master)
Local: slots 34-64 tx_pos=7b3b20 recycle=7d2
Slots claimed:
DEBUG: SLOT_HANDLER_COUNT = 19837(4d7d)
DEBUG: SLOT_HANDLER_LINE = 2100(834)
DEBUG: PARSE_LINE = 2074(81a)
DEBUG: PARSE_HEADER = 142130712(878be18)
DEBUG: PARSE_MSGID = 67219474(401b012)
DEBUG: AWAIT_COMPLETION_LINE = 1369(559)
DEBUG: DEQUEUE_MESSAGE_LINE = 1452(5ac)
DEBUG: SERVICE_CALLBACK_LINE = 633(279)
DEBUG: MSG_QUEUE_FULL_COUNT = 0(0)
DEBUG: COMPLETION_QUEUE_FULL_COUNT = 0(0)
Remote: slots 2-32 tx_pos=4ae20 recycle=69
Slots claimed:
14: 222/221
DEBUG: SLOT_HANDLER_COUNT = 18864(49b0)
DEBUG: SLOT_HANDLER_LINE = 1851(73b)
DEBUG: PARSE_LINE = 1827(723)
DEBUG: PARSE_HEADER = -141866216(f78b4b18)
DEBUG: PARSE_MSGID = 67182619(401201b)
DEBUG: AWAIT_COMPLETION_LINE = 0(0)
DEBUG: DEQUEUE_MESSAGE_LINE = 0(0)
DEBUG: SERVICE_CALLBACK_LINE = 0(0)
DEBUG: MSG_QUEUE_FULL_COUNT = 0(0)
DEBUG: COMPLETION_QUEUE_FULL_COUNT = 0(0)
Instance 0000000098cabd6b: pid 396, connected, completions 0/128
Service 0: LISTENING (ref 1) 'KEEP' remote n/a (msg use 0/3840, slot use 0/15)
Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
instance 0000000066bd5562
Service 1: OPEN (ref 1) 'GCMD' remote 0 (msg use 0/3840, slot use 0/15)
Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
Ctrl: tx_count=1, tx_bytes=21, rx_count=1, rx_bytes=13
Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
instance 0000000098cabd6b, 0/128 messages
Service 2: OPEN (ref 1) 'DISP' remote 10 (msg use 0/3840, slot use 0/15)
Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
instance 0000000098cabd6b, 0/128 messages
Service 3: OPEN (ref 1) 'UPDH' remote 18 (msg use 0/3840, slot use 0/15)
Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
instance 0000000098cabd6b, 0/128 messages
Service 4: OPEN (ref 1) 'TVSV' remote 35 (msg use 0/3840, slot use 0/15)
Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
Ctrl: tx_count=1, tx_bytes=4, rx_count=1, rx_bytes=52
Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
instance 0000000098cabd6b, 0/128 messages
Service 5: OPEN (ref 1) 'TVNT' remote 43 (msg use 0/3840, slot use 0/15)
Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
instance 0000000098cabd6b, 0/128 messages
Service 6: OPEN (ref 1) 'CECS' remote 51 (msg use 0/3840, slot use 0/15)
Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
instance 0000000098cabd6b, 0/128 messages
Service 7: OPEN (ref 1) 'CECN' remote 59 (msg use 0/3840, slot use 0/15)
Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
instance 0000000098cabd6b, 0/128 messages
Service 8: OPEN (ref 1) 'ILCS' remote 9 (msg use 0/3840, slot use 0/15)
Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
instance 0000000098cabd6b
So it looks to me like the kernel side of this is already taken care of?
@luked99 Don't worry about vcsm at the moment. There's a new version in the pipeline that replaces the reloc heap with CMA allocations made on behalf of the VPU and mem_wrapped into a MEM_HANDLE_T.
There's also a V4L2 codec driver in progress, so that reduces MMAL to only being required for a couple of tasks.
Hi, is there any update on this? I managed to compile it all without errors. However, I believe mmal does not work. Is there a way to make it work?
I also found GLES/EGL have issues... first time I do eglSwapBuffers it works but after a few frames I get a segfault. This does not happen on 32 bits build. Any idea?
We are not currently doing any dev work on 64bit builds, and don't use them in house, so I'm afraid I have no idea about the EGL issue. Without any sort of details on the fault it will also be very difficult to determine the cause of the issue.
The firmware GLES / EGL drivers will never be updated for 64 bit systems - please use the vc4 KMS drivers instead (those should already support 64 bit).
OpenMax IL is very unlikely to get any 64bit love - it's a hideous API to work with, and MMAL offers better functionality.
MMAL still needs some work, and that is the one bit that may be tackled. The camera can already be accessed via V4L2 which should be supported on 64bit systems. The codecs can now be accessed via V4L2 using the 4.19 branch. With two further patches that I have I think that too should be able to support 64bit systems. That covers the main use cases for MMAL, but using it directly does allow some more efficient pipelines to be created.
vcsm is being rewritten.
That should cover the majority of the userland code.