userland icon indicating copy to clipboard operation
userland copied to clipboard

Support arm64 compilation

Open rfinnie opened this issue 8 years ago • 57 comments

I've got a RPi3 with a test 64-bit kernel + userland setup going, and tried to compile the VideoCore userland, without success. First obstacle was:

interface/vmcs_host/linux/vcfilesys.c:286:19: error: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘int64_t {aka long int}’ [-Werror=format=]
       DEBUG_MINOR("vc_hostfs_lseek returning %lld)", read_offset);
                   ^

Stripping out -Werror allowed it to continue, leading it to:

interface/khronos/common/khrn_int_hash_asm.s: Assembler messages:
interface/khronos/common/khrn_int_hash_asm.s:36: Error: unknown architecture `armv6'

interface/khronos/common/khrn_int_hash_asm.s:37: Error: unknown pseudo-op: `.object_arch'
interface/khronos/common/khrn_int_hash_asm.s:38: Error: unknown pseudo-op: `.arm'
interface/khronos/common/khrn_int_hash_asm.s:104: Warning: unknown register 'a1' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:105: Warning: unknown register 'a2' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:106: Warning: unknown register 'a3' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:107: Warning: unknown register 'a4' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:110: Warning: unknown register 'ip' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:111: Warning: unknown register 'lr' -- .req ignored
interface/khronos/common/khrn_int_hash_asm.s:113: Error: operand 1 should be an integer register -- `ldr BB,=0xDEADBEEF'

followed by many more errors for khrn_int_hash_asm.s. Might be more problems after that's cleared.

rfinnie avatar May 13 '16 21:05 rfinnie

The GPU is a 32 bit processor. I haven't checked, but I'm expecting that there's a heck of a lot more work to do to get Khronos or other multimedia extension stuff up and running against a 64bit kernel than just getting userland to build.

6by9 avatar May 14 '16 07:05 6by9

It looks like the build scripts have been merged, so perhaps the issue needs to be closed?

ghost avatar Oct 30 '16 15:10 ghost

I forgot I had filed this bug actually. The build scripts @Electron752 are referring to were part of PR #347 which adds -DARM64=ON to only compile known-working 64-bit code. But the fact remains that a lot of 64-bit broken code still exists. Maybe this bug should remain open and be used to refer to work on fixing the 64-bit broken code? I'll leave that decision to the repo maintainers.

rfinnie avatar Oct 30 '16 20:10 rfinnie

I wouldn't be surprised if its from the use of thumb, as its deprecated in aarch64

IComplainInComments avatar May 26 '17 04:05 IComplainInComments

Since there have been no updates to this in a year, I'm inclined to close it. Any objections?

JamesH65 avatar Dec 05 '17 13:12 JamesH65

@JamesH65 : I'm closely tracking this issue, as the reporter expanded with the following question :

But the fact remains that a lot of 64-bit broken code still exists. Maybe this bug should remain open and be used to refer to work on fixing the 64-bit broken code? I'll leave that decision to the repo maintainers.

What is your take on this?

bamarni avatar Dec 05 '17 13:12 bamarni

The RPF are not putting any dev effort in to a 64 bit userland, it's enough work supporting 32bit! I've no idea if that will change - its is a LOT of work I believe. So any updates will be coming from third parties, and there haven't been any posts here for a year, so presumably either no-one is actually working on it, or its being documented elsewhere.

JamesH65 avatar Dec 05 '17 13:12 JamesH65

I'm sure there are certain applications where having a 64-bit kernel (let alone userland) may be beneficial, but I suspect the hoped-for performance improvements didn't materialise, otherwise people would be waving benchmark results at us demanding an RPi-supported aarch64 kernel.

pelwell avatar Dec 05 '17 13:12 pelwell

What is the best way to do benchmarks to post? I have a full 64-bit compile with march=armv8-a+crc and neon set in the compile, so it's pretty much optimized to the max of RPi hardware.

dracwyrm avatar Dec 16 '17 11:12 dracwyrm

You would think that Neon benchmarks would be the best ones to look at - Aarch64 Neon has double the number of Neon registers.

luked99 avatar Dec 16 '17 16:12 luked99

No, the aim is not to find something that a 64-bit kernel will excel at, but rather a benchmark or two that reflect performance for the (mythical) typical user by including a bit of everything.

pelwell avatar Dec 16 '17 17:12 pelwell

Is there such a benchmark I could use?

dracwyrm avatar Dec 17 '17 08:12 dracwyrm

Hi everyone, whats the status here?

krjw avatar Jul 12 '18 09:07 krjw

We have not been working on this, so no change.

JamesH65 avatar Jul 12 '18 09:07 JamesH65

I'm an experienced developer; if I wanted to hack on this in my spare time, where would be a good place to start? I understand if even figuring that out is more work than you guys want to put into this heh, but I figured it wouldn't hurt to ask.

rothomp3 avatar Jul 30 '18 21:07 rothomp3

On 30 July 2018 at 23:06, Robert Thompson [email protected] wrote:

I'm an experienced developer; if I wanted to hack on this in my spare time, where would be a good place to start? I understand if even figuring that out is more work than you guys want to put into this heh, but I figured it wouldn't hurt to ask.

I've wondered about doing this; I think it's actually relatively straightforward. There's a good chance I only think that because of a combination of ignorance and hubris though.

But anyhow, the basic problem is that there are various ARM/VideoCore interfaces around which are all designed around a 32 bit architecture on both sides. The tricky part is that there are places where the ARM side passes in a context, VC does some stuff, and then sends back a message with that context. The ARM side then does whatever it needs to do. That context is a pointer to some memory.

If you're on a 64 bit architecture, then that isn't going to work - your pointers are obviously too large.

So, what to do?

Well, I think one way is to allocate a virtual region (vma) with a suitably large size (e.g. 128MB virtual should be plenty, whatever, it's virtual so it doesn't matter). And then allocate memory in there. In the APIs, just pass the offset into this region, and on the way back, convert back to a pointer by adding back the offset.

Note that VideoCore only actual ever reads or writes at physical addresses as there is no IOMMU, so the virtual address can be anywhere, and it won't matter. However, there are certainly some code paths that will want a contiguous region (e.g. the VCHIQ circular buffers).

The place to start looking at this is in the vchiq driver - fix that and everything else will be easy (famous last words). There's a vchiq test program, so once that works you are home and dry.

For example - vchiq_service_params_struct has a void* userdata - that's an example of the problem. In vchiq_add_service_internal() it stuffs that pointer into some shared memory - that's probably where you would want to patch it up.

I think that should take care of vchiq.

There's also a shared memory driver where VC gets actual ARM-side addresses; I don't know how that can possible work directly, probably it will require a special allocator from this same region, but I think other architectures have similar problems, so it might not be that hard to overcome.

I can't help feeling though that I've overlooked something important!

Luke

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

luked99 avatar Jul 30 '18 22:07 luked99

Thanks for the description!

I got most of it compiled in 64bit removing some assertions and changing some types from int to int32_t for example or removing void pointers completely. There are still tons of pointer to integer casts warnings that are probably critical, but I could not investigate further.

The main problem here is that mmal and vcos code base (in interfaces) are definitely not 64bit compatible because of above reasons. Another issue is that mmal has a dependency (khronos) where 32bit assembly was used /interface/khronos/common/khrn_int_hash_asm.s. I am not that familiar with 32bit nor 64bit arm assembly to convert this. But maybe this file could be excluded??

Well I am glad that there are more people interested in doing this!!

Greetings!

krjw avatar Jul 31 '18 08:07 krjw

If you can make your code available somewhere then I might be able to have a look.

Don't worry abbot mmal for now, as it requires vchiq. vchiq kernel driver is the place to start.

On Tue, 31 Jul 2018, 09:03 Konstantin Wachendorff, [email protected] wrote:

Thanks for the description!

I got most of it compiled in 64bit removing some assertions and changing some types from int to int32_t for example or removing void pointers completely. There are still tons of pointer to integer casts warnings that are probably critical, but I could not investigate further.

The main problem here is that mmal and vcos code base (in interfaces) are definitely not 64bit compatible because of above reasons. Another issue is that mmal has a dependency (khronos) where 32bit assembly was used /interface/khronos/common/khrn_int_hash_asm.s. I am not that familiar with 32bit nor 64bit arm assembly to convert this. But maybe this file could be excluded??

Well I am glad that there are more people interested in doing this!!

Greetings!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/userland/issues/314#issuecomment-409132447, or mute the thread https://github.com/notifications/unsubscribe-auth/AFFYF_5t7XgLko-q5DzGTZXQIrehIC5vks5uMA9HgaJpZM4Ieaf3 .

luked99 avatar Jul 31 '18 10:07 luked99

I have deleted all of it because I was stuck on the assembly file. But it wasn't that much work, I just changed the CMakeLists so it would compile everything with an aarch64 compiler (just remove the if not arm64), downloaded the newest linaro toolchain from here and fixed the errors on the go.

I hope you have more time, patience and skill than I do! :)

krjw avatar Jul 31 '18 12:07 krjw

The assembly issue is trivial - there is a C implementation. Just remove:

#ifndef __arm__  // Use the version in khrn_int_hash_asm.s instead

from interface/khronos/common/khrn_int_hash.c and thre reference to common/khrn_int_hash_asm.s in interface/khronos/CMakeLists.txt

popcornmix avatar Jul 31 '18 15:07 popcornmix

So as far as I can tell, you can currently use a 32-bit userland with a 64-bit kernel, and everything that I would expect to work does work (omxplayer, glmark2-es2-dispmanx, etc). Is this just a coincidence of the fact that the userland is always going to be putting a 32-bit pointer into the ostensibly 64-bit void* in that structure? Or is there more going on here?

rothomp3 avatar Jul 31 '18 16:07 rothomp3

@popcornmix I guess I missed that... @rothomp3 I never came across of that kind of setup, but I haven't looked into Gentoo nor Arch... For me it is really important that mmal works because I need the Camera working.

To your question, I don't know but I guess that the original authors maybe didn't plan to make it 64 bit in the first place ... because they might have expected to change the VC or so anyway so it was not necessary to take precautions for 64 bit

krjw avatar Jul 31 '18 17:07 krjw

Sorry my question was really directed @luked99 heh, should have made that explicit…

rothomp3 avatar Aug 01 '18 18:08 rothomp3

On 1 August 2018 at 20:37, Robert Thompson [email protected] wrote:

Sorry my question was really directed @luked99 heh, should have made that explicit…

Well, I'm a bit surprised at your finding, but an ounce of experience is worth a pound of theory(*)!

If you do "cat /dev/vchiq" it should give you a list of the services (I know this should be in debugfs...). If that has something sensible then it means that vchiq is working 64bit, which makes life much easier.

In that case, fixing mmal might be just a matter of patching up the structure definitions, and perhaps doing something as crude as a lookup table to map from 64 bit address to 32 bit context (ugly, but I suspect performance might well be fine). Otherwise we have to make vchiq work 64 bit but I think that should still be OK.

The place where it will start getting tricky is if we ever have a 64 bit Raspberry Pi with more than 1GB of physical memory - at that point I think the 32 bit VideoCore (combined with it's various cache aliases) won't be able to address all of the available memory. But we're not at that point yet.

I'm on vacation right now so I can't really do anything other than theorize, sorry!

Luke

(*) I should use SI units, I know, sorry.

luked99 avatar Aug 01 '18 19:08 luked99

@luked99 indeed:

pi@raspberrypi3 ~> uname -a                           
Linux raspberrypi3 4.17.10-v8+ #4 SMP PREEMPT Wed Jul 25 20:35:40 EDT 2018 aarch64 GNU/Linux

and

pi@raspberrypi3 ~> cat /dev/vchiq
State 0: CONNECTED
  tx_pos=7b3b20(@000000004d94bef2), rx_pos=4ae20(@0000000005bf166a)
  Version: 8 (min 3)
  Stats: ctrl_tx_count=3142, ctrl_rx_count=3158, error_count=0
  Slots: 30 available (29 data), 0 recyclable, 0 stalls (0 data)
  Platform: 2835 (VC master)
  Local: slots 34-64 tx_pos=7b3b20 recycle=7d2
    Slots claimed:
    DEBUG: SLOT_HANDLER_COUNT = 19837(4d7d)
    DEBUG: SLOT_HANDLER_LINE = 2100(834)
    DEBUG: PARSE_LINE = 2074(81a)
    DEBUG: PARSE_HEADER = 142130712(878be18)
    DEBUG: PARSE_MSGID = 67219474(401b012)
    DEBUG: AWAIT_COMPLETION_LINE = 1369(559)
    DEBUG: DEQUEUE_MESSAGE_LINE = 1452(5ac)
    DEBUG: SERVICE_CALLBACK_LINE = 633(279)
    DEBUG: MSG_QUEUE_FULL_COUNT = 0(0)
    DEBUG: COMPLETION_QUEUE_FULL_COUNT = 0(0)
  Remote: slots 2-32 tx_pos=4ae20 recycle=69
    Slots claimed:
      14: 222/221
    DEBUG: SLOT_HANDLER_COUNT = 18864(49b0)
    DEBUG: SLOT_HANDLER_LINE = 1851(73b)
    DEBUG: PARSE_LINE = 1827(723)
    DEBUG: PARSE_HEADER = -141866216(f78b4b18)
    DEBUG: PARSE_MSGID = 67182619(401201b)
    DEBUG: AWAIT_COMPLETION_LINE = 0(0)
    DEBUG: DEQUEUE_MESSAGE_LINE = 0(0)
    DEBUG: SERVICE_CALLBACK_LINE = 0(0)
    DEBUG: MSG_QUEUE_FULL_COUNT = 0(0)
    DEBUG: COMPLETION_QUEUE_FULL_COUNT = 0(0)
Instance 0000000098cabd6b: pid 396, connected,  completions 0/128
Service 0: LISTENING (ref 1) 'KEEP' remote n/a (msg use 0/3840, slot use 0/15)
  Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
  Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
  instance 0000000066bd5562
Service 1: OPEN (ref 1) 'GCMD' remote 0 (msg use 0/3840, slot use 0/15)
  Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
  Ctrl: tx_count=1, tx_bytes=21, rx_count=1, rx_bytes=13
  Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
  instance 0000000098cabd6b, 0/128 messages
Service 2: OPEN (ref 1) 'DISP' remote 10 (msg use 0/3840, slot use 0/15)
  Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
  Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
  instance 0000000098cabd6b, 0/128 messages
Service 3: OPEN (ref 1) 'UPDH' remote 18 (msg use 0/3840, slot use 0/15)
  Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
  Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
  instance 0000000098cabd6b, 0/128 messages
Service 4: OPEN (ref 1) 'TVSV' remote 35 (msg use 0/3840, slot use 0/15)
  Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
  Ctrl: tx_count=1, tx_bytes=4, rx_count=1, rx_bytes=52
  Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
  instance 0000000098cabd6b, 0/128 messages
Service 5: OPEN (ref 1) 'TVNT' remote 43 (msg use 0/3840, slot use 0/15)
  Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
  Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
  instance 0000000098cabd6b, 0/128 messages
Service 6: OPEN (ref 1) 'CECS' remote 51 (msg use 0/3840, slot use 0/15)
  Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
  Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
  instance 0000000098cabd6b, 0/128 messages
Service 7: OPEN (ref 1) 'CECN' remote 59 (msg use 0/3840, slot use 0/15)
  Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
  Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
  instance 0000000098cabd6b, 0/128 messages
Service 8: OPEN (ref 1) 'ILCS' remote 9 (msg use 0/3840, slot use 0/15)
  Bulk: tx_pending=0 (size 0), rx_pending=0 (size 0)
  Ctrl: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  Bulk: tx_count=0, tx_bytes=0, rx_count=0, rx_bytes=0
  0 quota stalls, 0 slot stalls, 0 bulk stalls, 0 aborted, 0 errors
  instance 0000000098cabd6b

So it looks to me like the kernel side of this is already taken care of?

rothomp3 avatar Aug 01 '18 19:08 rothomp3

@luked99 Don't worry about vcsm at the moment. There's a new version in the pipeline that replaces the reloc heap with CMA allocations made on behalf of the VPU and mem_wrapped into a MEM_HANDLE_T.

There's also a V4L2 codec driver in progress, so that reduces MMAL to only being required for a couple of tasks.

6by9 avatar Aug 03 '18 08:08 6by9

Hi, is there any update on this? I managed to compile it all without errors. However, I believe mmal does not work. Is there a way to make it work?

alepi avatar Jan 11 '19 11:01 alepi

I also found GLES/EGL have issues... first time I do eglSwapBuffers it works but after a few frames I get a segfault. This does not happen on 32 bits build. Any idea?

alepi avatar Jan 22 '19 20:01 alepi

We are not currently doing any dev work on 64bit builds, and don't use them in house, so I'm afraid I have no idea about the EGL issue. Without any sort of details on the fault it will also be very difficult to determine the cause of the issue.

JamesH65 avatar Jan 23 '19 09:01 JamesH65

The firmware GLES / EGL drivers will never be updated for 64 bit systems - please use the vc4 KMS drivers instead (those should already support 64 bit).

OpenMax IL is very unlikely to get any 64bit love - it's a hideous API to work with, and MMAL offers better functionality.

MMAL still needs some work, and that is the one bit that may be tackled. The camera can already be accessed via V4L2 which should be supported on 64bit systems. The codecs can now be accessed via V4L2 using the 4.19 branch. With two further patches that I have I think that too should be able to support 64bit systems. That covers the main use cases for MMAL, but using it directly does allow some more efficient pipelines to be created.

vcsm is being rewritten.

That should cover the majority of the userland code.

6by9 avatar Jan 23 '19 09:01 6by9