tock IPC issues

Right now the IPC allow() call can succeed, and a notify callback can be triggered, even if the expose_to() call fails. This may be fixed in 2.0 upgrades.
client_notify() can be used to determine how many other processes are on the board. By iterating through service IDs the error return values leak which apps are present, even if the apps do not setup any IPC services.
Calling client_notify() on a different process causes the callee's process to allocate the grant region, consuming application memory the original app didn't intend to.
Currently, scheduling a notify callback may fail for a variety of reasons (callback queue is full, failure to allocate, calling process has died, etc and neither the caller nor the callee are made aware of the failure (see TODO in kernel/src/sched.rs)
Buffers not power of two in length will cause silent failures on ARM.
It seems to be the case that App A faulting does not reset the MPU configuration of app B, even if app A shared a buffer with app B. (I haven't verified this experimentally, just tried to read through the fault code -- Hudson)

Jul 01 '20 14:07 bradjc

If we re-implement IPC to use message passing rather than shared memory, we can revisit the MPU implementation (re: #1532).

Apr 26 '21 18:04 bradjc

I was playing with the IPC API and noticed that applications fault due to memory protection if the shared buffers are not multiple of 64. The buffer's memory region does not seem to be mapped.

I cannot figure out the reason for this, all the API calls seem to return success.

If this is a requirement, maybe we should add this to the kernel and refuse the buffer?

Aug 08 '21 19:08 alexandruradovici

While debugging tock/tock#2906 I've noticed a more intricate issue of app load order and, what appears to me as a racy relationship of the IPC service registering its callback and the client notifying the service. This is in the context of a failing rot13_{service,client} CI test. When debugging these issues I used to change the app's order in the TBF list and inserted printf statements. With the rot13_service and rot13_client installed in this order, what seems to be happening in the normal (good) case is:

rot13_client performs initialization and all of its IPC calls including ipc_notify_service in one go, which will enqueue a task in the target process' queue
rot13_service performs initialization and the call to ipc_register_service_callback, which will register its callback
the kernel processes the task enqueued in (1) which calls back to IPC::schedule_upcall, which will ultimately schedule the proper upcall on the rot13_service process
the IPC service callback scheduled in (3) is called by the kernel scheduler

Now, if the kernel were to execute (3) before (2), the IPC task is converted into a callback (FunctionCall) before the service had a chance to subscribe. Then, this function call is dropped because it is a null-callback. Even if it weren't scheduled, as soon as the service subscribes, this pending upcall would be removed from its queue. It appears one can trigger this behavior through a combination of load-order, app and scheduler behavior, but it is most obvious when inserting a simple printf statement at the beginning of ipc_service's main. With the printf statement, if the client is started first, notifying the service will silently fail, whereas if the service is started first, depending on the scheduler behavior, it will likely succeed. Here is an example system call trace of this happening (service prints "Rot13 service!"):

System call trace:

Process 0: rot13_service
Process 1: rot13_client

Verilated LiteX+VexRiscv: initialization complete, entering main loop.
[1] function_call @0x88088(0x88060, 0x40010000, 0x173c, 0x40010000)
[1] memop(0, 0x40010a40) = Success
[1] memop(10, 0x40010800) = Success
[1] memop(11, 0x40010a40) = Success
[1] read-only allow(0x10000, 0, @0x8ae5c, 0x19) = AllowReadOnlySuccess(0x0, 0)
[1] cmd(0x10000, 1, 0x0, 0x0) = SuccessU32(0)
[1] read-only allow(0x10000, 0, @0x0, 0x0) = AllowReadOnlySuccess(0x8ae5c, 25)
[1] remove_pending_upcalls[0x10000:0] = 0 upcall(s) removed
[1] subscribe(0x10000, 0, @0x88108, 0x40010a00) = SubscribeSuccess(0x0, 0)
[1] read-write allow(0x10000, 0, @0x40010a00, 0x40) = AllowReadWriteSuccess(0x0, 0)
[1] cmd(0x10000, 2, 0x0, 0x0) = Success
[1] yield. which: 1
[0] function_call @0x80088(0x80060, 0x40008000, 0x16dc, 0x40008000)
[0] memop(0, 0x400089cc) = Success
[0] memop(10, 0x40008800) = Success
[0] memop(11, 0x400089cc) = Success
[0] memop(1, 0x0) = SuccessU32(1073777100)
[0] memop(1, 0x1b4) = SuccessU32(1073777100)
[0] memop(1, 0x408) = SuccessU32(1073777536)
[0] memop(1, 0x18) = SuccessU32(1073778568)
[0] memop(1, 0x18) = SuccessU32(1073778592)
[0] read-only allow(0x1, 1, @0x40008fa8, 0xf) = AllowReadOnlySuccess(0x0, 0)
[0] remove_pending_upcalls[0x1:1] = 0 upcall(s) removed
[0] subscribe(0x1, 1, @0x81d14, 0x0) = SubscribeSuccess(0x0, 0)
[0] cmd(0x1, 1, 0xf, 0x0) = Success
Rot13 service!
[0] schedule[0x1:1] @0x81d14(0xf, 0x0, 0x0, 0x0) = Ok(())
[0] yield. which: 1
[0] schedule[0x10000:0] @0x0(0x1, 0x40, 0x40010a00, 0x0) = Ok(())
[0] function_call @0x81d14(0xf, 0x0, 0x0, 0x0)
[0] read-only allow(0x10000, 0, @0x81e08, 0x19) = AllowReadOnlySuccess(0x0, 0)
[0] cmd(0x10000, 1, 0x0, 0x0) = SuccessU32(0)
[0] read-only allow(0x10000, 0, @0x0, 0x0) = AllowReadOnlySuccess(0x81e08, 25)
[0] remove_pending_upcalls[0x10000:0] = 0 upcall(s) removed
[0] subscribe(0x10000, 0, @0x80108, 0x0) = SubscribeSuccess(0x0, 0)
[0] yield. which: 1

We currently have no way to tell whether notifying an IPC service (i.e. scheduling a callback in the service process) actually worked. Whereas the source code used to indicate that we might have a method of identifying a null-callback (EOFF), Process::enqueue_task does not actually provide us any information about whether the enqueued IPC task will result in a callback being scheduled or not. This usually makes sense and is a logical consequence of our architecture based around the fact that a null-callback should be treated exactly the same as a proper "do nothing" callback, but here it's not good.

Dec 13 '21 09:12 lschuermann

To add something to point (6) of the current IPC issues: restarting an application resets app_break/kernel_memory_break//allow_high_water_mark . The result is that an application (A) can:

Make a very large malloc(), causing a brk system call to claim all remaining grant/heap memory for itself.
Allow this to another process (call it B)
Restart itself in place (i.e., by faulting) (and so reset the watermarks)
Make syscalls to capsules causing grant allocations that exceed the size of allocation its previous life

This will result in application B having direct access to kernel data structures. Spicy. This can be leveraged even from safe userspace Rust.

Jan 26 '23 14:01 LawrenceEsswood