Ripes icon indicating copy to clipboard operation
Ripes copied to clipboard

[External Peripherals] Define memory read/write protocol

Open mortbopet opened this issue 2 years ago • 23 comments

A protocol should be defined for how memory read/writes from the Ripes memory subsystem sent to/from the connected peripheral server.

Memory requests from the processor arrive from in the following function: https://github.com/mortbopet/Ripes/blob/daad80f7de5c82a8662ecefb02150770619fc311/src/io/ioexternalbus.h#L36-L37

mortbopet avatar Oct 26 '21 20:10 mortbopet

I usually prefer to work with binary protocols because of the ease of decoding a datagram using only structs and using less processing (in C or C++). On the other hand protocols in text format (JSON for example) are easier to visualize/debug and more intuitive, to create programs in python for example. The first thing to define is what kind of protocol will be used. Both types can be supported at the same time, but we need to start with one. As the number of commands are small, a data packet starting with the command number (8bits) followed by the packet size and ending with payload can be the initial path.

Initial table of commands: 
1 - Client query for list of available peripherals
2 - Client Write to server  (Requested by processor Write to peripheral )
3 - Client Read from server (Requested by processor Read from peripheral )
4 - Client query server about available DMA operation (it has to be checked periodically)
5 - Client Write to server DMA (Requested by peripheral Read from memory  ) 
6 - Client Read form server DMA (Requested by peripheral Write to memory ) 

It also has to be defined whether all commands need to have an answer and wait for it. Is it necessary to synchronize the simulation between Ripes and the server (send a packet at each simulation step)?

lcgamboa avatar Oct 30 '21 02:10 lcgamboa

That set of commands looks good to me. With regards to blocking, i'd say that we can assume that writes are non-blocking and reads are blocking. For the initial support for this, I don't think we need to consider synchronization of accesses to the simulator from i.e. multiple different peripherals and the simulator itself. There might be issues here if we're considering large systems with many peripherals, but off the top of my mind i think synchronization support needs to happen in the simulator memory and probably won't have an effect on how we define the communication protocol.

Do you have experience with capnproto or protobuf's? I think our use case fits exactly into the definition of what those projects are good at, so might aswell piggy back off of their work!

mortbopet avatar Oct 31 '21 13:10 mortbopet

I know but have never used protobuf's. I usually use more bare-metal solutions because I'm used to developing for embedded systems. I would use something like:

typedef struct{
uint32_t msg_type;
uint32_tpayload_size;
} cmd_header;

typedef struct{
uint32_t addr;
uint32_t nbytes;
} data_header; 

1 - Client query for list of available peripherals

send cmd_header read cmd_header payload

2 - Client Write to server (Requested by processor Write to peripheral )

send cmd_header data_header+payload read cmd_header

3 - Client Read from server (Requested by processor Read from peripheral )

send cmd_header read cmd_header data_header+payload

4 - Client query server about available DMA operation (it has to be checked periodically)

send cmd_header read cmd_header payload

5 - Client Write to server DMA (Requested by peripheral Read from memory )

send cmd_header data_header+payload read cmd_header

6 - Client Read form server DMA (Requested by peripheral Write to memory )

send cmd_header read cmd_header data_header+payload

This way, any language can be used to develop the server without the need to use a specific library (capnproto or protobuf's). Choosing a library makes development easier, but it forces the server to use it too and with the same version.

lcgamboa avatar Oct 31 '21 18:10 lcgamboa

I have implemented a protocol for testing using QTcpSocket (because Ripes already uses it). The problem is that QTcpSocket is asynchronous, I haven't found a way to make it work synchronously (to implement ioRead). Actually using the waitForReadyRead method works but with problems, the QT documentation also advises against using the method. For me the simplest way out would be to use BSD and winsocket2 sockets that I'm already familiar with. @mortbopet any suggestion?

lcgamboa avatar Nov 07 '21 02:11 lcgamboa

I think the best way to solve this is to encapsulate the communication with the server via a future that is generated by a class that handles communication with the server. Haven't checked whether the below compiles but something like it:

// Packet datatypes; could (should) be defined by an external library such as 
// protobufs.

struct ReadResult {
  // ...
}

struct ReadRequest {
  // ...
}

struct QueryResult {
  // ...
}

struct QueryRequest {
  .// ...
}

// ...

class ExternalPeripheralClient {
  ExternalPeripheralClient() {
    tcpSocket = ...
    connect(&tcpSocket, QTcpSocket::readyRead, this, &ExternalPerihperalClient::onReadyRead);
  }


  // The above could (should) be generalized through a template that handles each
  // of the types of packets we expect to receive.
  template<typename TRequest, typename TResult>
  QFuture<TResult> readFromServer(TRequest req) {
    // Todo: might be able to simplify the locks since each Receiver object is
    // a singleshot receiver.
    struct Receiver {
      Q_OBJECT
      void onReceive(TResult result) {
        bufferLock.lock();
        buffer = result;
        waitCond.wakeAll();
        bufferLock.unlock();
      }
      // Since we may have a race condition here between sending the request and
      // getting the result, there needs to be a periodic timeout to check whether
      // the buffer has a value.
      TResult getResult() {
        waitLock.lock();
        while(true) {}
          bufferLock.lock();
          if(!buffer.hasValue()) {
            bufferLock.unlock();
            waitCond.wait(&waitLock, 1000);
          } else {
            // Buffer has value; at this point we still have the buffer lock
            // which we need to safely read the buffer.
            break;
          }
        }
        TResult res = buffer.value();
        // The unlocks aren't really needed since the receiver will get destroyed
        // after getResult returns.
        bufferLock.unlock();
        waitLock.unlock();
        return res;
      }
      std::optional<TResult> buffer;
      QMutex waitLock, bufferLock;
      QWaitCondition waitCond;
    };
    return QtConcurrent::run(this, [] {
      Receiver receiver;
      // Ensure only one read request is sent at a time.
      readLock.lock();
      // Connect to the handler of the result. The connection will get cleaned
      // up when the receiver is destroyed.
      connect(this, &ExternalPeripheralClient::handleResult<TResult>, &receiver, &Receiver::onReceive);
      // Send the request.
      sendRequest(request);
      // Wait for the result.
      return receiver.getResult();
    });
  }


signals:
  // Signal handlers for each possible result data type. These will be connected to whenever
  // a request goes out for the given packet.
  void handleResult(ReadResult);
  void handleResult(QueryResult);

private:
  template<typename TRequest>
  void sendRequest(const TRequest& request) {
    // encode the request into a byte array and send it to the server.
  }

  void onReadyRead() {
    lock.lock();
    // read from socket into buffer
    // try decoding the buffer into one of the data structures of the protocol.

    lock.unlock();
  }


  QMutex readLock;
  QByteArray buffer;
  QTcpSocket tcpSocket;
}


class ExternalPeripheralBus {

  ExternalPeripheralBus() {
    client = ...
  }


  VInt ioRead(addr, width) {
    auto readResFuture = client->readFromServer(...);
    readResFuture.wait();
    return readResFuture.value();
  }

  std::unique_ptr<ExternalPeripheralClient> client;
}

Nevertheless, I'm still not quite convinced that we should be sending around raw structs by ourselves, both for maintainability and compatibility reasons; i'd really like to see us leverage some of the many good libraries out there for this kinds of stuff. Using protobuffers or CapNProto again should fit easily into working with a QByteArray https://stackoverflow.com/a/57286003/6714645 .

mortbopet avatar Nov 07 '21 16:11 mortbopet

I've come across some solutions like yours using QFuture, but I found it too cumbersome to use thread synchronization just to replace what a simple blocking call to the standard recv function does. I've seen a solution that uses QTcpSocket socket descriptor directly in standard socket operation functions like recv, I'll try to implement. As for protobuffer libraries, the problem would not be for Ripes, but for the server. For example, it is simple to implement a server that receives C structures in an esp32 that is connected, for example, to a real LCD and use it as a peripheral in Ripes. To compile a library of protobuffers for use on an embedded platform increases complexity, increases resource utilization and decreases performance just to make coding easier.

lcgamboa avatar Nov 07 '21 16:11 lcgamboa

but I found it too cumbersome to use thread synchronization just to replace what a simple blocking call to the standard recv function does

Hmm, my take on this is that network programming is inherently asynchronous, and that complexity is (in a modern, OOP context) conquered best when using asynchronous building blocks, such as futures. Especially when we need to ensure compatibility across all platforms.

For example, it is simple to implement a server that receives C structures in an esp32 that is connected

So what you are suggesting here is to run a peripheral server on an actual hardware device with real peripherals? I was just thinking about the PICSimLab usecase, but that indeed sounds very interesting...! If so, i'd agree, using something like protobuffers might be too heavy for a microcontroller, and flat structs is a better way forward. My main concern here is just to nail the implementation of the protocol header definition, endianness, struct padding etc. so we ensure compatibility not only across clients (linux, windows, mac) but also with our targets (PICSimLab, microcontrollers, ...).

I think that once we've gotten a bit of prototyping done and things are up and running, it would be due diligence to move whatever code we have for defining the protocol to a separate repository, so it can be pulled in as a dependency (instead of keeping the protocol here in Ripes).

mortbopet avatar Nov 07 '21 22:11 mortbopet

I created an XTcpSocket class (using normal sockets with blocking) to replace the QTcpSocket class, in the first tests it is working without problems on Linux. After testing it on windows I'll put the code in github.
The only thing that didn't work was updating the Exports information in the external bus tab. (in the figure )

remote_ripes

This example just writes to an address of the server IO, reads the value again and increments the value with each interaction (ping-pong).

The protocol is as simple as possible, basically a field of message ID and payload size.

As I don't think there is any interrupt support in Ripes yet, it doesn't make much sense to me implement DMA before the interrupts support.

As for the idea of using real hardware, I find it interesting didactically. Lab classes can use Ripes to simulate the CPU (which can be upgraded to new models) and use the didactic kits running a server (TCP or UART). Even the processors of the kits being outdated, normally the peripherals are not (Displays, keyboards, LEDs...) . In my school, didactic kits are almost never updated, and from what I see of the amount of PICSimLab users who use the PIC16F programming in assembly, I believe it is a global problem for schools. Schools use what is available because hardware is usually expensive .

lcgamboa avatar Nov 09 '21 01:11 lcgamboa

Great progress! Feel free to submit PRs during development whenever atomic features are locked in... makes it easier to review compared to a big monolithic PR... (and i'm also interested in seeing the code behind this! :wink:).

mortbopet avatar Nov 09 '21 17:11 mortbopet

The first integration tests of Ripes with PICSimLab were very satisfactory. I have implemented two 16-bit GPIO ports, so it is possible to access almost all PICSimLab devices. But going forward I will implement SPI and I2C controllers to be able to control all types of devices through Ripes. There are some adjustments left, as soon as I finish the test version of PICSimLab I will let you know.

@mortbopet one question, the mmio accesses in Ripes are always 32 bits (address increment +4)?

Ripes_PICSimLab

lcgamboa avatar Nov 20 '21 02:11 lcgamboa

That looks very cool! I'm really excited about the progress here.

the mmio accesses in Ripes are always 32 bits (address increment +4)?

memory address instructions should be byte-addressable. I.e. the following code:

li a0 LED_MATRIX_0_BASE
li a1 LED_MATRIX_0_WIDTH
li a2 LED_MATRIX_0_HEIGHT

loop:
     lb x0 0 a0
     lb x0 1 a0
     lb x0 2 a0

should trigger VInt IOLedMatrix::ioRead(AInt offset, unsigned size) with arguments

0, 1
1, 1
2, 1

If that is not the case, then i suspect something is going wrong.

mortbopet avatar Nov 20 '21 10:11 mortbopet

I asked initially because all peripherals already implemented in Ripes have addresses incremented from 4 to 4. As in the case of the D-Pad which has 1 bit and each pad is addressed with a multiple offset of 4. But the size parameter of the ioRead method always has size 4 for 32 bits and 8 for 64 bits. For each reading independent of size, 4 addresses for 64 bits and 6 addresses for 32 bits are read. Maybe because of the cache?

test program:

li a0 EXTERNAL_BUS_0_DIRA
loop:
         lui a5, 0x12345
         addi a5, a5 , 0x678
         sb a5, 0, a0
         sh a5, 0, a0
         sw a5, 0, a0
         #sd a5, 0, a0
         lb t0 0 a0
         lb t1 1 a0
         lb t2 2 a0
         lb t3 3 a0
         lh t4 0 a0
         lh t5 2 a0 
         lw t6 0 a0
         #ld x17 0 a0
        j       loop

ioRead and ioWrite calls in program single cycle 32bits:

	 sb a5, 0, a0
ioRead  offset=0x4 size=0x4
ioWrite offset=0x4 size=0x1  value=0x12345678
ioRead  offset=0x4 size=0x4
ioRead  offset=0x0 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x8 size=0x4
ioRead  offset=0xc size=0x4

         sh a5, 0, a0
ioRead  offset=0x4 size=0x4
ioWrite offset=0x4 size=0x2  value=0x12345678
ioRead  offset=0x4 size=0x4
ioRead  offset=0x0 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x8 size=0x4
ioRead  offset=0xc size=0x4

         sw a5, 0, a0
ioRead  offset=0x4 size=0x4
ioWrite offset=0x4 size=0x4  value=0x12345678
ioRead  offset=0x4 size=0x4
ioRead  offset=0x0 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x8 size=0x4
ioRead  offset=0xc size=0x4

         lb t0 0 a0
ioRead  offset=0x4 size=0x4
ioRead  offset=0x5 size=0x4
ioRead  offset=0x0 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x8 size=0x4
ioRead  offset=0xc size=0x4

         lb t1 1 a0
ioRead  offset=0x5 size=0x4
ioRead  offset=0x6 size=0x4
ioRead  offset=0x0 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x8 size=0x4
ioRead  offset=0xc size=0x4

         lb t2 2 a0
ioRead  offset=0x6 size=0x4
ioRead  offset=0x7 size=0x4
ioRead  offset=0x0 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x8 size=0x4
ioRead  offset=0xc size=0x4

         lb t3 3 a0
ioRead  offset=0x7 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x0 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x8 size=0x4
ioRead  offset=0xc size=0x4

         lh t4 0 a0
ioRead  offset=0x4 size=0x4
ioRead  offset=0x6 size=0x4
ioRead  offset=0x0 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x8 size=0x4
ioRead  offset=0xc size=0x4

         lh t5 2 a0 
ioRead  offset=0x6 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x0 size=0x4
ioRead  offset=0x4 size=0x4
ioRead  offset=0x8 size=0x4
ioRead  offset=0xc size=0x4

         lw t6 0 a0
ioRead  offset=0x4 size=0x4

ioRead and ioWrite calls in program single cycle 64bits:

 	 sb a5, 0, a0
ioRead  offset=0x4 size=0x8
ioWrite offset=0x4 size=0x1  value=0x12345678
ioRead  offset=0x4 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8

         sh a5, 0, a0
ioRead  offset=0x4 size=0x8
ioWrite offset=0x4 size=0x2  value=0x12345678
ioRead  offset=0x4 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8

         sw a5, 0, a0
ioRead  offset=0x4 size=0x8
ioWrite offset=0x4 size=0x4  value=0x12345678
ioRead  offset=0x4 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8

         sd a5, 0, a0
ioRead  offset=0x4 size=0x8
ioWrite offset=0x4 size=0x8  value=0x12345678
ioRead  offset=0x4 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8


         lb t0 0 a0
ioRead  offset=0x4 size=0x8
ioRead  offset=0x5 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8

         lb t1 1 a0
ioRead  offset=0x5 size=0x8
ioRead  offset=0x6 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8

         lb t2 2 a0
ioRead  offset=0x6 size=0x8
ioRead  offset=0x7 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8

         lb t3 3 a0
ioRead  offset=0x7 size=0x8
ioRead  offset=0x4 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8


         lh t4 0 a0
ioRead  offset=0x4 size=0x8
ioRead  offset=0x6 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8

         lh t5 2 a0
ioRead  offset=0x6 size=0x8
ioRead  offset=0x4 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8 

         lw t6 0 a0
ioRead  offset=0x4 size=0x8
ioRead  offset=0x4 size=0x8
ioRead  offset=0x0 size=0x8
ioRead  offset=0x8 size=0x8

         ld x17 0 a0
ioRead  offset=0x4 size=0x8

lcgamboa avatar Nov 20 '21 20:11 lcgamboa

Despite the problem with the size of the reading generated by Ripes (explained in the previous message), it is possible to test with PICSimLab. I published the changes for integration with Ripes in the experimental version which can be downloaded here.

Integration can be easily tested using the code below in Ripes and adding LEDs on port A and switches on port B of PICSimLab. This code only reads the switches and shows the result on the LEDs, it's not very useful but it's a good test.

li a0 EXTERNAL_BUS_0_PORTA
li a1 EXTERNAL_BUS_0_DIRA
li a2 EXTERNAL_BUS_0_PORTB
li a3 EXTERNAL_BUS_0_DIRB

        sh x0 0 a1
        li t0 0xFFFF
        sh t0 0 a3
loop:        
        lh t0 0 a2
        sh t0 0 a0 
        j       loop

example

Obs: Apparently the Appimage format version doesn't work directly on newer Linux versions, it is necessary to specify the libgmodule library path. LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libgmodule-2.0.so ./PICSimLab-0.8.9_211121_experimetal-x86_64.AppImage

lcgamboa avatar Nov 22 '21 12:11 lcgamboa

Regarding the multiple memory reads, i think this is partially due to the cache graphics, and partially due to the fact that the memory is read here in the data memory component on each clock cycle when saving memory transactions. This is done regardless of whether there is a read/write enable signal activated (bad design on behalf of VSRTL). I'll go ahead and implement a fix which should hopefully remove some of these redundant memory reads.

mortbopet avatar Nov 22 '21 12:11 mortbopet

I did a test using an LCD diplay and it worked without problems. It would be nice if there was a cursor in the C source code to follow the execution. I know it is possible to get the relationship between the PC and source code line number from the debug information but I have no idea how difficult it is to implement this. An automatic code formatting option would also be helpful. These are just ideas to be implemented in the future.

ripes_lcd

lcgamboa avatar Nov 25 '21 02:11 lcgamboa

That looks very good - what's the 'clock rate' you achieve when running the processor in the above program? Or more specifically, how bad is the overhead of the external peripheral (network I/O, ...)?

It would be nice if there was a cursor in the C source code to follow the execution. I know it is possible to get the relationship between the PC and source code line number from the debug information but I have no idea how difficult it is to implement this.

What we'd be looking at here is to use the DWARF information containing within an ELF file compiled in debug format, and use this as a source mapping. I remember looking into it when i did the compiler support, but put it on hold at the time due to not being able to find a good open-source library that we could depend on for parsing the DWARF information (... and i wasn't in the mood to create one myself!).

An automatic code formatting option would also be helpful.

Indeed! In general, these ideas are very relevant, so don't hesitate to create new issues for them - then, once time permits, me, you or new people can go ahead and pick them up.

mortbopet avatar Nov 25 '21 08:11 mortbopet

That looks very good - what's the 'clock rate' you achieve when running the processor in the above program? Or more specifically, how bad is the overhead of the external peripheral (network I/O, ...)?

PICSimLab currently updates every 100ms and is configured to work synchronously. I need to make some adjustments to work better asynchronously. Initially it is reaching a frequency of 1kHz of IO. With the adjustments I believe it will improve. But it's already reasonable for use.

lcgamboa avatar Nov 26 '21 17:11 lcgamboa

I made the optimizations in PICSimLab. I tested it with a program that only writes to the peripheral (one writing and 6 readings actually) Without connection the program runs in Ripes at 24Khz on my computer, connected to PICSimLab the program runs at 12Khz. I found it very satisfactory, I had to implement a Tick Timer peripheral to be able to mark the time of the LCD example delays due to the speed increase.

lcgamboa avatar Dec 01 '21 02:12 lcgamboa

That sounds like a perfectly acceptable rate if we're doing some IPC as well.

Let me know when you'd like to have the PR reviewed, and i'll start going through it :+1:.

mortbopet avatar Dec 01 '21 08:12 mortbopet

There are still some crash issues when one of the programs finishes without disconnecting. As soon as I solve this I'll let you know.

lcgamboa avatar Dec 01 '21 17:12 lcgamboa

@mortbopet I did several tests and got some random crashes, maybe your review will help.

lcgamboa avatar Dec 11 '21 00:12 lcgamboa

Hi @mortbopet , I added in the PICSimLab examples two code examples to use the Ripes integration with PICSimLab (here) An example of LCD in C and a simple to read keys and drive LEDs in assembly.

LED_SW This same simple example I ran in the POC using real hardware (ESP32) :

https://user-images.githubusercontent.com/13749258/156271028-d6536b52-cf5c-44db-b6eb-6e8ebeca0528.mp4

Although I was far from the WIFI router it worked satisfactorily with some hookups. I believe that a version with a serial connection instead of TCP is more suitable to work with a wired real harware like an arduino Uno (using the same protocol ).

ESP32 code: RemoteServer.ino.txt

As for the integration with PICSimLab, in your opinion is the actual operation satisfactory or is there any point that needs to be improved for this initial version?

lcgamboa avatar Mar 02 '22 00:03 lcgamboa

It's very cool to see this working on the real hardware! Before merging these things, what's your thinking on extracting the protocol-related code to a separate repository? This would hopefully improve how to describe code in either project (Ripes/PICSim/ESP).

mortbopet avatar Mar 04 '22 08:03 mortbopet