kaitai_struct Any port for plain C?

Any port for plain C?

Open Zorgatone opened this issue 6 years ago • 103 comments

Hi, I would like to know if you would consider (or have any plans already) to port the project for use with "plain" C (other than C++ and C#). I would use it, and not all the systems (even embedded maybe?) support C++ and/or C#. Having a C version would enable portability on any system and even more languages with C bindings

Sep 25 '17 09:09 Zorgatone

You're completely correct, C port has been in heavy discussion since almost the very beginning of the project, yet nobody ever created an issue about it (and that's bad, because it's hard to collect all these discussions in one place).

There are/were several major issues with C target, though. It become a somewhat lengthy review of what's been discussed over the years, but I believe I've remembered most of the points and tried to order them from most serious to least serious.

Completely different workflow in mind

It turns out that most people who need C support in KS have completely different workflow that what KS provides now. Right now, KS does a very simple thing: it gets a binary format serialization spec and generates API around it. It usually does zero transformations, except for very simple and technical ones (i.e. endianness and that kind of stuff) — whatever's in the format, it all is reflected exactly as is in the memory. C people usually strive for performance and memory efficiency and would prefer to not save stuff that can be used right away and then just thrown out.

A very simple example:

seq:
  - id: len_foo
    type: u2
  - id: foo
    size: len_foo
    type: str

This is usually ok for many modern languages, but a lot of people who wanted C target automatically suggest that:

len_foo must not be stored in the structures that KS generates in memory at all — it must be used once during the parsing and then just thrown away
Given that we're talking about "string" data type, why not convert it into "pure C string", as most C stdlib functions expect it to be — i.e. no length information, just a zero byte termination. Of course, this also implies that (1) we'll need to allocate one more byte than len_foo for that zero byte, (2) we need to actually put that zero byte into a string (although it's clearly not existing in the stream), (3) we're silently agreeing on dealing with zero-terminated strings, i.e. foo could never contain a zero inside it.

A more complex (and real-life) example is a typical parsing of any network packet, for example, an udp_datagram. Typical current vision of what KS might create is something like this:

typedef struct udp_datagram_t {
  uint16_t src_port;
  uint16_t dst_port;
  uint16_t length;
  uint16_t checksum;
  char* body;
} udp_datagram_t;

udp_datagram_t* read_udp_datagram(kaitai_stream* io) {
  udp_datagram_t* r = (udp_datagram_t*) malloc(sizeof(udp_datagram_t));

  r->src_port = read_u2be(io);
  r->dst_port = read_u2be(io);
  r->length = read_u2be(io);
  r->checksum = read_u2be(io);
  r->body = read_byte_eos(io);

  return r;
}

It turns out that many users would be comfortable with completely different mechanism than "read function just fills in some structures in memory and returns a pointer to them":

Some would want certain callbacks to be called every time an attribute is "parsed", and do not need it to be stored in single memory structure at all, i.e.:

void read_udp_datagram(kaitai_stream* io, udp_diagram_callbacks* callbacks) {
  uint16_t src_port = read_u2be(io);
  if (io->status != OK) {
    udp_diagram_callbacks->on_error(io->status);
    return;
  }
  udp_diagram_callbacks->on_read_src_port(src_port);

  // ...
}

Some suggested more complex pubsub-like models, so there's some intermediate machinery where user applies to "subscribe" to only certain events like "this part of structure is finally completely read".
Some users suggested that such low-level packet parsing usually happens on incomplete/fragmented packets/structures. Typically, in such a situation KS would just stop reading and throw an exception. In C, however, they would prefer to be able to continuously resupply additional stream buffer contents into a single "reader" procedure, which would keep track of what have been already "parsed" on previous iterations (and not invoke relevant callbacks twice), and actually even to be able to resume parsing from certain points.

Not an "everything is an expression" language

Simply put, almost everything we had before supported "every KS expression translates into target language expression" idiom. That is, if you need to do string concatenation, i.e.

seq:
  - id: a
    type: strz
  - id: b
    type: strz
instances:
  c:
    value: a + b

... you do that a + b in one single-line expression everywhere. Even C++ allowed us to get away with a + b using std::string. In C, however, it traditionally boils down to many lines and temporary variables:

// Real-life code would be even more complex, probably with more checks, etc.
size_t len_a = strlen(a);
size_t len_b = strlen(b);
char *tmp = (char *) malloc(len_a + len_b + 1);
memcpy(tmp, a, len_a);
memcpy(tmp + len_a, b, len_b);
tmp[len_a + len_b] = 0;

This issue, however, was more or less solved with advent of #146.

Complex memory management

What's not solved however, is that such arbitrary allocations of temporary variables sometimes result in more complex memory management and need for additional manual cleanup. In the example above, tmp would likely be used directly as c value, and thus there's no need to store it additionally. However, if multiple operations will occur, we'll either need to store these intermediate values, or use some clever logic to either reusing these temporary buffers (and/or avoid extra copying), or clean them up right after they're no longer needed (i.e. earlier than in object's destructor).

Actually, even "allocate everything on the heap" is not universally agreed upon in many C apps. So, typical parsing of user-defined type like that:

udp_datagram_t* r = (udp_datagram_t*) malloc(sizeof(udp_datagram_t));

might be suggested to be replaced with passing a ready-made pointer to structure to fill into that read_* functions and creation of that udp_datagram_t on a stack of the caller instead.

No single standard library

For KS, we need some basic stuff like:

Byte arrays, which could report length of the contents that they store. There are no standard structure like that in C:

typedef byte_array {
    int len;
    void* data;
} byte_array;

Strings (again, knowing its length and, ideally, encoding-aware). If we'll stick to traditional char* strings, then we're getting hit with "no zero bytes inside" requirement, which might hurt some formats.
True element arrays which (1) know its size, (2) allow growth.

There are tons of "enhanced standard" libraries that do that, but there's no universal agreement on that. Probably roughly 80% of C applications roll something homebrew like that inside them. Out of "standard" implementations, there is glib, klib, libmowgli, libulz, tons of lesser known libraries, there's a huge assortment of string-related libs, array-related libs, etc. Out of them, probably glib is most well-known and well-maintained, but even a suggestion to use that frequently encounters a huge resistance in many C developers.

Another possible way (albeit not way too well-received by many developers) is to roll our own (yet another) implementation of all that stuff, and deal with ks_string*, ks_bytes*, ks_array*, etc, instead of char*, whatever_t[], etc.

No simple solution here, and whatever we would choose probably won't be accepted by many C developers. Probably if we'll implement support for top 3 (or top 5) popular libs that will cover at least some popular options.

Exception support

As we all know, C does not have any standard exception support, and typical KS-generated code relies on them a lot, i.e.:

  r->src_port = read_u2be(io);
  r->dst_port = read_u2be(io);
  r->length = read_u2be(io);
  // ...

On every step, read_u2be might encounter end of stream (or IO error) and it won't be able to suceed parsing yet another 2 bytes. Typical solution for that in C is using return codes and passing value-to-fill by reference, i.e.:

int err;

err = read_u2be(io, &(r->src_port));
if (err != 0)
  return err;

err = read_u2be(io, &(r->dst_port));
if (err != 0)
  return err;

// ...

Since Go support introduction (#146), that became possible, although probably it still be a pain-in-the-ass to use in C :(

Another quick "solution" for C is to use signals/abortions to handle these erroneous situations. In fact, it would even be ok in many use cases like embedded stuff, because things are not usually supposed to blow up there and if they do, then everything is lost already, there's no graceful exists, user interactions, "Send error report to the vendor" dialogs, etc.

Stream abstraction

Relatively minor and solveable issue, but still an issue: what would be a concept of "KS stream" be in C? Two popular options:

FILE* — usually it's not buffered, so many sequential "read_u2be" would translate into literal "read 2 bytes" syscalls, which is terribly inefficient. Besides, one can't read from in-memory array using FILE*
char* — just use in-memory array and screw everything else. On-disk file parsing can be done using mmap, but this is (1) very platform-dependent, (2) pretty inefficient for lots of smaller files. And a question about handling IO errors (or at least end-of-streams) still remain, so we'll need a wrapper for that to store mapped length.

Probably C runtime would need to implement all these options and allow for end-user to choose. Nothing too scary, but still an issue to be solved.

Sep 25 '17 12:09 GreyCat

And, to answer these:

Having a C version would enable portability on any system

Well, I won't be that optimistic. Given all the stuff above, chances are tons of C people would still opt to roll things manually because of all these compromises and "does not exactly fit my workflow" argument.

and even more languages with C bindings

Probably it won't be that easy :( KS C runtime is likely to be easier to rewrite in another language than go through all that binding hassle, and then you'll have to do that "binding" glue code for every particular type ported.

Sep 25 '17 12:09 GreyCat

Hi thanks for the lengthy and detailed answer. I'm glad to hear that some discussion about C was already made, and considered. For the "string" argument I would go for "standard C" zero-terminated strings. Other "strings" that contain zeros in them I would tread them as binary data of given length. For the libraries to use (many that would encounter resistance) I'd go for custom implementation. That could be long to make but shouldn't be too hard to do (let me know if you want some help, I would be happy to do so).

For Exception support what about CException? See link Otherwise we could do something like C11's bound-checked string functions and return errno_t.

For the KS stream any of the two solutions would be ok. If I remember correctly you can set/enable the default buffering/buffer of FILE *. Otherwise allocate everything manually on memory and release it later.

About the "workflow" argument, everyone will always decide on their own what library to use or what to do with their own code (even doing all custom handling), so I wouldn't think too much about that.

For the "C bindings", it would be good for languages not yet implemented that can use the C bindings easily.

I think a good solution would be to have a kslib_init() and kslib_free() or something similar if the library needs to initialize and allocate/release its own resources. Even if it looks ugly or you have to save and pass around an extra arguments to the library's functions. Still better than nothing.

I believe it would be "uglier" to just have to make C functions "wrapped around" C++ API calls, or even worse not being able to compile on some systems, or having to implement everything (without this library) manually every time.

I like the project (even if I haven't had the chance to play around with it yet) and, if I have some extra time, I'd really like to give a hand and help to make a C port (even if it would be a side-project with some differences)

Sep 25 '17 13:09 Zorgatone

@Zorgatone Ok, for a start, I would suggest to really play around with KS and see what it does and what it does not. May be you'll decide that it won't meet your expectations anyway?..

For Exception support what about CException? See link

The link just says "Non-Image content-type returned" for me :( If you mean something like that — https://github.com/ThrowTheSwitch/CException — at the very least, that's +1 extra library of dependencies, and in C world every library is usually a major hassle. But may be that could be done too.

I'd really like to give a hand and help to make a C port

You've probably seen http://doc.kaitai.io/new_language.html — right now we're somewhere in between stages (2) and (3). From all the issues that I've outlined, this "totally different workflow expected" is definitely the most serious one. I'm not too keen on doing lots of work that almost nobody would want to use.

Sep 25 '17 13:09 GreyCat

Understandable, thanks for the reply. I was planning to do some testing with KS in the near future, maybe I will try and make my own library in C if I think I'll need it :)

PS: thanks for the link, it's a good starting point

Sep 25 '17 15:09 Zorgatone

len_foo must not be stored in the structures that KS generates in memory at all — it must be used once during the parsing and then just thrown away

I don't use C, I use C++ and IMHO the preferred approach is not to store the info in a standalone structure, but to decompose the thing into a set of fixed (or variable size, if language supports it) dumb structures and put them upon raw virtual memory. #65

Given that we're talking about "string" data type, why not convert it into "pure C string", as most C stdlib functions expect it to be — i.e. no length information, just a zero byte termination.

for strz type just pass a pointer to that memory. There is issue with non-zero-byte terminators though.

Complex memory management

IMHO we should just use C++ for that. C coders can write in C++ in C-style if they want.

Sep 25 '17 16:09 KOLANICH

I'll just leave it here, just in case: https://matt.sh/howto-c

This link was heavily suggested by several modern C proponents that I've discussed KS support for C. Suggestions to modern C style guides are also most welcome. The only one that I know is Linux kernel coding style guide — this is my personal preference for C as well, but chances are that there are other popular style guides in other areas?

Sep 25 '17 16:09 GreyCat

@GreyCat nice link! Useful to know that. But still not all compilers support all the C11 features unfortunately. At least it should be good to use C99, especially for the stdint.h int types (I really didn't know about the fast and least ints! I knew about the fixed-size ones, though).

Sep 25 '17 17:09 Zorgatone

Most of things from that are also valid for C++.

Sep 25 '17 17:09 KOLANICH

I'm linking also another article with critics to matt's "how to c in 2016" article, to consider the other opinions as well: https://github.com/Keith-S-Thompson/how-to-c-response

Sep 26 '17 07:09 Zorgatone

For C strings, I would recommend that one field would end up adding few fields to resulting struct, with similar names and different types. For example:

  r->text_array = read_array(io, 10);
  r->text_str = r->text_array.to_str();

This does not consume more memory (only const amount), as the char* pointer points to same data as the array. End user might want some glib arrays, or char*, why not give them both?

Jan 18 '18 12:01 arekbulski

@arekbulski Giving them both is probably a bad idea: it will require dependency on glib, and would add extra unneeded bloat for both parties. Besides, char* strings are just not enough anyway: you need to be able to do .length on that, and you just can't do that with char* string.

Jan 19 '18 00:01 GreyCat

Another possible way (albeit not way too well-received by many developers) is to roll our own (yet another) implementation of all that stuff, and deal with ks_string*, ks_bytes*, ks_array*, etc, instead of char*, whatever_t[], etc.

You suggested using our own types, and it could provide convenience functions for transforming ks_arrays to glib bytearrays and other types. Hm? Glibc would be supported, not required.

Jan 19 '18 06:01 arekbulski

@GreyCat I would be willing to start implementing the C runtime. If you would approve, then I would outline the runtime file first (the types and methods for bytearrays etc), and if that meets your standards, we (you) would update the compiler/translator to suport the runtime, and I would implement the meat in runtime. What do you think?

Feb 08 '18 02:02 arekbulski

Besides, char* strings are just not enough anyway: you need to be able to do .length on that, and you just can't do that with char* string

@GreyCat, you may consider rolling your own string implementation which uses the same technique as sds. This will make Kaitai strings compatible with most functions accepting char* (unless a Kaitai string contain an extra zero byte in addition to the terminating NULL).

Feb 08 '18 09:02 smaximov

@arekbulski Sure, go ahead :) I'm not sure you've seen it, we also have this article with an overall new language support plan and an implementation graph like this one.

Feb 08 '18 09:02 GreyCat

@smaximov Yeah, that's probably how it should be done for "roll your own" implementation.

Feb 08 '18 09:02 GreyCat

I have sweat sour feelings about SDS. I really like the idea, I really do, but the implementation is horrible. The repo you linked has bug reports and bugfixes going back 4 years and still hanging. They also implemented variable-length prefix (the count field) which makes it bananas. We can implement our own SDS, I do not recommend using theirs.

Big thanks for sharing this with us, @smaximov !

Feb 11 '18 03:02 arekbulski

Is anyone working on this, even as a prototype?

Apr 04 '18 20:04 jonahharris

Not really. Personally, I would probably return to this one after #146, as experience with Go is very much the same as with C (except for the fact that Go relatively ok strings and slices).

Apr 04 '18 20:04 GreyCat

I promised to implement the C runtime, but that was few months ago. Since then I had much work on Construct, and now I am working on few things in KS. I am still willing to implement this, but I cant work on everything at once. If you wish, then I will get on top of C but other work items would need to be shelved instead.

Apr 04 '18 21:04 arekbulski

Any updates on this? I'd like to help, but I'm not familiar with scala...

Dec 20 '18 22:12 DarkShadow44

No updates. Unfortunately, most of https://github.com/kaitai-io/kaitai_struct/issues/263#issuecomment-331869391 still stands. It's probably still a good idea to complete Go port first, as it is shares many common concepts (except for the hassle with memory management).

Dec 20 '18 22:12 GreyCat

FWIW, I have a (for me) working C version at https://github.com/DarkShadow44/UIRibbon-Reversing/blob/master/tests/UIRibbon/parser_generic.c https://github.com/DarkShadow44/UIRibbon-Reversing/blob/master/tests/UIRibbon/parser_uiribbon.c It's pretty simple, copying data from the file into an in-memory struct. It also support writing data. What do you think about that approach? It might not fulfill all use cases, but to me it does the job.

Feb 02 '21 19:02 DarkShadow44

BTW, I have a half-finished (but not yet published, development stalled because I got other tasks) proposal of how it should look like for C and C++ for one damn simple spec .

In general:

C structs are made of pointer are public interface for access. They are headers of private structures.
private structures are pieces of memory + a header of pointers to them. Private structureres are used to insert items.
when serializing private structures they are memcpyed into the map, then their sources are truncated to their headers using realloc and the pointers are changed to point into a memory map.
when parsing raw structures are laid over memory. No streamed io at all, only memory-mapped one. Compatible to larger-than ram files as long as one doesn't need random access more than mapped pages and the index + the pages fit into memory. Also compatible to driving hardware.
serialization of simple structs already in their place is almost zero-cost

Feb 02 '21 19:02 KOLANICH

Would you have an example of how that C code would look like? I don't quite understand the "private structures" bit. In my example, all structs are public. I don't really do streams either, it's an in-memory stream abstraction. How do you do memory mapping in standard C?

Feb 02 '21 21:02 DarkShadow44

Would you have an example of how that C code would look like?

I have said that it is unfinished. But I'd create a small example just now illustrating what I mean, but without any guarantees of correctness.

I don't quite understand the "private structures" bit. In my example, all structs are public.

Very easy

struct a{
  uint64_t *c;
};
struct a_priv{
  uint64_t c;
};
struct a_full{
  struct a pub;
  struct a_priv priv;
};

struct a * construct_a(){
  sruct a_full *a = (struct a_full *) malloc(sizeof(a_full));
  a->pub.c = & a->priv.c;
  return (struct a *) a;
}

void process(struct a * c){
  *(c->b) = 42;
}

This way we access the data only via pointers, so we access it uniformly no matter where are they. It is at cost - there is overhead, a pointer per a var. It is possible to make it more efficient by keeping only pointers to structs, not to every fields, but in C it will cause the API being terrible and sufficiently different from it in other langs. In C++ it can be fixed by operator override and constexprs.

I don't really do streams either, it's an in-memory stream abstraction.

I guess some libc can implement fread fseek fwrite API over mmaps.

How do you do memory mapping in standard C?

Standard C doesn't even have any sane functions to work with strings. It is an extremily bad too -fpermissive stagnating language (once I was debugging a memory-safety issue for quite a long time .... because C compiler almost silently (with a warning, but who looks at warnings in a project that is already filled with warnings?) allowed to pass an incompatible type as an arg (or maybe I missed an arg, I don't remember exactly)). IMHO there is no sense to use C where C++ can be used. Usually when I see C fans, I see the inacceptable shitcode. The only real way to fix that shitcode ... is to implement a kind of OOP myself above plain C. I prefer to just use C++, but there are some projects created by C fanatics (in the sense I have told above, the projects are full of shitcode) I had to contribute to.

Feb 02 '21 21:02 KOLANICH

This way we access the data only via pointers, so we access it uniformly no matter where are they.

I don't really see the point behind that, tbh. What's the disadvantage of my approach? I don't need everything as pointers.

but in C it will cause the API being terrible and sufficiently different from it in other langs.

Sure, the API will be different, but that's because C is not OOP. That doesn't necessarily makes it terrible. As you see, my implementation uses an OOP abstraction as well, where's the problem with that?

Standard C doesn't even have any sane functions to work with strings

Yea, that's why I just keep strings as-is.

Feb 02 '21 22:02 DarkShadow44

What's the disadvantage of my approach?

It is just a different approach designed with different things in mind. When I was designing my approach I was thinking about making serialization cheaper and easier and about reducing memory footprint by not copying data at all, and about volatile structures in memory to be used for IPC and to control devices mapped to memory.

Feb 02 '21 22:02 KOLANICH

For memory footprint it would be enough to only keep big data blobs memmapped, everything else is smaller than the size of a pointer. Anyways, for memorymapping we need platform specific code anyways, right? I propose some kind of stream abstraction (similar to what I made), which can be in-memory (like mine) or memory mapped.

I was thinking about making serialization cheaper

How does that make serialization cheaper? Sure, you can edit files as-is, but when writing new files it makes things harder.

about volatile structures in memory to be used for IPC and to control devices mapped to memory.

That's an interesting point, I thought we only care about file formats. Can the other languages handle (especially C++) handle something like this?

Feb 03 '21 12:02 DarkShadow44

kaitai_struct kaitai_struct copied to clipboard

Any port for plain C?

Completely different workflow in mind

Not an "everything is an expression" language

Complex memory management

No single standard library

Exception support

Stream abstraction

kaitai_struct
kaitai_struct copied to clipboard