libdill
libdill copied to clipboard
changes to get this to compile on Win with mingw64
Work in progress, please join in and help if you'd like. I have this compiling on Win now. have not tried running a single thing yet though (I am cross-compiling so it's at least a little work to switch over to a win box...).
Attached is the diff file (had to rename to txt to be able to upload).
In addition, there was a Make command that was missing the "-lWs2_32" so rather than change the Makefile I just did that compilation step by hand:
/bin/sh ./libtool --silent --tag=CC --mode=link x86_64-w64-mingw32-gcc -std=gnu99 -fvisibility=hidden -DDILL_EXPORTS -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -g -O2 -no-undefined -version-info 3:0:0 -o libdill.la -rpath /opt/mingw64/lib libdill_la-chan.lo libdill_la-cr.lo libdill_la-fd.lo libdill_la-handle.lo libdill_la-libdill.lo libdill_la-list.lo libdill_la-pollset.lo libdill_la-proc.lo libdill_la-slist.lo libdill_la-stack.lo -lWs2_32
Will get around to testing at some point. May move to libmill first. Comments welcome
I suggest you rather send a GitHub pull request. It'll make easier for people to look at the patch and test it. I, personally, don't own a Win box, so I can't really comment on the code.
On Windows, this is difficult. proc
has fork
semantics which is difficult to emulate using Window's spawn
-like APIs. It is possible, however, cygwin and Windows 10 do manage this. I'll investigate.
For reference, this might be feasible: https://github.com/kaniini/win32-fork/blob/master/fork.c
@mewalig What is your progress on this? I am currently working on this as well. I did an experimental cygwin build but ran into issues with fork
s not supporting access to the same network listeners and AF_UNIX
having different close semantics.
I've come to the conclusion that replicating fork
behaviour will only end in tears. The linked fork
example I provided earlier breaks other Win32 API calls and does not work on Windows 10.
@sustrik I'm looking at implementing an optional thread
API (--enable-threads
) which stores all the libdill global state per-thread. This should provide per-thread coroutines for Windows which does not support multiprocess applications for networking etc.
I'm trying to mirror the proc
API but it can't be a drop-in replacement because fork
has copy-on-write semantics that is hard to replicate. My next-best-thing approach is that the documentation for thread
will emphasise that process data is shared and that the user needs to copy data or use thread local storage as necessary.
My use case is mainly for client libraries that want to use the functionality of dsock
on Windows platforms and potentially applications that need a thread-safety because multi-process access is prohibited/flaky in support (e.g. OpenGL).
@sustrik I was wondering whether --enable-threads
should compile a separate libdill-threads.so
because it'll contain functionality that is unused in a majority of use-cases and possibly hinder raw performance (thread local storage) if threading behaviour is not used.
P.S. The hierarchy between thread
and proc
would be that:
P1 |P2 proc (Unsupported on Windows)
-----+-----+-----+-----
T1 |T2 |T3 |T4 thread (Windows & *nix)
--+--+--+--+--+--+--+--
C1|C2|C3|C4|C5|C6|C7|C8 go(routines) (libdill)
On *nix systems, documentation will need to note that creating threads prior to forking will cause bad behaviour. The proper procedure would be to fork
all processes and then create threads in each forked process afterwards if they intend to mix proc
/thread
them.
Have you considered simply not supporting proc on Windows?
To be frank, the construct doesn't work that well on UNIX either. If you proc() later on in the parent's life cycle all the junk allocated by the parent is copied to the child even though the child doesn't need it.
Maybe the correct solution is just to start N processes by hand or using a shell scipt and be done with it.
As for thread(), yes, that would work better than proc(), but then, isn't that just asking for problems when libdill's stack magic clashes with pthreads' stack magic?
Yeah supporting proc
is pretty much impossible for networking purposes on Windows. I think the stack magic should be fine if the coroutines are confined per thread and the user doesn't try and make them interact without proper IPC.
To be frank, the construct doesn't work that well on UNIX either. If you proc() later on in the parent's life cycle all the junk allocated by the parent is copied to the child even though the child doesn't need it.
Maybe the correct solution is just to start N processes by hand or using a shell scipt and be done with it.
True, but wouldn't that make sharing a listener difficult?
Having said that this actually makes multi-processing consistent on Windows as well, as I just discovered that you can share sockets, albeit in a cumbersome way WSADuplicateSocket.
So since the proc
causes more harm than good, I think we should consider spawning separate processes and using IPC to share the listener fd or the WSAPROTOCOL_INFO
in the case of windows.
The details what needs to be transferred is left to the user to prevent libdill being overbloated.
In summary what this approach would entail:
- Dropping
proc
- Updating
tutorial/step6.c
to useposix_spawn
on *nix andCreateProcess
on Windows. - Figuring out the best method of thread support.
Thread support
Thread creation
Mimic'ing go
and proc
interface is difficult for threading and potentially flaky, if possible. I suggest producing a similar interface to c11's thread interface to safely create a new thread handle using the native thread interface (posix or windows).
Thread context storage
However, my initial approach of wrapping everything in __thread storage specifiers may not be the best for performance:
https://software.intel.com/en-us/blogs/2011/05/02/the-hidden-performance-cost-of-accessing-thread-local-variables
The extreme solution would be to make a passable dill_context
to each function.
The less extreme solution would be to bear the cost of accessing __thread once and then caching the location for the coroutine information.
While preparing the code-base for thread support I accidentally found some flags to make context switching an order of magnitude faster if you compile statically. It gives a tiny speed-up with libdill.so as well.
I'm not entirely sure if it's something to do with my re-ordering of the code-base or something else but the effect does not occur on HEAD.
For reference, HEAD does 14ns ctxswitch and 26ns go.
Small speed up on *.so, ~1-3ns:
$ ./configure CFLAGS="-O3 -flto -fuse-ld=gold -fvisibility=hidden"; make clean; make -j
$ ./perf/ctxswitch 10
performed 10M context switches in 0.139000 seconds
duration of one context switch: 13 ns
context switches per second: 76.923073M
$ ./perf/go 10
executed 10M coroutines in 0.239000 seconds
duration of one coroutine creation+termination: 23 ns
coroutine creations+terminations per second: 43.478260M
Compiled statically:
$ gcc -DDILL_THREADS -march=native -O3 -flto -fvisibility=hidden *.c perf/ctxswitch.c
$ ./a.out 10
performed 10M context switches in 0.062000 seconds
duration of one context switch: 6 ns
context switches per second: 166.666672M
$ gcc -DDILL_THREADS -march=native -O3 -flto -fvisibility=hidden *.c perf/go.c
$ ./a.out 10
executed 10M coroutines in 0.210000 seconds
duration of one coroutine creation+termination: 21 ns
coroutine creations+terminations per second: 47.619049M
I investigated the assembly listing and gcc is able to optimise across file boundaries easily when you provide -flto -fvisibility=hidden
. Take the numbers with a pinch of salt, however, because this does not take into account any more complex code where the saving of 7ns
might actually be pointless.
Microbenchmarks aside, the nice thing, however, is these flags make the thread-safe version of libdill run at the same speed as the non thread-safe version. That's probably the best outcome of this!
WIP threads and some other changes. I'll need to split them out: https://github.com/raedwulf/libdill/tree/threads
Doesn't actually have threading support yet, but it does isolate libdill state per thread.
Interesting, one would say that 6ns wasn't even possible. Maybe it's because of additional inlining that can be done when linking statically?
The generated assembly seems to show inlines everywhere and reduced it to simply context switching from main thread to the worker. It does do the scheduling as well but removes most of the function calls in-betwee. As far as I could see it was still behaving correctly in part due to the fact worker
was marked no inline. It might be that the simple benchmark took advantage of Intel's ability to execute small hot loops very quickly (i forget the term but it's something to do with loops of around 100 instr).
Pretty awesome work you're doing! Unfortunately I haven't had time to do much since my last posting and though I hope I'll be able to revisit as soon as possible, it's not looking so good for the near term.
Interestingly libdill works perfectly fine on Ubuntu for Windows! Shame it's still in beta.
!
It looks like it has been almost a year since there was much work on Windows. Is there anything still going on? I would love to use libdill in a project. But, I need to have Linux, Windows and Mac support.
Due to significant differences with the windows I/O system and unix polling mechanisms it's not easy to port - your best bet is to use Ubuntu for Windows but of course that will have it's own limitations.
One thing to check (a project that builds epoll on top of IOCP): https://github.com/piscisaureus/wepoll