SDL icon indicating copy to clipboard operation
SDL copied to clipboard

asyncio: Added async i/o APIs.

Open icculus opened this issue 1 year ago • 72 comments

This is still in-progress, but I've been heads-down on this for awhile, so I'm putting it in a draft PR.

This implements an async i/o API, and a "generic" backend that implements it with a pool of SDL threads that can block on synchronous SDL_IOStreams. This is likely "good enough" for most things, and it's usable on everything but single-threaded Emscripten builds.

The downsides to the generic backend:

  • While i/o happens in a background thread, you can't do two accesses in the same open file in parallel, because the SDL_IOStream needs to seek for each, so file position state would race. There's a mutex protecting it per-opened-file. Separate files (even two opens of the same file) can work in parallel, though.
  • It can't cancel in-flight i/o; if a read/write is happening, it can't be stopped before it completes. In practice, this isn't likely a big deal.
  • There are limited threads for i/o (currently (SDL_GetCPUCount * 2) +1 at most), ~~and it uses the same thread pool that is used for notifying apps that i/o is complete, which is to say if the app uses that callback for heavy processing, until the callback returns, it won't be available for doing more i/o. In practice, you'd have to have lots of read/writes that also have very heavy callbacks before you'd see bottlenecks, I think, but system-specific APIs can just get all the i/o cooking separately regardless of callback weight.~~ (EDIT: this isn't true now. The generic backend uses up to 8 threads for i/o, but they are only limited by disk bandwidth, and no longer have to worry about the app hogging them in a callback.)
  • ~~It can't ever work for single-threaded Emscripten, because it needs threads.~~ (EDIT: it works in single-threaded in Emscripten now...it's just synchronous, because Emscripten's filesystem API is synchronous and almost always just a memcpy from MEMFS anyhow).

This would benefit from a Windows backend, that uses win32 IO Completion Ports, to solve these limitations. Linux can use io_uring if that's generally available at this point. Emscripten ~~could use actual async file APIs and run stuff between mainloop iterations if single-threaded~~ (EDIT: see above), or hand off from there to the usual thread pool if not. I'm sure Mac/iOS/Android have a thing, too, but the ABI wouldn't change, so we can deal with those later.

Fixes #1374.

icculus avatar Aug 28 '24 02:08 icculus

Any specific reason why a callback-based system was chosen?

Alternative would be SDL_ReadIOAsync, etc. returning a handle on which the application could poll to ask if the operation was finished. This is the interface I have in my engine, and I feel it is more natural for loading screens, etc. (store handles for all in-flight operations in an array, iterate and poll every frame, keep showing the loading screen unless all return DONE).

With callback based system, this gets complex (maintain a multithread-safe hash-table for all operations, callback sets the value DONE for asset ID as key, application iterates over the hash table); especially if the platform layer and game layer are reasonably decoupled.

namandixit avatar Aug 28 '24 02:08 namandixit

This is one of those "you can't please everyone" things. I started with a poll interface and was asked to change it to callbacks.

You could just have an array of bools and have the callback set each one to true as pieces finish loading, and each frame check for true instead of DONE...or something like that.

icculus avatar Aug 28 '24 03:08 icculus

This generally looks good. Is there going to be WriteFileAsync()? What's the interaction with storage?

slouken avatar Aug 28 '24 03:08 slouken

Looks pretty good. Do have some input

  • There are limited threads for i/o (currently (SDL_GetCPUCount * 2) +1 at most)

While this is probably fine heuristically, I would probably add a hard maximum of 64. There's a lot of people gaming on old workstations that would exceed that cap and be unlikely to benefit from more in the typical case of I/O on a single drive.

and it uses the same thread pool that is used for notifying apps that i/o is complete... in practice, you'd have to have lots of read/writes that also have very heavy callbacks before you'd see bottlenecks, I think

this is fine. a note in the documentation that callbacks should try to be brief and forward heavy processing to other threads may be warranted, because many assets do need reasonably heavy processing.

This would benefit from ... IO Completion Ports ... io_uring ... Emscripten could use actual async file APIs

don't disagree but it might be more straightforward to start with an implementation that can portably share logic. A relatively thin shim over pread/pwrite on the Unix likes and ReadFile/WriteFile on Win32 can enable a single implementation with better characteristics than the generic implementation.

Emscripten would benefit most from its own implementation, IIRC all the libc synchronous I/O actually gets proxied to the main runtime thread in order to simplify handling the promises from the underlying JS interfaces.

I'm sure Mac/iOS/Android have a thing, too, but the ABI wouldn't change, so we can deal with those later.

Android doesn't have a thing for this.

Mac and iOS have dispatch_io which just doesn't mesh well with this interface (in particular there is no way to supply a read buffer) and at least when it was new wasn't any better for regular files than a simpler threadpool anyway

nfries88 avatar Aug 28 '24 05:08 nfries88

Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that?

slouken avatar Aug 28 '24 06:08 slouken

Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that?

having a fixed number of not enough threads makes the system unusable for many real projects and having a few too many idle threads is always cheaper than creating them on-demand. This is what posix aio implementations generally got wrong and why nobody really uses them for heavy I/O (except for FreeBSD's implementation I guess).

tunability as a range with a heuristic selection within that range might be appropriate. All of the factors that would determine the optimal number of threads for an I/O-busy program are not really statically knowable, but a user could profile their particular I/O pattern across a good variety of machines and come up with a range that might be better than just relying on the heuristic alone.

nfries88 avatar Aug 28 '24 08:08 nfries88

This generally looks good. Is there going to be WriteFileAsync()? What's the interaction with storage?

LoadFileAsync is useful because it has open the file, figure out how large the file is, allocate memory for it, and read into it. WriteFileAsync would only need to open and write an existing buffer, so it's only saving one step. But we could definitely add it!

Storage just needs an equivalent of LoadFileAsync. On the built-in implementation, it literally will just call SDL_LoadFileAsync. Current thinking was not to implement positional reads/writes at the storage level, but I'm still trying to understand what data in a SDL_Storage interface does when you have a file like Spider-Man 2's New York map, which is presumably dozens of gigabytes.

Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that?

Right now it spins threads on-demand if there aren't any idle ones, up to a maximum count, and ones that are idle for more than 30 seconds will start terminating until only one is left (we always keep one around so there isn't a failure case when no threads can start). Until you use async i/o, it won't initialize anything so it doesn't have a resource cost at all until you touch it. The idea being that you likely have a flood of startup loading and then there's not a lot of value in keeping idle threads around.

There's a FIXME to add a hint, and we definitely should tune the defaults in any case.

icculus avatar Aug 28 '24 14:08 icculus

Okay, single-threaded Emscripten support is in, so this is Good Enough to ship and do improvements with system-specific APIs later...but do we really want this callback API?

It forces people deal with locking and worries that they'll clog the thread pool if they don't move quickly enough.

Maybe it would be better if we give them something like:

SDL_GetAsyncResults(&details)

Internally we keep a queue of completed tasks, and each call to this returns the next completed one with all the details. You don't poll specific tasks, you just get the next one in the completed list. Call doesn't block, returns false if nothing is ready.

This dramatically simplifies SDL's code and the mental load an app needs to manage. The app can still toss work to a background thread to process the loaded data if they want...but if they just didn't want to block on i/o, they can otherwise still benefit in a single-threaded app.

icculus avatar Aug 29 '24 13:08 icculus

Sure, I buy it, let’s do that instead.

slouken avatar Aug 29 '24 14:08 slouken

The one thing that would be nice is a way to wait for multiple I/O operations to complete. WaitForCompletion(array of operations)

slouken avatar Aug 29 '24 14:08 slouken

You probably want a separate handle for each request. For example if a library does async loading it wants to get its own data separate from the app.

slouken avatar Aug 29 '24 14:08 slouken

Internally we keep a queue of completed tasks, and each call to this returns the next completed one with all the details. You don't poll specific tasks, you just get the next one in the completed list. Call doesn't block, returns false if nothing is ready.

Allowing for separate completion queues (probably specified when creating an async file object, with NULL refering to some default completion queue) would probably be the most useful form of this approach, since different threads would have different interests. That also conveniently translates very directly to how IOCP and io_uring both deal with completions. Should also be convenient for tying I/O ops directly to different asset pipelines.

nfries88 avatar Aug 30 '24 04:08 nfries88

I think allowing separate queues of I/O operations and then enumerating within each queue with SDL_GetAsyncResults(queue, &details) will be the best option. If someone doesn't want to store all handles, they can just use a single queue. If someone wants to deal with each I/O operation separately, they can have one queue per operation. And it allows future features and optimisations (queue priority, separate queue for local vs remote resources, different queues for different LOD levels, decompression on particular queues, etc.).

namandixit avatar Aug 30 '24 10:08 namandixit

Okay, this now has queryable tasks and "task groups" and the background thread callbacks are gone.

I'm super happy with how this came out. Here's the category documentation from the new version, which gives the basic idea:

CategoryIOAsync

SDL offers a way to perform i/o asynchronously. This allows an app to read or write files without waiting for data to actually transfer; the functions that request i/o never block while the request is fulfilled.

Instead, the data moves in the background and the app can check for results at their leisure.

This is more complicated that just reading and writing files in a synchronous way, but it can allow for more efficiency, and never having framerate drops as the hard drive catches up, etc.

The general usage pattern for async i/o is:

  • Open a file with SDL_IOAsyncFromFile.
  • Start i/o tasks to that file with SDL_ReadIOAsync or SDL_WriteIOAsync.
  • Later on, use SDL_GetIOAsyncTaskResult to see if the task is finished.
  • When your tasks are done, close the file with SDL_CloseIOAsync.

If you want to track several tasks at once, regardless of the order in which they complete:

  • Create a task group with SDL_CreateIOAsyncGroup.
  • Specify this group when calling SDL_ReadIOAsync or SDL_WriteIOAsync.
  • Use SDL_GetIOAsyncGroupResult to see if anything has finished yet.

This all works, without blocking, in a single thread, but one can also use a task group in a background thread and sleep until new results have arrived:

  • Call SDL_WaitIOAsyncGroup from one or more threads to efficiently block until new tasks complete.
  • When shutting down, call SDL_SignalIOAsyncGroup to unblock sleeping threads despite there being no new tasks completed.

And, of course, to match the synchronous SDL_LoadFile, we offer SDL_LoadFileAsync as a convenience function. This will handle allocating a buffer, slurping in the file data, and null-terminating it; you still get a task handle to check later or add to a task group.

icculus avatar Sep 01 '24 22:09 icculus

I'm super happy with how this came out. Here's the category documentation from the new version, which gives the basic idea:

Nice!

slouken avatar Sep 01 '24 22:09 slouken

Hi - I was sent this direction because I've been peripherally involved in SDL_Gpu (doing reviews focused on performance and answering questions about PCI and DMA). I happen to build storage APIs professionally though, so looking at this pull request is far more up my alley. I had a really long review I was working on, but the recent changes you've made addressed about 80% of what I was going to say so this will now be much shorter (but still probably long).

I think for now I'll stick to just public API feedback, and maybe I'll post in a few separate posts.

First, I do not think it is possible to implement the API calls that do not require a task group efficiently on Linux. On Linux, your asynchronous I/O choices are either Linux AIO (not to be confused with POSIX AIO) or io_uring. Both of these APIs require you to first create some per-thread context (e.g. a task group) in order to submit I/O. Creating these on demand per-file is expensive and may run into OS-imposed limits on the number of io_urings or aio_contexts available. I strongly suggest that you only expose the APIs that require a task group - it's the only way to get true asynchronous I/O. I've personally made a similar mistake with APIs released 10+ years ago and it was really painful to find my way out of the corner I painted myself into.

Second, opening and closing the file should be asynchronous operations too (and they should be submitted to the task group). These can actually secretly block a long time in some cases. Of course, on many operating systems there isn't an asynchronous open/close option at all, and the implementations there will be blocking. But at least on Linux with io_uring you can asynchronously open and close files, so the API should allow for the best behavior rather than forcing the worst.

More review to follow.

jaedan avatar Sep 02 '24 04:09 jaedan

(Sorry, I just squashed this down to a single commit right before you posted that; it's all the same stuff, just in one commit.)

I'm skeptical that async opens are worth the API complexity, but I can totally see a close operation blocking while cached writes make their way to disk. I'm nervous about adding the complexity, but let me think on this a little tonight.

As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it.

icculus avatar Sep 02 '24 04:09 icculus

(Also, more review is totally welcome at any length you're willing to write, because I'm 100% winging it over here like I know what I'm doing.)

icculus avatar Sep 02 '24 04:09 icculus

The shift to support "task groups" is an excellent and necessary change. However, I do still have a few pieces of feedback on the group task group design.

  1. One major source of overhead in storage APIs is the system calls themselves. The only way to work around this is to batch operations and submit them in a single system call. Some operating systems now have facilities for this (e.g. io_uring, and I think there's something on Windows called ioRing now). But in your API it isn't really clear whether this sort of batching would be allowed. I guess you could make an SDL_ReadIOAsync call simply queue onto the group and then only really submit the operation to the OS in SDL_GetIOAsyncGroupResult, but I think as the functions are named now that isn't clear enough to the user that they need to call SDL_GetIOAsyncGroupResult once to even start the operations. I think you need a separate SDL_QueueReadIOAsync and then an SDL_SubmitIO type of set up.

  2. Task groups should NOT be thread safe. The whole purpose of io_uring, or Linux aio, or Windows ioRing (and to some extent iocp), or even NVMe multiple submission/completion queues or Linux blk-mq is that you don't need locks. Instead, instruct users to create one task group per thread and use it for every operation on that thread for it's entire duration. This needs to work essentially the same way as SDL_Gpu's command buffers because the underlying storage hardware happens to work exactly the same way as the underlying GPU hardware in terms of command submission and completion.

  3. The name task group indicates to me something lighter weight than what I think this really needs to represent. A task group needs to represent a Linux io_uring, or a Windows iocp, etc. These are not cheap things to create and destroy. In fact, we only want to do it once per thread and at thread start up time, then re-use it. Most of the modern OS APIs call it a "ring" (because it's literally a circular submission ring of commands) or a "queue". I've historically called it an "io_channel".

(More coming)

jaedan avatar Sep 02 '24 04:09 jaedan

As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it.

Using a thread is, unequivocally, a terrible idea (sorry, that sounds really harsh and I promise I don't mean it to be, but I've been at this a long time and everyone wants to use a thread). The hardware (and I've helped design NVMe from close to the beginning) wants you to create a separate "queue" per thread. The hardware accepts command submissions in parallel (and they're all asynchronous too!). And the operating systems have spent the last decade re-designing themselves to do parallel, asynchronous I/O internally to match the hardware.

You are supposed to create a per-thread "context" (io_uring, aio_context, ioRing, iocp, whatever) at thread start up time, and use that exclusively for the I/O operations you perform on that thread. This will allow for parallel, lockless I/O submission and provide the lowest latency.

jaedan avatar Sep 02 '24 04:09 jaedan

I'm skeptical that async opens are worth the API complexity

I have noticed file open latency before. I've never seen it be significant for a single file, and the many-file usecase is probably covered adequately enough by SDL_LoadFileAsync and perhaps similar composite operations. I'd agree its probably not worth the complexity.

but I can totally see a close operation blocking while cached writes make their way to disk. I'm nervous about adding the complexity, but let me think on this a little tonight.

An async sync-and-close would be useful for savefiles. System APIs typically will not wait for cached writes to hit the disk before closing a file, having this behavior be explicit as a composite operation is useful.

As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it.

Windows will block inside an Overlapped ReadFile when the underlying disk driver's queue is full. It's not that hard to hit this with random reads because Windows disk drivers have shallow queues compared to eg Linux. This is unfortunate because other pending reads to the same device might still succeed without blocking by hitting the filesystem cache, and obviously when combined with other uses of Overlapped I/O (eg sockets) that should never block this is even worse. It will also block on Overlapped WriteFile for writes that extend a file, which is a pretty normal thing to happen. Because of this behavior the typical procedure is to still use a threadpool to issue overlapped I/O. Probably a lesser concern, but Wine also doesn't attempt to provide asynchronous I/O to regular files at all.

As far as io_uring goes, I definitely agree with jaedan that this would be a suboptimal use of it, but also think that the typical use of a per-thread ring is both needlessly resource-expensive (they are not lightweight objects) and a poor fit for games' logic. A ring wrapped in a mutex for each task group is probably the least bad approach here, although I definitely lack jaedan's level of experience with io_uring. Extending the API with batching operations would probably bring that kind of implementation closer to optimal (also an improvement for other implementations) but that can be a later enhancement IMO.

nfries88 avatar Sep 02 '24 06:09 nfries88

We could create a heavy weight async I/O context when a thread first uses the async API. This could be stored in TLS and be automatically cleaned up, so the users get the benefit of that without having to explicitly track things themselves.

It might be worth adding a io_uring and I/O ring implementation to validate that this API works efficiently with those interfaces.

The only thing that affects the ABI is the question of how asynchronous I/O interacts with the storage API, and it's becoming clear that it doesn't. This seems like it's intended as a low level high performance interface to disk files. If that's true, then let's bump this out of the ABI milestone.

slouken avatar Sep 02 '24 14:09 slouken

Because of this behavior the typical procedure is to still use a threadpool to issue overlapped I/O.

On Linux for many many years (via Linux aio) and on Windows for the last 3, this is no longer the standard procedure. We shouldn't make the API inefficient on modern operating systems by design, but rather accept the inefficiency only for operating systems with poor asynchronous IO support.

Let's imagine we have a system with a background thread that does the IO processing. This thread may spin while there's IO outstanding for efficiency, but it certainly can't just spin all the time. IO in games is sporadic, so most commonly this thread will be asleep when we next start a burst of IO. So we queue up the operation into the shared queue or ring and then signal the background thread to wake. On most operating systems, thread scheduling latency is only at about 1ms granularity, and even if it reacts instantly just the mechanics of restoring thread state to wake it takes a significant chunk of time. So all of this added time is added to our IO request. But then the underlying SSD will complete the request in 20-100 microseconds most likely. Why would we 10x our latency because of software design? It makes no sense.

but also think that the typical use of a per-thread ring is both needlessly resource-expensive (they are not lightweight objects) and a poor fit for games' logic

I really can't disagree more on both points. These objects are not cheap to create because they usually do memory allocations and system calls, but actual resource use is very minimal. Usually it's 64 bytes * number of queues entries. And you make one of these per thread that wants to perform IO. If you have true asynchronous IO, then you only need a very limited number of threads in your game that are doing asset loading. We're talking kilobytes of total resources use here.

And regarding a poor fit for game logic, a ring buffer for command submission is a perfect fit for game logic. On each tick you enqueue all your IO operations. At end of tick you submit and check for completions. It cannot be more ideal. It also happens to be exactly how the GPU API works too, and it would be how an optimal networking API would function as well.

jaedan avatar Sep 02 '24 14:09 jaedan

We could create a heavy weight async I/O context when a thread first uses the async API. This could be stored in TLS and be automatically cleaned up, so the users get the benefit of that without having to explicitly track things themselves.

The issue with this approach is the user needs to have some kind of handle to poll once per game tick. If you have a secret async context per thread and many task groups mapped to it, you don't have enough information about whether you've already polled that tick to make the most efficient use of the backing OS APIs. You end up polling too often or not often enough. The only solution I've ever found that really works is to make the async context a visible object in the API.

jaedan avatar Sep 02 '24 14:09 jaedan

The only thing that affects the ABI is the question of how asynchronous I/O interacts with the storage API, and it's becoming clear that it doesn't.

My primary motivation is having an API that can efficiently load data into the new SDL_GpuTransferBuffers directly, to be shipped over to the GPU. Ideally, this process will perform no CPU-copies of the data at all, and be entirely asynchronous.

But I'd use it for save files and stuff too.

jaedan avatar Sep 02 '24 14:09 jaedan

Because of this behavior the typical procedure is to still use a threadpool to issue overlapped I/O.

On Linux for many many years (via Linux aio) and on Windows for the last 3, this is no longer the standard procedure.

I agree, but SDL supports versions of Windows without IoRing and I was describing the behavior of the specific pre-existing API.

On most operating systems, thread scheduling latency is only at about 1ms granularity, and even if it reacts instantly just the mechanics of restoring thread state to wake it takes a significant chunk of time.

This is sometimes true of scheduler interrupts themselves but not typical of latency following a wake event, which is usually in the order of microseconds as long as a CPU is available to run on.

Why would we 10x our latency because of software design? It makes no sense.

because the threads that know when we need to start I/O tend to be latency sensitive (trying to hit those framerates) we generally do not want to do processing on them that could be done on another thread, especially when that processing is often more expensive than the I/O itself. Decompression on the CPU is typically about 10x slower than a read from a SSD, I know that the cutting edge is GPU decompression which is also done asynchronously but not everyone is using that.

but also think that the typical use of a per-thread ring is both needlessly resource-expensive (they are not lightweight objects) and a poor fit for games' logic

I really can't disagree more on both points. These objects are not cheap to create because they usually do memory allocations and system calls, but actual resource use is very minimal. Usually it's 64 bytes * number of queues entries.

maybe I did overestimate the resource use then.

And regarding a poor fit for game logic, a ring buffer for command submission is a perfect fit for game logic. On each tick you enqueue all your IO operations. At end of tick you submit and check for completions. It cannot be more ideal. It also happens to be exactly how the GPU API works too, and it would be how an optimal networking API would function as well.

ringbuffers are great for inter-thread communication in games, it's the pattern of a single thread both issuing I/O and dealing with completions that I think is a bad fit and what you seem to be advocating for (see above)

nfries88 avatar Sep 02 '24 18:09 nfries88

Okay, I'm going to think on this a bit and make some decisions.

To be clear, the goal here was to make something that doesn't have to block on i/o, and is easy to use, not to make something that moves data through DMA right from the disk to the GPU. If we have to give up some amount of efficiency to make the API pleasant, I'm willing to do that.

Also, I'm deeply worried by the Microsoft document that says overlapped i/o might turn out to be synchronous if the file is using NTFS compression, because that more or less makes overlapped i/o no better than just using synchronous i/o from a background thread for my purposes...if that's the state of things, we can just call this thing ready to merge.

icculus avatar Sep 02 '24 18:09 icculus

overlapped i/o might turn out to be synchronous if the file is using NTFS compression, because that more or less makes overlapped i/o no better than just using synchronous i/o from a background thread for my purposes...

yeah, afaik the Overlapped I/O system just won't use the internal threadpool for anything other than waiting on the I/O itself, and so it blocks and completes synchronously when it needs to do anything else. So there's really no point with files using the transparent encryption or compression functionality. Though it's still potentially useful to use Overlapped I/O for concurrent position-independent reads because without the overlapped flag when opening the file, ReadFile and WriteFile will serialize on concurrent accesses because they update the file pointer even with an offset supplied (more unfortunate behavior), but I kinda doubt that applies to encrypted or compressed files either.

nfries88 avatar Sep 02 '24 19:09 nfries88

To be clear, the goal here was to make something that doesn't have to block on i/o, and is easy to use, not to make something that moves data through DMA right from the disk to the GPU. If we have to give up some amount of efficiency to make the API pleasant, I'm willing to do that.

The difference is one additional function call to create the asynchronous I/O context at start up time. Without this, you simply cannot do efficient, lockless, parallel I/O and you'll force all the implementations to spawn background threads and take locks. Is one extra function call such a burden on the user that they'd rather sacrifice performance and efficiency? I would certainly be happy to add the extra 3 lines of code or whatever it'll be to my application.

Also, I'm deeply worried by the Microsoft document that says overlapped i/o might turn out to be synchronous if the file is using NTFS compression, because that more or less makes overlapped i/o no better than just using synchronous i/o from a background thread for my purposes...if that's the state of things, we can just call this thing ready to merge.

I'm in the camp that overlapped I/O on Windows is, in practice, mostly useless. If you're running on a Windows system without ioRing, I think it should just use the default thread pool implementation. But this doesn't let us off the hook on good API design, because ioRing and io_uring and Linux aio do exist and are effective.

because the threads that know when we need to start I/O tend to be latency sensitive (trying to hit those framerates) we generally do not want to do processing on them that could be done on another thread, especially when that processing is often more expensive than the I/O itself. Decompression on the CPU is typically about 10x slower than a read from a SSD, I know that the cutting edge is GPU decompression which is also done asynchronously but not everyone is using that.

I'm not advocating that everyone must do their I/O from the game thread. I'm advocating that the API design should not force the back-end implementation to use a thread. Any application remains free to submit I/O and process completions on whatever thread they want. So let's not design an API that forces the inefficient case and instead create one that allows for the efficient case where it is available (Linux, Windows with ioRing support).

jaedan avatar Sep 02 '24 20:09 jaedan

I'm not advocating that everyone must do their I/O from the game thread. I'm advocating that the API design should not force the back-end implementation to use a thread. Any application remains free to submit I/O and process completions on whatever thread they want. So let's not design an API that forces the inefficient case and instead create one that allows for the efficient case where it is available (Linux, Windows with ioRing support).

Providing a thread-safe API across all implementations should be a priority IMO.

with a mutex-guarded ring per task queue there should be no need for background threads in the implementation.

and if you were to only use this task queue on a single thread the overhead should only be a few ns on both ends because Linux's pthread mutexes and Windows' SRWLocks both have an atomic fast-path. It is certainly cheap relative to the generally necessary io_uring_enter syscall. So, at least in my head, an implementation can be made thread-safe without a significant sacrifice to the performance of the single-threaded usecase.

IIRC the submission side of the ring and the completion side of the ring could also be protected by separate mutexes.

nfries88 avatar Sep 02 '24 21:09 nfries88