[Needs thorough testing] async model file listing
This is a messy proof of concept: using aiofiles, handle model listing and cache validation with async parallelization, to make high-latency drives list models fast.
I tested by running ComfyUI in Tokyo accessing a model folder in Los Angeles via SMB.
Key timings:
- before anything: ~70s to read
object_info - cache code disabled, no further edits: ~25s
- asyncfiles listing, with cache validation disabled: ~8s
- asyncfiles listing, asyncfiles cache validation enabled: ~28s
- async cache validation only: ~23s
All times above are +/- a pretty wide range, latency was not constant, but an order of magnitude smaller than most of those jumps.
This code undoubtedly has side effects (I haven't tested symlinks for example), not to mention requires a new dependency, and adds significant complexity. So the main question is, is the case of high-latency drive reads worth the trouble of merging and maintaining this messier model listing code?
I will be building and submitting a separate, more easily agreeable PR, to buff up the caching code soon
aiofiles wasn't doing as much as I thought it was, and it's kinda wacky that it works, but new commit replaces the external dependency with a single helper class
ThreadPoolExecutor ends up a lot cleaner than trying to jank asyncio to, yknow, help asynchronously process IO :')
Testing notes:
- On a very fast local SSD, this actually increases the load time a touch. From
0.02to0.2seconds (*0.08after first opti below). Not really a huge difference here (literally added all of 60 milliseconds), and I think would inherently fix itself with scale - the bigger you get, the more the async helps, vs. with faster native listing the async is just adding a chunk of overhead. But arguably a case to make the async behavior optional? - On a very fast SSD over very fast LAN, it goes from
0.15to0.27seconds (*0.17after bonus preload opti below) - so same problem. When you're so fast anyway, you're wasting more time dealing with a thread processor in python code than you're saving by using it. - In both these cases, I have model lists with a few tens of folders, and a few thousand files. There are a few very dense folders (note the async opti is folder-to-folder here, and cannot benefit cases of folders with very large file counts)
- For cached load (ie refresh object_info after it's already been called before, so it's only checking mtime), timings are all very fast and similar - async PR is very very slightly faster (talking eg
0.018vs0.016seconds) (noting that the cache check async is a bit more efficient than the first list, because the cache check has a pre-existing dense list of folders, while the firstload has to take time discovering folders before emitting the async tasks) - Just having something else running in the background produced a bigger difference in these high speed cases than any code change does anyway.
- I tried other folders, symlinks, etc. and everything works as intended. Windows and Linux both work as intended. No bugs introduced through this PR as far as my own testing can find.
- Note that python threading support is limited and is a blocker to some potential further improvements that could be made. A "true" async executor would have significantly less overhead, but we got the tools we got,
Improvement notes:
- I swapped to not creating a new threadpoolexecutor every time and instead retaining one persistent instance, and that recovered a decent chunk of the time difference. So yeah the time difference here is literally just the extra CPU processing to handle the threading in very fast cases.
- My final commit, a bit of an experimental bonus, (might be preferred to remove or give an arg for?), pre-loads the model lists into cache during server startup. I perpetually have margin-of-error issues in local test env, but it does appear to become a touch faster (as it's running all folders in parallel, vs normally object_info runs the main folder list sequentially and only the sublists in parallel). Naturally because this runs at startup, the actual first object info call is now cached and completes in a few milliseconds. I'd love to see what this does in a very slow env, but as I am no longer in tokyo I do not have my painfully slow setup available. (Note: this part intentionally uses its own temporary ThreadPoolExecutor to prevent thread lockup issues, since ThreadPoolExecutor is not a proper async impl ie it can't release while it waits, so it will lock-up if one queue is used across multiple layers)
- Arguably one could even allow the model list preload to be non-blocking, ie start it early, and start running the server while it's still loading, so that it happens in the background, but that would need more research to ensure there wouldn't be unwanted side effects. It'd also require adding a lock to prevent overlap if object_info is called before that's done.