Concurrent linking/post-install of packages
Currently Mamba/Micromamba seem to perform the symlink/post-install steps serially. It can be a pretty long process when seconds count.
Is it possible to have it happen concurrently?
I suspect this is being done serially because of concerns around clobbering, but it should be possible to know which files are going to get clobbered by checking ahead of time and ensure it happens properly instead of a semi-random package being the last package to 'win'.
The post install steps also present another challenge possibly, but these are rare enough that perhaps packages that have them could be installed serially first, and then all other packages installed concurrently after?
A good example of where this might be faster is ncurses on an apple silicon mac. It seems to have around 2000 files which take a hot minute to link! Even adding multithreading for individual packages might help with that.
Actually I'm not sure if you can get much faster because the file system will be the limit. Maybe 2x. You might want to benchmark this outside of Mamba.
Right I think in most cases the file system will be very limiting, 2x is still pretty awesome and i'll create some benchmarks to check things out.
I'm actually operating on a RAM disk for my use case so I think it might be very significant in that case
Interesting, with 100,000 files
with ThreadPoolExecutor(max_workers=5) as executor:
for i in range(0,LINKS):
future = executor.submit(hardlink,i)
executor.shutdown(wait=True)
# Time (mean ± σ): 2.419 s ± 0.116 s [User: 2.115 s, System: 1.075 s]
# Range (min … max): 2.264 s … 2.598 s 10 runs
for i in range(0,LINKS):
hardlink(i)
# Time (mean ± σ): 247.2 ms ± 2.2 ms [User: 76.1 ms, System: 171.0 ms]
# Range (min … max): 244.1 ms … 251.1 ms 10 runs
I'm guessing there's either some interpreter optimizing happening, or os.link is so fast that the threadpool overhead is longer than the operation itself.
I haven't written C++ in years so i'm not sure about the equivalent.
I checked my 'average' mamba environment and it was only 68,000 files, but the linking steps definitely take drastically longer than 250ms, true of conda aswell. Perhaps the real overhead is parsing the file locations from the metadata?
Here's some benchmarking that I've done. It expects you to have the ncurses package unpacked at ncurses*.
import time
from pathlib import Path
import tempfile
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from multiprocessing import freeze_support
def hardlink(op):
a, b = op
a.hardlink_to(b)
if __name__ == "__main__":
freeze_support()
files = [p for p in Path(".").glob("ncurses*/**/*") if p.is_file()]
folders = [p for p in Path(".").glob("ncurses*/**/*") if p.is_dir()]
with tempfile.TemporaryDirectory() as tmpdir:
# Set up folders
for folder in folders:
(Path(tmpdir) / folder).mkdir(exist_ok=True, parents=True)
hardlink_ops = [(Path(tmpdir) / f, f) for f in files]
start_time = time.perf_counter()
if 1:
for op in hardlink_ops:
hardlink(op)
if 0:
with ThreadPoolExecutor(max_workers=4) as executor:
executor.map(hardlink, hardlink_ops)
executor.shutdown(wait=True)
if 0:
with ProcessPoolExecutor(max_workers=4) as executor:
executor.map(hardlink, hardlink_ops, chunksize=100)
executor.shutdown(wait=True)
print(time.perf_counter() - start_time)
Fastest on my M1 Mac is 4-worker thread pool at 0.2 s. The single-threaded baseline is 0.55 s.
Using map definitely makes things more equal, I guess that submit call was pricey!
On my M1 mac, I can roughly replicate your results. However on ubuntu 20.04, it's flipped with the 4 worker pool at 0.07s vs single threaded at 0.02s.
Given that it's already much faster than i'm observing I did some more digging.
It seems like it's a specific performance issue affecting the micromamba apple silicon build.
| Command | Mean [s] | Min [s] | Max [s] | Relative |
|---|---|---|---|---|
mamba[mac] create -n test555 conda-forge::ncurses -y |
3.414 ± 0.149 | 3.262 | 3.714 | 1.00 |
mamba[linux] create -n test555 conda-forge::ncurses -y |
1.304 ± 0.032 | 1.283 | 1.394 | 1.00 |
micromamba[mac] create -n test555 conda-forge::ncurses -y |
27.395 ± 2.567 | 26.357 | 34.692 | 1.00 |
micromamba[linux] create -n test555 conda-forge::ncurses -y |
0.999 ± 0.032 | 0.994 | 1.004 | 1.00 |
I'm happy to either make a new issue with this or continue here. Improving general linking performance would still be nice, but not as urgent if the mac build is fixed.
micromamba 0.25.1 in all cases
@shughes-uk the macOS timings are very curious! I don't have that problem on the M1 where I am running micromamba. Strange! Do you think you could trace the process to find what functions are taking up most time? E.g. using XCode instruments: https://www.avanderlee.com/debugging/xcode-instruments-time-profiler/
I know that @baszalmstra experimented with async / parallel linking of files and he might have some comments.
@wolfv are you using the mac arm64 build or the x86 build via rosetta emulation? I'll try and find some time to profile it otherwise
micromamba info output
environment : /Users/samanthahughes/miniconda3 (active)
env location : /Users/samanthahughes/miniconda3
user config files : /Users/samanthahughes/.mambarc
populated config files : /Users/samanthahughes/.condarc
libmamba version : 0.25.0
micromamba version : 0.25.1
curl version : libcurl/7.76.1 SecureTransport (OpenSSL/1.1.1q) zlib/1.2.12 libssh2/1.9.0 nghttp2/1.47.0
libarchive version : libarchive 3.3.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.5
virtual packages : __unix=0=0
__osx=12.5.1=0
__archspec=1=arm64
channels :
base environment : /Users/samanthahughes/micromamba
platform : osx-arm64
For profiling I guess you could also just run with the most verbose log level and use some tool that adds timestamps to output lines to find if there are any particular steps that take a lot of time.
Just commenting that we're running into this for networked storage where parallel IO requests does indeed speed things up, esp when the files are backed by high latency db requests.