mamba Concurrent linking/post-install of packages

Currently Mamba/Micromamba seem to perform the symlink/post-install steps serially. It can be a pretty long process when seconds count.

Is it possible to have it happen concurrently?

I suspect this is being done serially because of concerns around clobbering, but it should be possible to know which files are going to get clobbered by checking ahead of time and ensure it happens properly instead of a semi-random package being the last package to 'win'.

The post install steps also present another challenge possibly, but these are rare enough that perhaps packages that have them could be installed serially first, and then all other packages installed concurrently after?

Sep 15 '22 17:09 shughes-uk

A good example of where this might be faster is ncurses on an apple silicon mac. It seems to have around 2000 files which take a hot minute to link! Even adding multithreading for individual packages might help with that.

Sep 16 '22 17:09 shughes-uk

Actually I'm not sure if you can get much faster because the file system will be the limit. Maybe 2x. You might want to benchmark this outside of Mamba.

Sep 16 '22 19:09 jonashaag

Right I think in most cases the file system will be very limiting, 2x is still pretty awesome and i'll create some benchmarks to check things out.

I'm actually operating on a RAM disk for my use case so I think it might be very significant in that case

Sep 16 '22 19:09 shughes-uk

Interesting, with 100,000 files

with ThreadPoolExecutor(max_workers=5) as executor:
    for i in range(0,LINKS):
        future = executor.submit(hardlink,i)
    executor.shutdown(wait=True)
#  Time (mean ± σ):      2.419 s ±  0.116 s    [User: 2.115 s, System: 1.075 s]
#   Range (min … max):    2.264 s …  2.598 s    10 runs

for i in range(0,LINKS):
    hardlink(i)
#  Time (mean ± σ):     247.2 ms ±   2.2 ms    [User: 76.1 ms, System: 171.0 ms]
#  Range (min … max):   244.1 ms … 251.1 ms    10 runs

I'm guessing there's either some interpreter optimizing happening, or os.link is so fast that the threadpool overhead is longer than the operation itself.

I haven't written C++ in years so i'm not sure about the equivalent.

I checked my 'average' mamba environment and it was only 68,000 files, but the linking steps definitely take drastically longer than 250ms, true of conda aswell. Perhaps the real overhead is parsing the file locations from the metadata?

Sep 20 '22 04:09 shughes-uk

Here's some benchmarking that I've done. It expects you to have the ncurses package unpacked at ncurses*.

import time
from pathlib import Path
import tempfile
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from multiprocessing import freeze_support

def hardlink(op):
    a, b = op
    a.hardlink_to(b)

if __name__ == "__main__":
    freeze_support()

    files = [p for p in Path(".").glob("ncurses*/**/*") if p.is_file()]
    folders = [p for p in Path(".").glob("ncurses*/**/*") if p.is_dir()]

    with tempfile.TemporaryDirectory() as tmpdir:
        # Set up folders
        for folder in folders:
            (Path(tmpdir) / folder).mkdir(exist_ok=True, parents=True)

        hardlink_ops = [(Path(tmpdir) / f, f) for f in files]

        start_time = time.perf_counter()
        if 1:
            for op in hardlink_ops:
                hardlink(op)
        if 0:
            with ThreadPoolExecutor(max_workers=4) as executor:
                executor.map(hardlink, hardlink_ops)
                executor.shutdown(wait=True)
        if 0:
            with ProcessPoolExecutor(max_workers=4) as executor:
                executor.map(hardlink, hardlink_ops, chunksize=100)
                executor.shutdown(wait=True)
        print(time.perf_counter() - start_time)

Fastest on my M1 Mac is 4-worker thread pool at 0.2 s. The single-threaded baseline is 0.55 s.

Sep 20 '22 07:09 jonashaag

Using map definitely makes things more equal, I guess that submit call was pricey!

On my M1 mac, I can roughly replicate your results. However on ubuntu 20.04, it's flipped with the 4 worker pool at 0.07s vs single threaded at 0.02s.

Given that it's already much faster than i'm observing I did some more digging.

It seems like it's a specific performance issue affecting the micromamba apple silicon build.

Command	Mean [s]	Min [s]	Max [s]	Relative
`mamba[mac] create -n test555 conda-forge::ncurses -y`	3.414 ± 0.149	3.262	3.714	1.00
`mamba[linux] create -n test555 conda-forge::ncurses -y`	1.304 ± 0.032	1.283	1.394	1.00
`micromamba[mac] create -n test555 conda-forge::ncurses -y`	27.395 ± 2.567	26.357	34.692	1.00
`micromamba[linux] create -n test555 conda-forge::ncurses -y`	0.999 ± 0.032	0.994	1.004	1.00

I'm happy to either make a new issue with this or continue here. Improving general linking performance would still be nice, but not as urgent if the mac build is fixed.

micromamba 0.25.1 in all cases

Sep 21 '22 19:09 shughes-uk

@shughes-uk the macOS timings are very curious! I don't have that problem on the M1 where I am running micromamba. Strange! Do you think you could trace the process to find what functions are taking up most time? E.g. using XCode instruments: https://www.avanderlee.com/debugging/xcode-instruments-time-profiler/

I know that @baszalmstra experimented with async / parallel linking of files and he might have some comments.

Sep 21 '22 23:09 wolfv

@wolfv are you using the mac arm64 build or the x86 build via rosetta emulation? I'll try and find some time to profile it otherwise

Sep 22 '22 02:09 shughes-uk

micromamba info output

            environment : /Users/samanthahughes/miniconda3 (active)
           env location : /Users/samanthahughes/miniconda3
      user config files : /Users/samanthahughes/.mambarc
 populated config files : /Users/samanthahughes/.condarc
       libmamba version : 0.25.0
     micromamba version : 0.25.1
           curl version : libcurl/7.76.1 SecureTransport (OpenSSL/1.1.1q) zlib/1.2.12 libssh2/1.9.0 nghttp2/1.47.0
     libarchive version : libarchive 3.3.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.5
       virtual packages : __unix=0=0
                          __osx=12.5.1=0
                          __archspec=1=arm64
               channels :
       base environment : /Users/samanthahughes/micromamba
               platform : osx-arm64

Sep 22 '22 02:09 shughes-uk

For profiling I guess you could also just run with the most verbose log level and use some tool that adds timestamps to output lines to find if there are any particular steps that take a lot of time.

Sep 22 '22 05:09 jonashaag

Just commenting that we're running into this for networked storage where parallel IO requests does indeed speed things up, esp when the files are backed by high latency db requests.

May 23 '24 01:05 ebetica