software-layer icon indicating copy to clipboard operation
software-layer copied to clipboard

Reducing the size of the end user environment + avoid long installation paths

Open ocaisa opened this issue 2 years ago • 19 comments

Currently we have very long paths to get to our installations. We could dramatically reduce the size of these paths by removing the need to make them human readable. We can maintain human readable paths for MODULEPATH, but the software installations the module files reference do not need to retain human readable paths.

We can even maintain the human readable path via the use of symlinks, but use the short path when doing the installations. This should help save us from maxing out the size of peoples environments.

ocaisa avatar Jul 05 '23 11:07 ocaisa

Can you give a concrete example of what you have in mind for non-human readable paths?

There's definitely an impact of having a large environment, but is it significant enough to make things more cryptic (even if you maintain a human-readable symlink farm on top, the end users will still be exposed to the "cryptic" paths, that's impossible to avoid).

One thing that would definitely help here is to make pure libraries (no binaries) actual link-only dependencies, rather than keeping them as runtime dependencies. Since we're using RPATH, we don't need to load those modules at all to get a working environment, since the binaries know where the libraries are already.

boegel avatar Jul 08 '23 07:07 boegel

11a, combining 3 alphanumerics is already over 45k possibilities, do 4 and we never have to address it again. Reducing each software dir to that is a big saving (and multiplied per package). I don't know how "exposed" the users will be to the path, the MODULEPATH would be unchanged, but it is true that which gcc would now be more cryptic.

ocaisa avatar Jul 08 '23 11:07 ocaisa

To be clear here, what I mean is maintaining a mapping between the EESSI software installation path and the other directory, so

versions/2021.12/software/Linux/x86_64/intel/haswell/software -> .installs/11a

with that path being used as the install path for EasyBuild. The path mapping could be maintained in the installation script for that release.

ocaisa avatar Jul 08 '23 11:07 ocaisa

This actually happens in real life! @jpecar raised in the EasyBuild Slack that users are exceeding MAX_PATH. The solution proposed here would solve that problem and could even be implemented via an EasyBuild hook, which just creates the installation symlink and replaces it in the final module file.

ocaisa avatar Jan 24 '24 14:01 ocaisa

Seems like we're now hitting this issue as well: https://github.com/EESSI/software-layer/pull/563#issuecomment-2180005389 (only on skylake, which has the longest paths)

bedroge avatar Jun 21 '24 08:06 bedroge

And this will get significantly worse when we start adding accelerators

ocaisa avatar Jun 21 '24 08:06 ocaisa

Maybe something we can explore with dev.eessi.io? ... and if it works there, rebuild all the software with shortened paths.

trz42 avatar Jun 21 '24 09:06 trz42

I think that's a good idea, but I would probably just implement from the next version release (and maybe a couple of cases that are currently "broken")

ocaisa avatar Jun 21 '24 09:06 ocaisa

Given how we use versions -> host_injections replacement a lot these days, it probably makes sense to stick to something like

versions/2021.12/software/Linux/x86_64/intel/haswell/software -> versions/.installs/11a

ocaisa avatar Jun 21 '24 09:06 ocaisa

Here's a script that can map between the two, and also checks that a symlink exists:

#!/bin/bash

# Function to generate incrementing alphanumeric strings of 3 characters
generate_alphanumeric_string() {
    local index=$1
    local base_characters='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    local base_length=${#base_characters}
    local result=''

    while [ ${#result} -lt 3 ]; do
        local remainder=$((index % base_length))
        result="${base_characters:$remainder:1}$result"
        index=$((index / base_length))
    done

    # Ensure the result is 3 characters long
    while [ ${#result} -lt 3 ]; do
        result="0$result"
    done

    echo "$result"
}

# List of allowed directories with their corresponding alphanumeric values
declare -A allowed_dirs_map

# Populate allowed directories and their alphanumeric values
allowed_directories=(
    # New directories must be added to the end of the list!
    "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software"
    "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software"
    "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_v1/software"
    "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/generic/software"
    "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/haswell/software"
    "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software"
    "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software"
    "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software"
    "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software"
)

# Generate alphanumeric strings for each directory
for (( i=0; i<${#allowed_directories[@]}; i++ )); do
    dir="${allowed_directories[$i]}"
    allowed_dirs_map["$dir"]=$(generate_alphanumeric_string $i)
done

# Check if an argument is provided
if [ -z "$1" ]; then
    echo "Usage: $0 <directory>"
    exit 1
fi

# Argument to check
input_dir="$1"

# Validate if the provided directory is in the allowed list
if ! [[ " ${allowed_directories[@]} " =~ " ${input_dir} " ]]; then
    echo "Directory '$input_dir' is not in the list of allowed directories."
    exit 1
fi

# Get the corresponding alphanumeric value
alphanumeric_value="${allowed_dirs_map["$input_dir"]}"

# Declare variables for symlink
versions_dir="/cvmfs/software.eessi.io/versions"
install_dir="$versions_dir/.install"
symlink_target="$install_dir/$alphanumeric_value"

# Check if the input directory exists
if [ ! -d "$input_dir" ]; then
    echo "Directory '$input_dir' does not exist."
    exit 1
fi

# Check if there is already a symlink from input_dir to the expected target
if [ -L "$input_dir" ]; then
    current_target=$(readlink -f "$input_dir")
    if [ "$current_target" = "$symlink_target" ]; then
        echo "$symlink_target"
    else
        echo "Error: There is a symlink at '$input_dir', but it points to a different target:"
        echo "Current symlink target: $current_target"
        echo "Expected symlink target: $symlink_target"
        exit 1
    fi
else
    # If no symlink exists, provide instructions to create it
    echo "No symlink found at '$input_dir'."
    echo "To create a symlink, run the following command:"
    echo "ln -s \"$symlink_target\" \"$input_dir\""
    exit 1
fi

ocaisa avatar Jun 21 '24 10:06 ocaisa

Ok, if we use three characters (alphanumerics with upper/lower case is 62 options per character) we can actually encode information:

  • First character for the version (that should cover us for the next 50 years or so)
  • Second character for the cpu
  • Third character for the accelerator (optional?)

We can also add hook that injects EESSIROOTXXXAPPXXX corresponding to the human readable location

ocaisa avatar Jun 27 '24 20:06 ocaisa

I'd rather see us use a slightly longer string (say ~10 chars) that allows us to make it semi-readable. Like using 2306 instead of versions/2023.06, lxhsw for linux/x86_64/intel/haswell, lxskl for linux/x86_64/intel/skylake_avx512, lanvv1 for linux/aarch64/neoverse_v1, etc.

That would already greatly reduce a long path like /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/... to /cvmfs/software.eessi.io/2306/software/lxskl/software/....

We do also want to avoid that we get a mix of long and short paths in the installations, since that would lead to confusion. Letting /cvmfs/software.eessi.io/2306/software/lxskl symlink to /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512 will result in resolved symlinks in some cases. Can we use hardlinks instead?

Do we want to use the short path for the actual installation, but then use the long path in module files since that's more user-facing?

boegel avatar Jun 28 '24 06:06 boegel

Yes, indeed the idea here is to use the short path in the installation since that is the one that will actually solve the problem. The only reason for the symlink is to retain a working human readable path.

It's hard for me to see the point in retaining a more cryptic human readable path at the expense of quite a few unnecessary characters, that may benefit us perhaps but at a cost to our end users. What exactly is the concern here if it is not semi-human readable? Note, the module file itself would still have a fully human readable path with my proposal:

{EESSI 2023.06} ocaisa@LAPTOP-O6HF2IKC:~$ module show BLIS
-------------------------------------------------------------------------------------------------------------------------------
   /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all/BLIS/0.9.0-GCC-13.2.0.lua:
-------------------------------------------------------------------------------------------------------------------------------
...

You would lose this with your proposed approach.

You can add CI for the module files if your concern is some kind of architecture mixing. Let's look at a concrete example, where the true installation path is

/cvmfs/software.eessi.io/versions/.installs/a01

and (for convenience only) a symlink exists

/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software -> /cvmfs/software.eessi.io/versions/.installs/a01

The location of our final module file will be /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all/BLIS/0.9.0-GCC-13.2.0.lua and we end up with

setenv("EBROOTBLIS","/cvmfs/software.eessi.io/versions/.installs/a01/BLIS/0.9.0-GCC-13.2.0")

inside. We can set a single EESSI_SOFTWARE_PATH_MAPPING variable (where realpath $EESSI_SOFTWARE_PATH/software is $EESSI_SOFTWARE_PATH_MAPPING) as part of our init script and have CI that check that all EBROOT* start with this path (or EESSI_SOFTWARE_PATH/software for existing installations). The advantage of that approach is it is trivial to make that part of the EasyBuild configuration (we just need to set an additional EASYBUILD_INSTALLPATH_SOFTWARE variable set to the EESSI_SOFTWARE_PATH_MAPPING value) and this could be implemented today on top of what we already have.

This would lead to a 48 character saving per environment variable, per loaded module on our existing approach. Comparing to the other proposed approach (but retaining a versions subdirectory to make using host_injections easy)

/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software
/cvmfs/software.eessi.io/versions/.installs/a01
/cvmfs/software.eessi.io/versions/2306/software/lxskl/software

the difference is only 15 characters. However, with our agreed approach to GPU installation the savings are larger as we would need to add an the additional directory structure accel/nvidia/cc80 (or the shorter acc/nv/cc80). Either way it's at least another 10 characters or more (again, per variable, per module) which are not necessary with my proposal.

ocaisa avatar Jun 28 '24 09:06 ocaisa

To give some context to this, for having R-bundle-Bioconductor alone loaded, there are a total of 1638 occurrences of /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software in the environment so we would save 48*1638 characters.

ocaisa avatar Jun 28 '24 09:06 ocaisa

To me, the main issue is that some installations break because the installation path as we have it now is too long.

Retaining some level of readability for humans as opposed to using slightly shorter but cryptic paths is important imho, it's going to save lots of brain cycles going forward.

I understand that this would imply a slightly larger environment, but why is it a a problem exactly that the environment is say 100k character larger than it could be by using shorter cryptic path names? What's the big win there? There may be a tiny performance impact, but it's that mostly an academic argument?

boegel avatar Jul 03 '24 09:07 boegel

To be clear, that was a 100k saving for one module package the user needed (which of course I picked because it has a lot of dependencies, but that is irrelevant from the end user perspective).

It's not just academic, the maximum size of an environment variable is PAGE_SIZE*32 (https://askubuntu.com/a/1385554) which with stock Ubuntu would be 128k characters. There is a limit there that we are at risk of reaching with poor design. The length of PATH is increased by 10k characters with "just" R-bundle-Bioconductor loaded.

{EESSI 2023.06} ocaisa@LAPTOP-O6HF2IKC:~$ module purge
{EESSI 2023.06} ocaisa@LAPTOP-O6HF2IKC:~$ echo $PATH | wc -c
620
{EESSI 2023.06} ocaisa@LAPTOP-O6HF2IKC:~$ module load R-bundle-Bioconductor
{EESSI 2023.06} ocaisa@LAPTOP-O6HF2IKC:~$ echo $PATH | wc -c
10913

(for LIBRARY_PATH it is more than 14k characters)

ocaisa avatar Jul 03 '24 10:07 ocaisa

I was just looking into the OpenFOAM issue again. The failing command (which compiles libOpenFOAM.so) has 602 paths to something in /cvmfs/software.eessi.io/... and a total length of almost 2600 characters. The latter didn't seem that extreme, and I couldn't reproduce it by just running that command itself.

While googling a bit more, I found that it's not only about the command itself, but also all environment variables are taken into account. By setting a dummy environment variable with a lot of long paths in it (I tried export SOMETHING=$LIBRARY_PATH:$LIBRARY_PATH:....etc...), I was ultimately (had to make that dummy variable a bit longer a few times) able to reproduce the error.

So in that sense this does prove @ocaisa's point that every character seems to count to keep the entire environment small enough.

With this insight we may also be able to install OpenFOAM by filtering/unsetting some environment variables during the build; I'll give that a try.

bedroge avatar Jul 05 '24 08:07 bedroge

Actually found a workaround for OpenFOAM (but it's still a good example of how some extra characters in the installation path can greatly increase the size of the environment): https://github.com/EESSI/software-layer/pull/563#issuecomment-2211367644

bedroge avatar Jul 05 '24 20:07 bedroge

Perhaps there is a middle ground here? We can come up with a mapping scheme that is readable if you are familiar with the scheme:

  • don't include the OS family for now (we are not looking at MacOS support really any longer given we have an approach)
  • two character for date (one character for year (0 is 2023, a is 2033, we are all retired by 2059), one character for month (January is 1, a is October, c is December))
  • 3 character for cpu (one for family, two for type, lowercase)
  • 4 character for gpu (one for family, 3 for type, lowercase) - optional

To give an example then:

D06Cxa5An085

D = date
0 = 2023
6 = June
C = CPU
x = x86
a5 = avx512
A = accelerator
n = nvidia
085 = compute capability 8.5

I am very keen to drop the subdirectory structure so that at the very least we get consistent path lengths regardless of architecture (and this helps to avoid arch-specific surprises like in the case of OpenFOAM)

If you felt strongly about the family, it could be Fl. The good thing about this scheme is that it is parseable (as long as we are consistent)

ocaisa avatar Jul 19 '24 08:07 ocaisa