fishtest icon indicating copy to clipboard operation
fishtest copied to clipboard

split worker for reduced concurrency

Open vondele opened this issue 2 years ago • 10 comments

it seems like the time loss problem is getting more acute with multiple large core workers. See e.g. https://tests.stockfishchess.org/tests/view/62e523e2b383a712b1386193 We know this is probably due to cutechess not being able to deal with a large concurrency, and probably our best workaround is to split the worker internally (so not visible from the user side), to have multiple cutechess processes each with reduced concurrency.

vondele avatar Jul 30 '22 20:07 vondele

No matter what the prevention solution is -- fixing cutechess or making workarounds inside the worker -- fishtest should be able to manage tasks with timelosses separately from high residual tasks (e.g. rejecting them or purging them etc)

dubslow avatar Jul 30 '22 21:07 dubslow

technologov-28cores-r345 has lots of issues. See e.g. #1360.

vdbergh avatar Jul 31 '22 09:07 vdbergh

I've pinged him on discord. But there are at least two other workers with a lot of losses.

vondele avatar Jul 31 '22 09:07 vondele

Script for linux:

  • creates the "fishtest" user
  • creates 5 copies of the "worker" directory to be able to run 5 workers
  • creates an unit systemd file (name "fishtest") that takes the worker ID as parameter
  • start, stop, check the status for any single worker with sudo systemctl start fishtest@3
  • start, stop all the workers with sudo systemctl start fishtest@{0..4}
  • set the auto start for the workers with sudo systemctl enable fishtest-worker@{0..4}.service

To revert:

  • sudo systemctl disable fishtest-worker@{0..4}.service
  • sudo rm /etc/systemd/system/[email protected]
  • sudo deluser --remove-home fishtest (this delete the fishtest user and all his files/folders)
#!/bin/bash
# setup_worker.sh
# to setup a fishtest worker on Ubuntu 20.04, simply run: 
# sudo bash setup_workers.sh 2>&1 | tee setup_workers.sh.log

# print CPU information
cpu_model=$(grep "^model name" /proc/cpuinfo | sort | uniq | cut -d ':' -f 2)
n_cpus=$( grep "^physical id" /proc/cpuinfo | sort | uniq | wc -l)
online_cores=$(grep "^bogo" /proc/cpuinfo | wc -l)
n_siblings=$(grep "^siblings" /proc/cpuinfo | sort | uniq | cut -d ':' -f 2)
n_cpu_cores=$(grep "^cpu cores" /proc/cpuinfo | sort | uniq | cut -d ':' -f 2)
total_siblings=$((${n_cpus} * ${n_siblings}))
total_cpu_cores=$((${n_cpus} * ${n_cpu_cores}))
printf "CPU model : ${cpu_model}\n"
printf "CPU       : %3d  -  Online cores    : %3d\n" ${n_cpus} ${online_cores}
printf "Siblings  : %3d  -  Total siblings  : %3d\n" ${n_siblings} ${total_siblings}
printf "CPU cores : %3d  -  Total CPU cores : %3d\n" ${n_cpu_cores} ${total_cpu_cores}

# read the fishtest credentials and the number of cores to be contributed
echo
echo "Write your fishtest username:"
read usr_name
echo "Write your fishtest password:"
read usr_pwd
echo "Write the number of cores to be contributed to fishtest:"
echo "(max suggested 'Total CPU cores - 1')"
read n_cores

# install required packages
apt update && apt full-upgrade -y && apt autoremove -y && apt clean
apt install -y python3 python3-venv git build-essential 

# new linux account used to run the worker
worker_user='fishtest'
# create user for fishtest
useradd -m -s /bin/bash ${worker_user}

# add the bash variable for the python virtual env
sudo -i -u ${worker_user} << 'EOF'
echo export VENV=${HOME}/fishtest/worker/env >> .profile
EOF

# download fishtest
sudo -i -u ${worker_user} << EOF
git clone --single-branch --branch master https://github.com/glinscott/fishtest.git
cd fishtest
git config user.email "[email protected]"
git config user.name "your_name"
EOF

# fishtest worker setup and first start only to write the "fishtest.cfg" configuration file
sudo -i -u ${worker_user} << EOF
python3 -m venv \${VENV}
\${VENV}/bin/python3 -m pip install --upgrade pip setuptools wheel
\${VENV}/bin/python3 -m pip install requests

\${VENV}/bin/python3 \${HOME}/fishtest/worker/worker.py --concurrency ${n_cores} ${usr_name} ${usr_pwd} --only_config True && echo "concurrency successfully set" || echo "Restart the script using a proper concurrency value"
EOF

# copy the worker directory N=5 times (change according your needs)
sudo -i -u ${worker_user} << 'EOF'
cd fishtest
for ((k=0; k<=4; k++)); do
  cp -r worker worker${k}  
done 
EOF

echo
echo "Setup fishtest-worker as a service"
read -p "Press <Enter> to continue or <CTRL+C> to exit ..."

# install fishtest-worker as systemd service
# start/stop the worker with:
# sudo systemctl start fishtest-worker@{0..4}
# sudo systemctl stop fishtest-worker@{0..4}
# check the log with:
# sudo journalctl -u [email protected]
# the service uses the worker configuration file "fishtest.cfg"

# get the worker_user $HOME
worker_user_home=$(sudo -i -u ${worker_user} << 'EOF'
echo ${HOME}
EOF
)

cat << EOF > /etc/systemd/system/[email protected]
[Unit]
Description=Fishtest worker %i
After=multi-user.target

[Service]
Type=simple
StandardOutput=file:${worker_user_home}/fishtest/worker%i/worker.log
StandardError=inherit
ExecStart=${worker_user_home}/fishtest/worker%i/env/bin/python3 ${worker_user_home}/fishtest/worker%i/worker.py
User=${worker_user}
WorkingDirectory=${worker_user_home}/fishtest/worker%i

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload

echo
echo "Start fishtest-worker service"
read -p "Press <Enter> to continue or <CTRL+C> to exit ..."
systemctl start fishtest-worker@{0..4}.service

echo
echo "Enable fishtest-worker service auto start"
read -p "Press <Enter> to continue or <CTRL+C> to exit ..."
systemctl enable fishtest-worker@{0..4}.service

ppigazzini avatar Aug 08 '22 23:08 ppigazzini

I think the cleanest way to achieve splitting is to create a --pool N argument for the worker. Default N=1 which does nothing (current behaviour). N=0 means that the worker is a clone worker and N>=2 means that the worker is a master (see below).

If N>=2 the worker would quietly create N clone copies of itself (in subdirectories), with slightly adapted fishtest.cfg (the memory, concurrency and uuid_prefix options) and then start these clones using popen (with --pool 0).

For each clone there should probably be a controlling thread in the master to manage its life cyle... I am a bit worried about Crtl-C handling though (we want the clones to quit if the master worker receives Ctrl-C).

Clone workers do not upgrade. If the master upgrades then the clone workers are stopped and deleted. They will be recreated when the master restarts.

The main reason for doing it this way would be to keep the error handling manageable.

If instead we would be starting multiple copies of cutechess within a single worker, the error handling would be a nightmare I think.

vdbergh avatar Aug 09 '22 05:08 vdbergh

yes, I agree that this could be done at a higher level like you describe.

vondele avatar Aug 09 '22 06:08 vondele

Allowing the clone workers to update would lead to pretty bad race conditions. So I adapted the proposal accordingly.

vdbergh avatar Aug 09 '22 06:08 vdbergh

Problems:

  • the user willingness to start N workers, with or --pool or systemd. The big workers are contributed by people with high skillset, surely able to setup an unit systemd, but they never did it (also when provided with the script above)
  • how to stop or restart only 1-2 workers without stopping or restating all the workers, so losing the games played

ppigazzini avatar Aug 09 '22 09:08 ppigazzini

to stop or restart only 1-2 workers without stopping or restating all the workers, so losing the games played

I was thinking that from the point of view of the user the result of --pool would be a single worker. So the clones live and die together. If one has a 95 core worker then one can also not restart 20 cores.

The two solutions (--pool and the user manually splitting the worker) are not mutually exclusive.

vdbergh avatar Aug 09 '22 09:08 vdbergh

We can set the default value of pool to ceil(concurrency/32). In that way nothing would change for workers with <= 32 cores.

A 33 core worker would split up as a 16 core worker and a 17 core worker.

vdbergh avatar Aug 09 '22 20:08 vdbergh