suite2p
suite2p copied to clipboard
solved - no multi-threading on AMD EPYC 7702P 64-Core during registration -- how to speed up AMD processing?
Hi!
First: Thanks for your great work!
I am currently trying to optimize suite2p on our new AMD 64 core server (fresh conda/suite2p install on Ubuntu 5.4, w/ standard environment dependencies).
During registration, however, it spends most of the time using only a single thread. On a Xeon intel with only 8 cores and comparable processor speed, registration runs almost twice as fast and appears to fill up the cores nicely. Is this expected behavior?
I played with the usual suspects (MKL versions... the infamous MKL_DEBUG_CPU_TYPE flag etc.) - no single thread speedup and no change in multithreading behavior.
Thanks!
(settings and suite2p output: https://www.evernote.com/shard/s4/sh/363400de-eef8-2c75-44f0-1ec14b9b64a1/c94284efba1aab302a985bb93d0ee70c)
Edit: I consider this solved for us. https://github.com/MouseLand/suite2p/issues/725#issuecomment-1435949465
are you using the latest version with the torch dependency for the FFT? I switched to torch thinking it would handle the multithreading better
Thx - but yes. All running with pytorch.
...I will check a fresh environment and try installing pytorch with OpenBlas. I'll post here then.
thanks, let us know what you find, I'm not sure if there is a fast way to do FFTs on an AMD CPU with python, you could try timing scipy's fft functions and if it's better we can add an option to force to use it, but I'm guessing it may be sped up by MKL too. They unfortunately removed the MKL_DEBUG_CPU_TYPE flag I think
Hello! Just to know whether we have any news about this issue, @trose-neuro ? I was about to purchase a workstation with an AMD processor...but then read about MKL...- thank you in advance!
Sorry - no. Nothing yet. Frankly: I'd recommend going intel.
we will add to the documentation that we recommend intel processors
I've messaged a few HPC engineers at AMD over twitter about this and they said they will get back to me/seemed interested in helping since our lab just bought a bunch of AMD compute.
EDIT:
One did! A helpful fellow named Nicholas who is a principal scientist at AMD. Here's some initial things he mentioned and something that we can try checking out:
I agree with you that there is no reason to think we can't run a multithreaded FFT.
If it is using fftw as the backend, you will want to be sure it was compiled with the multithreaded flags,
such as: -lfftw3_threads -lfftw3 -lm on Unix, or -lfftw3_omp -lfftw3 -lm
(see: https://fftw.org/fftw3_doc/Usage-of-Multi_002dthreaded-FFTW.html)
Probably worthwhile to run pyFFTW then (AFAIK, torch.fft does not use the FFTW library): https://github.com/pyFFTW/pyFFTW
But this makes me not too optimistic regarding its performance (benchmark is pretty old, though): https://thomasaarholt.github.io/fftspeedtest/fftspeedtest.html
Maybe wrapping the new threadripper-optimized AMD fftw would be an option: https://developer.amd.com/amd-aocl/fftw/
I did not check for a while if there is news regarding MKL/AMD torch etc...
Currently also considering adding GPUs to our big AMD servers to at least use them a little bit more efficiently - our workflows are all piped through these machines and currently, it is a bit of a sad story...
I like @trose-neuro idea to try AOCL's FFTW implementation. This is certainly a software problem. AMD CPU hardware supports rich multithreading/hyperthreading.
I'll poke around on my side to see how to get multithreading on AMD hardware in pytorch's implementation of FFT.
thank you @nicholasmalaya ! on suite2p's side, I am happy to have an alternate fft import for AMD users, particularly if it is easy to install with conda or pip and if it works on both windows and ubuntu. but of course if pytorch can be sped up like you suggest, that would be best
Hello everyone!
I did a small test today using this tiny script just to see what torch.fft.fft is doing on this machine:
CPU Count: 256 : "AuthenticAMD AMD EPYC 7742 64-Core Processor 1497.825 MHz (2 chips x 64 cores : 128 hyperthread cores)"
EDIT:
With this environment (very simple):
(amd_torch) jdelahanty@aori:~$ conda list
# packages in environment at /home/jdelahanty/miniconda3/envs/amd_torch:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
ca-certificates 2022.4.26 h06a4308_0
certifi 2022.5.18.1 py38h06a4308_0
ld_impl_linux-64 2.38 h1181459_1
libffi 3.3 he6710b0_2
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
ncurses 6.3 h7f8727e_2
numpy 1.22.4 pypi_0 pypi
openssl 1.1.1o h7f8727e_0
pip 21.2.4 py38h06a4308_0
python 3.8.13 h12debd9_0
readline 8.1.2 h7f8727e_1
setuptools 61.2.0 py38h06a4308_0
sqlite 3.38.3 hc218d9a_0
tk 8.6.12 h1ccaba5_0
torch 1.11.0 pypi_0 pypi
typing-extensions 4.2.0 pypi_0 pypi
wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.5 h7f8727e_1
zlib 1.2.12 h7f8727e_2
With this small script:
# AMD Torch Testing
# FFTs appear to struggle while running on AMD hardware via pytorch in suite2p
# pytorch FFT alone seems to work okay!
import torch
import numpy as np
from time import perf_counter
# Set number of threads to be greater than 1
torch.set_num_threads(6)
print("Making Dummy Data...")
# Make Dummy Data with np.random
dummy = np.random.rand(12500, 512, 512)
print("Dummy Data Complete!")
print("Transform numpy array into Tensor")
# Make Dummy Data into a Tensor
dummy_tensor = torch.from_numpy(dummy)
print("Tensor creation complete!")
# Start timer
start = perf_counter()
print("Starting FFT")
# Save the result of the dummy data's fft to a variable
result = torch.fft.fft(dummy_tensor)
print("FFT Complete!")
# Stop the timer
end = perf_counter()
# Calculate total time
total_time = end - start
# Print total time
print("Total Time:", total_time)
From the small test I did here, it looks like it actually does end up doing something multithreaded maybe? See below:
https://user-images.githubusercontent.com/28411193/173710978-27799bff-3f82-4fbe-af1b-da0a3b481a6d.mp4
What do you think of this? Did I test it incorrectly?
EDIT2:
After running Suite2p on our EPYC configured AMD hardware, the run looks like this:
>>>>> started run at 10/06/2022 17:29:00Warning: cellpose did not import
No module named 'cellpose'
cannot use anatomical mode, but otherwise suite2p will run normally
{'data_path': ['/scratch/snlkt2p'], 'subfolders': [], 'h5py_key': 'data', 'save_path0': '/scratch/snlkt2p', 'fast_disk': '/scratch/snlkt2p', 'input_format': 'h5'}
h5
** Found 1 h5 files - converting to binary **
NOTE: using a list of h5 files:
['/scratch/snlkt2p/20211105_CSE20_subtraction.h5']
time 62.56 sec. Wrote 45510 frames per binary for 1 planes
>>>>>>>>>>>>>>>>>>>>> PLANE 0 <<<<<<<<<<<<<<<<<<<<<<
NOTE: not registered / registration forced with ops['do_registration']>1
(no previous offsets to delete)
----------- REGISTRATION
registering 45510 frames
Reference frame, 22.15 sec.
Registered 8000/45510 in 1123.35s
Registered 16000/45510 in 2234.96s
Registered 24000/45510 in 3337.56s
Registered 32000/45510 in 4441.69s
Registered 40000/45510 in 5662.43s
added enhanced mean image
----------- Total 6604.40 sec
Registration metrics, 37.10 sec.
NOTE: applying default /home/jdelahanty/.suite2p/classifiers/classifier_user.npy
----------- ROI DETECTION
Binning movie in chunks of length 22
Binned movie [2002,474,502] in 26.39 sec.
NOTE: estimated spatial scale ~12 pixels, time epochs 1.67, threshold 16.68
0 ROIs, score=209.17
Detected 255 ROIs, 19.12 sec
After removing overlaps, 232 ROIs remain
----------- Total 46.21 sec.
----------- EXTRACTION
Masks created, 1.46 sec.
Extracted fluorescence from 232 ROIs in 45510 frames, 24.46 sec.
----------- Total 26.28 sec.
----------- CLASSIFICATION
['compact', 'skew', 'npix_norm']
----------- Total 0.07 sec.
----------- SPIKE DECONVOLUTION
----------- Total 0.46 sec.
Plane 0 processed in 6714.75 sec (can open in GUI).
total = 6777.54 sec.
TOTAL RUNTIME 6777.54 sec
However, when running on an AMD Ryzen9 5950X (16-Cores), suite2p is using up the cores and processing quite quickly!
>>>>> started run at 14/06/2022 18:12:52Warning: cellpose did not import
No module named 'cellpose'
cannot use anatomical mode, but otherwise suite2p will run normally
{'data_path': ['C:/Users/jdelahanty/Desktop'], 'subfolders': [], 'h5py_key': 'data', 'save_path0': 'C:/Users/jdelahanty/Desktop', 'fast_disk': 'C:/Users/jdelahanty/Desktop', 'input_format': 'h5'}
h5
** Found 1 h5 files - converting to binary **
NOTE: using a list of h5 files:
['C:/Users/jdelahanty/Desktop\\20211105_CSE20_subtraction.h5']
time 83.60 sec. Wrote 45510 frames per binary for 1 planes
>>>>>>>>>>>>>>>>>>>>> PLANE 0 <<<<<<<<<<<<<<<<<<<<<<
NOTE: not registered / registration forced with ops['do_registration']>1
(no previous offsets to delete)
----------- REGISTRATION
registering 45510 frames
Reference frame, 9.64 sec.
Registered 8000/45510 in 229.04s
Registered 16000/45510 in 437.17s
Registered 24000/45510 in 660.37s
Registered 32000/45510 in 611.23s
Registered 40000/45510 in 817.70s
added enhanced mean image
----------- Total 976.36 sec
Registration metrics, 30.77 sec.
NOTE: applying default C:\Users\jdelahanty\.suite2p\classifiers\classifier_user.npy
----------- ROI DETECTION
Binning movie in chunks of length 22
Binned movie [2002,474,502] in 33.21 sec.
Binned movie denoised (for cell detection only) in 26.55 sec.
NOTE: estimated spatial scale ~12 pixels, time epochs 1.67, threshold 16.68
0 ROIs, score=545.66
Detected 933 ROIs, 42.18 sec
After removing overlaps, 671 ROIs remain
----------- Total 104.75 sec.
----------- EXTRACTION
Masks created, 2.60 sec.
Extracted fluorescence from 671 ROIs in 45510 frames, 43.53 sec.
----------- Total 46.79 sec.
----------- CLASSIFICATION
['npix_norm', 'compact', 'skew']
----------- Total 0.02 sec.
----------- SPIKE DECONVOLUTION
----------- Total 0.97 sec.
Plane 0 processed in 1159.84 sec (can open in GUI).
total = 1243.51 sec.
TOTAL RUNTIME 1243.51 sec
Maybe this is all more related to #771 somehow? It looks like it's specific to the EPYC configuration and not AMD which is great news!
Another update:
It looks like Pytorch doesn't care about there being two chips in this case. I updated the script above to just make the array it does the computation on really big and increased the number of threads to 200 for fun.
Here's the different script:
# AMD Torch Testing
# FFTs appear to struggle while running on AMD hardware via pytorch
# while running suite2p
import torch
import numpy as np
from time import perf_counter
# Set number of threads to be greater than 1
torch.set_num_threads(200)
print("Making Dummy Data...")
# Make Dummy Data with np.random
dummy = np.random.rand(50000, 512, 512)
print("Dummy Data Complete!")
print("Transform numpy array into Tensor")
# Make Dummy Data into a Tensor
dummy_tensor = torch.from_numpy(dummy)
print("Tensor creation complete!")
# Start timer
start = perf_counter()
print("Starting FFT")
# Save the result of the dummy data's fft to a variable
result = torch.fft.fft(dummy_tensor)
print("FFT Complete!")
# Stop the timer
end = perf_counter()
# Calculate total time
total_time = end - start
# Print total time
print("Total Time:", total_time)
And here's a clip of it while operating in the suite2p environment:
https://user-images.githubusercontent.com/28411193/173722620-50b45a95-5f94-418a-ac03-5ee403898a56.mp4
Our EPYC servers are single CPU, 64 core - threading is horrible. If you dump your dataset and settings somewhere we can give it a try to benchmark. BTW: The suite2p runs above yield different ROIs. Something seems odd there.
I didn't even notice that about the ROIs! I think I should use a different dataset as well because I did some background subtraction on this one incorrectly on accident. I'll run a dataset without subtraction on each system today and report back.
For sharing the dataset, the H5 file when losslessly compressed with blosc is something like 15GB, but I don't have the space on my Drive to put it there! So I'll make a subset of this recording thats easier to share and faster for us to test things as well. I'll update the comment with a link when I have it.
Also will try running each function for the registration steps and have outputs saved on small subsets of data that I'll also share in that drive.
We'll run a few tests / benchmarks on a fresh install now as well again. Will post here.
This is the most recent MKL/BLAS intel-spoofing benchmark I found: https://scrp.econ.cuhk.edu.hk/blog/analysis/2022/02/07/mkl-optimization https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html
We will try.
(just linking the matlab response on similar subject for completeness' sake here: https://de.mathworks.com/matlabcentral/answers/1672304-how-can-i-use-the-blas-and-lapack-implementations-included-in-amd-optimizing-cpu-libraries-aocl-wi)
...and a guide to properly choose openBLAS over MKL https://www.capitalone.com/tech/machine-learning/accelerating-python-with-hardware-optimized-computing/
We then probably only need to figure out how to use AOCC Blis instead... https://developer.amd.com/amd-aocl/blas-library/
Hi @jmdelahanty!
Did you follow up on this? I did not have time to dig in further...
Hello! I have yet to. For a while our dual EPYC machine was out of commission and no other machines we have seem to encounter this problem (as of yet anyways).
It is available again but I've been in the throes of running lots of mice and applying to graduate school. I may have a little time next weekend to try checking it out. Definitely need to solve this problem!
This would be awesome. Our two epyc machines are crunching data quite nicely otherwise. Will get back on it as well when I have time.
Some interim updates:
I did some profiling. It is not the FFT after all but scipy's gaussian_filter1d
in utils.temporal_smooth
:
data:image/s3,"s3://crabby-images/5555b/5555b176a09b82d70bb54bc49aed9aaf642a7f29" alt="image"
data:image/s3,"s3://crabby-images/a6281/a62817368a9318b72689c1d1f9c264e853d71f01" alt="image"
Scipy's correlated1d
runs on a single core, only.
Next steps:
- multithreaded
gaussian_filter1d
? (torch implementation? pythonthreading
?)
Interesting other facts:
- I tested OpenBLAS against MKL / scipy / numpy with spoofing, default setup etc.: Torch FFT is twice faster than MKL on AMD and almost 10 times faster than Scipy/Numpy FFTs - regardless of BLAS etc. All cores are used in TorchFFT.
edit3: There are some weird clashes with different Blas / Torch installations. After much benchmarking, and different install combinations, this is what works for us -> versions, install routines and benchmarks under this link:
https://www.evernote.com/shard/s4/sh/5b256de6-3809-f4f9-5a57-66e793917ecd/R7Js0zFUBeFljkpcf7iI9AQvY62OtdV35qjyEvsxGZytjhL-yshPiiI1wg
Multithreading gaussian_filter1d sped things up by a factor of 7, running on 64 cores now - so far so good! https://www.evernote.com/shard/s4/sh/eefddb8e-0924-531a-6938-54be5ff3febc/DtmNIZCmkYeq4cqiomtFfexUxuyZUyu07mbmtoSir2CHnBC3tRuEtdF1aQ
Source extraction results are not fully equivalent. Need to check what is going on. (edit: likely to be due to different batch-edge issues with thread-chunked data in temporal smoothing on top of s2p batch chunks: https://github.com/MouseLand/suite2p/issues/798 - seems minor, though).
@jmdelahanty - if you would like to test:
Code changes - added to registration/utils.py
:
# TR23: multithreading gaussian_filter1d on AMD Epyc
from concurrent.futures import ThreadPoolExecutor
def gaussian_filter1d_mt(input_array, sigma, axis=-1, mode='reflect', cval=0.0, n_threads=64):
"""
Apply Gaussian smoothing along a single axis of the input array using multiple threads.
Parameters:
input_array (numpy.ndarray): The input array to smooth.
sigma (float): The standard deviation of the Gaussian kernel.
axis (int, optional): The axis along which to apply the filter. Default is -1 (last axis).
mode (str, optional): The mode parameter to pass to scipy.ndimage.gaussian_filter1d. Default is 'reflect'.
cval (float, optional): The cval parameter to pass to scipy.ndimage.gaussian_filter1d. Default is 0.0.
n_threads (int, optional): The number of threads to use. Default is 4.
Returns:
numpy.ndarray: The smoothed array.
"""
# Split the input array into chunks along the axis we want to filter
chunks = np.array_split(input_array, n_threads, axis=axis)
# Create a thread pool and submit a filtering task for each chunk
with ThreadPoolExecutor(max_workers=n_threads) as executor:
tasks = [executor.submit(gaussian_filter1d, chunk, sigma, axis=axis, mode=mode, cval=cval) for chunk in chunks]
# Wait for all tasks to complete and collect the filtered chunks
filtered_chunks = [task.result() for task in tasks]
# Concatenate the filtered chunks back into a single array along the filtering axis
return np.concatenate(filtered_chunks, axis=axis)
#
Then use the following in line 243
def temporal_smooth(data: np.ndarray, sigma: float) -> np.ndarray:
"""
Returns Gaussian filtered 'frames' ndarray over first dimension
Parameters
----------
data: nImg x Ly x Lx
sigma: float
windowing parameter
Returns
-------
smoothed_data: nImg x Ly x Lx
Smoothed data
"""
return gaussian_filter1d_mt(data, sigma=sigma, axis=0)
Wow you've accomplished some pretty sweet things these past couple days! Amazing speed-ups and work!
This upcoming weekend I'll see if I alter things on one of my environments to do this on our hardware. I'm sure since it worked for you it'll work for us too but it'd be good to test it out and see for sure.
Also what did you use to profile it that way? I'm super curious about how to do that properly...
Pretty happy with the speed now. Would be curious to hear your results with dual-Epyc. I did not fully optimize and cross-check results (e.g., number of workers may not be optimal, I guess other functions could be parallel as well, I did not cross-check results with s2p defaults yet - but looks ok so far).
To profile: prun
and lprun
https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html
This also means: The whole issue probably only matters when smoothing in time
(ops 'smooth_sigma_time' set to something else than 0)
I'll have some time today in between imaging mice to try running suite2p on the dual-EPYC if it's available. I'll see if I can reproduce the performance increases on it. Would be fun to try and make an installation/docs PR out of this but don't have experience in software packaging at all.
Great! Curious to hear the results. I made this for the lab - see link below. The install sequence is not intellectually gratifying (read: that just ended up with the right combination of versions):
One thing: make sure to use the github version of the multithreading snippet (in the Evernote). The one posted above runs the risk of scrambling processing chunks.
https://www.evernote.com/shard/s4/sh/5b256de6-3809-f4f9-5a57-66e793917ecd/R7Js0zFUBeFljkpcf7iI9AQvY62OtdV35qjyEvsxGZytjhL-yshPiiI1wg
I got things running on our dual-EPYC today on a short recording and as far as I can tell things were working well! Didn't notice any slowdowns, looked like the threads were going gangbusters. Just haven't had a chance to validate whether or not things are the same on different hardware/different implementation. I think it'll take me a bit to have time to do that.