suite2p icon indicating copy to clipboard operation
suite2p copied to clipboard

solved - no multi-threading on AMD EPYC 7702P 64-Core during registration -- how to speed up AMD processing?

Open trose-neuro opened this issue 3 years ago • 27 comments

Hi!

First: Thanks for your great work!

I am currently trying to optimize suite2p on our new AMD 64 core server (fresh conda/suite2p install on Ubuntu 5.4, w/ standard environment dependencies).

During registration, however, it spends most of the time using only a single thread. On a Xeon intel with only 8 cores and comparable processor speed, registration runs almost twice as fast and appears to fill up the cores nicely. Is this expected behavior?

I played with the usual suspects (MKL versions... the infamous MKL_DEBUG_CPU_TYPE flag etc.) - no single thread speedup and no change in multithreading behavior.

Thanks!

(settings and suite2p output: https://www.evernote.com/shard/s4/sh/363400de-eef8-2c75-44f0-1ec14b9b64a1/c94284efba1aab302a985bb93d0ee70c)

Edit: I consider this solved for us. https://github.com/MouseLand/suite2p/issues/725#issuecomment-1435949465

trose-neuro avatar Sep 08 '21 14:09 trose-neuro

are you using the latest version with the torch dependency for the FFT? I switched to torch thinking it would handle the multithreading better

carsen-stringer avatar Sep 12 '21 19:09 carsen-stringer

Thx - but yes. All running with pytorch.

trose-neuro avatar Sep 14 '21 04:09 trose-neuro

...I will check a fresh environment and try installing pytorch with OpenBlas. I'll post here then.

trose-neuro avatar Sep 19 '21 08:09 trose-neuro

thanks, let us know what you find, I'm not sure if there is a fast way to do FFTs on an AMD CPU with python, you could try timing scipy's fft functions and if it's better we can add an option to force to use it, but I'm guessing it may be sped up by MKL too. They unfortunately removed the MKL_DEBUG_CPU_TYPE flag I think

carsen-stringer avatar Oct 18 '21 17:10 carsen-stringer

Hello! Just to know whether we have any news about this issue, @trose-neuro ? I was about to purchase a workstation with an AMD processor...but then read about MKL...- thank you in advance!

MariangelaP avatar Nov 16 '21 15:11 MariangelaP

Sorry - no. Nothing yet. Frankly: I'd recommend going intel.

trose-neuro avatar Nov 24 '21 16:11 trose-neuro

we will add to the documentation that we recommend intel processors

carsen-stringer avatar Jun 01 '22 10:06 carsen-stringer

I've messaged a few HPC engineers at AMD over twitter about this and they said they will get back to me/seemed interested in helping since our lab just bought a bunch of AMD compute.

EDIT:

One did! A helpful fellow named Nicholas who is a principal scientist at AMD. Here's some initial things he mentioned and something that we can try checking out:

I agree with you that there is no reason to think we can't run a multithreaded FFT. 
If it is using fftw as the backend, you will want to be sure it was compiled with the multithreaded flags,
such as: -lfftw3_threads -lfftw3 -lm on Unix, or -lfftw3_omp -lfftw3 -lm

(see: https://fftw.org/fftw3_doc/Usage-of-Multi_002dthreaded-FFTW.html)

jmdelahanty avatar Jun 13 '22 18:06 jmdelahanty

Probably worthwhile to run pyFFTW then (AFAIK, torch.fft does not use the FFTW library): https://github.com/pyFFTW/pyFFTW

But this makes me not too optimistic regarding its performance (benchmark is pretty old, though): https://thomasaarholt.github.io/fftspeedtest/fftspeedtest.html

Maybe wrapping the new threadripper-optimized AMD fftw would be an option: https://developer.amd.com/amd-aocl/fftw/

I did not check for a while if there is news regarding MKL/AMD torch etc...

Currently also considering adding GPUs to our big AMD servers to at least use them a little bit more efficiently - our workflows are all piped through these machines and currently, it is a bit of a sad story...

trose-neuro avatar Jun 14 '22 07:06 trose-neuro

I like @trose-neuro idea to try AOCL's FFTW implementation. This is certainly a software problem. AMD CPU hardware supports rich multithreading/hyperthreading.

I'll poke around on my side to see how to get multithreading on AMD hardware in pytorch's implementation of FFT.

nicholasmalaya avatar Jun 14 '22 15:06 nicholasmalaya

thank you @nicholasmalaya ! on suite2p's side, I am happy to have an alternate fft import for AMD users, particularly if it is easy to install with conda or pip and if it works on both windows and ubuntu. but of course if pytorch can be sped up like you suggest, that would be best

carsen-stringer avatar Jun 14 '22 15:06 carsen-stringer

Hello everyone!

I did a small test today using this tiny script just to see what torch.fft.fft is doing on this machine:

CPU Count: 256 : "AuthenticAMD AMD EPYC 7742 64-Core Processor 1497.825 MHz (2 chips x 64 cores : 128 hyperthread cores)"

EDIT:

With this environment (very simple):

(amd_torch) jdelahanty@aori:~$ conda list
# packages in environment at /home/jdelahanty/miniconda3/envs/amd_torch:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
ca-certificates           2022.4.26            h06a4308_0
certifi                   2022.5.18.1      py38h06a4308_0
ld_impl_linux-64          2.38                 h1181459_1
libffi                    3.3                  he6710b0_2
libgcc-ng                 11.2.0               h1234567_1
libgomp                   11.2.0               h1234567_1
libstdcxx-ng              11.2.0               h1234567_1
ncurses                   6.3                  h7f8727e_2
numpy                     1.22.4                   pypi_0    pypi
openssl                   1.1.1o               h7f8727e_0
pip                       21.2.4           py38h06a4308_0
python                    3.8.13               h12debd9_0
readline                  8.1.2                h7f8727e_1
setuptools                61.2.0           py38h06a4308_0
sqlite                    3.38.3               hc218d9a_0
tk                        8.6.12               h1ccaba5_0
torch                     1.11.0                   pypi_0    pypi
typing-extensions         4.2.0                    pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0
xz                        5.2.5                h7f8727e_1
zlib                      1.2.12               h7f8727e_2

With this small script:

# AMD Torch Testing
# FFTs appear to struggle while running on AMD  hardware via pytorch in suite2p
# pytorch FFT alone seems to work okay!

import torch
import numpy as np
from time import perf_counter

# Set number of threads to be greater than 1
torch.set_num_threads(6)

print("Making Dummy Data...")

# Make Dummy Data with np.random
dummy = np.random.rand(12500, 512, 512)

print("Dummy Data Complete!")

print("Transform numpy array into Tensor")
# Make Dummy Data into a Tensor
dummy_tensor = torch.from_numpy(dummy)

print("Tensor creation complete!")

# Start timer
start = perf_counter()

print("Starting FFT")

# Save the result of the dummy data's fft to a variable
result = torch.fft.fft(dummy_tensor)

print("FFT Complete!")
# Stop the timer
end = perf_counter()

# Calculate total time
total_time = end - start

# Print total time
print("Total Time:", total_time)

From the small test I did here, it looks like it actually does end up doing something multithreaded maybe? See below:

https://user-images.githubusercontent.com/28411193/173710978-27799bff-3f82-4fbe-af1b-da0a3b481a6d.mp4

What do you think of this? Did I test it incorrectly?

EDIT2:

After running Suite2p on our EPYC configured AMD hardware, the run looks like this:

 >>>>> started run at 10/06/2022 17:29:00Warning: cellpose did not import
No module named 'cellpose'
cannot use anatomical mode, but otherwise suite2p will run normally
{'data_path': ['/scratch/snlkt2p'], 'subfolders': [], 'h5py_key': 'data', 'save_path0': '/scratch/snlkt2p', 'fast_disk': '/scratch/snlkt2p', 'input_format': 'h5'}
h5
** Found 1 h5 files - converting to binary **
NOTE: using a list of h5 files:
['/scratch/snlkt2p/20211105_CSE20_subtraction.h5']
time 62.56 sec. Wrote 45510 frames per binary for 1 planes
>>>>>>>>>>>>>>>>>>>>> PLANE 0 <<<<<<<<<<<<<<<<<<<<<<
NOTE: not registered / registration forced with ops['do_registration']>1
      (no previous offsets to delete)
----------- REGISTRATION
registering 45510 frames
Reference frame, 22.15 sec.
Registered 8000/45510 in 1123.35s
Registered 16000/45510 in 2234.96s
Registered 24000/45510 in 3337.56s
Registered 32000/45510 in 4441.69s
Registered 40000/45510 in 5662.43s
added enhanced mean image
----------- Total 6604.40 sec
Registration metrics, 37.10 sec.
NOTE: applying default /home/jdelahanty/.suite2p/classifiers/classifier_user.npy
----------- ROI DETECTION
Binning movie in chunks of length 22
Binned movie [2002,474,502] in 26.39 sec.
NOTE: estimated spatial scale ~12 pixels, time epochs 1.67, threshold 16.68
0 ROIs, score=209.17
Detected 255 ROIs, 19.12 sec
After removing overlaps, 232 ROIs remain
----------- Total 46.21 sec.
----------- EXTRACTION
Masks created, 1.46 sec.
Extracted fluorescence from 232 ROIs in 45510 frames, 24.46 sec.
----------- Total 26.28 sec.
----------- CLASSIFICATION
['compact', 'skew', 'npix_norm']
----------- Total 0.07 sec.
----------- SPIKE DECONVOLUTION
----------- Total 0.46 sec.
Plane 0 processed in 6714.75 sec (can open in GUI).
total = 6777.54 sec.
TOTAL RUNTIME 6777.54 sec

However, when running on an AMD Ryzen9 5950X (16-Cores), suite2p is using up the cores and processing quite quickly!

>>>>> started run at 14/06/2022 18:12:52Warning: cellpose did not import

No module named 'cellpose'

cannot use anatomical mode, but otherwise suite2p will run normally

{'data_path': ['C:/Users/jdelahanty/Desktop'], 'subfolders': [], 'h5py_key': 'data', 'save_path0': 'C:/Users/jdelahanty/Desktop', 'fast_disk': 'C:/Users/jdelahanty/Desktop', 'input_format': 'h5'}

h5

** Found 1 h5 files - converting to binary **

NOTE: using a list of h5 files:

['C:/Users/jdelahanty/Desktop\\20211105_CSE20_subtraction.h5']

time 83.60 sec. Wrote 45510 frames per binary for 1 planes

>>>>>>>>>>>>>>>>>>>>> PLANE 0 <<<<<<<<<<<<<<<<<<<<<<

NOTE: not registered / registration forced with ops['do_registration']>1

      (no previous offsets to delete)

----------- REGISTRATION

registering 45510 frames

Reference frame, 9.64 sec.

Registered 8000/45510 in 229.04s

Registered 16000/45510 in 437.17s

Registered 24000/45510 in 660.37s

Registered 32000/45510 in 611.23s

Registered 40000/45510 in 817.70s

added enhanced mean image

----------- Total 976.36 sec

Registration metrics, 30.77 sec.

NOTE: applying default C:\Users\jdelahanty\.suite2p\classifiers\classifier_user.npy

----------- ROI DETECTION

Binning movie in chunks of length 22

Binned movie [2002,474,502] in 33.21 sec.

Binned movie denoised (for cell detection only) in 26.55 sec.

NOTE: estimated spatial scale ~12 pixels, time epochs 1.67, threshold 16.68 

0 ROIs, score=545.66

Detected 933 ROIs, 42.18 sec

After removing overlaps, 671 ROIs remain

----------- Total 104.75 sec.

----------- EXTRACTION

Masks created, 2.60 sec.

Extracted fluorescence from 671 ROIs in 45510 frames, 43.53 sec.

----------- Total 46.79 sec.

----------- CLASSIFICATION

['npix_norm', 'compact', 'skew']

----------- Total 0.02 sec.

----------- SPIKE DECONVOLUTION

----------- Total 0.97 sec.

Plane 0 processed in 1159.84 sec (can open in GUI).

total = 1243.51 sec.

TOTAL RUNTIME 1243.51 sec

Maybe this is all more related to #771 somehow? It looks like it's specific to the EPYC configuration and not AMD which is great news!

jmdelahanty avatar Jun 15 '22 00:06 jmdelahanty

Another update:

It looks like Pytorch doesn't care about there being two chips in this case. I updated the script above to just make the array it does the computation on really big and increased the number of threads to 200 for fun.

Here's the different script:

# AMD Torch Testing
# FFTs appear to struggle while running on AMD hardware via pytorch
# while running suite2p

import torch
import numpy as np
from time import perf_counter

# Set number of threads to be greater than 1
torch.set_num_threads(200)

print("Making Dummy Data...")

# Make Dummy Data with np.random
dummy = np.random.rand(50000, 512, 512)

print("Dummy Data Complete!")

print("Transform numpy array into Tensor")
# Make Dummy Data into a Tensor
dummy_tensor = torch.from_numpy(dummy)

print("Tensor creation complete!")

# Start timer
start = perf_counter()

print("Starting FFT")

# Save the result of the dummy data's fft to a variable
result = torch.fft.fft(dummy_tensor)

print("FFT Complete!")
# Stop the timer
end = perf_counter()

# Calculate total time
total_time = end - start

# Print total time
print("Total Time:", total_time)

And here's a clip of it while operating in the suite2p environment:

https://user-images.githubusercontent.com/28411193/173722620-50b45a95-5f94-418a-ac03-5ee403898a56.mp4

jmdelahanty avatar Jun 15 '22 02:06 jmdelahanty

Our EPYC servers are single CPU, 64 core - threading is horrible. If you dump your dataset and settings somewhere we can give it a try to benchmark. BTW: The suite2p runs above yield different ROIs. Something seems odd there.

trose-neuro avatar Jun 15 '22 09:06 trose-neuro

I didn't even notice that about the ROIs! I think I should use a different dataset as well because I did some background subtraction on this one incorrectly on accident. I'll run a dataset without subtraction on each system today and report back.

For sharing the dataset, the H5 file when losslessly compressed with blosc is something like 15GB, but I don't have the space on my Drive to put it there! So I'll make a subset of this recording thats easier to share and faster for us to test things as well. I'll update the comment with a link when I have it.

Also will try running each function for the registration steps and have outputs saved on small subsets of data that I'll also share in that drive.

jmdelahanty avatar Jun 15 '22 14:06 jmdelahanty

We'll run a few tests / benchmarks on a fresh install now as well again. Will post here.

This is the most recent MKL/BLAS intel-spoofing benchmark I found: https://scrp.econ.cuhk.edu.hk/blog/analysis/2022/02/07/mkl-optimization https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html

We will try.

(just linking the matlab response on similar subject for completeness' sake here: https://de.mathworks.com/matlabcentral/answers/1672304-how-can-i-use-the-blas-and-lapack-implementations-included-in-amd-optimizing-cpu-libraries-aocl-wi)

...and a guide to properly choose openBLAS over MKL https://www.capitalone.com/tech/machine-learning/accelerating-python-with-hardware-optimized-computing/

We then probably only need to figure out how to use AOCC Blis instead... https://developer.amd.com/amd-aocl/blas-library/

trose-neuro avatar Jun 17 '22 07:06 trose-neuro

Hi @jmdelahanty!

Did you follow up on this? I did not have time to dig in further...

trose-neuro avatar Feb 05 '23 13:02 trose-neuro

Hello! I have yet to. For a while our dual EPYC machine was out of commission and no other machines we have seem to encounter this problem (as of yet anyways).

It is available again but I've been in the throes of running lots of mice and applying to graduate school. I may have a little time next weekend to try checking it out. Definitely need to solve this problem!

jmdelahanty avatar Feb 17 '23 05:02 jmdelahanty

This would be awesome. Our two epyc machines are crunching data quite nicely otherwise. Will get back on it as well when I have time.

trose-neuro avatar Feb 17 '23 11:02 trose-neuro

Some interim updates:

I did some profiling. It is not the FFT after all but scipy's gaussian_filter1d in utils.temporal_smooth:

image image

Scipy's correlated1d runs on a single core, only.

Next steps:

  • multithreaded gaussian_filter1d? (torch implementation? python threading?)

Interesting other facts:

  • I tested OpenBLAS against MKL / scipy / numpy with spoofing, default setup etc.: Torch FFT is twice faster than MKL on AMD and almost 10 times faster than Scipy/Numpy FFTs - regardless of BLAS etc. All cores are used in TorchFFT.

edit3: There are some weird clashes with different Blas / Torch installations. After much benchmarking, and different install combinations, this is what works for us -> versions, install routines and benchmarks under this link:

https://www.evernote.com/shard/s4/sh/5b256de6-3809-f4f9-5a57-66e793917ecd/R7Js0zFUBeFljkpcf7iI9AQvY62OtdV35qjyEvsxGZytjhL-yshPiiI1wg

trose-neuro avatar Feb 19 '23 10:02 trose-neuro

Multithreading gaussian_filter1d sped things up by a factor of 7, running on 64 cores now - so far so good! https://www.evernote.com/shard/s4/sh/eefddb8e-0924-531a-6938-54be5ff3febc/DtmNIZCmkYeq4cqiomtFfexUxuyZUyu07mbmtoSir2CHnBC3tRuEtdF1aQ

Source extraction results are not fully equivalent. Need to check what is going on. (edit: likely to be due to different batch-edge issues with thread-chunked data in temporal smoothing on top of s2p batch chunks: https://github.com/MouseLand/suite2p/issues/798 - seems minor, though).

@jmdelahanty - if you would like to test:

Code changes - added to registration/utils.py:

# TR23: multithreading gaussian_filter1d on AMD Epyc
from concurrent.futures import ThreadPoolExecutor

def gaussian_filter1d_mt(input_array, sigma, axis=-1, mode='reflect', cval=0.0, n_threads=64):
    """
    Apply Gaussian smoothing along a single axis of the input array using multiple threads.

    Parameters:
        input_array (numpy.ndarray): The input array to smooth.
        sigma (float): The standard deviation of the Gaussian kernel.
        axis (int, optional): The axis along which to apply the filter. Default is -1 (last axis).
        mode (str, optional): The mode parameter to pass to scipy.ndimage.gaussian_filter1d. Default is 'reflect'.
        cval (float, optional): The cval parameter to pass to scipy.ndimage.gaussian_filter1d. Default is 0.0.
        n_threads (int, optional): The number of threads to use. Default is 4.

    Returns:
        numpy.ndarray: The smoothed array.
    """
    # Split the input array into chunks along the axis we want to filter
    chunks = np.array_split(input_array, n_threads, axis=axis)

    # Create a thread pool and submit a filtering task for each chunk
    with ThreadPoolExecutor(max_workers=n_threads) as executor:
        tasks = [executor.submit(gaussian_filter1d, chunk, sigma, axis=axis, mode=mode, cval=cval) for chunk in chunks]

    # Wait for all tasks to complete and collect the filtered chunks
    filtered_chunks = [task.result() for task in tasks]

    # Concatenate the filtered chunks back into a single array along the filtering axis
    return np.concatenate(filtered_chunks, axis=axis)
# 

Then use the following in line 243

def temporal_smooth(data: np.ndarray, sigma: float) -> np.ndarray:
    """
    Returns Gaussian filtered 'frames' ndarray over first dimension

    Parameters
    ----------
    data: nImg x Ly x Lx
    sigma: float
        windowing parameter

    Returns
    -------
    smoothed_data: nImg x Ly x Lx
        Smoothed data

    """
    return gaussian_filter1d_mt(data, sigma=sigma, axis=0)

trose-neuro avatar Feb 20 '23 08:02 trose-neuro

Wow you've accomplished some pretty sweet things these past couple days! Amazing speed-ups and work!

This upcoming weekend I'll see if I alter things on one of my environments to do this on our hardware. I'm sure since it worked for you it'll work for us too but it'd be good to test it out and see for sure.

Also what did you use to profile it that way? I'm super curious about how to do that properly...

jmdelahanty avatar Feb 20 '23 21:02 jmdelahanty

Pretty happy with the speed now. Would be curious to hear your results with dual-Epyc. I did not fully optimize and cross-check results (e.g., number of workers may not be optimal, I guess other functions could be parallel as well, I did not cross-check results with s2p defaults yet - but looks ok so far).

To profile: prun and lprun https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html

trose-neuro avatar Feb 21 '23 06:02 trose-neuro

This also means: The whole issue probably only matters when smoothing in time

(ops 'smooth_sigma_time' set to something else than 0)

trose-neuro avatar Feb 21 '23 10:02 trose-neuro

I'll have some time today in between imaging mice to try running suite2p on the dual-EPYC if it's available. I'll see if I can reproduce the performance increases on it. Would be fun to try and make an installation/docs PR out of this but don't have experience in software packaging at all.

jmdelahanty avatar Feb 25 '23 21:02 jmdelahanty

Great! Curious to hear the results. I made this for the lab - see link below. The install sequence is not intellectually gratifying (read: that just ended up with the right combination of versions):

One thing: make sure to use the github version of the multithreading snippet (in the Evernote). The one posted above runs the risk of scrambling processing chunks.

https://www.evernote.com/shard/s4/sh/5b256de6-3809-f4f9-5a57-66e793917ecd/R7Js0zFUBeFljkpcf7iI9AQvY62OtdV35qjyEvsxGZytjhL-yshPiiI1wg

trose-neuro avatar Feb 26 '23 06:02 trose-neuro

I got things running on our dual-EPYC today on a short recording and as far as I can tell things were working well! Didn't notice any slowdowns, looked like the threads were going gangbusters. Just haven't had a chance to validate whether or not things are the same on different hardware/different implementation. I think it'll take me a bit to have time to do that.

jmdelahanty avatar Mar 21 '23 01:03 jmdelahanty