MemTorch Parallelize/Optimize Device Simulation Logic

Currently, when performing inference, or programming devices in passive arrays (0T1R arrangements), devices are simulated in a sequential manor. CUDA kernels and other optimization methods can be used to drastically improve performance, as some specific operations are not easily parallelizable using the Python API.

Jun 14 '21 04:06 coreylammie

Hello, I was wondering if any advancement had been made on this issue? I would get started on it otherwise. Thank you,

Oct 29 '21 14:10 Philippe-Drolet

Hi @Philippe-Drolet,

I have prioritized the implementation of torch.nn.RNN, torch.nn.RNNCell, torch.nn.LTSM, torch.nn.LTSMCell, torch.nn.GRU, and torch.nn.GRUCell modules, so I will likely be unable to work on this issue in the near-future.

You are welcome to contribute yourself! I'm happy to answer any questions you may have.

Nov 01 '21 00:11 coreylammie

Hello,

So I have started work on this, I was curious as to what you would recommend for debugging the cuda files when using them with the python interface. So far, I have created a new pytest with the debug networks that you have defined but when I get to debugging my new .cu files, I cannot access them line by line as I would a regular python file. I am trying visual studio code right now to run the tests, what IDE are you using (I suppose its impossible to debug the c++ files with pycharm). Any guidance would help and I am also simply curious as to how you do it.

Also, do you have any documentation as per the purpose of matrices ABCD_E from simulate passive? Thanks!

Nov 15 '21 03:11 Philippe-Drolet

Hi @Philippe-Drolet,

Sure- my preferred method of debugging is to use cuda-memcheck. The cuda-memcheck tool can be used to pin-point the exact line/kernel and respective error message, as long as the -lineinfo flag is added during compilation. This has been done here: https://github.com/coreylammie/MemTorch/blob/master/setup.py#L46.

This tool can be used when executing a Python script that calls a C++/CUDA binding, which launches one or more CUDA kernels. It can be invoked as follows: cuda-memcheck python test.py. When debugging an especially problematic kernel, I would suggest setting the following environmental variable CUDA_LAUNCH_BLOCKING=1, so that only one kernel is executed at a time, i.e., cuda-memcheck CUDA_LAUNCH_BLOCKING=1 python test.py can be used.

In addition, cudaSafeCall can be used, which is defined in memtorch.cu.utils.cuh. Technically, breakpoints can be added using NVIDIA Nsight, however, in my experience, this is cumbersome to use, and printf statements can easily be used alongside cuda-memcheck, enabling the use of your preferred IDE.

ABCD_E matrices were originally proposed and defined in [1]. They are used to solve for node voltages using linear matrix algebra, while accounting for source and line resistances in crossbar architectures. Solving them efficiently is rather nuanced, as the ABCD matrix is sparse, and sparse linear matrix equations are difficult to solve in a parallelized manner.

Hopefully, this helps! I'm happy to answer any further questions you may have.

[1] A. Chen, “A Comprehensive Crossbar Array Model With Solutions for Line Resistance and Nonlinear Device Characteristics,” IEEE Transactions on Electron Devices, vol. 60, no. 4, pp. 1318–1326, Apr. 2013, doi: 10.1109/ted.2013.2246791.

Nov 16 '21 00:11 coreylammie

Thank you very much for this response, I will i go for the good old printf approach, it seems to work so far!

Nov 23 '21 14:11 Philippe-Drolet

MemTorch MemTorch copied to clipboard

Parallelize/Optimize Device Simulation Logic

MemTorch
MemTorch copied to clipboard