gpu4pyscf icon indicating copy to clipboard operation
gpu4pyscf copied to clipboard

Segfault libxc.so

Open markperri opened this issue 1 year ago • 9 comments

I installed pyscf into my environment in a Jupyter Notebook docker container running ubuntu 22.04 and python 3.11

pip install pyscf gpu4pyscf-cuda12x cutensor-cu12

When I test with the given example I get a segfault:

import pyscf
from gpu4pyscf.dft import rks

atom ='''
O       0.0000000000    -0.0000000000     0.1174000000
H      -0.7570000000    -0.0000000000    -0.4696000000
H       0.7570000000     0.0000000000    -0.4696000000
'''

mol = pyscf.M(atom=atom, basis='def2-tzvpp')
mf = rks.RKS(mol, xc='LDA').density_fit()

e_dft = mf.kernel()  # compute total energy

Segmentation fault

kernel: python[394296]: segfault at 0 ip 00007f506ab6fcff sp 00007ffdbc440778 error 6 in libxc.so[7f506ab65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

It looks like I have two libxc.so:

/opt/conda/lib/python3.11/site-packages/gpu4pyscf/lib/deps/lib/libxc.so
/opt/conda/lib/python3.11/site-packages/pyscf/lib/deps/lib/libxc.so
pip freeze | grep scf
gpu4pyscf-cuda12x==1.0
gpu4pyscf-libxc-cuda12x==0.4
pyscf==2.6.2
pyscf-dispersion==1.0.2

Do you have any thoughts on how to fix the segfault?

markperri avatar Jul 25 '24 21:07 markperri

The following things can be helpful to identify the issue:

  1. Run the following code to see if it is an issue related to libxc.so in PySCF or libxc.so (CUDA version) in gpu4pyscf.
import pyscf
from pyscf.dft import rks

atom ='''
O       0.0000000000    -0.0000000000     0.1174000000
H      -0.7570000000    -0.0000000000    -0.4696000000
H       0.7570000000     0.0000000000    -0.4696000000
'''

mol = pyscf.M(atom=atom, basis='def2-tzvpp')
mf = rks.RKS(mol, xc='LDA').density_fit()

e_dft = mf.kernel()  # compute total energy
  1. What is your GPU type?
  2. What is the message before Segmentation fault?

wxj6000 avatar Jul 25 '24 22:07 wxj6000

Thanks for the quick response.

  1. That code runs fine:

converged SCF energy = -75.2427927513195

  1. I am using an A100-40 on Jetstream2. It is sliced to 1/5 of a GPU in the hypervisor on this VM size. I also tried it on a g3.xl VM size, which uses the entire GPU, and got the same error.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-8C                  On  | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  1. There's no messages before that line, it's just Segmentation fault. I have to look in /var/log/messages to see the details. I'm not sure if that's due to running it in a docker container.
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault

/var/log/messages:

kernel: python[445393]: segfault at 0 ip 00007f8cb2b6fcff sp 00007ffc13e36838 error 6 in libxc.so[7f8cb2b65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

Thanks, Mark

markperri avatar Jul 25 '24 23:07 markperri

@markperri Thanks for the info. I tried to create a similar environment, but I was not able to reproduce the issue. If possible, could you please share your docker file?

And you probably have tried, but sometimes it is helpful to reinstall or create a new conda environment to avoid some possible conflict.

wxj6000 avatar Jul 26 '24 00:07 wxj6000

@wxj6000 Here is a minimal dockerfile that gives the same error. I wonder if there's something about the way this system is setup. I'll see if I can find another CUDA application to test the installation in general tomorrow. Thanks, Mark

FROM nvidia/cuda:12.2.0-devel-ubuntu22.04

RUN apt-get update -y && \
    apt-get install -y --no-install-recommends \
    python3-dev \
    python3-pip \
    python3-wheel \
    python3-setuptools && \
    rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*


ENV CUDA_HOME="/usr/local/cuda" LD_LIBRARY_PATH="${CUDA_HOME}/lib64::${LD_LIBRARY_PATH}"
RUN echo "export PATH=${CUDA_HOME}/bin:\$PATH" >> /etc/bash.bashrc
RUN echo "export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:\$LD_LIBRARY_PATH" >> /etc/bash.bashrc

RUN pip3 install pyscf gpu4pyscf-cuda12x cutensor-cu12
root@23aed08bf45d:/# python3
Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault (core dumped)

/var/log/messages:

python3[506069]: segfault at 0 ip 00007fa842b6fcff sp 00007ffc591d6028 error 6 in libxc.so[7fa842b65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38
``

markperri avatar Jul 26 '24 00:07 markperri

@wxj6000 I ran a NAMD container from NVIDIA NGC and it runs fine on the GPU, so at least we know the docker / GPU setup is working. I'm not sure what else to test.

Fri Jul 26 14:40:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-40C                 On  | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |    672MiB / 40960MiB |     61%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     11413      C   namd2                                       671MiB |
+---------------------------------------------------------------------------------------+

markperri avatar Jul 26 '24 14:07 markperri

@markperri I tried the docker file you provided. The docker container works fine on my side. Let me check if there is a memory leak in the modules.

Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>> 
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>> 
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>> 
>>> e_dft = mf.kernel()  # compute total energy
/usr/local/lib/python3.10/dist-packages/cupy/cuda/compiler.py:233: PerformanceWarning: Jitify is performing a one-time only warm-up to populate the persistent cache, this may take a few seconds and will be improved in a future release...
  jitify._init_module()
converged SCF energy = -75.2427927513248
>>> print(f"total energy = {e_dft}")
total energy = -75.24279275132476
>>> 

wxj6000 avatar Jul 27 '24 23:07 wxj6000

@wxj6000 I compiled gpu4pyscf from source and it still gives the same error. I'll contact the Jetstream2 staff and see if they have any ideas.

Thanks, Mark

markperri avatar Jul 28 '24 15:07 markperri

@markperri I went through the code related to libxc, and improved the interface related to memory allocation in libxc. But I am not sure if it is helpful on your side. https://github.com/pyscf/gpu4pyscf/actions/runs/10133763490/job/28019283314?pr=189

wxj6000 avatar Jul 28 '24 19:07 wxj6000

Thanks for trying. I compiled from source with 8fdfaa8, but I get the same segfault:

kernel: python[43743]: segfault at 0 ip 00007f3fad76ddf3 sp 00007ffeda1ba2c8 error 6 in libxc.so.15[7f3fad763000+224000]
kernel: Code: 00 00 00 75 05 48 83 c4 18 c3 e8 58 68 ff ff 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 8b 05 b5 b2 21 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

markperri avatar Jul 28 '24 23:07 markperri

@markperri Can you check if this PR resolves the issue please? https://github.com/pyscf/gpu4pyscf/pull/180

wxj6000 avatar Aug 21 '24 04:08 wxj6000

Thanks, is that the libxc_overhead branch? I installed it, but it doesn't seem to help:

pip install git+https://github.com/pyscf/gpu4pyscf.git@libxc_overhead
pip install cutensor-cu12

(base) jovyan@d67ddf22943d:/tmp$ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault

markperri avatar Aug 21 '24 05:08 markperri

Right. It is the libxc_overhead branch. Just to confirm, have you removed the existing package if installed?

And I registered an account on ChemCompute. But I don't have the access to JupyterHub as I don't have academic emails anymore. Is there any chance to have a development environment for debugging?

wxj6000 avatar Aug 21 '24 07:08 wxj6000

Yes, this is without any gpu4pyscf installed.

markperri avatar Aug 21 '24 14:08 markperri

Oh and @wxj6000 you should have Jupyter Notebook access now. Thanks, Mark

markperri avatar Aug 21 '24 14:08 markperri

@markperri Thank you for giving me the permission for debugging. It seems that the unified memory, which is required by libxc.so, is disabled on this device. Please checkout the managedMemory in the dict and CUDA documentation for the details. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

We can switch to libxc on CPU if the unified memory is not supported on the device. We will let you know the progress.

{'name': b'GRID A100X-8C', 'totalGlobalMem': 8585609216, 'sharedMemPerBlock': 49152, 'regsPerBlock': 65536, 'warpSize': 32, 'maxThreadsPerBlock': 1024, 'maxThreadsDim': (1024, 1024, 64), 'maxGridSize': (2147483647, 65535, 65535), 'clockRate': 1410000, 'totalConstMem': 65536, 'major': 8, 'minor': 0, 'textureAlignment': 512, 'texturePitchAlignment': 32, 'multiProcessorCount': 108, 'kernelExecTimeoutEnabled': 0, 'integrated': 0, 'canMapHostMemory': 1, 'computeMode': 0, 'maxTexture1D': 131072, 'maxTexture2D': (131072, 65536), 'maxTexture3D': (16384, 16384, 16384), 'concurrentKernels': 1, 'ECCEnabled': 1, 'pciBusID': 4, 'pciDeviceID': 0, 'pciDomainID': 0, 'tccDriver': 0, 'memoryClockRate': 1215000, 'memoryBusWidth': 5120, 'l2CacheSize': 41943040, 'maxThreadsPerMultiProcessor': 2048, 'isMultiGpuBoard': 0, 'cooperativeLaunch': 1, 'cooperativeMultiDeviceLaunch': 1, 'deviceOverlap': 1, 'maxTexture1DMipmap': 32768, 'maxTexture1DLinear': 268435456, 'maxTexture1DLayered': (32768, 2048), 'maxTexture2DMipmap': (32768, 32768), 'maxTexture2DLinear': (131072, 65000, 2097120), 'maxTexture2DLayered': (32768,32768, 2048), 'maxTexture2DGather': (32768, 32768), 'maxTexture3DAlt': (8192, 8192, 32768), 'maxTextureCubemap': 32768, 'maxTextureCubemapLayered': (32768, 2046), 'maxSurface1D': 32768, 'maxSurface1DLayered': (32768, 2048), 'maxSurface2D': (131072, 65536), 'maxSurface2DLayered': (32768, 32768, 2048), 'maxSurface3D': (16384, 16384, 16384), 'maxSurfaceCubemap': 32768,'maxSurfaceCubemapLayered': (32768, 2046), 'surfaceAlignment': 512, 'asyncEngineCount': 5, 'unifiedAddressing': 1, 'streamPrioritiesSupported': 1, 'globalL1CacheSupported': 1, 'localL1CacheSupported': 1, 'sharedMemPerMultiprocessor': 167936, 'regsPerMultiprocessor': 65536, 'managedMemory': 0, 'multiGpuBoardGroupID': 0, 'hostNativeAtomicSupported': 0, 'singleToDoublePrecisionPerfRatio': 2, 'pageableMemoryAccess': 0, 'concurrentManagedAccess': 0, 'computePreemptionSupported': 1, 'canUseHostPointerForRegisteredMem': 0, 'sharedMemPerBlockOptin': 166912, 'pageableMemoryAccessUsesHostPageTables': 0, 'directManagedMemAccessFromHost': 0, 'uuid': b'_:\x16\x9f_\xd6\x11\xef\xbex\x9d\x11\x11\x8e+\xa9', 'luid': b'', 'luidDeviceNodeMask': 0,'persistingL2CacheMaxSize': 26214400, 'maxBlocksPerMultiProcessor': 32, 'accessPolicyMaxWindowSize': 134213632, 'reservedSharedMemPerBlock': 1024}

wxj6000 avatar Aug 21 '24 18:08 wxj6000

Oh I see. The way their hypervisor works with vGPUs doesn't allow unified memory. Looks like this package won't be compatible with their system then. Thanks, Mark

markperri avatar Aug 21 '24 19:08 markperri

@markperri The issue has been fixed in v1.0.1. Most tasks can be executed on ChemCompute now. But, due to the limited memory of a slice of GPU, it may raise an error of out of memory for some tasks such as Hessian calculations.

Thank you for your feedback and your cooperation!

wxj6000 avatar Aug 25 '24 06:08 wxj6000

Thanks! It works great now. I increased the instance size to use the entire GPU and the out of memory problems are fixed. But, I had to install it from github. There is something wrong with the package on pypi. It just downloads all versions and then gives up.

(base) jovyan@7db95487cf10:/tmp$ pip install gpu4pyscf
Collecting gpu4pyscf
  Downloading gpu4pyscf-1.0.1.tar.gz (206 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.8/206.8 kB 6.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/3d/68/07452d97f874c77d622e42969fb54c265a734d4f7be86f18944400625bb2/gpu4pyscf-1.0.1.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/3d/68/07452d97f874c77d622e42969fb54c265a734d4f7be86f18944400625bb2/gpu4pyscf-1.0.1.tar.gz has inconsistent name: expected 'gpu4pyscf', butmetadata has 'gpu4pyscf-cuda12x'
  Downloading gpu4pyscf-1.0.tar.gz (204 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 205.0/205.0 kB 19.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/17/00/a9bfefd38206230cd4542106b13cc1d08dcdc6b76f0be112bb4be5fb23f4/gpu4pyscf-1.0.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/17/00/a9bfefd38206230cd4542106b13cc1d08dcdc6b76f0be112bb4be5fb23f4/gpu4pyscf-1.0.tar.gz has inconsistent name: expected 'gpu4pyscf', but metadata has 'gpu4pyscf-cuda12x'
  Downloading gpu4pyscf-0.8.2.tar.gz (204 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.9/204.9 kB 13.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/bb/dc/b33d96a33a406758cf9cd0ea14e5654c3d1310ee9ae7ff466ed6567816ae/gpu4pyscf-0.8.2.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/bb/dc/b33d96a33a406758cf9cd0ea14e5654c3d1310ee9ae7ff466ed6567816ae/gpu4pyscf-0.8.2.tar.gz has inconsistent name: expected 'gpu4pyscf', butmetadata has 'gpu4pyscf-cuda12x'

It continues to download older versions of gpu4pyscf and then errors out.

markperri avatar Aug 25 '24 15:08 markperri

@markperri pip3 install gpu4pyscf-cuda12x will resolve the issue.

wxj6000 avatar Aug 25 '24 16:08 wxj6000

Oh yes, sorry. Forgot that part!

markperri avatar Aug 25 '24 16:08 markperri