iqsharp icon indicating copy to clipboard operation
iqsharp copied to clipboard

Rare race condition can cause Python host programs to hang

Open cgranade opened this issue 4 years ago • 4 comments

Describe the bug When running many small cells (e.g.: %simulate DoFoo for a short operation DoFoo), messages sent by the IQ# kernel can be silently dropped by ZeroMQ polling internal to jupyter_client, causing the client to block forever waiting for messages.

To Reproduce

  • Use %simulate to repeatedly run a small operation; generally hangs ~1 / 1000 calls.

Screenshots Example IQ# debug log showing messages are correctly sent:

2021-09-01T13:59:11.2453868-07:00  [DBG] Received new message:
	{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"}
	{"msg_id":"7be84757-0e63-4143-96c2-bd0e34875a1f","username":"cgran","session":null,"msg_type":null,"version":"5.2.0"}
	{}
	{"code":"%simulate DrawTomographyMeasurement {\"prep\": 1, \"meas\": 3}","silent":false,"store_history":true,"user_expressions":{},"allow_stdin":true,"stop_on_error":true} (2e905567)
2021-09-01T13:59:11.2454057-07:00  [DBG] Handling message of type "execute_request". (cbb20ba6)
2021-09-01T13:59:11.2454125-07:00  [DBG] Handing "execute_request" with ordered shell handler. (e6ef4f49)
2021-09-01T13:59:11.2454342-07:00  [DBG] Asked to execute code:
%simulate DrawTomographyMeasurement {"prep": 1, "meas": 3} (31d9ba3a)
2021-09-01T13:59:11.2455601-07:00  [DBG] Sending iopub message:
	{"ZmqIdentities":["YzkxNjdkNTctYzFlMDJkYTY4MTg1N2UxYTdlZWM2Zjkz"],"Signature":null,"Header":{"msg_id":"b76ad47a-e59e-475d-9e4c-bc3a7b829489","username":"cgran","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"status","version":"5.2.0"},"ParentHeader":{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"},"Metadata":null,"Content":{"execution_state":"busy"}} (a9a26c78)
2021-09-01T13:59:11.2457356-07:00  [DBG] Sending iopub message:
	{"ZmqIdentities":["YzkxNjdkNTctYzFlMDJkYTY4MTg1N2UxYTdlZWM2Zjkz"],"Signature":null,"Header":{"msg_id":"fba80871-a769-4099-9e01-e88ba3f40de4","username":"cgran","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_input","version":"5.2.0"},"ParentHeader":{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"},"Metadata":null,"Content":{"code":"%simulate DrawTomographyMeasurement {\"prep\": 1, \"meas\": 3}","execution_count":2403} (32db5d1d)
2021-09-01T13:59:11.2460310-07:00  [DBG] Looking for magic %simulate (51b66c1e)
2021-09-01T13:59:11.2463072-07:00  [DBG] Using magic %simulate (9429eb87)
2021-09-01T13:59:11.2469593-07:00  [DBG] Sending iopub message:
	{"ZmqIdentities":["YzkxNjdkNTctYzFlMDJkYTY4MTg1N2UxYTdlZWM2Zjkz"],"Signature":null,"Header":{"msg_id":"2d46ae11-dbb4-45f1-99ee-11dd19f6f5df","username":"cgran","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_result","version":"5.2.0"},"ParentHeader":{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"},"Metadata":null,"Content":{"execution_count":2403,"data":{"text/plain":"Zero","application/x-qsharp-data":"0"},"metadata":{}} (f0ec6a5d)
2021-09-01T13:59:11.2471142-07:00  [DBG] Sending shell message:
	{"ZmqIdentities":["YzkxNjdkNTctYzFlMDJkYTY4MTg1N2UxYTdlZWM2Zjkz"],"Signature":null,"Header":{"msg_id":"b82ce983-ec05-4313-aafb-232401d9c117","username":"cgran","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_reply","version":"5.2.0"},"ParentHeader":{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"},"Metadata":null,"Content":{"status":"ok","execution_count":2403,"payload":[],"user_expressions":null}} (0b76f6d9)
2021-09-01T13:59:11.2472695-07:00  [DBG] Sending iopub message:
	{"ZmqIdentities":["YzkxNjdkNTctYzFlMDJkYTY4MTg1N2UxYTdlZWM2Zjkz"],"Signature":null,"Header":{"msg_id":"4356e3c9-4680-4fc5-97f0-0dd8a7eb005e","username":"cgran","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"status","version":"5.2.0"},"ParentHeader":{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"},"Metadata":null,"Content":{"execution_state":"idle"}} (b41328dc)

System information Running against local build of IQ# using main, jupyter_client v7.0.2, pyzmq 22.2.1.

Additional context Add any other context about the problem here.

cgranade avatar Sep 01 '21 21:09 cgranade

Confirmed that the same issue occurs for the following environment.yml:

channels:
- conda-forge
- quantum-engineering
dependencies:
- python>=3.6
- numpy
- scipy
- future
- qutip
- matplotlib<=2.1.2
- ipyparallel
- py
- notebook
- qsharp==0.18.2108.160999
- pyzmq<22.2.1
- pip
- pip:
    - qinfer
    - duecredit
    - mpltools

cgranade avatar Sep 01 '21 21:09 cgranade

Narrowing down a bit further, I was unable to reproduce for the following environment.yml (downgrading IQ# conda package to June):

##
# environment.yml: Specification of a conda environment for use with
#      Q# interoperability samples.
##
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
##

name: process-tomography
channels:
- conda-forge
- quantum-engineering
dependencies:
- python>=3.6
- numpy
- scipy
- future
- qutip
- matplotlib<=2.1.2
- ipyparallel
- py
- notebook
- qsharp==0.18.2106.148911
- jupyter_client==6.1.7
- pip
- pip:
    - qinfer
    - duecredit
    - mpltools

To check if this was due to a regression in IQ#, I manually overwrote the IQ# kernel in this new environment with a dotnet-installed version of the kernel from the August release:

➜  dotnet tool install -g Microsoft.Quantum.IQSharp  
You can invoke the tool using the following command: dotnet-iqsharp
Tool 'microsoft.quantum.iqsharp' (version '0.18.2108160999') was successfully installed.

via 🅒 python-qsharp  took 3s 
➜  etenv process-tomography

via 🅒 process-tomography  
➜  dotnet iqsharp install --sys-prefix
[InstallKernelSpec] Removing existing kernelspec in C:\...\Anaconda3\envs\process-tomography\share\jupyter\kernels\iqsharp
[InstallKernelSpec] Installed kernelspec iqsharp in C:\...\Anaconda3\envs\process-tomography\share\jupyter\kernels\iqsharp

Using the August version of IQ#, I was again unable to reproduce the issue. Checking the full versions of all packages installed in the working environment:

name: process-tomography
channels:
  - quantum-engineering
  - conda-forge
  - defaults
dependencies:
  - argcomplete=1.12.3=pyhd8ed1ab_2
  - argon2-cffi=20.1.0=py37hcc03f2d_2
  - async_generator=1.10=py_0
  - attrs=21.2.0=pyhd8ed1ab_0
  - backcall=0.2.0=pyh9f0ad1d_0
  - backports=1.0=py_2
  - backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
  - backports.weakref=1.0.post1=py37hc8dfbb8_1002
  - bleach=4.1.0=pyhd8ed1ab_0
  - ca-certificates=2021.5.30=h5b45459_0
  - cffi=1.14.6=py37hd8e9650_0
  - colorama=0.4.4=pyh9f0ad1d_0
  - cycler=0.10.0=py_2
  - debugpy=1.4.1=py37hf2a7229_0
  - decorator=5.0.9=pyhd8ed1ab_0
  - defusedxml=0.7.1=pyhd8ed1ab_0
  - entrypoints=0.3=pyhd8ed1ab_1003
  - freetype=2.10.4=h546665d_1
  - future=0.18.2=py37h03978a9_3
  - icu=64.2=he025d50_1
  - importlib-metadata=4.8.1=py37h03978a9_0
  - importlib_metadata=4.8.1=hd8ed1ab_0
  - intel-openmp=2021.3.0=h57928b3_3372
  - ipykernel=6.3.1=py37h4038f58_0
  - ipyparallel=6.3.0=pyhd8ed1ab_3
  - ipython=7.27.0=py37h4038f58_0
  - ipython_genutils=0.2.0=py_1
  - iqsharp=0.18.2106.148911=py37_0
  - jedi=0.18.0=py37h03978a9_2
  - jinja2=3.0.1=pyhd8ed1ab_0
  - jpeg=9d=h8ffe710_0
  - jsonschema=3.2.0=pyhd8ed1ab_3
  - jupyter_client=6.1.7=py_0
  - jupyter_core=4.7.1=py37h03978a9_0
  - jupyterlab_pygments=0.1.2=pyh9f0ad1d_0
  - kiwisolver=1.3.2=py37h8c56517_0
  - libblas=3.9.0=11_win64_mkl
  - libcblas=3.9.0=11_win64_mkl
  - liblapack=3.9.0=11_win64_mkl
  - libpng=1.6.37=h1d00b33_2
  - libsodium=1.0.18=h8d14728_1
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - markupsafe=2.0.1=py37hcc03f2d_0
  - matplotlib=2.1.2=py37_1
  - matplotlib-base=2.1.2=py37h2981e6d_1
  - matplotlib-inline=0.1.2=pyhd8ed1ab_2
  - mistune=0.8.4=py37hcc03f2d_1004
  - mkl=2021.3.0=hb70f87d_564
  - msys2-conda-epoch=20160418=1
  - nbclient=0.5.4=pyhd8ed1ab_0
  - nbconvert=6.1.0=py37h03978a9_0
  - nbformat=5.1.3=pyhd8ed1ab_0
  - nest-asyncio=1.5.1=pyhd8ed1ab_0
  - notebook=6.4.3=pyha770c72_0
  - numpy=1.21.2=py37h940b05c_0
  - openssl=1.1.1k=h8ffe710_1
  - packaging=21.0=pyhd8ed1ab_0
  - pandoc=2.14.2=h8ffe710_0
  - pandocfilters=1.4.2=py_1
  - parso=0.8.2=pyhd8ed1ab_0
  - pickleshare=0.7.5=py_1003
  - pip=21.2.4=pyhd8ed1ab_0
  - prometheus_client=0.11.0=pyhd8ed1ab_0
  - prompt-toolkit=3.0.20=pyha770c72_0
  - py=1.10.0=pyhd3deb0d_0
  - pycparser=2.20=pyh9f0ad1d_2
  - pygments=2.10.0=pyhd8ed1ab_0
  - pyparsing=2.4.7=pyh9f0ad1d_0
  - pyqt=5.9.2=py37h6538335_4
  - pyrsistent=0.17.3=py37hcc03f2d_2
  - python=3.7.10=h7840368_100_cpython
  - python-dateutil=2.8.2=pyhd8ed1ab_0
  - python_abi=3.7=2_cp37m
  - pytz=2021.1=pyhd8ed1ab_0
  - pywin32=301=py37hcc03f2d_0
  - pywinpty=1.1.3=py37h7f67f24_0
  - pyzmq=19.0.2=py37h453f00a_2
  - qt=5.9.7=h506e8af_3
  - qutip=4.6.2=py37h0a3528d_0
  - scipy=1.7.1=py37hb6553fb_0
  - send2trash=1.8.0=pyhd8ed1ab_0
  - setuptools=57.4.0=py37h03978a9_0
  - sip=4.19.8=py37h6538335_1000
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.36.0=h8ffe710_0
  - tbb=2021.3.0=h2d74725_0
  - terminado=0.11.1=py37h03978a9_0
  - testpath=0.5.0=pyhd8ed1ab_0
  - tornado=6.1=py37hcc03f2d_1
  - traitlets=5.1.0=pyhd8ed1ab_0
  - typing_extensions=3.10.0.0=pyha770c72_0
  - ucrt=10.0.20348.0=h57928b3_0
  - vc=14.2=hb210afc_5
  - vs2015_runtime=14.29.30037=h902a5da_5
  - wcwidth=0.2.5=pyh9f0ad1d_2
  - webencodings=0.5.1=py_1
  - wheel=0.37.0=pyhd8ed1ab_1
  - winpty=0.4.3=4
  - zeromq=4.3.2=h0e60522_5
  - zipp=3.5.0=pyhd8ed1ab_0
  - zlib=1.2.11=h62dcd97_1010
  - pip:
    - azure-quantum==0.18.2106.148911
    - certifi==2021.5.30
    - charset-normalizer==2.0.4
    - citeproc-py==0.6.0
    - configobj==5.0.6
    - duecredit==0.9.1
    - idna==3.2
    - lxml==4.6.3
    - mpltools==0.2.0
    - qinfer==1.0
    - qsharp==0.18.2106.148911
    - qsharp-chemistry==0.18.2106.148911
    - qsharp-core==0.18.2106.148911
    - requests==2.26.0
    - urllib3==1.26.6
prefix: C:\...\Anaconda3\envs\process-tomography

Using the latest IQ# conda package, but with pyzmq pinned to 19.0.2 seems to avoid the issue as well:

name: process-tomography
channels:
  - quantum-engineering
  - conda-forge
  - defaults
dependencies:
  - argcomplete=1.12.3=pyhd8ed1ab_2
  - argon2-cffi=20.1.0=py37hcc03f2d_2
  - async_generator=1.10=py_0
  - attrs=21.2.0=pyhd8ed1ab_0
  - backcall=0.2.0=pyh9f0ad1d_0
  - backports=1.0=py_2
  - backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
  - bleach=4.1.0=pyhd8ed1ab_0
  - ca-certificates=2021.5.30=h5b45459_0
  - cffi=1.14.6=py37hd8e9650_0
  - colorama=0.4.4=pyh9f0ad1d_0
  - cycler=0.10.0=py_2
  - debugpy=1.4.1=py37hf2a7229_0
  - decorator=5.0.9=pyhd8ed1ab_0
  - defusedxml=0.7.1=pyhd8ed1ab_0
  - entrypoints=0.3=pyhd8ed1ab_1003
  - freetype=2.10.4=h546665d_1
  - future=0.18.2=py37h03978a9_3
  - icu=64.2=he025d50_1
  - importlib-metadata=4.8.1=py37h03978a9_0
  - importlib_metadata=4.8.1=hd8ed1ab_0
  - intel-openmp=2021.3.0=h57928b3_3372
  - ipykernel=6.3.1=py37h4038f58_0
  - ipyparallel=6.3.0=pyhd8ed1ab_3
  - ipython=7.27.0=py37h4038f58_0
  - ipython_genutils=0.2.0=py_1
  - iqsharp=0.18.2108.160999=py37_0
  - jedi=0.18.0=py37h03978a9_2
  - jinja2=3.0.1=pyhd8ed1ab_0
  - jpeg=9d=h8ffe710_0
  - jsonschema=3.2.0=pyhd8ed1ab_3
  - jupyter_client=6.1.7=py_0
  - jupyter_core=4.7.1=py37h03978a9_0
  - jupyterlab_pygments=0.1.2=pyh9f0ad1d_0
  - kiwisolver=1.3.2=py37h8c56517_0
  - libblas=3.9.0=11_win64_mkl
  - libcblas=3.9.0=11_win64_mkl
  - liblapack=3.9.0=11_win64_mkl
  - libpng=1.6.37=h1d00b33_2
  - libsodium=1.0.18=h8d14728_1
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - markupsafe=2.0.1=py37hcc03f2d_0
  - matplotlib=2.1.2=py37_1
  - matplotlib-base=2.1.2=py37h2981e6d_1
  - matplotlib-inline=0.1.2=pyhd8ed1ab_2
  - mistune=0.8.4=py37hcc03f2d_1004
  - mkl=2021.3.0=hb70f87d_564
  - msys2-conda-epoch=20160418=1
  - nbclient=0.5.4=pyhd8ed1ab_0
  - nbconvert=6.1.0=py37h03978a9_0
  - nbformat=5.1.3=pyhd8ed1ab_0
  - nest-asyncio=1.5.1=pyhd8ed1ab_0
  - notebook=6.4.3=pyha770c72_0
  - numpy=1.21.2=py37h940b05c_0
  - openssl=1.1.1k=h8ffe710_1
  - packaging=21.0=pyhd8ed1ab_0
  - pandoc=2.14.2=h8ffe710_0
  - pandocfilters=1.4.2=py_1
  - parso=0.8.2=pyhd8ed1ab_0
  - pickleshare=0.7.5=py_1003
  - pip=21.2.4=pyhd8ed1ab_0
  - prometheus_client=0.11.0=pyhd8ed1ab_0
  - prompt-toolkit=3.0.20=pyha770c72_0
  - py=1.10.0=pyhd3deb0d_0
  - pycparser=2.20=pyh9f0ad1d_2
  - pygments=2.10.0=pyhd8ed1ab_0
  - pyparsing=2.4.7=pyh9f0ad1d_0
  - pyqt=5.9.2=py37h6538335_4
  - pyrsistent=0.17.3=py37hcc03f2d_2
  - python=3.7.10=h7840368_100_cpython
  - python-dateutil=2.8.2=pyhd8ed1ab_0
  - python_abi=3.7=2_cp37m
  - pytz=2021.1=pyhd8ed1ab_0
  - pywin32=301=py37hcc03f2d_0
  - pywinpty=1.1.3=py37h7f67f24_0
  - pyzmq=19.0.2=py37h453f00a_2
  - qt=5.9.7=h506e8af_3
  - qutip=4.6.2=py37h0a3528d_0
  - scipy=1.7.1=py37hb6553fb_0
  - send2trash=1.8.0=pyhd8ed1ab_0
  - setuptools=57.4.0=py37h03978a9_0
  - sip=4.19.8=py37h6538335_1000
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.36.0=h8ffe710_0
  - tbb=2021.3.0=h2d74725_0
  - terminado=0.11.1=py37h03978a9_0
  - testpath=0.5.0=pyhd8ed1ab_0
  - tornado=6.1=py37hcc03f2d_1
  - traitlets=5.1.0=pyhd8ed1ab_0
  - typing_extensions=3.10.0.0=pyha770c72_0
  - ucrt=10.0.20348.0=h57928b3_0
  - vc=14.2=hb210afc_5
  - vs2015_runtime=14.29.30037=h902a5da_5
  - wcwidth=0.2.5=pyh9f0ad1d_2
  - webencodings=0.5.1=py_1
  - wheel=0.37.0=pyhd8ed1ab_1
  - winpty=0.4.3=4
  - zeromq=4.3.2=h0e60522_5
  - zipp=3.5.0=pyhd8ed1ab_0
  - zlib=1.2.11=h62dcd97_1010
  - pip:
    - azure-quantum==0.18.2108.160999
    - certifi==2021.5.30
    - charset-normalizer==2.0.4
    - citeproc-py==0.6.0
    - configobj==5.0.6
    - duecredit==0.9.1
    - idna==3.2
    - lxml==4.6.3
    - mpltools==0.2.0
    - qinfer==1.0
    - qsharp==0.18.2108.160999
    - qsharp-chemistry==0.18.2108.160999
    - qsharp-core==0.18.2108.160999
    - requests==2.26.0
    - urllib3==1.26.6
prefix: C:\...\Anaconda3\envs\process-tomography

cgranade avatar Sep 01 '21 22:09 cgranade

I am experiencing the python/iqsharp stall issue that is likely to be caused by this bug. The attached contentsList.txt is a concatenation of pip list and conda list in the referenced VM/environment. The demoLogDump and the demoLogDumpStall files are the IQSHARP_LOG_LEVEL=Debug logs of two back-to back runs. The fist one terminated success fully after execution_count:210 and output as expected. The second one stalled after execution_count:3 componentsList.txt componentsList.txt demo4logDump.txt demo4logDumpStall.txt

0

alexeib2 avatar Nov 16 '21 20:11 alexeib2

This issue is now happening on our Docker image as well. Looks like the latest version of nbconvert (0.6.4) and ipython (7.34.0) are causing timeouts on IOPub messages.

anpaz avatar Jun 02 '22 21:06 anpaz

This should be fixed with #708, which is now released as part of the QDK 0.25.228311+

anpaz avatar Sep 02 '22 21:09 anpaz