qiskit icon indicating copy to clipboard operation
qiskit copied to clipboard

Some unit tests crash when using symengine

Open iyanmv opened this issue 1 year ago • 10 comments

Environment

  • Qiskit version: 1.1.0rc1 (but also in previous stable versions)
  • Python version: 3.12.3
  • Operating system: Arch Linux

What is happening?

Some unit tests fail due to symengine crashes. For example, the following five tests from test_circuit_load_from_qpy.py crash on my system:

  • test_parameter_expression
  • test_evolutiongate_param_expr_time
  • test_parameter_expression_global_phase
  • test_parameter_vector_element_in_expression
  • test_symengine_full_path

From the first test, the exact line when the the test crashes is this: https://github.com/Qiskit/qiskit/blob/6ff0eb54f88cdbe98275872b8ec9d65f76948bb4/qiskit/qpy/binary_io/value.py#L58

Which produces a src/tcmalloc.cc:304] Attempt to free invalid pointer 0x6146ecca5b60.

How can we reproduce the issue?

Here is a small self-contained script that can replicate the crash described above.

from qiskit.circuit.parameter import Parameter
from qiskit.circuit.parameterexpression import ParameterExpression


theta = Parameter("theta")
phi = Parameter("phi")
sum_param = theta + phi

sum_param._symbol_expr.__reduce__()
# src/tcmalloc.cc:304] Attempt to free invalid pointer 0x645c8e3eaa30 
# [1]    987079 IOT instruction (core dumped)  python

The following running on the same system works:

import symengine

theta = symengine.Symbol("theta")
phi = symengine.Symbol("phi")
sum_param = theta + phi

sum_param.__reduce__()

What should happen?

Ideally, all unit tests would pass in a clean chroot env on Arch Linux.

Any suggestions?

No response

iyanmv avatar May 07 '24 15:05 iyanmv

For some reason, if symengine is imported, there is no crash, i.e.,

import symengine
from qiskit.circuit.parameter import Parameter
from qiskit.circuit.parameterexpression import ParameterExpression


theta = Parameter("theta")
phi = Parameter("phi")
sum_param = theta + phi

sum_param._symbol_expr.__reduce__()

iyanmv avatar May 07 '24 16:05 iyanmv

Thanks for the report. Do you know what version of symengine you're using?

Fwiw, it would be really pretty weird if this is caused by Qiskit and not some allocator problem within Symengine - our use of it is very much inline with their public interface, and we don't go messing around in their internals or anything.

jakelishman avatar May 07 '24 17:05 jakelishman

I'm using symengine 0.11.2, and Arch Linux builds the library with the following options:

cmake -B build -S $pkgname-$pkgver \
  -DCMAKE_INSTALL_PREFIX=/usr \
  -DBUILD_SHARED_LIBS=ON \
  -DWITH_TCMALLOC=ON \
  -DWITH_PTHREAD=ON \
  -DWITH_SYMENGINE_THREAD_SAFE=ON \
  -DWITH_ARB=ON \
  -DWITH_ECM=ON \
  -DWITH_LLVM=ON \
  -DWITH_MPFR=ON \
  -DWITH_MPC=ON \
  -DWITH_PRIMESIEVE=ON \
  -DWITH_BOOST=ON \
  -DWITH_COTIRE=OFF \
  -DWITH_SYSTEM_CEREAL=ON
cmake --build build

iyanmv avatar May 07 '24 20:05 iyanmv

If you installed symengine from pip or conda, you'd probably have got their pre-packaged wheels, which is the same situation we're usually expecting. If you're using it from Arch's repos in the system Python installation, then this could very well be a difference: I don't think symengine's published wheels use tcmalloc, and it could be some fault there (or perhaps some questionable behaviour that happens to work with glibc's malloc, but tcmalloc is rejecting).

The difference between pre-importing symengine and not probably isn't indicative of everything being super fixed (since qiskit does just import symengine internally) - it's more likely a spooky-action-at-a-distance memory bug.

If you have the means to try, would you be able to see if the error repeats if you drop the -DWITH_TCMALLOC=ON (or manually set it to off) while you're building the symengine your Python installation is linking against?

jakelishman avatar May 07 '24 20:05 jakelishman

I was going to do that experiment next, yes. I will rebuild symengine and python-symengine without tcmalloc and see if that solves the issue.

I just installed qiskit quickly in a venv and when I use the symengine from PyPI I also can't replicate the issue.

iyanmv avatar May 07 '24 21:05 iyanmv

I can confirm those tests don't fail when using symengine without tcmalloc. I will create an issue on Arch Linux.

iyanmv avatar May 10 '24 08:05 iyanmv

Thanks so much for looking into it in depth! I guess if the issue is somewhere in Arch / tcmalloc / symengine, we can close this issue on Qiskit? I'll close it, but feel free to re-open if there's more to discuss.

jakelishman avatar May 10 '24 10:05 jakelishman

After solving the crashes due to tcmalloc, I found another error related to symengine (this time a numerical rounding error).

In particular is this test: https://github.com/Qiskit/qiskit/blob/235e581b1f76f29add0399989ed47f47a4e98bb8/test/python/qobj/test_pulse_converter.py#L343

It works running on a virtual env, but if fails using system shared libraries. It's a numerical error happening in this line: https://github.com/Qiskit/qiskit/blob/235e581b1f76f29add0399989ed47f47a4e98bb8/qiskit/circuit/parameterexpression.py#L534

Running on my system real_expr has type symengine.lib.symengine_wrapper.RealDouble and value 3140000.0. However, running float(real_expr) returns 3139999.9999999995. I think this possibility was considered here but it doesn't work as expected: https://github.com/Qiskit/qiskit/blob/235e581b1f76f29add0399989ed47f47a4e98bb8/qiskit/pulse/utils.py#L71-L74

The reason is that decimal has a value 10 so this error is not removed with the rounding. Passing decimals=9 fixes the issue for me.

Note also this relevant comment from NumPy (see Notes):

np.round uses a fast but sometimes inexact algorithm to round floating-point datatypes. For positive decimals it is equivalent to np.true_divide(np.rint(a * 10decimals), 10decimals), which has error due to the inexact representation of decimal fractions in the IEEE floating point standard [1] and errors introduced when scaling by powers of ten. (...) If your goal is to print such values with a fixed number of decimals, it is preferable to use numpy’s float printing routines to limit the number of printed decimals: (...) The float printing routines use an accurate but much more computationally demanding algorithm to compute the number of digits after the decimal point. Alternatively, Python’s builtin round function uses a more accurate but slower algorithm for 64-bit floating point values:

Perhaps it's better to use round(operand, ndigits=decimal) and change the default value of decimal from 10 to 9? What do you think? I can make a PR if you think it's a good idea (I will check first that there are no other side effects after changing that).

iyanmv avatar May 10 '24 11:05 iyanmv

Sorry, I guess round() is out of the table because of complex numbers.

iyanmv avatar May 10 '24 12:05 iyanmv

Also, after thinking a bit I don't understand why there should be such a rounding error in the first place. Do you know why is that error expected from that comment in the code and why decimal has a default value of 10 instead of 14 for example (which should be fine with double precision I guess)? I also can't replicate the error using pure symengine code. I can try to compile symengine without MPFR as a wild guess... but I don't understand what is going on.

iyanmv avatar May 10 '24 12:05 iyanmv

No difference compiling symengine with -DWITH_MPFR=OFF

iyanmv avatar May 12 '24 12:05 iyanmv

I noticed that the ParameterExpression(f/1000) is multiplied by a float here: https://github.com/Qiskit/qiskit/blob/f20dad0d03281bf3e628a7c32c117f41140365d4/qiskit/qobj/converters/pulse_instruction.py#L780

This causes frequency to have the value ParameterExpression(1000000.0*f) in this particular unit test instead of ParameterExpression(1000000*f) which could have be obtained by multiplying by an integer instead, i.e.,

frequency = self.disassemble_value(instruction.frequency) * 10**9

This is what ultimately leads to the numerical error that could have been avoided and has nothing to do with the double precision of the float type.

iyanmv avatar May 12 '24 14:05 iyanmv

And here is a small self-contained example replicating the issue with pure symengine:

import symengine

f = symengine.Symbol("f")
expr = f / 1000

expr_wrong = expr * 1e9
expr_good = expr * 10**9

wrong_value = float(expr_wrong.subs("f", 3.14))
good_value = float(expr_good.subs("f", 3.14))

print(wrong_value)
# 3139999.9999999995
print(good_value)
# 3140000.0

iyanmv avatar May 12 '24 14:05 iyanmv

Sorry for the lack of response here - I reopened the issue, started typing a comment, then got distracted and forgot I hadn't posted.

Thanks so much for tracking this down and providing the fix - that's a very tricksy bug to find, and thanks for filing it on symengine.

jakelishman avatar May 15 '24 17:05 jakelishman

No problem ;)

iyanmv avatar May 16 '24 05:05 iyanmv