cuQuantum icon indicating copy to clipboard operation
cuQuantum copied to clipboard

Wrong estimation for scratch memory needed?

Open aromanro opened this issue 3 weeks ago • 10 comments

Context of the problem:

I implemented a MPS quantum computing simulator, similar with what's implemented in my open source simulator qcsim, but with cuQuantum 25.11.1 and cuda 13. I used an implementation inspired by examples, like mps_example.cu... the main difference is that it keeps the singular values separately instead of partitioning them in the site tensors, as described in Vidal papers. I believe that qiskit aer uses the same method for the mps implementation as well.

Since I need the singular values to be kept separately, SVD is done with CUTENSORNET_TENSOR_SVD_PARTITION_NONE and the 'gate split' is done with CUTENSORNET_GATE_SPLIT_ALGO_DIRECT.

It works well but I needed some hacks in order to make it work, because of some issues when obtaining the scratch memory size needed for various operations (basically two of them are the source of the problem: SVD and cutensornetGateSplit, which internally uses SVD, so probably it's related with SVD).

Problem:

For both cases cutensornetWorkspaceGetMemorySize appears to return an insufficient size for the operation (despite being called with CUTENSORNET_WORKSIZE_PREF_MAX).

For example, for 4 qubits, no bond dimension limit set, the max virtual extent needed is 4 (so SVD is done on a matrix of max size 8x8 containing complex values - floats or doubles, as physical extent is 2).

Displayed from the code (each value is the max of the previous and the current estimation):

Workspace size requested after estimating one qubit gate: 512 bytes.
Workspace size requested after estimating two qubit contraction: 768 bytes.
Workspace size requested after estimating two qubit gate: 549120 bytes, host workspace: 0 bytes.
SVD device workspace size required: 548608 bytes.

The adjusted value - to avoid the issues - becomes 8785920 and with this it works well... if I allocate only 549120 bytes when trying to apply a two qubit gate the CUTENSORNET_STATUS_INSUFFICIENT_WORKSPACE occurs when cutensornetTensorSVD is called.

As a note, I tried some other algorithms than GESVD, most of the available ones seem to not have the accuracy we need, but GESVDJ might have it... the problem gets much worse there, I had to multiply the scratch memory value with 64 in order to avoid getting errors.

'Hack' solution:

I multiplied the scratch size with 2 * 'complex data size' (8 for float, 16 for double), this seems to provide a size big enough. Removing the 2 multiplication can still issue errors sometimes.

What I also tried:

I tried to compute the memory size needed at runtime, trying to reallocate if the needed reported size was bigger, but it didn't work... and worse, in some cases if I recall correctly some gigantic value was returned, more than the available memory.

Concern:

Multiplying by 16 (or 32 for double, but we default to float since it needs less memory and it's faster) might mean a lot of memory for many qubits and no/or big max bond dimension, quite a bit of it being allocated unnecessarily.

aromanro avatar Dec 16 '25 12:12 aromanro

Here is the log for (generated with CUTENSORNET_LOG_LEVEL=5) applying a two qubits gate when the issue occurs:

[2025-12-16 14:23:50:604:715][cuTensorNet][16472][Api][cutensornetCreateTensorDescriptor] handle=0X56A0B096B890 numModes=4 extents=[2,2,2,2] strides=[] modes=[2,37,42,6] dataType=4 tensorDesc=0X7FFDBCC7DAD8
[2025-12-16 14:23:50:604:726][cuTensorNet][16472][Api][cutensornetCreateNetwork] handle=0X56A0B096B890 networkDesc=0X7FFDBCC7DAE0
[2025-12-16 14:23:50:604:729][cuTensorNet][16472][Api][cutensornetNetworkAppendTensor] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 numModes=3 extents=[2,2,2] modeLabels=[2,37,4] qualifiers= dataType=4 tensorId=0X7FFDBCC7DC50
[2025-12-16 14:23:50:604:742][cuTensorNet][16472][Api][cutensornetNetworkAppendTensor] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 numModes=3 extents=[2,2,2] modeLabels=[4,42,6] qualifiers= dataType=4 tensorId=0X7FFDBCC7DC58
[2025-12-16 14:23:50:604:745][cuTensorNet][16472][Api][cutensornetNetworkSetOutputTensor] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 numModes=4 modeLabels=[2,37,42,6] dataType=4
[2025-12-16 14:23:50:604:748][cuTensorNet][16472][Api][cutensornetNetworkSetAttribute] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 attr=30 buf=0X56A0B0961A54 sizeInBytes=4
[2025-12-16 14:23:50:604:750][cuTensorNet][16472][Api][cutensornetCreateContractionOptimizerConfig] handle=0X56A0B096B890 optimizerConfig=0X7FFDBCC7DAE8
[2025-12-16 14:23:50:604:756][cuTensorNet][16472][Api][cutensornetCreateContractionOptimizerInfo] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 optimizerInfo=0X7FFDBCC7DAF0
[2025-12-16 14:23:50:604:759][cuTensorNet][16472][Api][cutensornetContractionOptimize] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 optimizerConfig=0X56A0C30BAB60 workspaceSizeConstraint=549120 optimizerInfo=0X56A0C30BB150
[2025-12-16 14:23:50:604:764][cuTensorNet][16472][Info][cutensornetContractionOptimize] INFO about architecture requested 8   data_type 4   compute_type 4.
[2025-12-16 14:23:50:604:850][cuTensorNet][16472][Api][cutensornetNetworkPrepareContraction] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 workDesc=0X56A0BC65D000
[2025-12-16 14:23:50:604:860][cuTensorNet][16472][Trace][cutensornetNetworkPrepareContraction] workspace=0X1317C42200 workspaceSizeProvided=549120
[2025-12-16 14:23:50:605:099][cuTensorNet][16472][Api][cutensornetNetworkSetInputTensorMemory] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 tensorId=0 buffer=0X1317C21800 strides=0X0
[2025-12-16 14:23:50:605:121][cuTensorNet][16472][Trace][cutensornetNetworkSetInputTensorMemory] Tensor(0) strides=[]
[2025-12-16 14:23:50:605:133][cuTensorNet][16472][Api][cutensornetNetworkSetInputTensorMemory] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 tensorId=1 buffer=0X1317C21A00 strides=0X0
[2025-12-16 14:23:50:605:142][cuTensorNet][16472][Trace][cutensornetNetworkSetInputTensorMemory] Tensor(1) strides=[]
[2025-12-16 14:23:50:605:145][cuTensorNet][16472][Api][cutensornetNetworkSetOutputTensorMemory] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 buffer=0X1317C21C00 strides=0X0
[2025-12-16 14:23:50:605:155][cuTensorNet][16472][Trace][cutensornetNetworkSetOutputTensorMemory] Output tensor strides=[]
[2025-12-16 14:23:50:605:165][cuTensorNet][16472][Api][cutensornetNetworkContract] handle=0X56A0B096B890 networkDesc=0X56A0C22F2620 accumulateOutput=0 workDesc=0X56A0BC65D000 sliceGroup=0X0 stream=0X56A0AFF388B0
[2025-12-16 14:23:50:605:169][cuTensorNet][16472][Trace][cutensornetNetworkContract] Provided device scratchWorkspace=0X1317C42200 scratchWorkspaceSize=549120 cacheWorkspace=0X0 cacheWorkspaceSize=0
[2025-12-16 14:23:50:605:221][cuTensorNet][16472][Api][cutensornetDestroyContractionOptimizerInfo] optimizerInfo=0X56A0C30BB150
[2025-12-16 14:23:50:605:232][cuTensorNet][16472][Api][cutensornetDestroyContractionOptimizerConfig] optimizerConfig=0X56A0C30BAB60
[2025-12-16 14:23:50:605:234][cuTensorNet][16472][Api][cutensornetDestroyNetwork] desc=0X56A0C22F2620
[2025-12-16 14:23:50:605:278][cuTensorNet][16472][Api][cutensornetCreateTensorDescriptor] handle=0X56A0B096B890 numModes=3 extents=[2,2,4] strides=[] modes=[2,37,4] dataType=4 tensorDesc=0X7FFDBCC7DAF8
[2025-12-16 14:23:50:605:289][cuTensorNet][16472][Api][cutensornetCreateTensorDescriptor] handle=0X56A0B096B890 numModes=3 extents=[4,2,2] strides=[] modes=[4,42,6] dataType=4 tensorDesc=0X7FFDBCC7DB00
[2025-12-16 14:23:50:605:293][cuTensorNet][16472][Api][cutensornetTensorSVD] handle=0X56A0B096B890 descTensorIn=0X56A0C29E44E0 rawDataIn=0X1317C21C00 descTensorU=0X56A0C279E770 u=0X1317C20C00 s=0X1317C21400 descTensorV=0X56A0C27A3650 v=0X1317C20E00 svdConfig=0X56A0BC659B60 svdInfo=0X56A0BC658C00 workDesc=0X56A0BC65D000 stream=0X56A0AFF388B0
[2025-12-16 14:23:50:605:304][cuTensorNet][16472][Trace][cutensornetTensorSVD] deviceWorkspacePtr=0X1317C42200 deviceWorkspaceSize=549120 hostWorkspacePtr=0X0 hostWorkspaceSize=0
[2025-12-16 14:23:50:605:318][cuTensorNet][16472][Trace][cutensornetTensorSVD] cusolverDnXgesvd_bufferSize(handle=0X56A0BC65BDF0 params=0X0 jobu=79 jobvt=83 m=4 n=4 dataTypeA=4 A=0X0 lda=4 dataTypeS=0 S=0X0 dataTypeU=4 U=0X0 ldu=4 dataTypeVT=4 VT=0X0 ldvt=4 computeType=4 workspaceInBytesOnDevice=0X7FFDBCC7C028 workspaceInBytesOnHost=0X7FFDBCC7C030)
[2025-12-16 14:23:50:605:354][cuTensorNet][16472][Trace][cutensornetTensorSVD] cusolverDnXgesvd(handle=0X56A0BC65BDF0 params=0X0 jobu=79 jobvt=83 m=4 n=4 dataTypeA=4 A=0X1317C42200 lda=4 dataTypeS=0 S=0X1317C42300 dataTypeU=4 U=0X1317C42200 ldu=4 dataTypeVT=4 VT=0X1317C42400 ldvt=4 computeType=4 bufferOnDevice=0X1317C42500 workspaceInBytesOnDevice=546560 bufferOnHost=0X0 workspaceInBytesOnHost=0 info=0X1317CC7C00)
[2025-12-16 14:23:50:605:822][cuTensorNet][16472][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56A0C29E44E0
[2025-12-16 14:23:50:605:836][cuTensorNet][16472][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56A0C279EE70
[2025-12-16 14:23:50:605:839][cuTensorNet][16472][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56A0C27A09D0
[2025-12-16 14:23:50:605:850][cuTensorNet][16472][Api][cutensornetGetTensorDetails] handle=0X56A0B096B890 tensorDesc=0X56A0C279E770 numModes=0X7FFDBCC7DA5C dataSize=0X0 modeLabels=0X0 extents=0X7FFDBCC7DCE0 strides=0X0
[2025-12-16 14:23:50:605:912][cuTensorNet][16472][Api][cutensornetCreateTensorDescriptor] handle=0X56A0B096B890 numModes=4 extents=[2,2,2,2] strides=[] modes=[42,41,44,45] dataType=4 tensorDesc=0X7FFDBCC7DBD0
[2025-12-16 14:23:50:605:923][cuTensorNet][16472][Api][cutensornetCreateTensorDescriptor] handle=0X56A0B096B890 numModes=3 extents=[4,2,2] strides=[] modes=[4,44,6] dataType=4 tensorDesc=0X7FFDBCC7DBE0
[2025-12-16 14:23:50:605:926][cuTensorNet][16472][Api][cutensornetCreateTensorDescriptor] handle=0X56A0B096B890 numModes=3 extents=[2,2,1] strides=[] modes=[6,45,8] dataType=4 tensorDesc=0X7FFDBCC7DBE8
[2025-12-16 14:23:50:605:937][cuTensorNet][16472][Api][cutensornetGateSplit] handle=0X56A0B096B890 descTensorInA=0X56A0C27A3650 rawDataInA=0X1317C20E00 descTensorInB=0X56A0C27A0540 rawDataInB=0X1317C21000 descTensorInG=0X56A0C27A09D0 rawDataInG=0X1317A01170 descTensorU=0X56A0C279EE70 u=0X1317C20E00 s=0X1317C21600 descTensorV=0X56A0C29E44E0 v=0X1317C21000 gateAlgo=0 svdConfig=0X56A0BC659B60 computeType=4 svdInfo=0x56a0bc658c00 workDesc=0X56A0BC65D000 stream=95248146729136
[2025-12-16 14:23:50:605:961][cuTensorNet][16472][Info][cutensornetGateSplit] INFO about architecture requested 8   data_type 4   compute_type 4.
[2025-12-16 14:23:50:606:317][cuTensorNet][16472][Trace][cutensornetGateSplit] cusolverDnXgesvd_bufferSize(handle=0X56A0BC65BDF0 params=0X0 jobu=79 jobvt=83 m=8 n=2 dataTypeA=4 A=0X0 lda=8 dataTypeS=0 S=0X0 dataTypeU=4 U=0X0 ldu=8 dataTypeVT=4 VT=0X0 ldvt=2 computeType=4 workspaceInBytesOnDevice=0X7FFDBCC7B648 workspaceInBytesOnHost=0X7FFDBCC7B650)
[2025-12-16 14:23:50:606:567][cuTensorNet][16472][Error][cutensornetGateSplit] Insufficient device workspace (549120 bytes) provided to executeGateSplit(...), need 6560256 bytes (2).
[2025-12-16 14:23:50:606:581][cuTensorNet][16472][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56A0C27A3F20
[2025-12-16 14:23:50:606:593][cuTensorNet][16472][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56A0C279E770
[2025-12-16 14:23:50:606:596][cuTensorNet][16472][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56A0C27A3650
[2025-12-16 14:23:50:606:606][cuTensorNet][16472][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56A0C27A0540
[2025-12-16 14:23:50:606:609][cuTensorNet][16472][Api][cutensornetDestroy] handle=0X56A0B096B890
[2025-12-16 14:23:50:607:133][cuTensorNet][16472][Api][cutensornetDestroyWorkspaceDescriptor] workDesc=0X56A0BC65D000
[2025-12-16 14:23:50:607:148][cuTensorNet][16472][Api][cutensornetDestroyTensorSVDConfig] svdConfig=0X56A0BC659B60
[2025-12-16 14:23:50:607:151][cuTensorNet][16472][Api][cutensornetDestroyTensorSVDInfo] svdInfo=0X56A0BC658C00
Error: [2025-12-16 14:23:50:607:201][cuTensorNet][16472][Api][cutensornetGetErrorString] error=19: CUTENSORNET_STATUS_INSUFFICIENT_WORKSPACE

As a note, when a two-qubit gate is applied on a non-adjacent qubits, swaps are applied to bring them together. Swapping is done by explicit contraction of site tensors, applying the swapping operation with a cuda kernel then doing SVD. This might be more efficient than the generic gate split.

aromanro avatar Dec 16 '25 12:12 aromanro

Hello @aromanro , are you able to extract an isolated example for running a single gatesplit and share the snippet as well as the logging here. From your logging I can't see where workspace descriptor is used for workspace query and I'm also seeing that you used a single workDesc across network contraction as well as tensor SVD + gatesplit.

Meanwhile, are you following same practice as mps_example.cu to query only the gate split largest problem?

yangcal avatar Dec 16 '25 16:12 yangcal

Yes, a single workspace is used (having the allocated memory the max needed for any operations involved) in almost all cases.

Yes, the largest problem is computed like in mps_example.cu.

I will come up tomorrow with logs separated by the operations done.

aromanro avatar Dec 16 '25 18:12 aromanro

Here it is, along with some operations - swapping, to bring the sites together, then another two qubit gate on the adjacent qubits - done before applying the two qubit gate where the issue occurs:

********************************************************************************
Applying swap on sites 1, 2
********************************************************************************
********************************************************************************
Contract two sites then SVD
********************************************************************************
[2025-12-17 11:15:44:741:765][cuTensorNet][3482][Api][cutensornetCreateTensorDescriptor] handle=0X62CB03D9E990 numModes=4 extents=[1,2,2,1] strides=[] modes=[2,36,39,6] dataType=4 tensorDesc=0X7FFD565CA508
[2025-12-17 11:15:44:741:769][cuTensorNet][3482][Api][cutensornetCreateNetwork] handle=0X62CB03D9E990 networkDesc=0X7FFD565CA510
[2025-12-17 11:15:44:741:771][cuTensorNet][3482][Api][cutensornetNetworkAppendTensor] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 numModes=3 extents=[1,2,1] modeLabels=[2,36,4] qualifiers= dataType=4 tensorId=0X7FFD565CA680
[2025-12-17 11:15:44:741:775][cuTensorNet][3482][Api][cutensornetNetworkAppendTensor] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 numModes=3 extents=[1,2,1] modeLabels=[4,39,6] qualifiers= dataType=4 tensorId=0X7FFD565CA688
[2025-12-17 11:15:44:741:778][cuTensorNet][3482][Api][cutensornetNetworkSetOutputTensor] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 numModes=4 modeLabels=[2,36,39,6] dataType=4
[2025-12-17 11:15:44:741:780][cuTensorNet][3482][Api][cutensornetNetworkSetAttribute] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 attr=30 buf=0X62CB03D94B54 sizeInBytes=4
[2025-12-17 11:15:44:741:782][cuTensorNet][3482][Api][cutensornetCreateContractionOptimizerConfig] handle=0X62CB03D9E990 optimizerConfig=0X7FFD565CA518
[2025-12-17 11:15:44:741:788][cuTensorNet][3482][Api][cutensornetCreateContractionOptimizerInfo] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 optimizerInfo=0X7FFD565CA520
[2025-12-17 11:15:44:741:791][cuTensorNet][3482][Api][cutensornetContractionOptimize] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 optimizerConfig=0X62CB15BD72B0 workspaceSizeConstraint=549120 optimizerInfo=0X62CB15E1ABD0
[2025-12-17 11:15:44:741:795][cuTensorNet][3482][Info][cutensornetContractionOptimize] INFO about architecture requested 8   data_type 4   compute_type 4.
[2025-12-17 11:15:44:741:877][cuTensorNet][3482][Api][cutensornetNetworkPrepareContraction] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 workDesc=0X62CB0FA90100
[2025-12-17 11:15:44:741:888][cuTensorNet][3482][Trace][cutensornetNetworkPrepareContraction] workspace=0X1317C42200 workspaceSizeProvided=549120
[2025-12-17 11:15:44:742:044][cuTensorNet][3482][Api][cutensornetNetworkSetInputTensorMemory] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 tensorId=0 buffer=0X1317C21800 strides=0X0
[2025-12-17 11:15:44:742:055][cuTensorNet][3482][Trace][cutensornetNetworkSetInputTensorMemory] Tensor(0) strides=[]
[2025-12-17 11:15:44:742:057][cuTensorNet][3482][Api][cutensornetNetworkSetInputTensorMemory] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 tensorId=1 buffer=0X1317C21A00 strides=0X0
[2025-12-17 11:15:44:742:068][cuTensorNet][3482][Trace][cutensornetNetworkSetInputTensorMemory] Tensor(1) strides=[]
[2025-12-17 11:15:44:742:078][cuTensorNet][3482][Api][cutensornetNetworkSetOutputTensorMemory] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 buffer=0X1317C21C00 strides=0X0
[2025-12-17 11:15:44:742:081][cuTensorNet][3482][Trace][cutensornetNetworkSetOutputTensorMemory] Output tensor strides=[]
[2025-12-17 11:15:44:742:091][cuTensorNet][3482][Api][cutensornetNetworkContract] handle=0X62CB03D9E990 networkDesc=0X62CB15433DC0 accumulateOutput=0 workDesc=0X62CB0FA90100 sliceGroup=0X0 stream=0X62CB03347E10
[2025-12-17 11:15:44:742:096][cuTensorNet][3482][Trace][cutensornetNetworkContract] Provided device scratchWorkspace=0X1317C42200 scratchWorkspaceSize=549120 cacheWorkspace=0X0 cacheWorkspaceSize=0
[2025-12-17 11:15:44:742:130][cuTensorNet][3482][Api][cutensornetDestroyContractionOptimizerInfo] optimizerInfo=0X62CB15E1ABD0
[2025-12-17 11:15:44:742:148][cuTensorNet][3482][Api][cutensornetDestroyContractionOptimizerConfig] optimizerConfig=0X62CB15BD72B0
[2025-12-17 11:15:44:742:150][cuTensorNet][3482][Api][cutensornetDestroyNetwork] desc=0X62CB15433DC0
********************************************************************************
Contraction done
********************************************************************************
********************************************************************************
Swapping
********************************************************************************
********************************************************************************
Swapping done
********************************************************************************
********************************************************************************
Applying SVD
********************************************************************************
[2025-12-17 11:15:44:742:204][cuTensorNet][3482][Api][cutensornetCreateTensorDescriptor] handle=0X62CB03D9E990 numModes=3 extents=[1,2,2] strides=[] modes=[2,36,4] dataType=4 tensorDesc=0X7FFD565CA528
[2025-12-17 11:15:44:742:207][cuTensorNet][3482][Api][cutensornetCreateTensorDescriptor] handle=0X62CB03D9E990 numModes=3 extents=[2,2,1] strides=[] modes=[4,39,6] dataType=4 tensorDesc=0X7FFD565CA530
[2025-12-17 11:15:44:742:219][cuTensorNet][3482][Api][cutensornetTensorSVD] handle=0X62CB03D9E990 descTensorIn=0X62CB15BD4950 rawDataIn=0X1317C21C00 descTensorU=0X62CB14C54380 u=0X1317C20C00 s=0X1317C21400 descTensorV=0X62CB15BD2EC0 v=0X1317C20E00 svdConfig=0X62CB0FA8CC60 svdInfo=0X62CB0FA8BD00 workDesc=0X62CB0FA90100 stream=0X62CB03347E10
[2025-12-17 11:15:44:742:222][cuTensorNet][3482][Trace][cutensornetTensorSVD] deviceWorkspacePtr=0X1317C42200 deviceWorkspaceSize=549120 hostWorkspacePtr=0X0 hostWorkspaceSize=0
[2025-12-17 11:15:44:742:227][cuTensorNet][3482][Trace][cutensornetTensorSVD] cusolverDnXgesvd_bufferSize(handle=0X62CB0FA8EEF0 params=0X0 jobu=79 jobvt=83 m=2 n=2 dataTypeA=4 A=0X0 lda=2 dataTypeS=0 S=0X0 dataTypeU=4 U=0X0 ldu=2 dataTypeVT=4 VT=0X0 ldvt=2 computeType=4 workspaceInBytesOnDevice=0X7FFD565C8A58 workspaceInBytesOnHost=0X7FFD565C8A60)
[2025-12-17 11:15:44:742:248][cuTensorNet][3482][Trace][cutensornetTensorSVD] cusolverDnXgesvd(handle=0X62CB0FA8EEF0 params=0X0 jobu=79 jobvt=83 m=2 n=2 dataTypeA=4 A=0X1317C42200 lda=2 dataTypeS=0 S=0X1317C42300 dataTypeU=4 U=0X1317C42200 ldu=2 dataTypeVT=4 VT=0X1317C42400 ldvt=2 computeType=4 bufferOnDevice=0X1317C42500 workspaceInBytesOnDevice=546560 bufferOnHost=0X0 workspaceInBytesOnHost=0 info=0X1317CC7C00)
********************************************************************************
Finished applying SVD
********************************************************************************
[2025-12-17 11:15:44:742:927][cuTensorNet][3482][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X62CB15BD4950
[2025-12-17 11:15:44:742:937][cuTensorNet][3482][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X62CB158F72F0
[2025-12-17 11:15:44:742:948][cuTensorNet][3482][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X62CB15E1AB40
[2025-12-17 11:15:44:742:958][cuTensorNet][3482][Api][cutensornetGetTensorDetails] handle=0X62CB03D9E990 tensorDesc=0X62CB14C54380 numModes=0X7FFD565CA48C dataSize=0X0 modeLabels=0X0 extents=0X7FFD565CA710 strides=0X0
********************************************************************************
Contraction and SVD done
********************************************************************************
********************************************************************************
Applying two-qubit gate on sites 2, 3
********************************************************************************
[2025-12-17 11:15:44:743:038][cuTensorNet][3482][Api][cutensornetCreateTensorDescriptor] handle=0X62CB03D9E990 numModes=4 extents=[2,2,2,2] strides=[] modes=[39,40,41,42] dataType=4 tensorDesc=0X7FFD565CA340
[2025-12-17 11:15:44:743:042][cuTensorNet][3482][Api][cutensornetCreateTensorDescriptor] handle=0X62CB03D9E990 numModes=3 extents=[1,2,2] strides=[] modes=[4,41,6] dataType=4 tensorDesc=0X7FFD565CA350
[2025-12-17 11:15:44:743:045][cuTensorNet][3482][Api][cutensornetCreateTensorDescriptor] handle=0X62CB03D9E990 numModes=3 extents=[2,2,1] strides=[] modes=[6,42,8] dataType=4 tensorDesc=0X7FFD565CA358
[2025-12-17 11:15:44:743:048][cuTensorNet][3482][Api][cutensornetGateSplit] handle=0X62CB03D9E990 descTensorInA=0X62CB15BD2EC0 rawDataInA=0X1317C20E00 descTensorInB=0X62CB15BD6720 rawDataInB=0X1317C21000 descTensorInG=0X62CB15E1AB40 rawDataInG=0X1317C20800 descTensorU=0X62CB158F72F0 u=0X1317C20E00 s=0X1317C21600 descTensorV=0X62CB15BD4950 v=0X1317C21000 gateAlgo=0 svdConfig=0X62CB0FA8CC60 computeType=4 svdInfo=0x62cb0fa8bd00 workDesc=0X62CB0FA90100 stream=108624071654928
[2025-12-17 11:15:44:743:067][cuTensorNet][3482][Info][cutensornetGateSplit] INFO about architecture requested 8   data_type 4   compute_type 4.
[2025-12-17 11:15:44:743:434][cuTensorNet][3482][Trace][cutensornetGateSplit] cusolverDnXgesvd_bufferSize(handle=0X62CB0FA8EEF0 params=0X0 jobu=79 jobvt=83 m=2 n=2 dataTypeA=4 A=0X0 lda=2 dataTypeS=0 S=0X0 dataTypeU=4 U=0X0 ldu=2 dataTypeVT=4 VT=0X0 ldvt=2 computeType=4 workspaceInBytesOnDevice=0X7FFD565C7DB8 workspaceInBytesOnHost=0X7FFD565C7DC0)
[2025-12-17 11:15:44:743:454][cuTensorNet][3482][Info][cutensornetGateSplit] INFO about architecture requested 8   data_type 4   compute_type 4.
[2025-12-17 11:15:44:743:815][cuTensorNet][3482][Trace][cutensornetGateSplit] cusolverDnXgesvd_bufferSize(handle=0X62CB0FA8EEF0 params=0X0 jobu=79 jobvt=83 m=2 n=2 dataTypeA=4 A=0X0 lda=2 dataTypeS=0 S=0X0 dataTypeU=4 U=0X0 ldu=2 dataTypeVT=4 VT=0X0 ldvt=2 computeType=4 workspaceInBytesOnDevice=0X7FFD565C86B8 workspaceInBytesOnHost=0X7FFD565C86C0)
[2025-12-17 11:15:44:743:846][cuTensorNet][3482][Trace][cutensornetGateSplit] cusolverDnXgesvd(handle=0X62CB0FA8EEF0 params=0X0 jobu=79 jobvt=83 m=2 n=2 dataTypeA=4 A=0X1317C42300 lda=2 dataTypeS=0 S=0X1317C42400 dataTypeU=4 U=0X1317C42300 ldu=2 dataTypeVT=4 VT=0X1317C42500 ldvt=2 computeType=4 bufferOnDevice=0X1317C42600 workspaceInBytesOnDevice=546304 bufferOnHost=0X0 workspaceInBytesOnHost=0 info=0X1317CC7C00)
[2025-12-17 11:15:44:744:344][cuTensorNet][3482][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X62CB15E1AB40
[2025-12-17 11:15:44:744:357][cuTensorNet][3482][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X62CB15BD2EC0
[2025-12-17 11:15:44:744:360][cuTensorNet][3482][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X62CB15BD6720
[2025-12-17 11:15:44:744:371][cuTensorNet][3482][Api][cutensornetGetTensorDetails] handle=0X62CB03D9E990 tensorDesc=0X62CB158F72F0 numModes=0X7FFD565CA360 dataSize=0X0 modeLabels=0X0 extents=0X7FFD565CA450 strides=0X0
********************************************************************************
Finished applying two-qubit gate on sites 2, 3
********************************************************************************
Executing gate: 24 on qubits 2, 1
********************************************************************************
Applying two-qubit gate on sites 1, 2
********************************************************************************
[2025-12-17 11:15:44:744:509][cuTensorNet][3482][Api][cutensornetCreateTensorDescriptor] handle=0X62CB03D9E990 numModes=4 extents=[2,2,2,2] strides=[] modes=[36,41,43,44] dataType=4 tensorDesc=0X7FFD565CA380
[2025-12-17 11:15:44:744:521][cuTensorNet][3482][Api][cutensornetCreateTensorDescriptor] handle=0X62CB03D9E990 numModes=3 extents=[1,2,2] strides=[] modes=[2,43,4] dataType=4 tensorDesc=0X7FFD565CA390
[2025-12-17 11:15:44:744:533][cuTensorNet][3482][Api][cutensornetCreateTensorDescriptor] handle=0X62CB03D9E990 numModes=3 extents=[2,2,2] strides=[] modes=[4,44,6] dataType=4 tensorDesc=0X7FFD565CA398
[2025-12-17 11:15:44:744:545][cuTensorNet][3482][Api][cutensornetGateSplit] handle=0X62CB03D9E990 descTensorInA=0X62CB14C54380 rawDataInA=0X1317C20C00 descTensorInB=0X62CB158F72F0 rawDataInB=0X1317C20E00 descTensorInG=0X62CB15E1AB40 rawDataInG=0X1317C20800 descTensorU=0X62CB15BD2810 u=0X1317C20C00 s=0X1317C21400 descTensorV=0X62CB15BD3970 v=0X1317C20E00 gateAlgo=0 svdConfig=0X62CB0FA8CC60 computeType=4 svdInfo=0x62cb0fa8bd00 workDesc=0X62CB0FA90100 stream=108624071654928
[2025-12-17 11:15:44:744:565][cuTensorNet][3482][Info][cutensornetGateSplit] INFO about architecture requested 8   data_type 4   compute_type 4.
[2025-12-17 11:15:44:744:948][cuTensorNet][3482][Trace][cutensornetGateSplit] cusolverDnXgesvd_bufferSize(handle=0X62CB0FA8EEF0 params=0X0 jobu=79 jobvt=83 m=4 n=2 dataTypeA=4 A=0X0 lda=4 dataTypeS=0 S=0X0 dataTypeU=4 U=0X0 ldu=4 dataTypeVT=4 VT=0X0 ldvt=2 computeType=4 workspaceInBytesOnDevice=0X7FFD565C7DF8 workspaceInBytesOnHost=0X7FFD565C7E00)
[2025-12-17 11:15:44:745:229][cuTensorNet][3482][Error][cutensornetGateSplit] Insufficient device workspace (549120 bytes) provided to executeGateSplit(...), need 6560256 bytes (2).
[2025-12-17 11:15:44:745:245][cuTensorNet][3482][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X62CB15E198A0
[2025-12-17 11:15:44:745:248][cuTensorNet][3482][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X62CB14C54380
[2025-12-17 11:15:44:745:250][cuTensorNet][3482][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X62CB158F72F0
[2025-12-17 11:15:44:745:261][cuTensorNet][3482][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X62CB15BD4950
[2025-12-17 11:15:44:745:271][cuTensorNet][3482][Api][cutensornetDestroy] handle=0X62CB03D9E990
[2025-12-17 11:15:44:745:770][cuTensorNet][3482][Api][cutensornetDestroyWorkspaceDescriptor] workDesc=0X62CB0FA90100
[2025-12-17 11:15:44:745:785][cuTensorNet][3482][Api][cutensornetDestroyTensorSVDConfig] svdConfig=0X62CB0FA8CC60
[2025-12-17 11:15:44:745:787][cuTensorNet][3482][Api][cutensornetDestroyTensorSVDInfo] svdInfo=0X62CB0FA8BD00
Error: [2025-12-17 11:15:44:745:838][cuTensorNet][3482][Api][cutensornetGetErrorString] error=19: CUTENSORNET_STATUS_INSUFFICIENT_WORKSPACE
CUTENSORNET_STATUS_INSUFFICIENT_WORKSPACE in line 378, file: /home/adrian/maestro-gpu-simlators/lib/mpsimplgates.cu

aromanro avatar Dec 17 '25 09:12 aromanro

Thanks for sharing the logging. I would further ask for the logging for the part where the workspace size is queried on the maximal problem for some verification.

That being said, I would like to share one observation from our dev team, that is, the workspace requirement for GateSplit operation does not always increase monotonically with the problem size (the bond dimension). So in principle, one would need to query the workspace size for each problem size that may occur during the simulation and take the maximal. This assumption was the basis of the mps_example.cu example but it may not always hold and led to your confusion. Can you check if performing workspace query on all problem sizes solve your problem? [We acknowledge that this indeed can incur some small overhead for small size simulation]

A side note that we'll be entering holiday seasons and response from our dev team may be delayed.

yangcal avatar Dec 18 '25 02:12 yangcal

Here is the log with the creation of the MPS, including finding the max workspace needed:

[2025-12-18 13:38:30:294:676][cuTensorNet][5736][Api][cutensornetCreate] handle=0X565183A56CA8
[2025-12-18 13:38:30:294:917][cuTensorNet][5736][Info][cutensornetCreate] cuTensorNet version: 21000, cuTENSOR version: 20401
[2025-12-18 13:38:30:662:896][cuTensorNet][5736][Api][cutensornetCreateWorkspaceDescriptor] handle=0X565183A60A90 workDesc=0X565183A56CC8
[2025-12-18 13:38:30:662:924][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[1,2,1] strides=[] modes=[0,1,2] dataType=4 tensorDesc=0X7FFE4FEFD630
[2025-12-18 13:38:30:663:360][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[1,2,1] strides=[] modes=[2,3,4] dataType=4 tensorDesc=0X7FFE4FEFD630
[2025-12-18 13:38:30:663:377][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[1,2,1] strides=[] modes=[4,5,6] dataType=4 tensorDesc=0X7FFE4FEFD630
[2025-12-18 13:38:30:663:381][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[1,2,1] strides=[] modes=[6,7,8] dataType=4 tensorDesc=0X7FFE4FEFD630
[2025-12-18 13:38:30:663:545][cuTensorNet][5736][Api][cutensornetCreateTensorSVDConfig] handle=0X565183A60A90, svdConfig=0X565183A56CD0
[2025-12-18 13:38:30:663:568][cuTensorNet][5736][Api][cutensornetCreateTensorSVDInfo] handle=0X565183A60A90, svdInfo=0X565183A56CD8
[2025-12-18 13:38:30:663:584][cuTensorNet][5736][Api][cutensornetTensorSVDConfigSetAttribute] handle=0X565183A60A90 svdConfig=0X56518F74ED60 attr=0 buf=0X7FFE4FEFD500 sizeInBytes=8
[2025-12-18 13:38:30:663:592][cuTensorNet][5736][Api][cutensornetTensorSVDConfigSetAttribute] handle=0X565183A60A90 svdConfig=0X56518F74ED60 attr=2 buf=0X7FFE4FEFD4FC sizeInBytes=4
[2025-12-18 13:38:30:663:594][cuTensorNet][5736][Api][cutensornetTensorSVDConfigSetAttribute] handle=0X565183A60A90 svdConfig=0X56518F74ED60 attr=3 buf=0X7FFE4FEFD4F8 sizeInBytes=4
[2025-12-18 13:38:30:664:394][cuTensorNet][5736][Api][cutensornetCreateWorkspaceDescriptor] handle=0X565183A60A90 workDesc=0X565183A56CC8
********************************************************************************
Computing max workspace size for one qubit gate...
********************************************************************************
[2025-12-18 13:38:30:664:422][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=2 extents=[2,2] strides=[] modes=[112,113] dataType=4 tensorDesc=0X7FFE4FEFD4A8
[2025-12-18 13:38:30:664:436][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[4,2,4] strides=[] modes=[105,112,106] dataType=4 tensorDesc=0X7FFE4FEFD4A0
[2025-12-18 13:38:30:664:439][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[4,2,4] strides=[] modes=[105,113,106] dataType=4 tensorDesc=0X7FFE4FEFD4B0
[2025-12-18 13:38:30:664:442][cuTensorNet][5736][Api][cutensornetCreateNetwork] handle=0X565183A60A90 networkDesc=0X7FFE4FEFD4B8
[2025-12-18 13:38:30:664:453][cuTensorNet][5736][Api][cutensornetNetworkAppendTensor] handle=0X565183A60A90 networkDesc=0X56518F752FF0 numModes=3 extents=[4,2,4] modeLabels=[105,112,106] qualifiers= dataType=4 tensorId=0X7FFE4FEFD520
[2025-12-18 13:38:30:664:469][cuTensorNet][5736][Api][cutensornetNetworkAppendTensor] handle=0X565183A60A90 networkDesc=0X56518F752FF0 numModes=2 extents=[2,2] modeLabels=[112,113] qualifiers= dataType=4 tensorId=0X7FFE4FEFD528
[2025-12-18 13:38:30:664:477][cuTensorNet][5736][Api][cutensornetNetworkSetOutputTensor] handle=0X565183A60A90 networkDesc=0X56518F752FF0 numModes=3 modeLabels=[105,113,106] dataType=4
[2025-12-18 13:38:30:664:480][cuTensorNet][5736][Api][cutensornetNetworkSetAttribute] handle=0X565183A60A90 networkDesc=0X56518F752FF0 attr=30 buf=0X565183A56C54 sizeInBytes=4
[2025-12-18 13:38:30:664:580][cuTensorNet][5736][Api][cutensornetCreateContractionOptimizerConfig] handle=0X565183A60A90 optimizerConfig=0X7FFE4FEFD4D0
[2025-12-18 13:38:30:664:921][cuTensorNet][5736][Api][cutensornetCreateContractionOptimizerInfo] handle=0X565183A60A90 networkDesc=0X56518F752FF0 optimizerInfo=0X7FFE4FEFD4D8
[2025-12-18 13:38:30:664:936][cuTensorNet][5736][Api][cutensornetContractionOptimize] handle=0X565183A60A90 networkDesc=0X56518F752FF0 optimizerConfig=0X56518F753480 workspaceSizeConstraint=21584726016 optimizerInfo=0X56518F753A70
[2025-12-18 13:38:30:664:947][cuTensorNet][5736][Info][cutensornetContractionOptimize] INFO about architecture requested 8   data_type 4   compute_type 4.
[2025-12-18 13:38:30:665:853][cuTensorNet][5736][Api][cutensornetWorkspaceComputeContractionSizes] handle=0X565183A60A90 networkDesc=0X56518F752FF0 optimizerInfo=0X56518F753A70 workDesc=0X56518F752200
[2025-12-18 13:38:30:666:093][cuTensorNet][5736][Api][cutensornetWorkspaceGetMemorySize] handle=0X565183A60A90 workDesc=0X56518F752200 workPref=2 memSpace=0 workKind=0 memorySize=0X7FFE4FEFD4E0
[2025-12-18 13:38:30:666:105][cuTensorNet][5736][Api][cutensornetDestroyContractionOptimizerInfo] optimizerInfo=0X56518F753A70
[2025-12-18 13:38:30:666:108][cuTensorNet][5736][Api][cutensornetDestroyContractionOptimizerConfig] optimizerConfig=0X56518F753480
[2025-12-18 13:38:30:666:119][cuTensorNet][5736][Api][cutensornetDestroyNetwork] desc=0X56518F752FF0
[2025-12-18 13:38:30:666:123][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F7527D0
[2025-12-18 13:38:30:666:133][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F7523E0
[2025-12-18 13:38:30:666:135][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F752C20
********************************************************************************
Finished computing max workspace size for one qubit gate.
********************************************************************************
********************************************************************************
Computing max workspace size two qubits contraction...
********************************************************************************
[2025-12-18 13:38:30:666:144][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[4,2,4] strides=[] modes=[105,112,106] dataType=4 tensorDesc=0X7FFE4FEFD478
[2025-12-18 13:38:30:666:147][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[4,2,4] strides=[] modes=[106,113,107] dataType=4 tensorDesc=0X7FFE4FEFD480
[2025-12-18 13:38:30:666:150][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=4 extents=[4,2,2,4] strides=[] modes=[105,112,113,106] dataType=4 tensorDesc=0X7FFE4FEFD488
[2025-12-18 13:38:30:666:152][cuTensorNet][5736][Api][cutensornetCreateNetwork] handle=0X565183A60A90 networkDesc=0X7FFE4FEFD490
[2025-12-18 13:38:30:666:155][cuTensorNet][5736][Api][cutensornetNetworkAppendTensor] handle=0X565183A60A90 networkDesc=0X56518F752FF0 numModes=3 extents=[4,2,4] modeLabels=[105,112,106] qualifiers= dataType=4 tensorId=0X7FFE4FEFD4F0
[2025-12-18 13:38:30:666:159][cuTensorNet][5736][Api][cutensornetNetworkAppendTensor] handle=0X565183A60A90 networkDesc=0X56518F752FF0 numModes=3 extents=[4,2,4] modeLabels=[106,113,107] qualifiers= dataType=4 tensorId=0X7FFE4FEFD4F8
[2025-12-18 13:38:30:666:162][cuTensorNet][5736][Api][cutensornetNetworkSetOutputTensor] handle=0X565183A60A90 networkDesc=0X56518F752FF0 numModes=4 modeLabels=[105,112,113,106] dataType=4
[2025-12-18 13:38:30:666:171][cuTensorNet][5736][Api][cutensornetNetworkSetAttribute] handle=0X565183A60A90 networkDesc=0X56518F752FF0 attr=30 buf=0X565183A56C54 sizeInBytes=4
[2025-12-18 13:38:30:666:253][cuTensorNet][5736][Api][cutensornetCreateContractionOptimizerConfig] handle=0X565183A60A90 optimizerConfig=0X7FFE4FEFD4A8
[2025-12-18 13:38:30:666:269][cuTensorNet][5736][Api][cutensornetCreateContractionOptimizerInfo] handle=0X565183A60A90 networkDesc=0X56518F752FF0 optimizerInfo=0X7FFE4FEFD4B0
[2025-12-18 13:38:30:666:279][cuTensorNet][5736][Api][cutensornetContractionOptimize] handle=0X565183A60A90 networkDesc=0X56518F752FF0 optimizerConfig=0X565191B80380 workspaceSizeConstraint=21584726016 optimizerInfo=0X56518F753450
[2025-12-18 13:38:30:666:288][cuTensorNet][5736][Info][cutensornetContractionOptimize] INFO about architecture requested 8   data_type 4   compute_type 4.
[2025-12-18 13:38:30:666:394][cuTensorNet][5736][Api][cutensornetWorkspaceComputeContractionSizes] handle=0X565183A60A90 networkDesc=0X56518F752FF0 optimizerInfo=0X56518F753450 workDesc=0X56518F752200
[2025-12-18 13:38:30:666:569][cuTensorNet][5736][Api][cutensornetWorkspaceGetMemorySize] handle=0X565183A60A90 workDesc=0X56518F752200 workPref=2 memSpace=0 workKind=0 memorySize=0X7FFE4FEFD4B8
[2025-12-18 13:38:30:666:580][cuTensorNet][5736][Api][cutensornetDestroyContractionOptimizerInfo] optimizerInfo=0X56518F753450
[2025-12-18 13:38:30:666:582][cuTensorNet][5736][Api][cutensornetDestroyContractionOptimizerConfig] optimizerConfig=0X565191B80380
[2025-12-18 13:38:30:666:591][cuTensorNet][5736][Api][cutensornetDestroyNetwork] desc=0X56518F752FF0
[2025-12-18 13:38:30:666:601][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F752C20
[2025-12-18 13:38:30:666:603][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F7523E0
[2025-12-18 13:38:30:666:605][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F7527D0
********************************************************************************
Finished computing max workspace size two qubits contraction.
********************************************************************************
********************************************************************************
Computing max workspace size two qubits gate...
********************************************************************************
[2025-12-18 13:38:30:666:612][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[4,2,4] strides=[] modes=[105,112,106] dataType=4 tensorDesc=0X7FFE4FEFD488
[2025-12-18 13:38:30:666:615][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[4,2,4] strides=[] modes=[106,113,107] dataType=4 tensorDesc=0X7FFE4FEFD490
[2025-12-18 13:38:30:666:618][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=4 extents=[2,2,2,2] strides=[] modes=[112,113,114,115] dataType=4 tensorDesc=0X7FFE4FEFD498
[2025-12-18 13:38:30:666:620][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[4,2,4] strides=[] modes=[105,114,106] dataType=4 tensorDesc=0X7FFE4FEFD4A0
[2025-12-18 13:38:30:666:623][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[4,2,4] strides=[] modes=[106,115,107] dataType=4 tensorDesc=0X7FFE4FEFD4A8
[2025-12-18 13:38:30:666:865][cuTensorNet][5736][Api][cutensornetWorkspaceComputeGateSplitSizes] handle=0X565183A60A90 descTensorInA=0X56518F7527D0 descTensorInB=0X56518F7523E0 descTensorInG=0X56518F752C20 descTensorU=0X56518F7532F0 descTensorV=0X565191B7EE30 gateAlgo=0 svdConfig=0X56518F74ED60 computeType=4 workDesc=0X56518F752200
[2025-12-18 13:38:30:666:910][cuTensorNet][5736][Info][cutensornetWorkspaceComputeGateSplitSizes] INFO about architecture requested 8   data_type 4   compute_type 4.
[2025-12-18 13:38:30:671:850][cuTensorNet][5736][Trace][cutensornetWorkspaceComputeGateSplitSizes] cusolverDnXgesvd_bufferSize(handle=0X56518F750FF0 params=0X0 jobu=79 jobvt=83 m=8 n=8 dataTypeA=4 A=0X0 lda=8 dataTypeS=0 S=0X0 dataTypeU=4 U=0X0 ldu=8 dataTypeVT=4 VT=0X0 ldvt=8 computeType=4 workspaceInBytesOnDevice=0X7FFE4FEFBBD8 workspaceInBytesOnHost=0X7FFE4FEFBBE0)
[2025-12-18 13:38:30:673:427][cuTensorNet][5736][Api][cutensornetWorkspaceGetMemorySize] handle=0X565183A60A90 workDesc=0X56518F752200 workPref=2 memSpace=0 workKind=0 memorySize=0X7FFE4FEFD4B0
[2025-12-18 13:38:30:673:443][cuTensorNet][5736][Api][cutensornetWorkspaceGetMemorySize] handle=0X565183A60A90 workDesc=0X56518F752200 workPref=2 memSpace=1 workKind=0 memorySize=0X7FFE4FEFD4B8
[2025-12-18 13:38:30:673:446][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F7527D0
[2025-12-18 13:38:30:673:448][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F7523E0
[2025-12-18 13:38:30:673:458][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F752C20
[2025-12-18 13:38:30:673:461][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F7532F0
[2025-12-18 13:38:30:673:464][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X565191B7EE30
********************************************************************************
Finished computing max workspace size two qubits gate.
********************************************************************************
********************************************************************************
Computing max workspace size for SVD...
********************************************************************************
[2025-12-18 13:38:30:673:481][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[4,2,4] strides=[] modes=[105,114,106] dataType=4 tensorDesc=0X7FFE4FEFD4B0
[2025-12-18 13:38:30:673:496][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=3 extents=[4,2,4] strides=[] modes=[106,115,107] dataType=4 tensorDesc=0X7FFE4FEFD4B8
[2025-12-18 13:38:30:673:499][cuTensorNet][5736][Api][cutensornetCreateTensorDescriptor] handle=0X565183A60A90 numModes=4 extents=[4,4,2,2] strides=[] modes=[105,107,114,115] dataType=4 tensorDesc=0X7FFE4FEFD4C0
[2025-12-18 13:38:30:673:502][cuTensorNet][5736][Api][cutensornetWorkspaceComputeSVDSizes] handle=0X565183A60A90 descTensorIn=0X56518F752C20 descTensorU=0X565191B7EE30 descTensorV=0X56518F7532F0 svdConfig=0X56518F74ED60 workDesc=0X56518F752200
[2025-12-18 13:38:30:673:509][cuTensorNet][5736][Trace][cutensornetWorkspaceComputeSVDSizes] cusolverDnXgesvd_bufferSize(handle=0X56518F750FF0 params=0X0 jobu=79 jobvt=83 m=8 n=8 dataTypeA=4 A=0X0 lda=8 dataTypeS=0 S=0X0 dataTypeU=4 U=0X0 ldu=8 dataTypeVT=4 VT=0X0 ldvt=8 computeType=4 workspaceInBytesOnDevice=0X7FFE4FEFD248 workspaceInBytesOnHost=0X7FFE4FEFD250)
[2025-12-18 13:38:30:673:524][cuTensorNet][5736][Api][cutensornetWorkspaceGetMemorySize] handle=0X565183A60A90 workDesc=0X56518F752200 workPref=2 memSpace=0 workKind=0 memorySize=0X7FFE4FEFD4C8
[2025-12-18 13:38:30:673:534][cuTensorNet][5736][Api][cutensornetWorkspaceGetMemorySize] handle=0X565183A60A90 workDesc=0X56518F752200 workPref=2 memSpace=1 workKind=0 memorySize=0X7FFE4FEFD4D0
[2025-12-18 13:38:30:673:544][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F752C20
[2025-12-18 13:38:30:673:554][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X565191B7EE30
[2025-12-18 13:38:30:673:564][cuTensorNet][5736][Api][cutensornetDestroyTensorDescriptor] tensorDesc=0X56518F7532F0
********************************************************************************
Finished computing max workspace size for SVD.
********************************************************************************
[2025-12-18 13:38:30:673:634][cuTensorNet][5736][Api][cutensornetWorkspaceSetMemory] handle=0X565183A60A90 workDesc=0X56518F752200 memSpace=0 workKind=0 memoryPtr=0X1317C42200 memorySize=549120
[2025-12-18 13:38:30:673:656][cuTensorNet][5736][Trace][cutensornetWorkspaceSetMemory] workDesc([workSizes=[
[(CUTENSORNET_WORKSPACE_SCRATCH,CUTENSORNET_MEMSPACE_DEVICE)=[size=0, sizeNeeded=(548352,548352,548352,)], enabled, not owned],
[(CUTENSORNET_WORKSPACE_SCRATCH,CUTENSORNET_MEMSPACE_HOST)=[size=0, sizeNeeded=(0,0,0,)], disabled, not owned],
[(CUTENSORNET_WORKSPACE_CACHE,CUTENSORNET_MEMSPACE_DEVICE)=[size=0, sizeNeeded=(0,0,0,)], disabled, not owned],
[(CUTENSORNET_WORKSPACE_CACHE,CUTENSORNET_MEMSPACE_HOST)=[size=0, sizeNeeded=(0,0,0,)], disabled, not owned],
] cacheMap=[CUTENSORNET_MEMSPACE_DEVICE=],[CUTENSORNET_MEMSPACE_HOST=],
workPool=[(CUTENSORNET_WORKSPACE_SCRATCH,CUTENSORNET_MEMSPACE_DEVICE)=[cudaMemoryTypeDevice, SCRATCH, 0X1317C42200 : 0X1317CC8300 , 549120, 549120], blocks=[[0X1317C42200  : 0X1317CC8300 , [2145, 0, 0X0 , 0X0 ]],],
]
workPoolCircular=[]
[deviceMempoolStream(0X0 ), dataStream(0X0 ), ]
])
MPS created!
MPS, valid: 1, created: 1, Nr qubits: 4

As for the each possible problem size, I doubt that it can be done at the initialization for a situation with many qubits, as the singular values vectors and the site tensors grow up depending on the particular gates applied, there can be many configurations possible (as in having extents from 1 to max extent on each side, so querying max_extent^2 configurations would be needed). But I tried querying at runtime before the operation being applied (and reset the workspace if the required memory was higher than the existing one)... if I recall correctly in some situations the workspace needed was reported much higher than the available memory, so that solution did not work.

aromanro avatar Dec 18 '25 11:12 aromanro

querying at runtime before the operation being applied

This is the recommended workflow.

in some situations the workspace needed was reported much higher than the available memory

This is odd. Have you been releasing the unused buffer back to the OS with cudaFree to make sure that they're available when you wanna draw extra memory? And what's the problem size that caused the OOM error? If your gpu does not have enough memory to perform that single operation, there wouldn't be any way around that.

yangcal avatar Dec 18 '25 20:12 yangcal

I did not release the workspace memory yet (I was planning to release it only if needed - that is, if the needed memory was higher than the already allocated one).

The problem size was not big, I think I reproduced that on 4 qubits, where the site tensors cannot grow so large.

I'll try today to change the code to loop over all possible configurations just to see if the computed value improves (although this seems an overkill for a large bond dimension limit, like 1000) .

aromanro avatar Dec 19 '25 07:12 aromanro

I tried something like this for computing the size needed for SVD:

	for (int64_t leftVirtualExtent = 2; leftVirtualExtent <= maxVirtualExtent; ++leftVirtualExtent)
		for (int64_t middleVirtualExtent = 2; middleVirtualExtent <= maxVirtualExtent; ++middleVirtualExtent)
			for (int64_t rightVirtualExtent = 2; rightVirtualExtent <= maxVirtualExtent; ++rightVirtualExtent)
                        {
                              // here compute as before but replacing what was before maxVirtualExtent with the proper values from the fors
                        }

For the 4 qubits it seems to work, it will supply often a value large enough. Here is what I displayed from the loops:

SVD device workspace size required: 538880 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 538880 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 538880 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 544256 bytes.
SVD device workspace size required: 6561792 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 544256 bytes.
SVD device workspace size required: 6561792 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 544256 bytes.
SVD device workspace size required: 6561792 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 6561792 bytes.
SVD device workspace size required: 548608 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 6561792 bytes.
SVD device workspace size required: 548608 bytes.
SVD device workspace size required: 6560256 bytes.
SVD device workspace size required: 6561792 bytes.
SVD device workspace size required: 548608 bytes.

Changing the fors above to 1 (starting with 1 for leftVirtualExtent only is enough to have the issue) will supply the CUTENSORNET_STATUS_INVALID_VALUE error. A 1 for an extent is possible (for example when starting the simulation or for the qubits at the ends of the chain, the left or right extent will be always 1).

Trying the same code with some other number of qubits will not work so well, I tried to create a simulator with 128 qubits and max extent limited to 1024 and I've got the CUTENSORNET_STATUS_INVALID_VALUE error.

For the two qubits gate the issue will get worse, as it's possible that the middle extent to be different for the 'in' tensors and the 'out' tensors. Doing the memory estimation max_extents^4 seems to be excessive.

aromanro avatar Dec 19 '25 11:12 aromanro

A solution that might or might not work for all cases was to limit the number of configurations checked, like this:

For SVD:

const int64_t startLimit = std::max<int64_t>(2, maxVirtualExtent - 4);

for (int64_t leftVirtualExtent = startLimit; leftVirtualExtent <= maxVirtualExtent; ++leftVirtualExtent)
	for (int64_t middleVirtualExtent = startLimit; middleVirtualExtent <= maxVirtualExtent; ++middleVirtualExtent)
		for (int64_t rightVirtualExtent = startLimit; rightVirtualExtent <= maxVirtualExtent; ++rightVirtualExtent)

For two qubits gate:

const int64_t startLimit = std::max<int64_t>(2, maxVirtualExtent - 4);

for (int64_t leftVirtualExtent = startLimit; leftVirtualExtent <= maxVirtualExtent; ++leftVirtualExtent)
	for (int64_t middleVirtualExtentIn = startLimit; middleVirtualExtentIn <= maxVirtualExtent; ++middleVirtualExtentIn)
		for (int64_t middleVirtualExtentOut = startLimit; middleVirtualExtentOut <= maxVirtualExtent; ++middleVirtualExtentOut)
			for (int64_t rightVirtualExtent = startLimit; rightVirtualExtent <= maxVirtualExtent; ++rightVirtualExtent)

I have no idea if this works or not in general, though.... but on the other hand, the same is true for multiplying with 16.

I tried for several situations and it seems to work... and use way less memory than with the multiplication by 16 workaround.

aromanro avatar Dec 19 '25 11:12 aromanro

Changing the fors above to 1 (starting with 1 for leftVirtualExtent only is enough to have the issue) will supply the CUTENSORNET_STATUS_INVALID_VALUE error.

cutensornet does support extent = 1 case, I think you may have created unphysical gate split / SVD problem in your workflow that triggered the CUTENSORNET_STATUS_INVALID_VALUE error, e,g, an mps tensor (1, 2, 1), (1, 2, 1), (2, 2, 2, 2) -> (1, 2, 4), (4, 2, 1). Note that the maximal bond dimension that could possibly come out in the gate split problem above would be 2 in this case, if you have specified output extent to be 4, this would be considered as invalid. Please check the log and error message. The same applies to the other case where you observed INVALID_VALUE error. Let us know if my assumption is incorrect and kindly share the failing log.

some hacky way that may temporarily help you to get an upper bound for the maximal required workspace without querying all potential configurations: Imagining that your maximal bond dimension for your entire MPS simulation is to be capped at d, your maximal gate split problem would be:

1st site, 2nd site, gate. 1st output, 2nd output (d, 2, d), (d, 2, d), (2, 2, 2, 2) -> (d, 2, d), (d, 2, d)

In additional to query the maximal gate split problem above, please also query the following problem and take the maximal as your estimate: (d+1, 2, d), (d, 2, d), (2, 2, 2, 2) -> (d+1, 2, d), (d, 2, d)

In gate split problem above would result in a (2d+2, 2d) matrix SVD step in the middle, which may take a different SVD kernel without different workspace scaling than the symmetric case (2d, 2d). That may be why the largest problem was not sufficient to cover smaller problems.

NOTE that the suggestion above is a hack and may or may not work for now or future releases. The most robust way is always query and execute on the fly. You may combine it with some tricks to reduce the overhead from repeated cudaMalloc/cudaFree calls or cutensornetComputeGateSplitSizes, e.g,

buffer_size = 10000; # some initial guess
max_size = 1000000; # the largest mem that you wanna spare
buffer = cuda_malloc(buffer_size);

for gate in gates:
    workspace_size = compute_gate_split_sizes(mps_0, mps_1, gate)
    # note that you may build a cached dir to reduce the actual query, e,g, key being (left_virtual, middle_virtual, right_virtual, output_middle)
    if workspace_size > max_size:
        raise RuntimeError("insufficient gpu mem")
    if workspace_size > buffer_size:
        cuda_free(buffer)
        buffer_size = workspace_size
        buffer = cuda_malloc(buffer_size)
    workdesc_set_memory(buffer, buffer_size)
    gate_split(mps_0, mps_1, gate)

yangcal avatar Dec 20 '25 03:12 yangcal

Thank you, you are right, among the configurations generated there were unphysical ones.

I did the changes to check as in your description and it seems that it works ok now. I'll do more tests on Monday.

aromanro avatar Dec 20 '25 10:12 aromanro

This is an example of failure for the attempt to get the needed memory at runtime, for a simple case, one qubit gate. It's about a contraction between the site tensor and the gate tensor:

Executing gate: 11 on qubit 1
[2025-12-22 10:27:55:536:340][cuTensorNet][3063][Api][cutensornetCreateTensorDescriptor] handle=0X5639F3DDD9F0 numModes=2 extents=[2,2] strides=[] modes=[3,9] dataType=4 tensorDesc=0X7FFFBDA8F3E8
[2025-12-22 10:27:55:536:361][cuTensorNet][3063][Api][cutensornetCreateTensorDescriptor] handle=0X5639F3DDD9F0 numModes=3 extents=[1,2,1] strides=[] modes=[2,9,4] dataType=4 tensorDesc=0X7FFFBDA8F3F0
[2025-12-22 10:27:55:536:365][cuTensorNet][3063][Api][cutensornetCreateNetwork] handle=0X5639F3DDD9F0 networkDesc=0X7FFFBDA8F3F8
[2025-12-22 10:27:55:536:368][cuTensorNet][3063][Api][cutensornetNetworkAppendTensor] handle=0X5639F3DDD9F0 networkDesc=0X563A02BFF740 numModes=3 extents=[1,2,1] modeLabels=[2,3,4] qualifiers= dataType=4 tensorId=0X7FFFBDA8F480
[2025-12-22 10:27:55:536:373][cuTensorNet][3063][Api][cutensornetNetworkAppendTensor] handle=0X5639F3DDD9F0 networkDesc=0X563A02BFF740 numModes=2 extents=[2,2] modeLabels=[3,9] qualifiers= dataType=4 tensorId=0X7FFFBDA8F488
[2025-12-22 10:27:55:536:377][cuTensorNet][3063][Api][cutensornetNetworkSetOutputTensor] handle=0X5639F3DDD9F0 networkDesc=0X563A02BFF740 numModes=3 modeLabels=[2,9,4] dataType=4
[2025-12-22 10:27:55:536:389][cuTensorNet][3063][Api][cutensornetNetworkSetAttribute] handle=0X5639F3DDD9F0 networkDesc=0X563A02BFF740 attr=30 buf=0X5639F3DD3BB4 sizeInBytes=4
[2025-12-22 10:27:55:536:554][cuTensorNet][3063][Api][cutensornetCreateContractionOptimizerConfig] handle=0X5639F3DDD9F0 optimizerConfig=0X7FFFBDA8F400
[2025-12-22 10:27:55:536:578][cuTensorNet][3063][Api][cutensornetCreateContractionOptimizerInfo] handle=0X5639F3DDD9F0 networkDesc=0X563A02BFF740 optimizerInfo=0X7FFFBDA8F408
[2025-12-22 10:27:55:536:589][cuTensorNet][3063][Api][cutensornetContractionOptimize] handle=0X5639F3DDD9F0 networkDesc=0X563A02BFF740 optimizerConfig=0X563A02BFFA80 workspaceSizeConstraint=21572415488 optimizerInfo=0X563A02C000D0
[2025-12-22 10:27:55:536:596][cuTensorNet][3063][Info][cutensornetContractionOptimize] INFO about architecture requested 8   data_type 4   compute_type 4.
[2025-12-22 10:27:55:536:703][cuTensorNet][3063][Api][cutensornetWorkspaceComputeContractionSizes] handle=0X5639F3DDD9F0 networkDesc=0X563A02BFF740 optimizerInfo=0X563A02C000D0 workDesc=0X5639FFACF160
[2025-12-22 10:27:55:536:860][cuTensorNet][3063][Api][cutensornetWorkspaceGetMemorySize] handle=0X5639F3DDD9F0 workDesc=0X563A02BFF740 workPref=1 memSpace=0 workKind=0 memorySize=0X7FFFBDA8F410
More workspace was requested than the computed maximum, reallocating...
Current workspace size: 6563328 bytes, requested size: 94807102234448 bytes.
Not enough free GPU memory to allocate the requested workspace size.

Doing a similar thing for SVD and gate split works, those issues occur for contractions.

aromanro avatar Dec 22 '25 08:12 aromanro

94807102234448 bytes for a minimal contraction feels insane, are you sure you’re using the correct data type and pointer address

yangcal avatar Dec 22 '25 13:12 yangcal

I'm not sure what you mean by 'pointer address'... the same code - but without the cutensornetWorkspaceComputeContractionSizes & cutensornetWorkspaceGetMemorySize - works fine, so I suppose the rest of it is fine.

If you mean the address of the int64 value that holds the memory size, yes, it's set correctly (and the value is initialized with 0 before the call).

aromanro avatar Dec 22 '25 13:12 aromanro

Acknowledged. For verification purpose, are you able to put up a standalone script to only compute the workspace size for the minimal problem above? We will follow up after holiday break

yangcal avatar Dec 22 '25 23:12 yangcal

Mystery solved! It was me, I was passing the wrong descriptor to cutensornetWorkspaceGetMemorySize. Now it works well (although I'm not sure those memory computations while executing are needed, memory reallocations do not happen, it seems that the preallocated space is now enough).

Thank you and happy holidays!

aromanro avatar Dec 23 '25 07:12 aromanro