bug: Icicle integration is out of date
The Icicle integration for the Groth16/BN254 backend is out of date and does not currently compile, and the version of icicle that gnark depends on is old.
Description
Trying to build gnark with -tags=icicle fails due to accumulated changes in the rest of the gnark internal API that were not reflected in the Icicle backend. See the specific compile errors mentioned in "Actual Behavior" below.
Additionally, the icicle dependency used by gnark is currently commit https://github.com/ingonyama-zk/icicle/commit/97f0079e5c71d0bd4b8b97fababcfdb9940f12e1 (specified in go.mod here, indirectly from the iciclegnark dependency), which means that the current instructions specified here do not work for gnark: the current version of the icicle Go bindings create static libraries like libingo_curve_bn254.a and libingo_field_bn254.a while the older version that gnark relies on creates a shared library libbn254.so and the build process is documented here instead. It would be very nice if the gnark integration could be updated to support the latest icicle version as well, besides simply fixing the compile errors mentioned in this issue, though that probably requires Ingonyama to make the respective updates to the iciclegnark package.
Expected Behavior
The code should compile without errors and generated proofs should verify correctly.
Actual Behavior
Building with -tags=icicle gives the following compile errors:
# github.com/consensys/gnark/backend/groth16/bn254/icicle [github.com/consensys/gnark.test]
backend/groth16/bn254/icicle/icicle.go:44:30: cannot use pk.Domain.CosetTableInv (value of type func() ([]"github.com/consensys/gnark-crypto/ecc/bn254/fr".Element, error)) as []"github.com/consensys/gnark-crypto/ecc/bn254/fr".Element value in argument to iciclegnark.CopyToDevice
backend/groth16/bn254/icicle/icicle.go:47:30: cannot use pk.Domain.CosetTable (value of type func() ([]"github.com/consensys/gnark-crypto/ecc/bn254/fr".Element, error)) as []"github.com/consensys/gnark-crypto/ecc/bn254/fr".Element value in argument to iciclegnark.CopyToDevice
backend/groth16/bn254/icicle/icicle.go:160:73: commitmentInfo[i].HintID undefined (type constraint.Groth16Commitment has no field or method HintID)
Possible Fix
I tried fixing this locally in this commit https://github.com/wwared/gnark/commit/186e9de8321f7933e42f4d1b01c40a25f33fd4a5
However, even though this enabled me to compile gnark with Icicle support (linking against the older version at commit https://github.com/ingonyama-zk/icicle/commit/97f0079e5c71d0bd4b8b97fababcfdb9940f12e1), proofs generated with this would fail to verify with panic: pairing doesn't match, so this attempted fix is either wrong or incomplete.
Steps to Reproduce
- Run
go test -tags=icicleon the repository root of current master, or attempt to build a gnark-using program by passing-tags=icicleto the build/run command line.
Context
I am trying to generate a proof using GPU acceleration for a circuit defined on the latest gnark. It's not easy for me to revert to an older gnark version.
Your Environment
- gnark version used:
HEAD@master(https://github.com/Consensys/gnark/commit/db299cef6c78dc5acff8453b66c910c15ea88123 as of writing) - gnark-crypto version used: Whichever one is used by default (
v0.12.2-0.20240504013751-564b6f724c3baccording togo.mod) - go version: 1.22.3
- Operating System and version: Linux 6.6.30
Acknowledged. I cannot give deadlines when we fix it though, it is a bit of work.
It would be nice to have icicle support back.
@ivokub @wwared Are there plans to support other than bn254 curve on ICICLE?
Hmm, I think ICICLE in principle supports other curves. I don't recall why I didn't add the support back then. EIther ICICLE at that time didn't support other curves, it was buggy or I wanted to refactor. Let met see if there is something easy refactoring.
I can test it when ready.
I can test it when ready.
Thanks for the offer - I did a quick run to see if the interfaces make sense: https://github.com/Consensys/gnark/tree/feat/allcurve-accelerated
First of all - I moved the ICICLE integration out of backend/groth16 into backend/accelerated/icicle/groth16. Now instead of running backend/groth16.Prove etc. you should need to run backend/accelerated/icicle/groth16.Prove etc. methods. This is mainly for two reasons:
- it makes maintaining different accelerations a bit easier, we don't need to contaminate the vanilla backends we have right now. This is particularly keeping in mind if we want to implement ICICLE-Plonk also (still in mind).
- it also prepares to extract all the acceleration code into separate repository and go package. Right now one issue is that when we clone/import gnark, then we also have to import ICICLE backends which are quite big and I'd like to avoid that.
Keep in mind that I didn't have right now GPUs available to test at all, so it could be completely broken still. But maybe it works, I didn't see any linter errors yet.
Could you have a look if it seems to work, particularly for different curves BLS12-377, BLS12-381, BN254 and BW6-761? I had to rewrite a bit for BW6, but maybe it works? If seems to work more or less, then I will implement the implementations using code generation for different curves to avoid manual copy-paste implementations.
10:36:40 DBG constraint system solver done nbConstraints=3991968 took=9601.012258
10:36:41 DBG prover done acceleration=icicle backend=groth16 curve=bls12_377 nbConstraints=3991968 took=1005.11475
2025-09-15T10:36:41.503Z INF sequencer/ballot.go:213 > ballot verified address=691527994399008376970758624997482085815811072871 pid=00aa36a7bf7b6386ecb6b8bfcc548d2c51f142a513deb7520000000000000002 took=10.60949643s voteID=0000000000000000000000002275951a8ff8ec3d50a642d20f2923e0fed63820
First test on bls12-377 seems to work!
The GPU is a RTX 4090. Regarding the times for a 4M constraint circuit. As far as I understand, it took 1s for the prover but 9.6s for solving the constraint system? The total time is around 10s, which seems too much. Is that correct? Am I missing something?
Thanks for testing it out. 9 seconds for solving seems quite a lot. When I'm solving directly with the constraint system without proving, then a 4M-ish circuit takes me 1.9 seconds. But I'll see what happens when I use dummy proving key. Maybe computing the witness for the commitment is very slow. I'll benchmark it.
// GPUProver is an implementation that uses GPU acceleration for proving.
func GPUProver(
curve ecc.ID,
ccs constraint.ConstraintSystem,
pk groth16.ProvingKey,
assignment frontend.Circuit,
opts ...backend.ProverOption,
) (groth16.Proof, error) {
// Create a witness from the circuit
witness, err := frontend.NewWitness(assignment, curve.ScalarField())
if err != nil {
return nil, fmt.Errorf("failed to create witness: %w", err)
}
log.Debugw("using GPU prover", "curve", curve.String())
// Generate the proof using GPU acceleration
return gpugroth16.Prove(ccs, pk, witness, opts...)
}
This is my Go code for proving. The witness is computed as usual (no GPU acceleration), because I don't find any way on the new icicle/groth16 package to build the Witness. Maybe I'm missing some option or function?
Yup, witness computation is not GPU accelerated right now and I'm not sure it would be worth it to implement. We have optimized quite for CPU proving and have different shapes for witness computation depending on the shape of the instruction. And GPUs afaik are not that good in case we have different instructions we need to perform different computations (data parallelism vs instruction parallelism).
But 9s seems excessive. One thing is that usually witness solving should be well parallelized so more CPUs should make it faster. I have just discovered an optimization which should make witness computation even more parallelizable by reducing the dependency levels significantly (in PLONK 1000x, in R1CS 3-4x).
Another thing which could make solving slower if you are using a lot of hints as the hints get as inputs *big.Int and they are usually slow to use.
You could profile the witness solving speed by compiling the circuit and running solver directly on it (without proving) a la:
cs, err := frontend.Compile(ecc.BLS12_377.ScalarField(), r1cs.NewBuilder, circuit)
sol, err := scs.Solve(witnesses)
and then running the CPU profile. Maybe something obvious stands out.
Is this something to report to icicle?
16:03:56 DBG constraint system solver done nbConstraints=3991968 took=3080.983765
!!!Unrecoverable!!! : an illegal memory access was encountered : detected by: cudaStreamSynchronize(cuda_stream) at: /home/vocdoni/go/pkg/mod/github.com/ingonyama-zk/icicle-gnark/[email protected]/icicle/backend/cuda/src/ntt/ntt.cuh:743
The error is reported there and may be caused by prior calls.
!!!Unrecoverable!!! : an illegal memory access was encountered : detected by: cudaMemcpyAsync( &smallest_bucket_index, d_single_bucket_indices, sizeof(unsigned), cudaMemcpyDeviceToHost, stream) at: /home/vocdoni/go/pkg/mod/github.com/ingonyama-zk/icicle-gnark/[email protected]/icicle/backend/cuda/src/msm/cuda_msm.cuh:591
The error is reported there and may be caused by prior calls.
!!!Unrecoverable!!! : an illegal memory access was encountered : detected by: cudaEventRecord(event_finished_reduction, stream_reduction) at: /home/vocdoni/go/pkg/mod/github.com/ingonyama-zk/icicle-gnark/[email protected]/icicle/backend/cuda/src/msm/cuda_msm.cuh:906
The error is reported there and may be caused by prior calls.
terminate called after throwing an instance of 'IcicleError'
terminate called recursively
terminate called recursively
what(): CUDA Error: an illegal memory access was encountered !!!Unrecoverable!!! : an illegal memory access was encountered : detected by: cudaStreamSynchronize(cuda_stream) at: /home/vocdoni/go/pkg/mod/github.com/ingonyama-zk/icicle-gnark/[email protected]/icicle/backend/cuda/src/ntt/ntt.cuh:743
The error is reported there and may be caused by prior calls.
SIGABRT: abort
PC=0x7d409dc9eb2c m=3 sigcode=18446744073709551610
signal arrived during cgo execution
goroutine 23083 gp=0xc000e3ae00 m=3 mp=0xc000099008 [syscall, locked to thread]:
runtime.cgocall(0x17f3670, 0xcf0fecda60)
/snap/go/10938/src/runtime/cgocall.go:167 +0x4b fp=0xcf0fecda38 sp=0xcf0fecda00 pc=0x4851ab
github.com/ingonyama-zk/icicle-gnark/v3/wrappers/golang/curves/bls12377/g2._Cfunc_bls12_377_g2_msm(0x7d3dde000000, 0x7d3ec0000000, 0x5237ee, 0xcf0fecdea0, 0xca06d36480)
_cgo_gotypes.go:177 +0x46 fp=0xcf0fecda60 sp=0xcf0fecda38 pc=0x14c6826
github.com/ingonyama-zk/icicle-gnark/v3/wrappers/golang/curves/bls12377/g2.G2Msm.func1(...)
/home/vocdoni/go/pkg/mod/github.com/ingonyama-zk/icicle-gnark/[email protected]/wrappers/golang/curves/bls12377/g2/msm.go:27
github.com/ingonyama-zk/icicle-gnark/v3/wrappers/golang/curves/bls12377/g2.G2Msm({0x24a2d10?, 0xd173ebe4b0?}, {0x24a2d10?, 0xd173ebe4c8?}, 0xcf0fecdea0, {0x24a2de8?, 0xd173ebe4e0?})
/home/vocdoni/go/pkg/mod/github.com/ingonyama-zk/icicle-gnark/[email protected]/wrappers/golang/curves/bls12377/g2/msm.go:27 +0x116 fp=0xcf0fecdae8 sp=0xcf0fecda60 pc=0x14c7b16
github.com/consensys/gnark/backend/accelerated/icicle/groth16/bls12-377.Prove.func9()
/home/vocdoni/go/pkg/mod/github.com/vocdoni/[email protected]/backend/accelerated/icicle/groth16/bls12-377/icicle.go:534 +0x1f6 fp=0xcf0fecdf20 sp=0xcf0fecdae8 pc=0x14cf036
github.com/consensys/gnark/backend/accelerated/icicle/groth16/bls12-377.Prove.func12({0xd10163d720?, 0xc0?, 0xc00178e450?})
/home/vocdoni/go/pkg/mod/github.com/vocdoni/[email protected]/backend/accelerated/icicle/groth16/bls12-377/icicle.go:565 +0x17 fp=0xcf0fecdf68 sp=0xcf0fecdf20 pc=0x14cecd7
github.com/ingonyama-zk/icicle-gnark/v3/wrappers/golang/runtime.RunOnDevice.func1(0xd10163d720)
/home/vocdoni/go/pkg/mod/github.com/ingonyama-zk/icicle-gnark/[email protected]/wrappers/golang/runtime/runtime.go:99 +0x90 fp=0xcf0fecdfc8 sp=0xcf0fecdf68 pc=0x14c1210
github.com/ingonyama-zk/icicle-gnark/v3/wrappers/golang/runtime.RunOnDevice.gowrap1()
/home/vocdoni/go/pkg/mod/github.com/ingonyama-zk/icicle-gnark/[email protected]/wrappers/golang/runtime/runtime.go:101 +0x24 fp=0xcf0fecdfe0 sp=0xcf0fecdfc8 pc=0x14c1144
runtime.goexit({})
/snap/go/10938/src/runtime/asm_amd64.s:1700 +0x1 fp=0xcf0fecdfe8 sp=0xcf0fecdfe0 pc=0x490901
created by github.com/ingonyama-zk/icicle-gnark/v3/wrappers/golang/runtime.RunOnDevice in goroutine 9587
/home/vocdoni/go/pkg/mod/github.com/ingonyama-zk/icicle-gnark/[email protected]/wrappers/golang/runtime/runtime.go:94 +0xd9
goroutine 1 gp=0xc000002380 m=nil [chan receive, 144 minutes]:
runtime.gopark(0x85e305?, 0x1ae4ae0?, 0x1?, 0x49?, 0xc070703e50?)
/snap/go/10938/src/runtime/proc.go:435 +0xce fp=0xc0012e3dc0 sp=0xc0012e3da0 pc=0x48882e
runtime.chanrecv(0xc0004159d0, 0xc070703ed0, 0x1)
/snap/go/10938/src/runtime/chan.go:664 +0x445 fp=0xc0012e3e38 sp=0xc0012e3dc0 pc=0x420925
runtime.chanrecv1(0xc0004159d0?, 0xc070703ee0?)
/snap/go/10938/src/runtime/chan.go:506 +0x12 fp=0xc0012e3e60 sp=0xc0012e3e38 pc=0x4204b2
main.main()
/home/vocdoni/davinci-node/cmd/davinci-sequencer/main.go:77 +0x325 fp=0xc0012e3f50 sp=0xc0012e3e60 pc=0x17741c5
runtime.main()
/snap/go/10938/src/runtime/proc.go:283 +0x28b fp=0xc0012e3fe0 sp=0xc0012e3f50 pc=0x4534eb
runtime.goexit({})
/snap/go/10938/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0012e3fe8 sp=0xc0012e3fe0 pc=0x490901
goroutine 2 gp=0xc000002e00 m=nil [force gc (idle), 3 minutes]:
runtime.gopark(0xc2fe2f613bb?, 0x0?, 0x0?, 0x0?, 0x0?)
/snap/go/10938/src/runtime/proc.go:435 +0xce fp=0xc000092fa8 sp=0xc000092f88 pc=0x48882e
runtime.goparkunlock(...)
/snap/go/10938/src/runtime/proc.go:441
runtime.forcegchelper()
/snap/go/10938/src/runtime/proc.go:348 +0xb3 fp=0xc000092fe0 sp=0xc000092fa8 pc=0x453833
runtime.goexit({})
/snap/go/10938/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000092fe8 sp=0xc000092fe0 pc=0x490901
created by runtime.init.7 in goroutine 1
/snap/go/10938/src/runtime/proc.go:336 +0x1a
Could it actually be an "out of memory"? So maybe 20,475 MiB is not enough for 4M constraint circuit
It could be, I also recall there used to be strange errors when GPU ran out of memory. Sometimes I got non verifying proofs, later OOM, and sometimes machine crash. You could try with larger GPU mem machine on AWS.
But, I wouldn't exclude a bug in my implementation. Does the circuit work with BN254?
I have upgraded to a L40S with 48 GiB of memory. And no more crashes (for now).
Here, a bw6-761 4M constraints circuit.
07:28:48 DBG constraint system solver done nbConstraints=3991968 took=4021.875792
07:28:49 DBG prover done acceleration=icicle backend=groth16 curve=bls12_377 nbConstraints=3991968 took=1189.84605
07:28:49 DBG precomputing proving key in GPU acceleration=icicle backend=groth16 curve=bw6_761 nbConstraints=3722465
07:28:54 DBG constraint system solver done nbConstraints=3722465 took=4477.9352
07:28:56 DBG prover done acceleration=icicle backend=groth16 curve=bw6_761 nbConstraints=3722465 took=1877.914421
2025-09-16T07:28:56.146Z INF sequencer/aggregate.go:230 > aggregate proof generated ballots=57 processID=00aa36a7bf7b6386ecb6b8bfcc548d2c51f142a513deb7520000000000000006 took=6.892442902s
07:29:00 DBG constraint system solver done nbConstraints=3991968 took=3883.938822
Did you also verify the proofs? I recall at some point when I was doing tests that proving passed correctly, but at some point the proofs were incorrect.
Umm, you are right. I'm not able to verify the proofs.
I wonder if there is a way to use the new Prove() ICICLE function within a test. For debugging purposes. I believe I'd need something like test.WithProverFunction(proverFn).
assert := test.NewAssert(t)
assert.SolvingSucceeded(placeholder, assignments,
test.WithCurves(circuits.AggregatorCurve), test.WithBackends(backend.GROTH16),
test.WithProverOpts(stdgroth16.GetNativeProverOptions(
circuits.StateTransitionCurve.ScalarField(),
circuits.AggregatorCurve.ScalarField())))
c.Logf("proving tooks %s", time.Since(now).String())
Hmm - indeed. The option should be specific to groth16 and plonk interfaces though, they afaik have different key and circuit interfaces. But I think it could work. Let me see.
Did you also try with smaller circuits over BLS12-377? I'm trying to understand if it is OOM issue or my porting issue