celo-blockchain
celo-blockchain copied to clipboard
Seg fault / panic syncing mainnet on v1.7.4
Expected Behavior
Sync from genesis without panic
Actual Behavior
Panic and geth terminated. Resumes syncing normally after restart.
Steps to reproduce the behavior
Build celo-blockchain from source tag v1.7.4
Go: go1.19.2.linux-arm64, dependency in custom Dockfile
Docker: 24.0.2
Running on: Ubuntu 22.04.2 LTS ARM64
Dockerfile:
FROM arm64v8/ubuntu:jammy as builder
RUN apt update && apt install gcc make musl-dev wget -y
RUN wget https://go.dev/dl/go1.19.2.linux-arm64.tar.gz
RUN rm -rf /usr/local/go && tar -C /usr/local -xzf go1.19.2.linux-arm64.tar.gz
ENV PATH $PATH:/usr/local/go/bin
ADD . /go-ethereum
WORKDIR /go-ethereum
RUN make geth-musl
FROM arm64v8/ubuntu:jammy
ARG COMMIT_SHA
COPY --from=builder /go-ethereum/build/bin/geth /usr/local/bin/
RUN echo $COMMIT_SHA > /version.txt
ADD scripts/run_geth_in_docker.sh /
EXPOSE 8545 8546 30303 30303/udp
ENTRYPOINT ["sh", "/run_geth_in_docker.sh"]
# Add some metadata labels to help programatic image consumption
ARG COMMIT=$COMMIT_SHA
ARG VERSION=""
ARG BUILDNUM=""
LABEL commit="$COMMIT" version="$VERSION" buildnum="$BUILDNUM"
Backtrace
INFO [06-19\|13:17:04.423] Imported new chain segment blocks=190 txs=1079 mgas=243.352 elapsed=8.336s mgasps=29.192 number=18,830,021 hash=234037..f78e82 age=2mo2d1h dirty=74.86MiB
--
INFO [06-19\|13:17:07.629] Sending val enode share msg to proxy func=sendValEnodeShareMsgs proxy peer="Peer 16331f68e399c8af 10.192.11.7:30503" valAddresses length=119
INFO [06-19\|13:17:07.630] Skipping sending ValEnodesShareMsg b/c not validating func=sendValEnodesShareMsg
INFO [06-19\|13:17:12.431] Imported new chain segment blocks=169 txs=889 mgas=206.133 elapsed=8.008s mgasps=25.740 number=18,830,190 hash=3cd7bb..1c429a age=2mo2d1h dirty=75.76MiB
INFO [06-19\|13:17:20.441] Imported new chain segment blocks=221 txs=1171 mgas=267.872 elapsed=8.010s mgasps=33.439 number=18,830,411 hash=570d7c..095d89 age=2mo2d59m dirty=76.94MiB
INFO [06-19\|13:17:28.969] Imported new chain segment blocks=199 txs=820 mgas=223.748 elapsed=8.527s mgasps=26.240 number=18,830,610 hash=63997b..6b01d5 age=2mo2d43m dirty=76.46MiB
INFO [06-19\|13:17:36.971] Imported new chain segment blocks=145 txs=573 mgas=211.915 elapsed=8.002s mgasps=26.481 number=18,830,755 hash=e204e2..de3f67 age=2mo2d31m dirty=77.38MiB
INFO [06-19\|13:17:37.634] Sending val enode share msg to proxy func=sendValEnodeShareMsgs proxy peer="Peer 16331f68e399c8af 10.192.11.7:30503" valAddresses length=119
INFO [06-19\|13:17:37.634] Skipping sending ValEnodesShareMsg b/c not validating func=sendValEnodesShareMsg
INFO [06-19\|13:17:42.725] Imported new chain segment blocks=245 txs=1128 mgas=244.124 elapsed=5.753s mgasps=42.428 number=18,831,000 hash=6af2e0..cfc589 age=2mo2d11m dirty=79.62MiB
INFO [06-19\|13:17:42.726] Downloader queue stats receiptTasks=0 blockTasks=33151 itemSize=1.83KiB throttle=8192
INFO [06-19\|13:17:43.728] Unindexing transactions blocks=1727 txs=39667 total=2048 elapsed=1.002s
INFO [06-19\|13:17:44.066] Unindexed transactions blocks=2048 txs=47208 tail=16,481,001 elapsed=1.340s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xc44908]
goroutine 574387 [running]:
github.com/celo-org/celo-blockchain/contracts/currency.(*Currency).ToCELO(...)
github.com/celo-org/celo-blockchain/contracts/currency/currency.go:62
github.com/celo-org/celo-blockchain/miner.createConversionFunctions.func2(0x3c831c0a?, 0x4011eba400?)
github.com/celo-org/celo-blockchain/miner/block.go:380 +0x28
github.com/celo-org/celo-blockchain/core/types.NewTxWithMinerFee(0x401328e120, 0x4091e8fbd0, 0x4091e8fbe0)
github.com/celo-org/celo-blockchain/core/types/transaction.go:555 +0x88
github.com/celo-org/celo-blockchain/core/types.NewTransactionsByPriceAndNonce({0x1914680?, 0x405c819bb0}, 0x409b8fe1e0, 0x4091e8fbd0, 0x4091e8fbe0)
github.com/celo-org/celo-blockchain/core/types/transaction.go:609 +0x168
github.com/celo-org/celo-blockchain/miner.(*worker).constructPendingStateBlock(0x40029794a0, {0x1911e58, 0x403d4b1d40}, 0x40034d0a80)
github.com/celo-org/celo-blockchain/miner/worker.go:366 +0x408
github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.2()
github.com/celo-org/celo-blockchain/miner/worker.go:415 +0x38
created by github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1
github.com/celo-org/celo-blockchain/miner/worker.go:414 +0x1f4
Chain/Network: Mainnet
Possibly related to #1982
Maybe we should not silently discard the error in https://github.com/celo-org/celo-blockchain/blob/85c175776055fc739fb179d9426d7b158e0c3644/miner/block.go#L386
This involves this function: https://github.com/celo-org/celo-blockchain/blob/de3f84b741d3472846a05754d49910e1bceb54dd/core/types/transaction.go#L604-L628
From reading the stack trace, I see only one way to cause the error and that is
- Have a
curr
ofnil
intoCeloFn
- because
GetExchangeRate
failed to deliver an exchange rate for the given currency - because
NewExchangeRate
gets called with either numerator or denominator being zero
However, I don't see this being the case for any of the fee currencies around that time:
for curr in 0x765DE816845861e75A25fCA122bb6898B8B1282a 0xD8763CBa276a3738E6DE85b4b3bF5FDed6D6cA73 0xe8537a3d056DA446677B9E9d6c5dB704EaAb4787
cast call -r https://celo-mainnet.infura.io/v3/$INFURA_API_KEY 0xefB849352
39dAcdecF7c5bA76d8dE40b077B7b33 'function medianRate(address token) external view returns (uint256, uint256)' $curr -b 18831000
end
713715994804191731790000
1000000000000000000000000
653368356238021788240000
1000000000000000000000000
3553065332252728052300000
1000000000000000000000000
Have I missed anything?
I was unable to reproduce the panic with v1.7.4, but I just ran it locally and didn't use the docker container as described in the issue.
We could easily handle the case and avoid a panic (e.g. by not processing a tx when we can't convert the miner fee to CELO). But since it does not happen for all validators, we would get a consensus failure instead of a panic, which would be even harder to debug.
Alternatively, we could keep panicking and add some additional debugging output to show the specific tx that fails when it happens the next time.