go icon indicating copy to clipboard operation
go copied to clipboard

Horizon node goes down after a few hours with invalid memory address or nil pointer.

Open robatipoor opened this issue 1 year ago • 1 comments

I installed Stellar-Core and Stellar-Horizon using the Debian binary package format on Ubuntu LTS. I set up a private network with 5 Stellar-Core nodes. After ensuring that all Stellar-Core nodes were synchronized, I proceeded to configure and launch a Stellar-Horizon node. However, I encountered an error message stating 'Still Ingesting.' Upon investigation, I found that the Stellar-Captive-Core was not syncing and remained in a 'catching up' state. Furthermore, I am experiencing an issue where the server frequently goes down after a few hours with a invalid memory address or nil pointer

The output of the command stellar-horizon version is:

2.31.0-d1d911f6ed7396fd705d8f2068f70f468012102e
go1.22.4

The output of the command stellar-core version is:

Warning: running non-release version stellar-core 21.1.0 (b3aeb14cc798f6d11deb2be913041be916f3b0cc) of stellar-core
stellar-core 21.1.0 (b3aeb14cc798f6d11deb2be913041be916f3b0cc)
ledger protocol version: 21
rust version: rustc 1.74.1 (a28077b28 2023-12-04)

Horizon node standard output:

ERRO[2024-07-02T17:27:56.696+03:30] History: Could not download file: archive node03 maybe missing file history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:27:56.696+03:30] History: Missing HAS for ledger 7: maybe stale archive node03  pid=131852 service=ingest subservice=stellar-core
INFO[2024-07-02T17:27:58.696+03:30] History: Downloading history archive state: history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
WARN[2024-07-02T17:27:58.721+03:30] Process: process 131953 exited 22: curl -sf http://NODE01_IP/history/00/00/00/history-00000007.json -o buckets/tmp/history-a708a75f7c0058c1/a78c94a842ed0588-stellar-history.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:27:58.722+03:30] History: Could not download file: archive node02 maybe missing file history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:27:58.722+03:30] History: Missing HAS for ledger 7: maybe stale archive node02  pid=131852 service=ingest subservice=stellar-core
INFO[2024-07-02T17:28:02.722+03:30] History: Downloading history archive state: history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
WARN[2024-07-02T17:28:02.745+03:30] Process: process 131954 exited 22: curl -sf http://NODE05_IP/history/00/00/00/history-00000007.json -o buckets/tmp/history-a708a75f7c0058c1/a78c94a842ed0588-stellar-history.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:28:02.746+03:30] History: Could not download file: archive node05 maybe missing file history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:28:02.746+03:30] History: Missing HAS for ledger 7: maybe stale archive node05  pid=131852 service=ingest subservice=stellar-core
INFO[2024-07-02T17:28:08.746+03:30] History: Downloading history archive state: history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
WARN[2024-07-02T17:28:08.769+03:30] Process: process 131956 exited 22: curl -sf http://NODE05_IP/history/00/00/00/history-00000007.json -o buckets/tmp/history-a708a75f7c0058c1/a78c94a842ed0588-stellar-history.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:28:08.770+03:30] History: Could not download file: archive node05 maybe missing file history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:28:08.770+03:30] History: Missing HAS for ledger 7: maybe stale archive node05  pid=131852 service=ingest subservice=stellar-core
INFO[2024-07-02T17:28:20.770+03:30] History: Downloading history archive state: history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
WARN[2024-07-02T17:28:20.794+03:30] Process: process 131958 exited 22: curl -sf http://localhost/history/00/00/00/history-00000007.json -o buckets/tmp/history-a708a75f7c0058c1/a78c94a842ed0588-stellar-history.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:28:20.794+03:30] History: Could not download file: archive node01 maybe missing file history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:28:20.794+03:30] History: Missing HAS for ledger 7: maybe stale archive node01  pid=131852 service=ingest subservice=stellar-core
INFO[2024-07-02T17:28:25.593+03:30] default: Performing maintenance               pid=131852 service=ingest subservice=stellar-core
INFO[2024-07-02T17:28:25.593+03:30] History: Trimming history <= ledger 0 (rmin=4294967295, qmin=1, lmin=0)  pid=131852 service=ingest subservice=stellar-core
INFO[2024-07-02T17:28:40.794+03:30] History: Downloading history archive state: history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
WARN[2024-07-02T17:28:40.818+03:30] Process: process 131960 exited 22: curl -sf http://localhost/history/00/00/00/history-00000007.json -o buckets/tmp/history-a708a75f7c0058c1/a78c94a842ed0588-stellar-history.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:28:40.818+03:30] History: Could not download file: archive node01 maybe missing file history/00/00/00/history-00000007.json  pid=131852 service=ingest subservice=stellar-core
ERRO[2024-07-02T17:28:40.818+03:30] History: Missing HAS for ledger 7: maybe stale archive node01  pid=131852 service=ingest subservice=stellar-core
 Close of ledger 28493 buffered.
 panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x10f8024]

goroutine 117 [running]:
github.com/stellar/go/ingest/ledgerbackend.(*bufferedLedgerMetaReader).getChannel(...)
        github.com/stellar/go/ingest/ledgerbackend/buffered_meta_pipe_reader.go:108
github.com/stellar/go/ingest/ledgerbackend.(*stellarCoreRunner).getMetaPipe(0xc0000c2500?)
        github.com/stellar/go/ingest/ledgerbackend/stellar_core_runner.go:554 +0x4
github.com/stellar/go/ingest/ledgerbackend.(*CaptiveStellarCore).GetLedger(0xc000227200, {0x1c7b810, 0xc00023aaf0}, 0x3e3f)
        github.com/stellar/go/ingest/ledgerbackend/captive_core_backend.go:655 +0x691
github.com/stellar/go/ingest/ledgerbackend.metricsLedgerBackend.GetLedger({{0x1c7e260?, 0xc000227200?}, {0x1c7e460?, 0xc0004937a0?}}, {0x1c7b810, 0xc00023aaf0}, 0x3e3f)
        github.com/stellar/go/ingest/ledgerbackend/metrics.go:38 +0x9b
github.com/stellar/go/services/horizon/internal/ingest.buildState.run({0x117c824?, 0x0?, 0x0?}, 0xc0005d2c88)
        github.com/stellar/go/services/horizon/internal/ingest/fsm.go:319 +0x337
github.com/stellar/go/services/horizon/internal/ingest.(*system).runStateMachine(0xc0005d2c88, {0x1c792c0, 0x26ed5a0})
        github.com/stellar/go/services/horizon/internal/ingest/main.go:665 +0x3c5
github.com/stellar/go/services/horizon/internal/ingest.(*system).Run(0xc0005d2c88)
        github.com/stellar/go/services/horizon/internal/ingest/main.go:539 +0x3a
github.com/stellar/go/services/horizon/internal.(*App).Serve.func1()
        github.com/stellar/go/services/horizon/internal/app.go:103 +0x2f
created by github.com/stellar/go/services/horizon/internal.(*App).Serve in goroutine 1
        github.com/stellar/go/services/horizon/internal/app.go:102 +0x285

The output of the command cat /etc/stellar/stellar-core.cfg is:

HTTP_PORT=11626
PEER_PORT=11625
PUBLIC_HTTP_PORT=false
RUN_STANDALONE=false
NETWORK_PASSPHRASE="My Private Network ; Feb 2024"
NODE_SEED="PRIVATE_KEY self"
NODE_IS_VALIDATOR=true
DATABASE="postgresql://dbname=core_database_name user=username password=password host=localhost"
DEPRECATED_SQL_LEDGER_STATE=false
LOG_FILE_PATH="/var/log/stellar/stellar-core-{datetime:%Y-%m-%d}.log"
BUCKET_DIR_PATH="/var/lib/stellar/buckets"
COMMANDS=["ll?level=debug"]
FAILURE_SAFETY=1
UNSAFE_QUORUM=false
CATCHUP_COMPLETE=false

NODE_HOME_DOMAIN="my.domain"
[[HOME_DOMAINS]]
HOME_DOMAIN="my.domain"
QUALITY="HIGH"

# [[VALIDATORS]]
# NAME="node01"
# HOME_DOMAIN="my.domain"
# PUBLIC_KEY="GB..."
# ADDRESS="NODE01_IP:11625"
# HISTORY="curl -sf http://NODE01_IP/{0} -o {1}"

[[VALIDATORS]]
NAME="node02"
HOME_DOMAIN="my.domain"
PUBLIC_KEY="GB..."
ADDRESS="NODE02_IP:11625"
HISTORY="curl -sf http://NODE02_IP/{0} -o {1}"

[[VALIDATORS]]
NAME="node03"
HOME_DOMAIN="my.domain"
PUBLIC_KEY="GB..."
ADDRESS="NODE03_IP:11625"
HISTORY="curl -sf http://NODE03_IP/{0} -o {1}"

[[VALIDATORS]]
NAME="node04"
HOME_DOMAIN="my.domain"
PUBLIC_KEY="GB..."
ADDRESS="NODE04_IP:11625"
HISTORY="curl -sf http://NODE04_IP/{0} -o {1}"

[[VALIDATORS]]
NAME="node05"
HOME_DOMAIN="my.domain"
PUBLIC_KEY="GD..."
ADDRESS="NODE05_IP:11625"
HISTORY="curl -sf http://NODE05_IP/{0} -o {1}"

[HISTORY.local]
get="cp /var/lib/stellar/history/vs/{0} {1}"
put="cp {0} /var/lib/stellar/history/vs/{1}"
mkdir="mkdir -p /var/lib/stellar/history/vs/{0}"

The output of the command stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'info' is:

Warning: running non-release version stellar-core 21.1.0 (b3aeb14cc798f6d11deb2be913041be916f3b0cc) of stellar-core
2024-07-01T20:59:17.170 [default INFO] Config from /etc/stellar/stellar-core.cfg
2024-07-01T20:59:17.174 [default INFO] Generated QUORUM_SET: {
   "t" : 3,
   "v" : [ "self", "node03", "node02", "node04", "node05" ]
}
{
   "info" : {
      "build" : "stellar-core 21.1.0 (b3aeb14cc798f6d11deb2be913041be916f3b0cc)",
      "ledger" : {
         "age" : 3,
         "baseFee" : 100,
         "baseReserve" : 100000000,
         "closeTime" : 1719854954,
         "hash" : "6c8da53d06814966acd472d421e87f77bff9b96d4a46fc56114848b8aab03e80",
         "maxTxSetSize" : 100,
         "num" : 3360,
         "version" : 0
      },
      "network" : "My Private Network ; Feb 2024",
      "peers" : {
         "authenticated_count" : 4,
         "pending_count" : 0
      },
      "protocol_version" : 21,
      "quorum" : {
         "node" : "self",
         "qset" : {
            "agree" : 5,
            "cost" : 17776,
            "delayed" : 0,
            "disagree" : 0,
            "fail_at" : 3,
            "hash" : "04828e",
            "lag_ms" : 30,
            "ledger" : 3359,
            "missing" : 0,
            "phase" : "EXTERNALIZE",
            "validated" : true
         },
         "transitive" : {
            "critical" : null,
            "intersection" : true,
            "last_check_ledger" : 1521,
            "node_count" : 5
         }
      },
      "startedOn" : "2024-07-01T14:51:13Z",
      "state" : "Synced!"
   }
}

The output of the command stellar-core --conf /etc/stellar/stellar-captive-core.cfg http-command 'info' is:

Warning: running non-release version stellar-core 21.1.0 (b3aeb14cc798f6d11deb2be913041be916f3b0cc) of stellar-core
2024-07-07T06:07:43.934 [default INFO] Config from /etc/stellar/stellar-captive-core.cfg
2024-07-07T06:07:43.935 [default INFO] Generated QUORUM_SET: {
   "t" : 3,
   "v" : [ "node01", "node03", "node02", "node04", "node05" ]
}

Content-Length: 1287
Content-Type: application/json

{
   "info" : {
      "build" : "stellar-core 21.1.0 (b3aeb14cc798f6d11deb2be913041be916f3b0cc)",
      "ledger" : {
         "age" : 0,
         "baseFee" : 100,
         "baseReserve" : 100000000,
         "closeTime" : 0,
         "hash" : "8401376de9f86caa0934e10ce5fbc8706d1a72f4db6b0243601e640fed00dd44",
         "maxTxSetSize" : 100,
         "num" : 1,
         "version" : 0
      },
      "network" : "My Private Network ; Feb 2024",
      "peers" : {
         "authenticated_count" : 3,
         "pending_count" : 0
      },
      "protocol_version" : 21,
      "quorum" : {
         "node" : "GCAPG",
         "qset" : {
            "agree" : 5,
            "cost" : 21692,
            "delayed" : 0,
            "disagree" : 0,
            "fail_at" : 3,
            "hash" : "04828e",
            "lag_ms" : 29,
            "ledger" : 60022,
            "missing" : 0,
            "phase" : "EXTERNALIZE"
         },
         "transitive" : {
            "critical" : null,
            "intersection" : true,
            "last_check_ledger" : 59955,
            "node_count" : 6
         }
      },
      "startedOn" : "2024-07-07T02:32:45Z",
      "state" : "Catching up",
      "status" : [ "Catching up to ledger 59903: downloading ledger files 1/1 (100%)" ]
   }

The output of the command cat /etc/default/stellar-horizon is:

NETWORK_PASSPHRASE="My Private Network ; Feb 2024"
LOG_FILE_PATH="/var/log/stellar/stellar-horizon-{datetime:%Y-%m-%d}.log"
DATABASE_URL="postgres://username:password@localhost/horizon_database_name"
CAPTIVE_CORE_CONFIG_PATH="/etc/stellar/stellar-captive-core.cfg"
STELLAR_CORE_URL="http://127.0.0.1:11626"
STELLAR_CORE_BINARY_PATH=/usr/bin/stellar-core
HISTORY_ARCHIVE_URLS="http://NODE01_IP/,http://NODE02_IP/,http://NODE03_IP/,http://NODE04_IP/,http://NODE05_IP/"
HISTORY_RETENTION_COUNT=518
LOG_LEVEL="info"
CAPTIVE_CORE_STORAGE_PATH="/var/lib/stellar"
ENABLE_CAPTIVE_CORE_INGESTION="true"
CAPTIVE_CORE_USE_DB=true
STELLAR_CAPTIVE_CORE_HTTP_PORT=0
PER_HOUR_RATE_LIMIT="72000"
DISABLE_ASSET_STATS="true"
CHECKPOINT_FREQUENCY=8
ADMIN_PORT=6060
PORT=8000
PER_HOUR_RATE_LIMIT=9000
INGEST=true
FRIENDBOT_URL=http://localhost:8004/

The output of the command cat /etc/stellar/stellar-captive-core.cfg is:

HTTP_PORT=11726
PEER_PORT=11725
PUBLIC_HTTP_PORT=false
RUN_STANDALONE=true
NETWORK_PASSPHRASE="My Private Network ; Feb 2024"
DATABASE="sqlite3://stellar_core_captive.db"
LOG_FILE_PATH="/var/log/stellar/stellar-captive-core-{datetime:%Y-%m-%d}.log"
ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING=true
FAILURE_SAFETY=0
UNSAFE_QUORUM=true

[[HOME_DOMAINS]]
HOME_DOMAIN="my.domain"
QUALITY="HIGH"

[[VALIDATORS]]
NAME="node01"
HOME_DOMAIN="my.domain"
PUBLIC_KEY="GB..."
ADDRESS="localhost:11625"
HISTORY="curl -sf http://localhost/{0} -o {1}"

[[VALIDATORS]]
NAME="node02"
HOME_DOMAIN="my.domain"
PUBLIC_KEY="GB..."
ADDRESS="NODE02_IP:11625"
HISTORY="curl -sf http://NODE02_IP/{0} -o {1}"

[[VALIDATORS]]
NAME="node03"
HOME_DOMAIN="my.domain"
PUBLIC_KEY="GB..."
ADDRESS="NODE03_IP:11625"
HISTORY="curl -sf http://NODE03_IP/{0} -o {1}"

[[VALIDATORS]]
NAME="node04"
HOME_DOMAIN="my.domain"
PUBLIC_KEY="GB..."
ADDRESS="NODE04_IP:11625"
HISTORY="curl -sf http://NODE04_IP/{0} -o {1}"

[[VALIDATORS]]
NAME="node05"
HOME_DOMAIN="my.domain"
PUBLIC_KEY="GD..."
ADDRESS="NODE05_IP:11625"
HISTORY="curl -sf http://NODE05_IP/{0} -o {1}"

The output of the command free -h is:

               total        used        free      shared  buff/cache   available
Mem:           1.9Gi       476Mi        63Mi        89Mi       1.4Gi       1.1Gi
Swap:           11Gi        41Mi        11Gi

robatipoor avatar Jul 07 '24 02:07 robatipoor