kubo icon indicating copy to clipboard operation
kubo copied to clipboard

Kubo OOM and locks our servers when pinning a big amount of CID

Open Mayeu opened this issue 1 year ago • 9 comments

Checklist

Installation method

built from source

Version

Kubo version: 0.21.0
Repo version: 14
System version: amd64/linux
Golang version: go1.20.5

Config

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [],
    "AppendAnnounce": [],
    "Gateway": "/ip4/127.0.0.1/tcp/8080",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
      "/ip6/::/udp/4001/quic",
      "/ip6/::/udp/4001/quic-v1",
      "/ip6/::/udp/4001/quic-v1/webtransport"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/3",
            "sync": false,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "20TB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false
    }
  },
  "Experimental": {
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "OptimisticProvide": false,
    "OptimisticProvideJobsPoolSize": 0,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "DeserializedResponses": null,
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": ""
  },
  "Identity": {
    "PeerID": "xxxxxx"
  },
  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": 128,
      "EngineTaskWorkerCount": 8,
      "MaxOutstandingBytesPerPeer": null,
      "ProviderSearchDelay": null,
      "TaskWorkerCount": 8
    }
  },
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": [
      {
        "Addrs": [
          "/dnsaddr/node-1.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcFf2FH3CEgTNHeMRGhN7HNHU1EXAxoEk6EFuSyXCsvRE"
      },
      {
        "Addrs": [
          "/dnsaddr/node-2.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcFmLd5ySfk2WZuJ1mfSWLDjdmHZq7rSAua4GoeSQfs1z"
      },
      {
        "Addrs": [
          "/dnsaddr/node-3.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfFmzSDVbwexQ9Au2pt5YEXHK5xajwgaU6PpkbLWerMa"
      },
      {
        "Addrs": [
          "/dnsaddr/node-4.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfJeB3Js1FG7T8YaZATEiaHqNKVdQfybYYkbT1knUswx"
      },
      {
        "Addrs": [
          "/dnsaddr/node-5.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfVvzK4tMdFmpJjEKDUoqRgP4W9FnmJoziYX5GXJJ8eZ"
      },
      {
        "Addrs": [
          "/dnsaddr/node-6.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfZD3VKrUxyP9BbyUnZDpbqDnT7cQ4WjPP8TRLXaoE7G"
      },
      {
        "Addrs": [
          "/dnsaddr/node-7.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfZP2LuW4jxviTeG8fi28qjnZScACb8PEgHAc17ZEri3"
      },
      {
        "Addrs": [
          "/dnsaddr/node-8.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfgsJsMtx6qJb74akCw1M24X1zFwgGo11h1cuhwQjtJP"
      },
      {
        "Addrs": [
          "/dnsaddr/node-9.ingress.cloudflare-ipfs.com"
        ],
        "ID": "Qmcfr2FC7pFzJbTSDfYaSy1J8Uuy8ccGLeLyqJCKJvTHMi"
      },
      {
        "Addrs": [
          "/dnsaddr/node-10.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfR3V5YAtHBzxVACWCzXTt26SyEkxdwhGJ6875A8BuWx"
      },
      {
        "Addrs": [
          "/dnsaddr/node-11.ingress.cloudflare-ipfs.com"
        ],
        "ID": "Qmcfuo1TM9uUiJp6dTbm915Rf1aTqm3a3dnmCdDQLHgvL5"
      },
      {
        "Addrs": [
          "/dnsaddr/node-12.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfV2sg9zaq7UUHVCGuSvT2M2rnLBAPsiE79vVyK3Cuev"
      },
      {
        "Addrs": [
          "/dnsaddr/ipfs.ssi.eecc.de"
        ],
        "ID": "12D3KooWGaHbxpDWn4JVYud899Wcpa4iHPa3AMYydfxQDb3MhDME"
      },
      {
        "Addrs": [
          "/ip4/139.178.68.217/tcp/6744"
        ],
        "ID": "12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw"
      },
      {
        "Addrs": [
          "/ip4/147.75.49.71/tcp/6745"
        ],
        "ID": "12D3KooWGBWx9gyUFTVQcKMTenQMSyE2ad9m7c9fpjS4NMjoDien"
      },
      {
        "Addrs": [
          "/ip4/147.75.86.255/tcp/6745"
        ],
        "ID": "12D3KooWFrnuj5o3tx4fGD2ZVJRyDqTdzGnU3XYXmBbWbc8Hs8Nd"
      },
      {
        "Addrs": [
          "/ip4/3.134.223.177/tcp/6745"
        ],
        "ID": "12D3KooWN8vAoGd6eurUSidcpLYguQiGZwt4eVgDvbgaS7kiGTup"
      },
      {
        "Addrs": [
          "/ip4/35.74.45.12/udp/6746/quic"
        ],
        "ID": "12D3KooWLV128pddyvoG6NBvoZw7sSrgpMTPtjnpu3mSmENqhtL7"
      },
      {
        "Addrs": [
          "/dns4/elastic.dag.house/tcp/443/wss/p2p/QmQzqxhK82kAmKvARFZSkUVS6fo9sySaiogAnx5EnZ6ZmC"
        ],
        "ID": "QmQzqxhK82kAmKvARFZSkUVS6fo9sySaiogAnx5EnZ6ZmC"
      }
    ]
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "0s",
    "Strategy": "roots"
  },
  "Routing": {
    "AcceleratedDHTClient": true,
    "Methods": null,
    "Routers": null,
    "Type": "autoclient"
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 2048,
      "LowWater": 128
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "RelayClient": {
      "Enabled": false
    },
    "RelayService": {
      "Enabled": false
    },
    "ResourceMgr": {
      "Limits": {},
      "MaxMemory": "16GB"
    },
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

Hello,

In the past month, we have been slowly pinning millions of CID with a 2 server ipfs cluster. Currently we are around 9M pinned CID on a total of 13.5M. Kubo has regularly been killed by the system for consuming all the memory, and from time to time it even completely locks out our servers and require a hard reboot.

We were waiting for the 0.21.0 release to open this ticket since we thought that the release would reduce RAM consumption, but in the past 24h both our servers have locked up again.

Both servers have the following spec:

  • AMD Ryzen 5 Pro 3600 - 6c/12t - 3.6 GHz/4.2 GHz
  • 128 GB ECC 2666 MHz
  • 2×512 GB SSD NVMe, one with the system (Nixos), the other is currently unused
  • one ZFS ZRAID-0 pool with 4×6 TB HDD SATA

We have tried a lot of different configuration, including disabling bandwidth metrics, disabling being a DHT server, and to activate or deactivate the Accelerated DHT client, but whatever the configuration we tried kubo always end-up consuming all available memory.

We are currently running Kubo with GOGC=50 and GOMEMLIMIT=80GiB

Here are two ipfs diag profile taken today:

In case it's relevant, for the ipfs-cluster we followed the setup guide in the documentation, we are keeping around 50k pin in the queue.

ipfs-cluster configuration
{
  "cluster": {
    "peername": "ipfs-cluster-1",
    "secret": "…",
    "leave_on_shutdown": false,
    "listen_multiaddress": [
      "/ip4/0.0.0.0/tcp/9096",
      "/ip4/0.0.0.0/udp/9096/quic"
    ],
    "enable_relay_hop": true,
    "connection_manager": {
      "high_water": 400,
      "low_water": 100,
      "grace_period": "2m0s"
    },
    "dial_peer_timeout": "3s",
    "state_sync_interval": "6h",
    "pin_recover_interval": "6h",
    "replication_factor_min": -1,
    "replication_factor_max": -1,
    "monitor_ping_interval": "15s",
    "peer_watch_interval": "5s",
    "mdns_interval": "10s",
    "pin_only_on_trusted_peers": false,
    "disable_repinning": true,
    "peer_addresses": []
  },
  "consensus": {
    "crdt": {
      "cluster_name": "ipfs-cluster",
      "trusted_peers": [
        "*"
      ],
      "batching": {
        "max_batch_size": 500,
        "max_batch_age": "15s",
        "max_queue_size": 50000
      },
      "repair_interval": "1h0m0s",
      "rebroadcast_interval": "10s"
    }
  },
  "api": {
    "ipfsproxy": {
      "listen_multiaddress": "/ip4/127.0.0.1/tcp/9095",
      "node_multiaddress": "/ip4/127.0.0.1/tcp/5001",
      "log_file": "",
      "read_timeout": "0s",
      "read_header_timeout": "5s",
      "write_timeout": "0s",
      "idle_timeout": "1m0s",
      "max_header_bytes": 4096
    },
    "pinsvcapi": {
      "http_listen_multiaddress": "/ip4/127.0.0.1/tcp/9097",
      "read_timeout": "0s",
      "read_header_timeout": "5s",
      "write_timeout": "0s",
      "idle_timeout": "2m0s",
      "max_header_bytes": 4096,
      "basic_auth_credentials": null,
      "http_log_file": "",
      "headers": {},
      "cors_allowed_origins": [
        "*"
      ],
      "cors_allowed_methods": [
        "GET"
      ],
      "cors_allowed_headers": [],
      "cors_exposed_headers": [
        "Content-Type",
        "X-Stream-Output",
        "X-Chunked-Output",
        "X-Content-Length"
      ],
      "cors_allow_credentials": true,
      "cors_max_age": "0s"
    },
    "restapi": {
      "http_listen_multiaddress": "/ip4/127.0.0.1/tcp/9094",
      "read_timeout": "0s",
      "read_header_timeout": "5s",
      "write_timeout": "0s",
      "idle_timeout": "2m0s",
      "max_header_bytes": 4096,
      "basic_auth_credentials": null,
      "http_log_file": "",
      "headers": {},
      "cors_allowed_origins": [
        "*"
      ],
      "cors_allowed_methods": [
        "GET"
      ],
      "cors_allowed_headers": [],
      "cors_exposed_headers": [
        "Content-Type",
        "X-Stream-Output",
        "X-Chunked-Output",
        "X-Content-Length"
      ],
      "cors_allow_credentials": true,
      "cors_max_age": "0s"
    }
  },
  "ipfs_connector": {
    "ipfshttp": {
      "node_multiaddress": "/ip4/127.0.0.1/tcp/5001",
      "connect_swarms_delay": "30s",
      "ipfs_request_timeout": "10m",
      "pin_timeout": "20s",
      "unpin_timeout": "3h0m0s",
      "repogc_timeout": "24h0m0s",
      "informer_trigger_interval": 0
    }
  },
  "pin_tracker": {
    "stateless": {
      "concurrent_pins": 20,
      "priority_pin_max_age": "24h0m0s",
      "priority_pin_max_retries": 5
    },
    "concurrent_pins": 20
  },
  "monitor": {
    "pubsubmon": {
      "check_interval": "15s"
    }
  },
  "allocator": {
    "balanced": {
      "allocate_by": [
        "tag:group",
        "freespace"
      ]
    }
  },
  "informer": {
    "disk": {
      "metric_ttl": "30s",
      "metric_type": "freespace"
    },
    "pinqueue": {
      "metric_ttl": "30s",
      "weight_bucket_size": 100000
    },
    "tags": {
      "metric_ttl": "30s",
      "tags": {
        "group": "default"
      }
    }
  },
  "observations": {
    "metrics": {
      "enable_stats": true,
      "prometheus_endpoint": "/ip4/127.0.0.1/tcp/8888",
      "reporting_interval": "5s"
    },
    "tracing": {
      "enable_tracing": false,
      "jaeger_agent_endpoint": "/ip4/0.0.0.0/udp/6831",
      "sampling_prob": 0.3,
      "service_name": "cluster-daemon"
    }
  },
  "datastore": {
    "pebble": {
      "pebble_options": {
        "cache_size_bytes": 1073741824,
        "bytes_per_sync": 1048576,
        "disable_wal": false,
        "flush_delay_delete_range": 0,
        "flush_delay_range_key": 0,
        "flush_split_bytes": 4194304,
        "format_major_version": 1,
        "l0_compaction_file_threshold": 750,
        "l0_compaction_threshold": 4,
        "l0_stop_writes_threshold": 12,
        "l_base_max_bytes": 134217728,
        "max_open_files": 1000,
        "mem_table_size": 67108864,
        "mem_table_stop_writes_threshold": 20,
        "read_only": false,
        "wal_bytes_per_sync": 0,
        "levels": [
          {
            "block_restart_interval": 16,
            "block_size": 4096,
            "block_size_threshold": 90,
            "compression": 2,
            "filter_type": 0,
            "filter_policy": 10,
            "index_block_size": 4096,
            "target_file_size": 4194304
          },
          {
            "block_restart_interval": 16,
            "block_size": 4096,
            "block_size_threshold": 90,
            "compression": 2,
            "filter_type": 0,
            "filter_policy": 10,
            "index_block_size": 4096,
            "target_file_size": 8388608
          },
          {
            "block_restart_interval": 16,
            "block_size": 4096,
            "block_size_threshold": 90,
            "compression": 2,
            "filter_type": 0,
            "filter_policy": 10,
            "index_block_size": 4096,
            "target_file_size": 16777216
          },
          {
            "block_restart_interval": 16,
            "block_size": 4096,
            "block_size_threshold": 90,
            "compression": 2,
            "filter_type": 0,
            "filter_policy": 10,
            "index_block_size": 4096,
            "target_file_size": 33554432
          },
          {
            "block_restart_interval": 16,
            "block_size": 4096,
            "block_size_threshold": 90,
            "compression": 2,
            "filter_type": 0,
            "filter_policy": 10,
            "index_block_size": 4096,
            "target_file_size": 67108864
          },
          {
            "block_restart_interval": 16,
            "block_size": 4096,
            "block_size_threshold": 90,
            "compression": 2,
            "filter_type": 0,
            "filter_policy": 10,
            "index_block_size": 4096,
            "target_file_size": 134217728
          },
          {
            "block_restart_interval": 16,
            "block_size": 4096,
            "block_size_threshold": 90,
            "compression": 2,
            "filter_type": 0,
            "filter_policy": 10,
            "index_block_size": 4096,
            "target_file_size": 268435456
          }
        ]
      }
    }
  },
  "metrics": {
    "enable_stats": true,
    "prometheus_endpoint": "/ip4/127.0.0.1/tcp/8888",
    "reporting_interval": "5s"
  }
}

Mayeu avatar Jul 05 '23 13:07 Mayeu

Cc @marten-seemann could you please take a look at those profiles ? Screenshot_2023-07-05-18-12-48-000_org.mozilla.firefox.jpg

Jorropo avatar Jul 05 '23 16:07 Jorropo

Maybe related: https://github.com/quic-go/quic-go/issues/3883.

marten-seemann avatar Jul 05 '23 17:07 marten-seemann

Thank you for the feedback, we have relaunched both nodes without anything using QUIC for now.

Mayeu avatar Jul 06 '23 07:07 Mayeu

@Mayeu how did you do that exactly ? Swarm.Transports.Network.QUIC ? Did it worked ?

Jorropo avatar Jul 06 '23 10:07 Jorropo

@Jorropo I deactivated both Swarm.Transports.Network.QUIC & Swarm.Transports.Network.WebTransport (since the doc says it uses QUIC).

Did it worked ?

Hard to tell for now, RAM growth seems pretty similar than with the previous configuration, previously kubo was killed after 9-11h of uptime.

Screenshot 2023-07-06 at 14 35 55

Left part is our previous run until this morning when the server started to stop answering. Right part is since we deactivated QUIC.

Here is a profile taken right now.

We definitely have seen a drop in pin/s since this morning, maybe some of our peers were only using QUIC.

Mayeu avatar Jul 06 '23 12:07 Mayeu

A note on the profile in my last comment, there is apparently still memory allocated to QUIC, this may relate to #9895

Mayeu avatar Jul 06 '23 13:07 Mayeu

I deactivated both Swarm.Transports.Network.QUIC & Swarm.Transports.Network.WebTransport (since the doc says it uses QUIC).

Yes thx, that is good, I wanted to be sure you didn't just removed the quic multiaddresses.

Jorropo avatar Jul 06 '23 14:07 Jorropo

I can confirm. I'm syncing 64GB RAM nodes with a few millions pins and I have to restart kubo every 12 hours to avoid OOM killing it: image

And I use only 4 concurrent pins on ipfs-cluster and conservative Kubo settings:

  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": 16,
      "EngineTaskWorkerCount": 8,
      "MaxOutstandingBytesPerPeer": 1048576,
      "ProviderSearchDelay": null,
      "TaskWorkerCount": 8
    }
  },
  "Swarm": {
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 128,
      "LowWater": 64,
      "Type": "basic"
    },
    "ResourceMgr": {
      "Enabled": true,
      "MaxMemory": "8 GB"
    },
  }

I tried disabling QUIC but then I lose 99% of connections so everything is way slower and it is therefore hard to tell if there is still a memory leak, but memory seemed to continue slowly increasing.

Memory stopped increasing as soon as the pin queue became empty: image

SmaugPool avatar Dec 13 '23 13:12 SmaugPool

Triage notes:

  1. @SmaugPool are you still running with QUIC disabled?
  2. are you able to retry with latest Kubo, and produce same two ipfs diag profile as before?
    • Around 2h before the server start to lock up
    • While the server started to stop responding

lidel avatar Jan 30 '24 14:01 lidel