go-libp2p Proposal: introducing sqlite as a datastore

Some of our use cases become excruciating to implement efficiently with KV stores:

address book: attach TTLs to each address (now), or confidence/quality scores (future), and being able to efficiently query by those criteria. GC.
peer metadata / peer exchange: find peers supporting protocol X.
pruning expired DHT entries or provider records.
rendezvous.

The lack of indices hurts us badly. We are penalised by protobuf serde on every query and iteration. Ad-hoc GC logic has sprouted across various modules. I'm not confident KV stores award us the flexibility to support future use cases.

Should we start an experiment to explore adopting sqlite as our main datastore? We can define together what our evaluation, scoring and success criteria should be.

I doubt we'll need an abstraction layer analogous to go-datastore, because the other two alternatives (Apache Derby and H2) are JVM-based, so no chance to use those as a backend anyway.

go stdlib already supports sql, and if we want to get fancy, we can evaluate squirrel for SQL generation. I discourage ORMs.

Apr 19 '19 13:04 raulk

cc @yusefnapora, as discussed today during the call with @warchant and @kamilsa (cpp-libp2p).

Apr 19 '19 13:04 raulk

@Stebalien pretty sure you have thoughts about this?

Apr 19 '19 13:04 raulk

+1 I am very fond of sqlite.

Apr 19 '19 13:04 vyzo

A drawback is the logistics around Cgo compilation :-(

Apr 19 '19 13:04 raulk

My concerns with sqlite are:

Cgo build
dynamic libraries
binary size (if we go with static bindings)
the go<->C latency.

Ideally we'd have some kind of pure-go solution available.

My concerns with anything not "datastore" is flexibility:

We have a postgres datastore and even an s3 datastore. The simple interface makes this possible.
We can switch between backends at any time.

In terms of performance:

Decoding protobufs is actually pretty fast, we probably just need to better reuse intermediate buffers.
The current Get(key) -> []byte interface is problematic from an allocations standpoint and should probably be Get(key, func([]byte)). However, this would be a massively breaking change.
We can probably mandate TTL support and/or write an auto-TTL wrapper that writes some TTL metadata to the database.
- Specifically, we'd need to write:
  - /ttl/expirations/DATE -> RECORD_KEY
  - /ttl/records/RECORD_KEY -> /ttl/expirations/DATE
  - RECORD_KEY -> RECORD_DATA
- However, updating a TTL would require deleting one key and writing two. Deleting any record would require reading one key and deleting 1 to 3 keys.

All that aside, we feel the pain in go-ipfs and an actual database would make quite a few things quite a bit simpler. It also looks like there's a pretty reasonable SQL abstraction for golang (https://golang.org/pkg/database/sql/).

TL;DR: Let's do it as long as we abstract over any SQL database. In go-ipfs, my primary motivation is that it makes pinning, GC, graph traversal, reverse graph traversal, etc. much easier. Really, we need a graph database but SQL is probably the best general-purpose solution.

Apr 19 '19 16:04 Stebalien

I'm definitely in favor, for all the reasons described above.

@vyzo - just like old times! sqlite for metadata indices + rocksdb for blob storage worked really well for mediachain.

Apr 24 '19 18:04 yusefnapora

To store clearly defined entities I would recommend repository pattern:

(below is a pseudocode):

class User {
  int id;
  string name;
  date birthdate;
}

// you don't need single parent interface "Repository". At least, create
// one per Entity
interface UserRepository {
  User[] findAll();
  User findById(int id);
  User findByName(string name); 
  ...
}

Implementation(s) of this interface may use SQLite, postgres, or even multiple KV stores.

implementation can leverage indexing, so there is no performance penalty
easy to extend
easy to mock
easy to swap implementations

From my experience, it is very hard to define clear and generic interface for SQL/KV storages and Repository works well every time.

Apr 24 '19 20:04 Warchant

@Warchant definitely something to consider! Abstracting away how we store entities vs. abstracting how we generically access a database. The former allows us to use database-specific optimisations and indexing – something the latter struggles to enable.

Apr 24 '19 20:04 raulk

We're planning a big redesign of the peerstore, see #2355. We're also planning to deprecate the database-backed peerstore, see #2329.

Jun 28 '23 02:06 marten-seemann