hyades Implement search index that works with multiple API server instances

One of the challenges with making the API server horizontally scalable (#375), is the question of what to do with the local Lucene indexes.

Lucene indexes use write locks, such that only concurrent write operations by multiple processes are not possible. As a consequence, it is not possible to share an indexes across multiple application instances.

Additionally, index modifications are "requested" through DT's internal event system. For example, this is roughly what happens when a new component is created via REST API call:

sequenceDiagram
    Client ->> ComponentResource: PUT /api/v1/component
    ComponentResource ->> QueryManager: Create Component
    QueryManager ->> Database: INSERT INTO "COMPONENT" ...
    QueryManager ->> SingleThreadedEventService: IndexEvent(Action.CREATE, component)
    SingleThreadedEventService ->> IndexTask: inform(indexEvent)
    IndexTask ->> IndexManager: add(indexEvent.component)
    QueryManager ->> SingleThreadedEventService: IndexEvent(Action.COMMIT, Component.class)
    SingleThreadedEventService ->> IndexTask: inform(indexEvent)
    IndexTask ->> IndexManager: commit()

The procedure is similar for when components are updated or deleted. The usage of internal events means that an API server instance can only ever update indexes with changes it itself has made.

If we were to refactor index access such that only one instance could perform writes, and all others only reads, components created or modified by readers would never reflect in the index.

There are a few options I see for dealing with this:

Drop usage of local Lucene indexes entirely
- This involves finding a (optimally) better replacement: Using text search capabilities of Postgres, or utilizing a centralized search server like ElasticSearch
Have each instance of the API server maintain their own Lucene index
- Instead of publishing index changes to the internal event system, publish them to a Kafka topic
- There's no repercussions towards consistency guarantees, as even the current implementation is only eventually consistent

Option (2) would look roughly like this:

sequenceDiagram
    par
        loop continuously
            IndexKafkaConsumer ->> Kafka: Consume index events
            loop for each event
                IndexKafkaConsumer ->> IndexManager: add / delete / commit
            end
        end
    and
        Client ->> ComponentResource: PUT /api/v1/component
        ComponentResource ->> QueryManager: Create Component
        QueryManager ->> Database: INSERT INTO "COMPONENT" ...
        QueryManager ->> Kafka: IndexEvent(Action.CREATE, component)
        QueryManager ->> Kafka: IndexEvent(Action.COMMIT, Component.class)
    end

Note
For this to work, each API server instance must use a dedicated consumer group ID for its IndexKafkaConsumer. All instances must receive all events, if they form a consumer group this will not be the case.

Because the order of write operations on the index matter (CREATE should be processed before COMMIT), the Kafka consumer must be single-threaded. This also means that the only reason to have more than one partition for the Kafka topic would be availability, but not parallelism.

Warning
A notable implication is that the index state can be different across API server instances, depending on consumer lag, event processing failures, etc.

Jul 06 '23 14:07 nscuro

Few complications:

If new instances of the API server are started later, previous index events may have already been deleted from the Kafka topic; As a consequence, the instance will not be able to build a complete index
While an instance is catching up on processing index events, it may return search results that have been outdated for a long time, depending on the retention policy of the topic and how for long the instance was inactive
Indexes will eventually diverge between instances, so each instance needs a mechanism to detect drift in its own index and repair it; We have such a consistency check currently

Jul 07 '23 09:07 nscuro

Idea from @VithikaS: If tables like COMPONENT and PROJECT had CREATED_AT, UPDATED_AT etc. columns, it would be possible to build indexes incrementally. This would make (re-)indexing solely via database queries a lot more viable. The big benefit being that we could avoid a lot of consistency issues we'd run into if we rely on messaging to propagate index updates.

Jul 07 '23 11:07 nscuro

Leaning onto Full Text Search capabilities of the database we're already using might be preferable for multiple reasons:

Search results are always consistent with what's in the database
Search capability is available to all services connected to the database
Reduced overhead by not having to maintain and operate yet another technology

What needs to be checked is how well FTS is supported in other RDBMSes besides PostgreSQL, as we will eventually have to tackle #642 and can't rely on Postgres-exclusive features.

Jul 11 '23 08:07 nscuro

For a short-term solution, we decided to drop Lucene entirely: #661

Jul 11 '23 08:07 nscuro