spring-modulith icon indicating copy to clipboard operation
spring-modulith copied to clipboard

EVENT_PUBLICATION insert fails due to big SERIALIZED_EVENT

Open razubuddy opened this issue 1 year ago • 11 comments
trafficstars

Spring Modulith: 1.2.0-M2

After introducing indexes on ENVENT_PUBLICATION table (https://github.com/spring-projects/spring-modulith/issues/403) events with bigger payload are not working due to index size limit.

INSERT INTO EVENT_PUBLICATION (ID, EVENT_TYPE, LISTENER_ID, PUBLICATION_DATE, SERIALIZED_EVENT)
VALUES (?, ?, ?, ?, ?)

raises error

ERROR: index row requires 36896 bytes, maximum size is 8191

razubuddy avatar Feb 26 '24 13:02 razubuddy

If you want to use bigger event payloads, you need to adapt the schema definitions declared here to scale up the size of the serialized event column.

odrotbohm avatar Feb 27 '24 18:02 odrotbohm

Yes, I use default JDBC PostgreSQL schema Default provided schema for PostgreSQL in 1.2.0-M2 looks like this:

CREATE TABLE IF NOT EXISTS event_publication
(
  id               UUID NOT NULL,
  listener_id      TEXT NOT NULL,
  event_type       TEXT NOT NULL,
  serialized_event TEXT NOT NULL,
  publication_date TIMESTAMP WITH TIME ZONE NOT NULL,
  completion_date  TIMESTAMP WITH TIME ZONE,
  PRIMARY KEY (id)
);
CREATE INDEX IF NOT EXISTS event_publication_by_listener_id_and_serialized_event_idx ON event_publication (listener_id, serialized_event);
CREATE INDEX IF NOT EXISTS event_publication_by_completion_date_idx ON event_publication (completion_date);

The index event_publication_by_listener_id_and_serialized_event_idx is justified by query in JdbcEventPublicationRepository:

UPDATE EVENT_PUBLICATION
SET COMPLETION_DATE = ?
WHERE
    LISTENER_ID = ?
    AND SERIALIZED_EVENT = ?

This make me think I should care SERIALIZED_EVENT to be relatively small if it is indexed and Spring Modulith is designed in this direction, so I think using TEXT datatype is misleading. On the other hand if you plan to support bigger event payloads, then it shouldn't be indexed and updating COMPLETION_DATE query could be redesigned, am I wright?

razubuddy avatar Feb 28 '24 08:02 razubuddy

I am not a Postgres expert, but it looks like we might need to optimize our index declaration. Would you mind adding USING GIN to the declaration and try again, as recommended here?

odrotbohm avatar Feb 28 '24 11:02 odrotbohm

GIN, GIST index types are dedicated for column types like tsvector, it is won't create for text column type

razubuddy avatar Mar 11 '24 09:03 razubuddy

Also running into this issue, in my case I can make events smaller but not sure how long this trick will suffice.

matiwinnetou avatar Mar 26 '24 00:03 matiwinnetou

@odrotbohm what if one introduces another field, a hash of serialised_event and then put index on it?

matiwinnetou avatar Mar 27 '24 21:03 matiwinnetou

It seems like GIN indices are supported by using a trick:

image

matiwinnetou avatar Mar 29 '24 14:03 matiwinnetou

image

matiwinnetou avatar Mar 29 '24 14:03 matiwinnetou

Here is my solution, I had to split index into 2:

-- Create a GIN index on just the tsvector column
CREATE INDEX IF NOT EXISTS serialized_event_tsvector_idx 
ON event_publication 
USING gin (to_tsvector('simple', serialized_event));

-- Create a separate index for listener_id if needed
CREATE INDEX IF NOT EXISTS listener_id_idx 
ON event_publication 
(listener_id);

Unfortunately I could not use JSONB as it is recommended because this is a serialized_event is a TEXT field on JPA in Modulith.

matiwinnetou avatar Mar 29 '24 23:03 matiwinnetou

I'm not sure about other databases, but for PostgreSQL we found that the hash index works well given that Modulith is doing a full equality comparison on the serialized event instead of searching for a substring. And it's not restricted by the size limit that the b-tree index has. We saw a significant performance improvement in the setting of the completion timestamp after adding this index.

CREATE INDEX IF NOT EXISTS event_publication_serialized_event_hash_idx ON event_publication USING hash(serialized_event);

JakeTiritilli avatar Apr 30 '24 15:04 JakeTiritilli

Hi @odrotbohm , in case you missed the post from @JakeTiritilli about the hash index for PostgreSQL, I wanted to bring it to your attention because it dramatically reduced the load on PostgreSQL when the event_publication table had a large volume of records (1+ million in our case during performance testing) for SELECT and UPDATE. The SQL commands are now doing an "Index Scan using event_publication_serialized_event_hash_idx on event_publication" with O(1) !

dustinsand avatar May 06 '24 12:05 dustinsand

I'll make sure the optimization gets applied in the 1.2 GA release in two weeks and the corresponding service releases for 1.0.x and 1.1.x.

odrotbohm avatar May 07 '24 14:05 odrotbohm

Are you folks using this index in addition to the event_publication_by_listener_id_and_serialized_event_idx one we already declare or as a replacement for that?

odrotbohm avatar May 20 '24 18:05 odrotbohm

Hi @odrotbohm, we replaced event_publication_by_listener_id_and_serialized_event_idx with the hash index.

FYI, we also added an index specifically for queries by the listener_id and completion_date for use cases we had.

CREATE INDEX IF NOT EXISTS event_publication_listener_id_idx ON event_publication(listener_id); CREATE INDEX IF NOT EXISTS event_publication_completion_date_idx ON event_publication(completion_date);

dustinsand avatar May 20 '24 18:05 dustinsand

Okay, I'll replace the original index for Postgres then. I think I'll refrain from adding additional indexes to make sure we only support the interactions our code triggers. Totally fine if you add more indexes to support interactions you need on top.

odrotbohm avatar May 21 '24 07:05 odrotbohm