occurrent icon indicating copy to clipboard operation
occurrent copied to clipboard

(WIP) JPA support for blocking event strore

Open nicklundin08 opened this issue 1 year ago • 6 comments

Disclaimer

:warning: This PR is a WIP (Im still struggling with some test stuff) :warning: I havent given much thought to how other modules (besides the blocking event store) might be implemented using JPA. (e.g. something like subscriptions is probably more provider specific)

JPA support for blocking event store

Im curious if you'd be open to supporting JPA implementations of some of the Occurrent modules.

Why

Ill be the first to admit that I'm not a big fan of JPA. Especially for a project like Occurrent where the data access patterns are very clearly defined I thinks its probably more optimal to simply use a native driver directly.

That being said, supporting JPA might be an easy way to very quickly gain support for many different underlying stores (postgres, mysql, dynamo, redis, etc)

Additionally, it would not preclude the project from also support native drivers for those stores in the future.

Design choices

I mostly copied the Mongo implementation, tweaking some things for readability/organization. The Query DSL stuff is implemented using JPA's Specification library (see first link at bottom)

Lombok

I mostly just added this because I wanted to iterate quickly. If youd rather the project not use lombok, Id be happy to remove it!

Mixins

A lot of the heavy lifting that maps the query dsl functionality to JPA's query functionality is done via default interface methods. I called these things Mixins (I dont know if thats the most accurate technical term but :shrug:)

This pattern allows the library to specify a lot of default functionality but gives consumers the ability to override things if required.

Batteries not included

With this implementation, some assembly is required from the consumer of the JPA library. To get an idea of what consuming this library might look like, check out the batteries package in the test source set.

This pattern gives the consumer more control over the underlying concrete type that will be used

  • If using a relational database it gives the consumer control over how the schemas get applied (e.g. if they want to split out schema management into liquibase/etc they can because they own the types)
  • JPA has a lot of performance pitfalls when using related data. This approach completely offloads the problem of "make sure your JPA models are set up correctly" onto the consumer (see second linked resource)

Resources

  • https://vladmihalcea.com/spring-data-jpa-specification/
  • https://medium.com/@majbahbuet08/performance-pitfalls-while-using-spring-data-jpa-and-solutions-to-avoid-them-5eb4ee3fe4ea

WIP

I left this as a WIP because I wanted to get some feedback + see if you're even interested in merging it before I go any further. Let me know!

nicklundin08 avatar Feb 05 '24 01:02 nicklundin08

What a great initiative @nicklundin08! I've wanted to have a SQL implementation for a long time, but I've never had time to think about it or implement it myself. My initial thoughts were to use JDBC or something (just as you hinted at), but maybe JPA would work.

Could you maybe describe the thinking behind your design decisions? For example, if I understand it correctly, you serialize the data part of the CloudEvent to JSON as string. My idea was perhaps to use the native JSON type that I think exists in both MySQL and Postgres (and maybe others)? Not sure if JPA supports this though.

Also, do you have any experience with change streams? That's more or less required if you want a working application :)

johanhaleby avatar Feb 05 '24 07:02 johanhaleby

Thanks for the reply!

Could you maybe describe the thinking behind your design decisions? For example, if I understand it correctly, you serialize the data part of the CloudEvent to JSON as string. My idea was perhaps to use the native JSON type that I think exists in both MySQL and Postgres (and maybe others)? Not sure if JPA supports this though.

Yeah wrt the data structure that you see in the init tables function, I was simply going fast. I think I better approach would be to either use the JSONB column type that postgres supports OR do something like convert the cloud event to 3NF. Im not sure if we could use the JPA specifications to grab nested objects outside of a JSONB column. Im sure once I implement some more tests it will become clear what the appropriate table structure would be

Disclaimer: I have no production experience with event sourcing

Also, do you have any experience with change streams? That's more or less required if you want a working application :)

Is this in the context of implementing subscriptions? I took a peek at the mongo implementation and I think I see whats going on. Correct me if Im wrong but under the hook you are using the mongodb changestream functionality to trigger various functions that "subscribe" to certain events (or types of events)

If so I can think of a few ways that you could do that in RDS land

Option1: Use JPA with polling

This would require/assume a few things

  • A new table that stores position-aware subscription information
  • Competing consumer model
  • Some sort of timer/polling/clock trigger

Option 2: Use JPA's auditing entity listener

https://www.baeldung.com/jpa-entity-lifecycle-events

  • Use the JPA lifecycle hooks to fire subscription handlers when new events are inserted
  • The events would be guaranteed/force to be processed on the same "box" (process) from where the event originated

:warning: Unsure on what kind of delivery guarantees/data loss would be present in this solution. Needs more analysis

Option 3: Use a provider specific equivalent of mongodb's changestream

Postgres has the listen/notify concept

  • https://www.postgresql.org/docs/current/sql-notify.html
  • https://bitbucket.org/neilmcg/postgresql-websocket-example/src/master/src/main/java/PGNotifyToWebSocket.java

MySQL might have something similar

Option 4: CDC using integration tools

Use a solution like dbezium to move events from your RDS system into a kafka stream. Subscribe to the kafka stream.


Option #1 would be the most generic to implement as it doesnt require any provider specific functionality nor an additional piece of infrastructure, but Im not sure if polling would cause other issues/not be an acceptable solution for the changestream problem

Do any of those jump out at you as the right place to start?

Does occurrent have any particular delivery/order guarantees with subscriptions?

  • e.g. I think the competing consumer model gives you at-least-once delivery but no order guarantees

nicklundin08 avatar Feb 06 '24 01:02 nicklundin08

On another note, is there anything youd like to see code-wise before considering merging? Heres a few things Id like to do before marking this as "ready"

  • Port adaquate tests into the BaseTest class
    • Mostly just copying from the mongodb section
  • Add concrete JPA tests for the following providers
    • Postgres
    • [Stretch] Dynamo
    • [Stretch] Redis
  • Add javadocs to all public apis
  • Scrutinize dependencies
    • Currently using spring-data-starter-jpa
    • This is a very heavy package and Im only bringin it in to get tests to work
    • I think a better solution is to use the spring-data-jpa package instead
  • update the package structure to reflect the blocking/reactor structure in other packages
  • [Stretch] Implement reactor event store using JPA
  • [Stretch?] Try to implement a subscription implementation using JPA

Let me know if theres anything youd like me to add/remove!

nicklundin08 avatar Feb 06 '24 01:02 nicklundin08

Thanks for all your efforts. However, I think I would like to see better JSON support, i.e. not storing the data as a string :/ The reason is that I'd like to support query capabilities (https://occurrent.org/documentation#eventstore-queries), also inside the data property in the cloud event (I'm using this in production atm and I think it can be quite nice). If this is not possible with JPA, I think it might be better to implement support using other means (jdbc if that works). I don't have any experience with this myself though, but I guess it should work in both Postgres, MySQL and probably others.

Another thing I'm thinking of is subscription support (https://occurrent.org/documentation#subscriptions). Without subscriptions, it's hard to do anything useful. I think that we need subscription support from the get-go. I don't want to bring in Debezium unless it's really needed (to keep dependencies down), but maybe it would be a good starting point if it's too difficult to achieve it ourselves. I don't know if JDBC supports change streams or if one would need to write different change streams for different implementations. If so Debezium might be a more lucrative option, given that it supports everyone we want to do.

WDYT?

johanhaleby avatar Feb 09 '24 13:02 johanhaleby

Hey sorry for the late reply. I was on vacation last week.


Re: JSON

Gotcha. Yeah I think have JSON support in a pretty good spot ATM (see above comment)


Re: Subscriptions

Makes sense not going down a route that makes use of dbezium or something like that

I think the two options that could be implemented then are

  • Some sort of table structure that stores subscription information + polling
    • :white_check_mark: Not provider specific
    • :x: Polling introduces latency (may/may not be an issue - looking for guidance here)
  • Using provider specific functioanlity to "listen" for changes
    • E.g. postgres supports Listen + Notify keywords
    • :x: Provider specific (different impls for postgres, mysql, etc)
    • :white_check_mark: Real time/no polling

Do you prefer either of those options?


Ive got 14 failing tests that I need to wrap up before I take a stab at subscriptions

nicklundin08 avatar Feb 22 '24 18:02 nicklundin08

You should check out https://github.com/eugene-khyst/postgresql-event-sourcing in which the author has detailed a robust approach for implementing asynchronous subscribers using both polling as well as listen/notify on PostgreSQL. The readme goes into nice detail regarding the care that must be taken due to parallel transactions.

The discussion on Hacker News regarding that reference implementation gives further insight and mentions some alternative approaches: https://news.ycombinator.com/item?id=38084098

Perhaps you can ask @eugene-khyst whether or not he is interested in collaboration.

frederikb avatar Mar 16 '24 11:03 frederikb