postgrest icon indicating copy to clipboard operation
postgrest copied to clipboard

Multiple requests in a transaction

Open timbod7 opened this issue 10 years ago • 59 comments

I have a schema capturing events and participants:

CREATE TABLE demo.participant(
  id BIGINT DEFAULT,
  name VARCHAR(255) NOT NULL
  );

CREATE TABLE demo.event(
  id BIGINT DEFAULT,
  title VARCHAR(255) NOT NULL
  );

CREATE TABLE demo.event_participant(
  event_id BIGINT NOT NULL REFERENCES demo.event(id),
  participant_id BIGINT NOT NULL REFERENCES demo.participant(id),
  PRIMARY KEY(event_id,participant_id)
  );

I can expose these tables directly with 3 postgrest views. However, I required that inserting/updating an event and the participants in that event should be atomic.

I think I need a single view that somehow data from event and event_participant, and is insertable/updateable.

Is there any way to do this with postgrest?

timbod7 avatar Sep 13 '15 21:09 timbod7

I'm curious that there hasn't been a response to this. How are people handling many-many relationships with postgrest?

timbod7 avatar Sep 21 '15 04:09 timbod7

Hey sorry for the slow response. I've been away on vacation this week and not on the computer much.

There is currently no general way to batch operations in a single transaction. As a workaround you can create a stored procedure to do complicated updates and call it with the rpc interface.

It would be good to invent a nice interface for batch operations. We could take advantage of HTTP/2 multiplexing and share a single transaction per request. Then the client could send several requests in the same connection to execute atomically.

begriffs avatar Sep 21 '15 04:09 begriffs

Thanks for the reply. An interface for batch operations would be awesome.

But apart from the "how can I do this atomically" question, I guess I was wondering if there was a better way to model the data relationship given the capabilities of postgrest now? Given my naive model above, If I want to update an event along with the list of participants associated, I need to:

  - PUT the new event value
  - DELETE the existing values from event_participant
  - POST a csv to bulk insert the new values in event_participant

Perhaps, as you say the RPC approach is the best one.

timbod7 avatar Sep 21 '15 05:09 timbod7

The RPC is your best bet at the moment, but we really need a batch interface. I can see that being high priority after the upcoming v0.3 release.

begriffs avatar Oct 26 '15 22:10 begriffs

Is this still a priority? When this is implemented, I will be able to use PostgREST in many of my projects.

dcominottim avatar Oct 07 '16 15:10 dcominottim

Not in the immediate future but there is a possibility of manually implementing this by having a trigger on insert/update in the parent table take the values that are meant for the child table and send them there.

ruslantalpa avatar Oct 07 '16 17:10 ruslantalpa

Our web server supports HTTP/2, so we can do this bulk operation stuff at some point but it'll take a significant amount of work and there are a number of more pressing issues to fix before attempting it.

begriffs avatar Oct 07 '16 18:10 begriffs

I'm not sure about the relative stupidity of this suggestion, but couldn't you use custom headers (or reuse semantically appropriate standard ones) to track requests that should be wrapped in a single transaction?

X-TRANSACTION-START: f058ebd6-02f7-4d3f-942e-904344e8cde5
...
X-TRANSACTION-ID: f058ebd6-02f7-4d3f-942e-904344e8cde5
...
X-TRANSACTION-COMMIT: f058ebd6-02f7-4d3f-942e-904344e8cde5

ephetic avatar May 09 '17 19:05 ephetic

Your suggestion makes sense. I've seen a related scheme recommended on various places on the web: expose an HTTP resource that represents transactions. Let me outline how it could work, then talk about some problems.

How it could work

The server can expose /transactions which accepts only POST. When you post to it you can include a json object with options like the desired isolation level. It then responds with a transaction id (txid). Using this id you can access another endpoint, /transactions/:txid. DELETE would roll it back, POST (or maybe PUT) would commit it, GET would give statistics about it.

With the txid you could then send regular postgrest requests and either include it in a query param like &txid=foo or as an HTTP header if we find one that is appropriate. The web server thread would then find the existing db connection and resume sending commands on it. After doing regular postgrest requests the client would commit or rollback the transaction with HTTP calls on its resource.

Challenges

OK that's the general idea, and here are challenges we would face:

  • Making sure that users couldn't spoof the txid to get their requests executed under another user's transaction with escalated permissions. PostgreSQL already has the notion of txid, accessible with txid_current(), but it is a simple number counting upward, easy to guess. We would have to obfuscate it with a big guid alias at the very least.
  • We would want an aggressive inactivity timeout on the open transactions so that a user couldn't easily lock up all our db pool connections.
  • The URLs I proposed don't match the postgrest url conventions or typical verbs. Could adjust it a little, but it makes the interface inconsistent.
  • Have to find a way to maintain affinity between client and postgrest server under load balancing. Sending the updates in an HTTP/2 request would get around this problem.
  • Holding transactions open is always expensive because postgresql is an MVCC system. Memory gets used, rows can't be vacuumed, etc. Waiting for all those round-trip HTTP requests to finish holds the transaction open longer than sending the requests in a single HTTP/2 request.

So I still think HTTP/2 is the most elegant way, but it may have problems of its own like client compatibility. What do you think? Do you have solutions for these problems, or perhaps problems I didn't imagine?

begriffs avatar May 10 '17 04:05 begriffs

  • guid alias sounds sufficient
  • reasonable, if configurable
  • should be out-of-band. perhaps rpc endpoint
  • umm...pass
  • yeah, pass again

How would HTTP/2 handle arbitrary SELECT/UPDATE type transactions?

ephetic avatar May 10 '17 05:05 ephetic

What about - until a good way of utilizing HTTP2 connections can be established - just supporting a simple /transaction endpoint which handles RFC6902 JSON PATCH-esque payloads with a very specific content-type? This way we'd have a good stable baseline for what I feel is an absolutely essential part of a REST API today, and perhaps especially in this project, considering the transactional nature of PostgreSQL.

I'm thinking the format could look like this:

POST /transaction
Content-Type: application/vnd.pgrst.patch+json
[
  {
    "verb": "DELETE",
    "endpoint": "/dogs?alive=is.false"
  },
  {
    "verb": "POST",
    "endpoint": "/dogs",
    "data": "{\"name\": \"Foo\"}"
  }
]

jacobsvante avatar Jun 28 '17 22:06 jacobsvante

Aside from the http2 interface to implement this feature, we also need a way to restrict the cost of what could be many expensive queries/mutations in a single transaction and that way avoid the risk of clients ddosing the database.

I think this would be better enforced at the database level with a statement_timeout or perhaps an explain cost maximum, so maybe it would be better to address #249 before this is implemented.

steve-chavez avatar Aug 16 '17 00:08 steve-chavez

Comment from the peanut gallery about request batching: there's a standard application/http content type defined in the HTTP/1.1 spec that represents a sequence of requests or a sequence of responses. Posting that to some sort of "batch operations" resource could serve as an alternative to clients creating, manipulating, and committing transaction resources with a series of requests.

jackfirth avatar Oct 06 '17 01:10 jackfirth

there is a possibility of manually implementing this by having a trigger on insert/update in the parent table take the values that are meant for the child table and send them there.

This approach seems to be the best for current versions. However it would be great to see something more transactions-friendly in feature.

wildsurfer avatar Jan 12 '18 16:01 wildsurfer

Not in the immediate future but there is a possibility of manually implementing this by having a trigger on insert/update in the parent table take the values that are meant for the child table and send them there.

The approach seems something like https://github.com/tobyhede/postgresql-event-sourcing, but it would need a complete rewrite of the application logic, something that is not always possible.

ilkosta avatar Jan 13 '18 16:01 ilkosta

Even if the http2 interface is implemented, that still wouldn't solve the problem of INSERTing a row dependent on the id of another row(child/parent table).

I made a proposal for solving that part in https://github.com/PostgREST/postgrest/issues/818#issuecomment-409981816.

steve-chavez avatar Aug 02 '18 16:08 steve-chavez

Just throwing out an idea here:

Implement a /batch or /transaction endpoint. You can POST your array of commands:

[ { "endpoint": "/authors", "payload": { "author_id" : 1, "first": "Robert", "last": Frost" } }, { "endpoint": "/works", "payload": { "work_id": 1, "author_id": 1, "title": "Mending Wall" } }, { "endpoint": "/works", "payload": { "work_id": 2, "author_id": 1, "title": "The Road Not Taken"} } ]

These would all be performed inside of a transaction. Should work for other HTTP requests as well.

Thoughts?

dmulfinger avatar Aug 29 '18 15:08 dmulfinger

What's the benefit of a transaction interface over exposing an rpc function that takes json objects, arrays and what not? I know it's been said before, but I'm saying it again; you can do it all in your function.

Sure, creating such a function means putting some effort handling json and arrays (of json objects) as arguments. But, implementing an http request to construct a transaction (as proposed above) requires effort too.

I would put my effort in building the Postgres function instead of such an http request.

There's an argument to be made about keeping PostgREST lean because that's more likely to not introduce extra maintenance and bugs. Offload your requirements to Postgres, not PostgREST when possible.

christiaanwesterbeek avatar Aug 29 '18 19:08 christiaanwesterbeek

The benefit is that while a json parsing /rpc function would need to be individually coded for every parent/child relationship of every project that uses PostgREST, a transaction interface as proposed above would be coded once into PostgREST. I agree that a json parsing /rpc function can be written, but it's not necessary easy to write one well that handles error cases correctly and reports meaningful error messages. And these /rpc function have a big maintenance problem in that any change to the schema would require changes to the /rpc function. There's probably more to consider regarding how to declare the arguments meaningfully in the OpenAPI spec, but I haven't thought that through.

From the client's perspective, I think it's a minimal effort to batch the calls.

If handling multiple-table relationships transactionally can be somehow be accommodated in PostgREST, I believe it would appeal to a significantly larger user base. As you can see above, others like me would find PostgREST the perfect tool if it could handle this very common use case.

dmulfinger avatar Aug 29 '18 20:08 dmulfinger

  • We would want an aggressive inactivity timeout on the open transactions so that a user couldn't easily lock up all our db pool connections.
  • The URLs I proposed don't match the postgrest url conventions or typical verbs. Could adjust it a little, but it makes the interface inconsistent.
  • Holding transactions open is always expensive because postgresql is an MVCC system. Memory gets used, rows can't be vacuumed, etc.

To avoid the above drawbacks and have client-safe transactions, one option could be to maintain state for our regular requests. Maintaining a hashmap of the transaction guid + a list of the generated queries would allow us to send all at the same time using a single pool connection + transaction.

We would take the hit in memory consumption instead of the db since we'd store the state until the transaction is sent. An inactivity timeout for saving up memory would be needed, this would account for HTTP 1 latency and it'd be configurable. HTTP2 would improve this as it reduces latency.

Interface

Reusing our Prefer:tx=commit/rollback header, the interface could be like

PATCH /tbl?id=eq.1
Prefer: tx=begin; guid=550e8400-e29b-41d4-a716-446655440000

{"body": "value"}

200 OK
Preference-Applied: Prefer: tx=begin; guid=550e8400-e29b-41d4-a716-446655440000
POST /other
Prefer: tx=continue; guid=550e8400-e29b-41d4-a716-446655440000

{"body": "value"}

200 OK
Preference-Applied: Prefer: tx=continue; guid=550e8400-e29b-41d4-a716-446655440000
DELETE /another?id=eq.3
Prefer: tx=commit; guid=550e8400-e29b-41d4-a716-446655440000

200 OK
Preference-Applied: Prefer: tx=commit; guid=550e8400-e29b-41d4-a716-446655440000
{"..."}

Limitations

  • Since there's no response from the database yet, the tx=begin|continue requests won't return a body( Prefer: return=representation won't be applied). The last query on the transaction would return its result or an error message.
  • Because of the above, GET/HEAD requests would have to fail on a tx context(or better, they will succeed and ignore the Prefer: tx=...)
  • No ability to obtain a previous query result to make a dependent mutation(comment above)
  • Could be a slow operation if latency is high. HTTP2 would help with this.
  • The memory consumption per tx would be the sum of the request bodies(needs solving https://github.com/PostgREST/postgrest/issues/2261)
  • Not sure about this part. If HTTP2 can parallelize requests, we might have to add a position to the Prefer: tx?.

Further possibilities

  • No ability to obtain a previous query result to make a dependent mutation

It could be possible to lift the above restriction if we refine the interface to construct a single query instead of multiple by using Data-Modifying Statements in WITH. The basic idea is to add the previous request result as a body of the next and refer to it with underscore operators.

steve-chavez avatar Aug 24 '22 00:08 steve-chavez

It could be possible to lift the above restriction if we refine the interface to construct a single query instead of multiple by using Data-Modifying Statements in WITH. The basic idea is to add the previous request result as a body of the next and refer to it with underscore operators.

The above idea would be unnecessary if we allow our POST to obtain previous related data(SELECT.. WHERE..) before inserting. PATCH and DELETE can already query before performing the action. This would maintain the transaction API simpler and it would keep the "only mutations allowed" condition - which fits really well with the Prefer: return=representation being ignored I must say.

if we allow our POST to obtain previous related data(SELECT.. WHERE..) before inserting

Maybe also unnecessary with https://github.com/PostgREST/postgrest/issues/818

steve-chavez avatar Aug 26 '22 16:08 steve-chavez

fwiw, my use case around this is strictly testing purposes - so the locking impacts aren't as big of a concern as I'd be shooting my own foot in the ci/cd only. I agree with the comments above around functions - i haven't seen any issues that couldn't be resolved through one in production. But I'd love to be able to make integration tests that use a shared transaction ID and then rollback when the test is complete.

mazondo avatar Sep 02 '22 03:09 mazondo

+1

m-farahmand avatar Sep 11 '22 12:09 m-farahmand

Another option would be extending the syntax proposed on https://github.com/PostgREST/postgrest/issues/818#issuecomment-409981816 to allow different types of statements.

POST /projects?columns=id,name,clients!update(name),tasks!delete()&clients.id=eq.1&tasks.id=eq.3

{".."}

Which would roughly translate to

WITH 
payload AS (
  select '{".."}'
),
update_clients AS (
  update clients set name = .. where id = 1  
  returning ..
),
delete_tasks AS (
  delete from tasks where id = 3
  returning ..
), 
insert into projects(id, name)
select 
  ...
from json_populate_recordset(...) _;

This would avoid the need to maintain state for multiple requests as on the option above.

(clients.id=eq.1 used as above would be a breaking change since currently it's used for filtering the insert results, which is not that useful IMO)

steve-chavez avatar Oct 19 '22 16:10 steve-chavez

POST /projects?columns=id,name,clients!update(name),tasks!delete()&clients.id=eq.1&tasks.id=eq.3 clients.id=eq.1 used as above would be a breaking change since currently it's used for filtering the insert results, which is not that useful IMO)

To avoid a breaking change as above, we could do

POST /projects
Prefer: columns=id,name,clients!update(name),tasks!delete();clients.id=eq.1;tasks.id=eq.3

(Related to the idea on https://github.com/PostgREST/postgrest/issues/2602#issuecomment-1382753901)

steve-chavez avatar Jan 14 '23 15:01 steve-chavez

I'm working on migrating from Firebase to Supabase and I was surprised to come across this issue! Lack of transaction support from the client will mean extra work for us in porting our existing firebase code. @steve-chavez this issue has been open for 7+ years, does it look like this is outside the scope of postgrest or is it something that may be implemented near term?

evelant avatar Mar 10 '23 17:03 evelant

It is on the scope but to get it right, there are other issues that must be solved first. Off the top of my mind:

  • https://github.com/PostgREST/postgrest/issues/818
  • https://github.com/PostgREST/postgrest/issues/465

In fact solving those might alleviate the current need for client-side transactions.

steve-chavez avatar Mar 15 '23 01:03 steve-chavez

Thanks! Ah I also didn't know about the limitation of updating jsonb columns. That will also make my transition from firebase more challenging as I relied upon being able to update deeply nested data (which became jsonb columns for an easy transition)

evelant avatar Mar 15 '23 12:03 evelant

Implement a /batch or /transaction endpoint. You can POST your array of commands:

[ { "endpoint": "/authors", "payload": { "author_id" : 1, "first": "Robert", "last": Frost" } }, { "endpoint": "/works", "payload": { "work_id": 1, "author_id": 1, "title": "Mending Wall" } }, { "endpoint": "/works", "payload": { "work_id": 2, "author_id": 1, "title": "The Road Not Taken"} } ]

That only works when the request data is all known beforehand, but dynamic requests are harder. The issue here is more on "transactionalizing" requests so the client can atomically get the results from a request and use it to modify or send another request, all in the same transaction.

What are the main challenges in delivering this feature? Would it be helpful to see how ORMs typically handle transactions or are these libraries too different from the REST API-based approach of PostgREST? I know they have direct DB connections, but they must have solved some common issues of async connection transaction handling.

bombillazo avatar Apr 10 '23 01:04 bombillazo

That only works when the request data is all known beforehand, but dynamic requests are harder. The issue here is more on "transactionalizing" requests so the client can atomically get the results from a request and use it to modify or send another request, all in the same transaction.

Agreed.. Kinda. In isolation this would be more akin to Firebase's "batched writes", which execute multiple statements as part of the same transaction on the backend and wouldn't be a bad thing. It would solve a lot of common issues, such as inserting relational data.

Looking at Firebase's docs however, for their client libraries, they seem to use optimistic concurrency control. Meaning reads are done before the transaction starts (and they require you put reads first when using transactions in the client side library and then writes are done as part of the transaction following that.

For their server-side client libraries, they use pessemistic concurrency control, which as you exepect, starts the transaction before the reads and locks the tables at that point.

With that in mind, their approach would mean essentially only requiring the above "batched writes" functionality. The question then is I guess, is that good enough? Leaving pessemistic concurrency control to RPC function calls? Or would something along the lines of Steve's earlier proposal be more acceptable?

gitbugr avatar Apr 10 '23 10:04 gitbugr