Add maxRequestSize env var to merlin-server to set Javalin HTTP config
Users are running into a case where they need to upload data that is larger than the accepted current default size for Javalin. This ticket is to expose an environment variable to merlin-server so users can set the Javalin HttpConfig maxRequestSize based on their use case. We should make the default larger than 1 Mb. We should test a large payload (> 10 Mb) works after this update.
For example when using the addExternalDataset action, Hasura makes a POST request to the merlin-server /addExternalDataset endpoint. The default value for the maxRequestSize config is 1 Mb and external datasets will often be much larger than 1 Mb. Payloads larger than 1 Mb returns a 413 Payload Too Large.
Additionally we should investigate if there are env vars on the Hasura side to support larger payloads. What even is the upper limit of payload size for Hasura actions? I can't find this documented anywhere.
More questions to think about when working on this ticket are:
- Can we support streaming very large payloads into a Hasura action and into
merlin-server? - If we cannot support streaming, can we support submitting data in batches? This would involve either updating the
addExternalDatasetaction to accept batches, or adding anupdateExternalDatasetaction to include batch semantics.
See the internal Slack thread here.
As a thought, would it be difficult to set up a sidecar API for managing file uploads? I think this alternative was raised quite some time ago (over a year ago, certainly), but it didn't seem to be necessary at the time. I remember reading articles on this back then, like this one.
The proposed scheme was something like:
- Create a small service that manages uploads (creation, deletion, download) on the filesystem.
- Add an
uploadstable to Postgres withidandpathcolumns (at minimum). - Change Hasura queries that reference paths directly to reference
uploadsIDs instead.
Mostly, the rationale here is, if finagling upload files through Hasura is painful, maybe it's the wrong abstraction for the use-case.
@Twisol This is not for file uploads actually. The gateway still handles file uploads using the scheme you describe. The addExternalDataset action does not accept files, but instead just a (potentially) very large JSON blob which then writes to the plan_dataset.sql table (among others).
Ahh, I see. 🤔 Would it be improper to use the file upload infrastructure here? Users could upload external datasets as files (CSV?), then batch-load them through a custom action. In fact, Postgres has support for reading tables from CSV via COPY FROM, so maybe there are also some performance and simplicity gains in this direction.
@Twisol Yeah it's definitely a possibility! In fact I think users would love being able to just add a raw CSV as an external dataset since that's the main format they are using. It's a little out of scope for this ticket tho since it would require an API change first (refactoring addExternalDataset). It would also slightly complicate the API since two requests would be required to create an external dataset instead of one. I'll bring this up when we bring this ticket into a sprint tho and we can decide to make an additional ticket then.
For sure :+1: This issue definitely seems to be focusing on the proximal pain point, I get that. Just trying to think about what an ideal fix might be, aside from any immediate patch 😃
Yeah I appreciate the additional perspective!
Being able to upload CSV directly would be nice. Turning CSV into GraphQL greatly increases the payload size.
Long term I think we will need a more robust solution than just a control on the timeout (especially if we continue using GraphQL). The files we tested with were for a one day plan, with samples every 100 seconds. Even increasing the limit to 10M would only get us to 10 day plans at this sample rate, and we need to be able to support much longer timelines.