central-backend Form attachment content type should be determined server-side

In the course of starting work on support for geojson in Enketo, I was surprised to see that the Content-Type header for .geojson form attachments was coming back null from Central. After a bit of time trying to track this down in central-backend, I finally found that the value is currently passed through from central-frontend. The value is determined by the File#type DOM API, which (currently) returns an empty string for .geojson files. This isn't entirely surprising, it's more or less what I'd expect for less common file extensions.

My concern is that the value is passed through from the frontend and persisted as-is. I decided not to implement a fix there, and instead to file an issue to add server-side validation and determination of content type for blobs. This would be a clearer place for the implementation, and it would provide an added layer of security.

May 24 '22 23:05 eyelidlessness

Hey @eyelidlessness! This is very interesting. In terms of the current implementation, it looks like something similar is true for submission attachments as well: Central Backend will use the content type specified by Enketo or Collect.

Are you thinking that Central Backend should read the content of the form attachment or submission attachment in order to determine its actual content type? Would Central Backend ever reject an attachment based on its content type? Or would the main benefit be that when the attachment is requested later, the Content-Type header of the response can be trusted as accurate?

May 25 '22 03:05 matthew-white

Hey @eyelidlessness! This is very interesting. In terms of the current implementation, it looks like something similar is true for submission attachments as well: Central Backend will use the content type specified by Enketo or Collect.

Yep! I was filing this quickly before it fell off my attention span but meant to mention this is why I identified blobs as the likely appropriate place/concern around which to handle this (good data model once I found my way to it!).

Are you thinking that Central Backend should read the content of the form attachment or submission attachment in order to determine its actual content type? Would Central Backend ever reject an attachment based on its content type? Or would the main benefit be that when the attachment is requested later, the Content-Type header of the response can be trusted as accurate?

My thinking is the current behavior is a good hint, but should only be treated as a hint. It's clearly wrong, at least in Chrome, for .geojson and doesn't provide value there. But trusting client values isn't a good idea in general, and in particular for media types[^1]. I think the server side should take these hints as guidance to validate, and provide a source of truth regardless of whether the client provides the hint.

Naively, for the geojson case, that would involve checking for an empty content type and/or the .geojson filename suffix, checking that it's well-formed JSON, and probably a loose schema validation.

The validation scenario for binary types would probably scale to the complexity of the binary format in question. There are almost certainly libraries which would do this better than we could.

Central should definitely reject invalid content types, even if rejection is friendly and assumes user error rather than hostile. But yes the primary benefit would be a reliable source of truth that attachments are properly served with a Content-Type that's reliable and (hopefully) correct.

[^1]: Amusingly, just in the last couple days I've seen a few fresh posts about exploiting media type ambiguity to embed unsuspected content in otherwise innocuous payloads, including on host platforms. Today's was embedding arbitrary data in YouTube uploads as a "backup" 😬🙃

May 25 '22 06:05 eyelidlessness

yes, look up steganography at some point.

i hear you about doing All The Right Things but i don't think we really have the bandwidth to build or maintain that for what's sort of a corner of the whole thing. i wonder if we find a way for .geojson to ring up correctly and not worry too much about eg is this mp4 really an mp4.

May 25 '22 18:05 issa-tseng

I really should have clarified that I would’ve assigned myself to the issue but I don’t have GH permissions to do so. I have the bandwidth for this, or at least a reasonable first pass.

May 26 '22 01:05 eyelidlessness

As I grumbled about in other channels, file type determination is full of redundant info and other irritants (filename suffix, leading bytes signature, mime type). We favor different kinds of information in different parts of the system, for better or worse.

Checking for an empty content type and a .geojson suffix and setting a content type accordingly seems ok. The way we've written the spec, clients are also responsible for looking for a .geojson suffix but I don't see a problem with also setting the content type. @eyelidlessness is it actually set for CSV or XML external secondary instances coming from Central at the moment? If so, is that because of the way that the Enketo request is made?

I'm not so sure about the schema validation. Currently clients are supporting a subset of .geojson. I think the most useful thing for users if to be allowed to upload whatever geoJSON they have and for the clients to do all of the schema validation according to what they currently support.

In general, I think it's a strength of Central that it's open about content types. For example, it was valuable to be able to support geojson in clients without making any Central changes.

In practice, we don't actually know what file types users use and expect. That is, users could specify a file field in their form with a default of whatever file type they want. I've seen gpx, ogg, amr and many other less-common file types. If we start rejecting certain file types we're going to be playing whack-a-mole in this area for a long time.

I also don't see a user benefit though I could very well be missing something. It makes sense on publicly-accessible services to try to guard against various exploits. With Central, it feels like if someone with malicious intent gets administrator access you have way bigger problems than bad attachments getting in. In the case of malformed files, it feels ok for clients to be responsible for identifying them and alerting the user. They're better positioned to have the ultimate say in what is an acceptable file.

May 26 '22 05:05 lognaturel

@eyelidlessness is it actually set for CSV or XML external secondary instances coming from Central at the moment? If so, is that because of the way that the Enketo request is made?

It's sent as a Content-Type request header by central-frontend when those CSV/XML files are uploaded for the form draft, which is then stored by central-backend in blobs, which in turn it populates in the response Content-Type header when Enketo requests them. When blank, as in the case of .geojson, it's populated in blobs as NULL, and sent as "null" by central-backend in the Content-Type response header.

I'm not so sure about the schema validation. Currently clients are supporting a subset of .geojson. I think the most useful thing for users if to be allowed to upload whatever geoJSON they have and for the clients to do all of the schema validation according to what they currently support.

Maybe schema validation is phrased too strongly, I was thinking more a heuristic to say "yeah, this is probably what it claims to be". In the case of .geojson, that could be as simple as:

Is the file valid JSON?
Does it have a valid type property?

I'd probably stop there, unless there's a compelling reason to validate further. Though I suppose clients will inevitably be doing both to support it, so maybe none of that's necessary.

In general, I think it's a strength of Central that it's open about content types. For example, it was valuable to be able to support geojson in clients without making any Central changes.

In practice, we don't actually know what file types users use and expect. That is, users could specify a file field in their form with a default of whatever file type they want. I've seen gpx, ogg, amr and many other less-common file types. If we start rejecting certain file types we're going to be playing whack-a-mole in this area for a long time.

Oh. I wasn't aware that support for file types was this open-ended. I had been referencing the spec, which reads to me as more limited. Is the intent to broadcast a more limited set of file types but implicitly allow people to use whatever type they like if they try?

Maybe a better approach would be to have a more dynamic solution to determining the content type, e.g. using a library like mime?

I also don't see a user benefit though I could very well be missing something.

Not serving the correct Content-Type header can cause all sorts of unexpected behavior in a browser, such as failing to render certain media or opening a download prompt instead of navigating to a resource. It might also cause different encoding behavior in web servers like nginx.

With Central, it feels like if someone with malicious intent gets administrator access you have way bigger problems than bad attachments getting in.

Unless I'm mistaken, this would also affect submission attachments.

Anyway, here's my thinking at this point in the discussion/after some time to sleep on it. I'm on board with deferring any kind of validation, but I do still think central-backend should be responsible for populating blobs.contentType correctly. I think it makes sense to choose a library like mime[^1] linked above, as that allows Central to continue to be open to supporting new content types. The backend should be the source of truth in conflicting cases where the client sends a type which isn't among the known types on the backend, but should defer to the client-sent type when there are no known types on the backend.

[^1]: The underlying library mime-db looks less ergonomic to use for our purposes, but I wanted to make sure to link to it because that page documents its sources: Apache, IANA and nginx.

May 26 '22 17:05 eyelidlessness

central-backend central-backend copied to clipboard

Form attachment content type should be determined server-side

central-backend
central-backend copied to clipboard