pact-reference icon indicating copy to clipboard operation
pact-reference copied to clipboard

Matching binary payloads does not work consistently across OS/Platform

Open mefellows opened this issue 3 years ago • 16 comments
trafficstars

Given a test that looks like this:

      const interaction = pact.newInteraction('some description');

      interaction.uponReceiving('a request to create a dog with binary data');
      interaction.withRequest('POST', '/dogs/1234');
      interaction.withRequestBinaryBody(bytes, 'application/gzip');
      interaction.withStatus(200);

      port = pact.createMockServer(HOST);

      // and http request to that mock server:

      return axios
        .request({
          baseURL: `http://${HOST}:${port}`,
          headers: {
            'content-type': 'application/gzip',
          },
          data: bytes, 
          method: 'POST',
          url: '/dogs/1234',
        })

We get the following error:

Expected header 'Content-Type' to have value 'application/octet-stream' but was 'application/gzip'

This is confusing and unexpected, because nowhere in my test setup was I expecting that type, and am sending the correct content-type header from the HTTP client.

If I change the test to instead use both application/octet-stream I get a different error:

Expected binary contents to have content type 'application/octet-stream' but detected contents was 'application/gzip'.

The assertions for the header are correct, but it's failing to match on the body.

Problem

We can make this work by modifying it so that the setup is this way:

interaction.withRequestBinaryBody(bytes, 'application/gzip');
...
 return axios
        .request({
          baseURL: `http://${HOST}:${port}`,
          headers: {
            'content-type': 'application/octet-stream',
          },
          data: bytes,
          method: 'POST',
          url: '/dogs/1234',
        })

However, aside from that being a little confusing, it's not reliable on Windows, which expects the content type to be as per the initial setup.

The current consequence is that I can't write a small feature spec in Pact JS that works x-platform, but more broadly it seems like a bug.

Proposed solution

1. Disable sys_fdo_magic in tree_magic_mini crate

From: https://crates.io/crates/tree_magic_mini/3.0.3

All mime information and relation information is loaded from the Shared MIME-info Database as described at https://specifications.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-latest.html. If you beleive that this is not present on your system, turn off the sys_fdo_magic feature flag.

If I understand that correctly, this would disable the reliance on behaviour of a third party library, and result in reliable behaviour in this crate.

2. Use content types

In the presence of a content-type header, do not use the MIME type sniffing to determine the data type.

There was some other options proposed in this slack thread: https://pact-foundation.slack.com/archives/CA2S7E6KC/p1642474173003300

Example builds

  1. https://github.com/pact-foundation/pact-js/runs/4848121252?check_suite_focus=true#step:4:7668 (linux passing, windows failing)
  2. https://github.com/pact-foundation/pact-js-core/runs/4812794228?check_suite_focus=true#step:4:361 (note the ubuntu build passes, but osx fails)

mefellows avatar Jan 18 '22 11:01 mefellows

I'm running into another variation of this issue today.

Previously, setting the content-type on pactffiWithBinaryFile as application/gzip and application/octet-stream with the actual request (on OSX/Linux) worked, but now I get this:

2022-08-14T00:13:56.457164Z DEBUG tokio-runtime-worker pact_matching: --> Mismatches: [HeaderMismatch { key: "Content-Type", expected: "application/gzip", actual: "application/octet-stream", mismatch: "Mismatch with header 'Content-Type': Expected header 'Content-Type' to have value 'application/gzip' but was 'application/octet-stream'" }, BodyTypeMismatch { expected: "application/gzip", actual: "application/octet-stream", mismatch: "Expected body with content type application/gzip but was application/octet-stream", expected_body: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0"), actual_body: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0") }]
2022-08-14T00:13:56.457401Z DEBUG tokio-runtime-worker pact_mock_server::hyper_server: Request did not match: Request did not match - HTTP Request ( method: POST, path: /dogs/1234, query: Some({"someParam": ["someValue"]}), headers: Some({"x-special-header": ["header"], "Content-Type": ["application/gzip"]}), body: Present(43 bytes) )    0) Mismatch with header 'Content-Type': Expected header 'Content-Type' to have value 'application/gzip' but was 'application/octet-stream'    1) expected 'application/gzip' body but was 'application/octet-stream'

If I change the expectation though to octet-stream and octet-stream...

2022-08-14T00:15:45.029187Z DEBUG tokio-runtime-worker pact_matching: --> Mismatches: [BodyMismatch { path: "$", expected: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0"), actual: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0"), mismatch: "Expected binary contents to have content type 'application/octet-stream' but detected contents was 'application/gzip'" }]
2022-08-14T00:15:45.029468Z DEBUG tokio-runtime-worker pact_mock_server::hyper_server: Request did not match: Request did not match - HTTP Request ( method: POST, path: /dogs/1234, query: Some({"someParam": ["someValue"]}), headers: Some({"x-special-header": ["header"], "Content-Type": ["application/octet-stream"]}), body: Present(43 bytes) )    0) $ -> Expected binary contents to have content type 'application/octet-stream' but detected contents was 'application/gzip'

If I change to gzip and gzip:

2022-08-14T00:28:05.169885Z DEBUG tokio-runtime-worker pact_matching: --> Mismatches: [BodyMismatch { path: "$", expected: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0"), actual: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0"), mismatch: "Could not parse expected value as UTF-8 text: invalid utf-8 sequence of 1 bytes from index 1" }, BodyMismatch { path: "$", expected: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0"), actual: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0"), mismatch: "Could not parse actual value as UTF-8 text: invalid utf-8 sequence of 1 bytes from index 1" }, BodyMismatch { path: "$", expected: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0"), actual: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0"), mismatch: "Unable to match '' using ContentType(\"application/gzip\")" }]
2022-08-14T00:28:05.170299Z DEBUG tokio-runtime-worker pact_mock_server::hyper_server: Request did not match: Request did not match - HTTP Request ( method: POST, path: /dogs/1234, query: Some({"someParam": ["someValue"]}), headers: Some({"Content-Type": ["application/gzip"], "x-special-header": ["header"]}), body: Present(43 bytes) )    0) $ -> Could not parse expected value as UTF-8 text: invalid utf-8 sequence of 1 bytes from index 1    1) $ -> Could not parse actual value as UTF-8 text: invalid utf-8 sequence of 1 bytes from index 1    2) $ -> Unable to match '' using ContentType("application/gzip")

If I change to octet-scream and gzip:

2022-08-14T00:28:43.797683Z DEBUG tokio-runtime-worker pact_matching: --> Mismatches: [HeaderMismatch { key: "Content-Type", expected: "application/octet-stream", actual: "application/gzip", mismatch: "Mismatch with header 'Content-Type': Expected header 'Content-Type' to have value 'application/octet-stream' but was 'application/gzip'" }, BodyTypeMismatch { expected: "application/octet-stream", actual: "application/gzip", mismatch: "Expected body with content type application/octet-stream but was application/gzip", expected_body: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0"), actual_body: Some(b"\x1f\x8b\x08\0\0\0\0\0\0\x13+\xc9\xc8,V\0\xa2\xc4<\x85\xd4\xbc\xe4\xfc\x94\xd4\x14\x85\xe2\x92\xa2\xcc\xbct\0\xea\x15\"\xbc\x19\0\0\0") }]
2022-08-14T00:28:43.797917Z DEBUG tokio-runtime-worker pact_mock_server::hyper_server: Request did not match: Request did not match - HTTP Request ( method: POST, path: /dogs/1234, query: Some({"someParam": ["someValue"]}), headers: Some({"x-special-header": ["header"], "Content-Type": ["application/octet-stream"]}), body: Present(43 bytes) )    0) Mismatch with header 'Content-Type': Expected header 'Content-Type' to have value 'application/octet-stream' but was 'application/gzip'    1) expected 'application/octet-stream' body but was 'application/gzip'

So it seems I can't get this to work at all now.

It's worth noting that the Windows build still works, where application/octet-stream is sent as both the expected content type and the request header.

(For me, this was the failing commit/build where I noticed the change: https://github.com/pact-foundation/pact-js-core/runs/7822608710?check_suite_focus=true and the downstream impact to Pact JS: https://github.com/pact-foundation/pact-js/runs/7823153394?check_suite_focus=true#step:4:12345)

mefellows avatar Aug 14 '22 00:08 mefellows

Just a note on:

  1. Use content types

In the presence of a content-type header, do not use the MIME type sniffing to determine the data type.

This assumes the content type header is always correct, which may not be the case (i.e content type is application/jpeg but the body contains a application/pdf). There are people who want to test for this case.

rholshausen avatar Aug 17 '22 05:08 rholshausen

Fair point!

mefellows avatar Aug 17 '22 05:08 mefellows

Just revisiting this when looking at https://github.com/pact-foundation/pact-net/pull/447.

Would moving to https://docs.rs/tree_magic/latest/tree_magic/ be an option?

Uses system FreeDesktop.org magic files on Linux systems, and built-in magic file on Windows and macOS.

If I understand the current implementation, if the MIME database mentioned is present on the OS it will use it, otherwise it just falls back to application/octet-stream. If I further understand the claim of the tree_magic library, it would still have a DB to query, just not the same across OS. That still opens the potential for minor differences across OS I suppose, but I think it would be better than the current situation.

mefellows avatar Mar 28 '23 12:03 mefellows

Any option we choose I'd like the escape hatch of always allowing the user to override it.

There'll be times when the detection just gets it wrong. There'll be times when the detection just doesn't know (e.g. proprietary formats). There'll be times when it's almost, but not quite, correct due to things like extensions (e.g. identifying application/json when the user wants application/json+patch or json+schema).

For me, I'd even make the auto detection opt in rather than opt out. If it's on by default and it fails then you get a complete mystery of a test failure (as has been documented above). At least if you choose to opt in and it goes wrong then you've got a hint as to why.

adamrodger avatar Mar 28 '23 17:03 adamrodger

This was discussed in the maintainers channel, the only one who responded was Tim Jones: https://github.com/pact-foundation/pact-reference/pull/138#issuecomment-911127948

We are using tree_magic_mini (https://crates.io/crates/tree_magic_mini), which is a fork of tree_magic.

Licensing and the MIME database

By default, tree_magic_mini will attempt to load the shared MIME info database from the standard locations at runtime.

If you won't have the database files available, or would like to include them in your binary for simplicity, you can optionally embed the database information if you enable the tree_magic_db feature.

As the magic database files themselves are licensed under the GPL, you must make sure your project uses a compatible license if you enable this behaviour.

The decision was not to include any GPL licensed files in the Pact frameworks.

rholshausen avatar Mar 28 '23 22:03 rholshausen

Ah ok. So both libraries revert to the standard OS-specific locations? Interesting, that tells me Windows by default doesn't have a useful mime library because it can't detect application/gzip (the test I used in Pact JS Core) or image/jpeg (this PR).

The decision was not to include any GPL licensed files in the Pact frameworks.

I think that's the correct decision, unless we decide to change the license format which would in effect restrict the distribution and adoption of the tool.

mefellows avatar Mar 29 '23 01:03 mefellows

Windows (historically) uses the file extension to determine the file type.

rholshausen avatar Mar 29 '23 01:03 rholshausen

I've been mulling this over after coming across #305 and other issues and I think I am fond of @adamrodger's suggestions of

I think the default should be that the user has to specify a content type, with an opt-in for auto-detection. That way we can document that it may work differently on different OSs and the user can conciously make that choice. If it doesn't work, e.g. because they use different OS or the OS fails auto-detection, then they've always got the option to specify manually.

especially with regards of this code smell

Also, the round trip integration test that I've asked for will fail in CI because we run it on all 3 major OSs and verify the file contents afterwards. A test library that produces different results on different OSs isn't really desirable. We shouldn't be building OS specific switches into the tests either, that's an obvious code smell.

I like the fact that user has to be overt about the content-type they provided, and then state whether they'd like to use auto detection. Maybe it would warn if it couldn't detect and fall back to the user provided.

Q - Would we ask the user to set the name and value (so they could pick an Content-Type or content-type), or just have a content_type option where they provide the specified content_type value.

AFAIK HTTP headers are case insensitive and should be considered to be lowercased

YOU54F avatar Aug 03 '23 13:08 YOU54F

Had a bit of a scour and found https://crates.io/crates/infer/0.3.2 which is MIT licensed.

Looks like a nice drop in replacement.

Will push up an example on my fork and test out across the examples where I identified failure points. I've got pretty consistent repros and have been documenting the results so can add some test cases and fall backs

https://crates.io/crates/infer/0.3.2

YOU54F avatar Aug 07 '23 18:08 YOU54F

Nice find! Where it gets tricky is if the detection is incorrect, or if it doesn't know the narrow type (see the discussion about doc, ppt, xls all having the same magic bytes for example in the infer package), it will mismatch on the expected type - even if correct.

That's where breaking out of the auto-detection could be helpful. If it's at least the same x-platform, that's a good start though.

mefellows avatar Aug 07 '23 23:08 mefellows

Yeah, container types are a problem. The only library I've seen that deals with them is Apache Tika (which is used by Pact-JVM).

rholshausen avatar Aug 08 '23 00:08 rholshausen

Yeah I much prefer the default to be that the user specifies the content type, and an opt-in option for auto detection.

At least then when auto detection gets it wrong:

  1. It's pretty obvious to you why you've got a dodgy content type, since you explicitly decided to opt in to auto detection so it must be that

  2. It's obvious what to do next - disable auto detection and specify manually instead

If the default is auto detection then it's not obvious at all what to do when it goes wrong, and we know it'll go wrong often even if we find the most state of the art auto detection.

adamrodger avatar Aug 10 '23 19:08 adamrodger

My thinking is that this becomes a documentation issue for maintainers if we can solve the cross-platform auto detection.

The content-type matcher explicitly uses byte detection to check the actual content type. As per Ron's comment above, if we rely on purely the headers, the actual body may not be of the correct type.

The escape hatch is to not use the content-type matcher, and simply use all of the other primitives - look at the content-type itself, match content-length etc.

mefellows avatar Aug 10 '23 22:08 mefellows

How does that work in terms of the spec though? Unless the spec can define what will happen during that auto detection then the spec becomes ambiguous.

If the behaviour relies on the implementation details of a particular Rust crate and/or OS then that's not really a spec unless we can fully define it to maintain compatibility. I don't think we can make that guarantee.

On Thu, 10 Aug 2023, 18:27 Matt Fellows, @.***> wrote:

My thinking is that this becomes a documentation issue for maintainers if we can solve the cross-platform auto detection.

The content-type matcher explicitly uses byte detection to check the actual content type. As per Ron's comment above, if we rely on purely the headers, the actual body may not be of the correct type.

The escape hatch is to not use the content-type matcher, and simply use all of the other primitives - look at the content-type itself, match content-length etc.

— Reply to this email directly, view it on GitHub https://github.com/pact-foundation/pact-reference/issues/171#issuecomment-1674005625, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD4FKWGBS2QVWJ6LF4JF7DXUVN6JANCNFSM5MG667RA . You are receiving this because you were mentioned.Message ID: @.***>

adamrodger avatar Aug 14 '23 15:08 adamrodger

All the spec has to say about it (currently) is this:

Screenshot 2023-08-15 at 11 15 00 am

That is, there is an explicit matcher that you can use to sniff the actual payload and confirm if it matches the content type expected in the test.

I think what I am saying above is at least consistent with the spec, and with what (I think) you're asking:

Options:

  1. Only compare the metadata around the request (content-type header, content-length if necessary, and the usual HTTP bits).
  2. 1 + additionally the magic byte check

(1) is already supported by the usual implementation bits - you don't need to do anything special (2) is the "opt-in" bit, where you say "I'd also like the magic byte check please"

The problem at the moment is that (2) is inconsistent across platforms and perhaps the spec could be made to be clearer.

If the behaviour relies on the implementation details of a particular Rust crate and/or OS then that's not really a spec unless we can fully define it to maintain compatibility.

No, it shouldn't, to the degree we can control this. It's a non-trivial problem, but it sounds like we may have a path forward:

  1. The crate mentioned by Yousaf
  2. Installation instructions to communicate how to get the right setup for consistent matching i.e. documentation (this is less than ideal, obviously)

mefellows avatar Aug 15 '23 01:08 mefellows