typebox icon indicating copy to clipboard operation
typebox copied to clipboard

Encoding a value with a union schema produces surprising results

Open myndzi opened this issue 1 month ago • 5 comments

Hi there. Congrats on 1.0 release, I'm excited to migrate all my code 😂 (actually though!)

The codebase I encountered this situation in is still on an older version of typebox, but the typebox code that I tracked it back to seems to still be present in master, so I'm filing this anyway.

--

I encountered something unexpected while working on some code performing OAuth2 duties. Since the spec allows for auth via either headers or body parameters, I created a schema that is essentially "maybe a given type, or maybe a given type plus the two body parameters":

const TMaybeBodyAuth = <T extends TProperties>(properties: T) =>
  T.Union([
    // Basic Auth is preferred
    // https://datatracker.ietf.org/doc/html/rfc6749#section-2.3.1
    T.Object({
      ...BodyAuth,
      ...properties,
    }),
    T.Object(properties),
  ]);

I've also created a type for holding secrets in a weakmap, to avoid accidentally exposing them, and used a transform type to handle wrapping/unwrapping secrets as early/late as possible:

export const TSecret = <T extends TSchema>(type: T, label?: string) => {
  return T.Transform(type)
    .Decode(
      (value: StaticEncode<T>): Secret<StaticDecode<T>> =>
        new Secret(value, label)
    )
    .Encode(
      (secret: Secret<StaticDecode<T>>): StaticEncode<T> => secret.unwrap()
    );
};

I finally got to a point where I'm doing end-to-end testing against an OAuth server, and was surprised to be getting 4xx errors saying "invalid client secret". After some debugging, I found that the body did not include the unwrapped secret, for the following combination of reasons:

  • The data I wanted it to encode satisfies both members of the union
  • Typebox (presumably for performance reasons) checks for a valid schema from all union members with untransformed data, and only then re-attempts with the transformed versions: https://github.com/sinclairzx81/typebox/blob/5130ee57a721d58a43dfa180f8603a4530231302/src/value/transform/encode.ts#L200-L212
  • I had ordered the union members backwards (the "shorter" one first), but this turned out not to matter
  • I had not specified { additionalProperties: false }

Setting additionalProperties to false on the object does produce the expected result:

(with additionalProperties: false)

  value: {
    client_id: 'foo',
    client_secret: 'bar',
    grant_type: 'client_credentials'
  }

(without additionalProperties: false)

  value: {
    client_id: 'foo',
    client_secret: Secret(client_secret),
    grant_type: 'client_credentials'
  }

... so this isn't exactly a bug, just a fairly sharp corner that I banged my knee on.

While the behavior is explicable and makes sense in light of my discoveries, it might be worth considering different logic here. If I'm explicitly calling "Encode" on some data, I am not expecting it to return unencoded data -- so it's surprising to me that the first loop exists at all. The Encode function also appears to operate on the data in-place, so my output was including the client_id and client_secret properties because they were present on the input, but their content was unexpected because the schema being applied was not the one I expected. (This may have changed in the new version, but I have a lot of work to do before I can verify experimentally - perhaps Clone keeps properties when additionalProperties is true?)

This mismatch didn't really surface to me while coding because of the exact property checks in Typescript - so it wasn't immediately obvious to me at the type level. Things typechecked fine, but the implementation wound up unexpected.

Thanks again for the awesome library, I've been using it for quite some time now and it's still my go-to for good reason!

myndzi avatar Dec 05 '25 22:12 myndzi

@myndzi Hi, nice to hear from you. Thanks for the kind words :)

... so this isn't exactly a bug, just a fairly sharp corner that I banged my knee on.

Yeah, Codecs / Transforms haven't been working out very well in TypeBox unfortunately and I am considering alternative designs for upcoming versions (which will likely be breaking).

The work in Version 1.0 was focused specifically on two main aspects of TypeBox:

  • TypeScript Emulation and Script Engine (TypeScript)
  • JSON Schema 2020-12 Compliant Validation Engine (AJV)

However, the Value.* API's were ported from 0.34.x verbatim (there hasn't been much work done on them since 0.34.x) ....but where the long term plan is to overhaul Value.* inline with the JSON Schema views on schematics (where currently these functions operate on TypeBox specific views of the schematics). I couldn't achieve this by the end of 2025 (it would have postponed 1.0) so will be deferring to 2026 and releasing under 1.x or possible 2.0.

Codec Issue

The problems with Codec relate to ambiguous encode/decode for logical types (Union / Intersect), but especially Union, where TB's internal implementations need to execute multiple decode/encode paths (one per variant in the Union) to try and determine which decoded form is correct. The linked logic looks like it was trying to accommodate for multiple possible resolved values ... (with early exit for matching values) ... it's very complex ... and where this ambiguous logic resolution exists in other parts of the code base too (it's one of these academic problems I'm not convinced there is a solution)


As of writing, there are 3 options on the table, some more immediate, others more long term.

Option 1: Disallow Codec Usage with Logical Types (Union | Intersect)

It might be possible to try and detect ambiguous resolution early and throw (simplifying internal Codec handling based on predicate assumptions), this would be meaningful to Union. This could go out under a minor revision.

// Throw: Codec(A OR B) - ambiguous 
const T = Type.Codec(  
  Type.Union([Type.Number(), Type.String()]),     
)
// Ok: Codec(A) OR Codec(B)
const T = Type.Union([
   Type.Codec(Type.Number()),
   Type.Codec(Type.String()),
])

Option 2: Application of Codec Finalizes Type

Another approach I have been thinking about is just disallowing sub schema Codec assignment entirely, when a Codec finalizes a Type it means no more composition of that type is possible. This means Codec application is lifted to a high level where the Codec is written against the fully materialized type (Vector3 in this case), as opposed to sub-schema types (i.e. String to Number to String)

This would be fundamentally breaking (as well as reduce the utility of sub schema auto encode/decode)

const T = Type.Object({
   x: Type.String(),
   y: Type.String(),
   z: Type.String(),
})

const C = Type.Codec(T)
  .Decode(value => ({ 
    x: parseFloat(value.x),
    y: parseFloat(value.y),
    z: parseFloat(value.z) 
  }).
 .Encode(value => ({ 
    x: value.x.toString(),
    y: value.y.toString(),
    z: value.z.toString(),
  })

const X = Type.Object({ x: C }) // throw!!

Option 3: Drop Codec Support

I have been weighing this option also. Codecs have been in the library for a number of years now, but many users generally struggle with them (particularly with logical types - and where application to Union seems to be the first thing people try to use with them !!!!). I've had a really hard time trying to support them.

I have considered pulling the Codecs out of TypeBox specifically and providing support via an external library. This would allow Codecs to evolve separately to TypeBox, and where the Codecs package would be rewritten specifically over JSON Schema schematics (not TypeBox specifically) and where other JSON Schema builders could get some use out of it.

$ npm install @typebox/codecs # exterior package

So, this is where things are at. The Value.* functions should operate mostly as they did in 0.34.x, but where there is work to do to bring these functions inline with 1.0 and it's internal changes (if you compare master and main branches, you'll see 1.0 is a near complete re-architecture of TB!!!)

I am open to PR's to fix this issue though, so if you want to submit a PR to fix the union logic, that would be ok. Also, I could use some community help in general as TypeBox has become so large these days. Community help to establish organizational packages (like @typebox/codecs) would be a good way to help distribute some aspects of TB and foster experimentation in the ecosystem.

I have some ideas to consolidate several libraries under the @typebox/ organizational package.

npm install @typebox/codecs    # input/output encoding, support for optimized JSON / CBOR serialization
npm install @typebox/codegen   # generate typescript code from json schematics (and vice versa)
npm install @typebox/map       # transform between JSON Schema, JSON Type Definition, ProtoBuf, XSD
npm install @typebox/react     # generate dynamic / reactive forms from TypeBox types
npm install @typebox/driver    # integration middleware for Standard Schema, JSON Schema and TypeBox

Related: https://github.com/sinclairzx81/typebox/issues/306#issuecomment-1375555234 Serialization/deserialization considerations

Some of these are being worked on currently

  • https://github.com/sinclairzx81/typedriver
  • https://github.com/sinclairzx81/typebox-codegen

So, big response today :) But did want to provide some insight into post-1.0 planning. If you did want to submit a PR to resolve the Codec issue though (immediate fix), that would be fine.

Again, nice to hear from you! Cheers S

sinclairzx81 avatar Dec 06 '25 05:12 sinclairzx81

Thanks for the thorough response :)

Addressing the current codebase

I'm willing to attempt a PR, but I am not confident in my global understanding of correctness here.

The problems with Codec relate to ambiguous encode/decode for logical types (Union / Intersect), but especially Union, where TB's internal implementations need to execute multiple decode/encode paths (one per variant in the Union) to try and determine which decoded form is correct. The linked logic looks like it was trying to accommodate for multiple possible resolved values

I get the gist of what you're saying, but I'm not sure I understand it concretely enough to know what change to make. I can dig into the code a bit. I think it would be reasonable to test the union members in the order they are provided, at the very least? This is still a bit undiscoverable, but allows the user full control of priority.

Do you have any cautions or preferred solutions, beyond "tests pass"? That is, if we're just trying to avoid the footgun but not change the architecture, how should I go about it?

Some general thoughts on data transformation / codecs

These days I'm thinking about data transformation in stages.

Stage 1: I have unknown data, which I expect to conform to a certain shape, and I want to verify that (validation) Stage 2: I have data which I've verified conforms to my expectation, which I now want to transform into the form I want to work with (transformation)

"Stage 1" data benefits heavily from composition. I may know at a document level what format I expect timestamps, currency values, or large numeric values to be, for a specific data source I'm interacting with. I want to define a schema by composing small reusable pieces (this response has an object with a property named "date" and a value type that's a fractional epoch timestamp in seconds). Typebox shines here, and this is what I tend to use the Transform type for. This aligns with the kind of usage you demonstrate in "Option 1"

"Stage 2" data benefits less from composition; I am generally applying this on whole objects, or arrays of objects, and sometimes on objects composed of other discrete "chunks" composed together. Here I want to declare functions that are broader in scope: "convert this whole row of the database response" or "convert this property value of this HTTP response". The transformations I'm performing are things like "normalize the object keys", "turn an array into a set", or "convert relative time into absolute time" (e.g. OAuth's "expires_in" property to a Temporal.Instant). This aligns with the kind of usage you demonstrate in "Option 2"

I think some of the trouble is that there are blurry boundaries between these two stages, and they can benefit from tight integration with each other. While I don't usually want to perform "stage 2" transformations at the level of a single scalar, I still want to compose these transformations much like I would compose "stage 1" validations: "This API returns me an object where the data property is an array of User" -> I want to declare a function to transform "users" and one to transform "responses" in exactly the same way I would declare validations. If there's a failure in a transformation I want to know which specific piece of data had the problem and receive a clear explanation, in exactly the same way as I would get for a validation failure.

So, while the two stages are conceptually distinct, for ergonomic reasons I want to treat them the same, and this is why the idea of the Transform/Codec type is appealing. As a user, I just want to "insert data, receive data", check whether there was a problem exactly once, and be able to clearly identify the cause of the problem if there was one. I want to declare "what I want to happen" in one place, clearly, to provide this experience.

Option 4?

One direction that might support a clearer separation of these two intents while providing tight integration is something like user-defined types + hooks.

By user-defined types, I mean that it would be possible to implement e.g. Type.Date completely via Typebox's public API. Right now, the tools I have are:

Unsafe: JSON schema spec can be changed, data stays the same; yolo Codec: JSON schema spec stays the same, data can be changed; some error leakage Built-in types: convenient; clear error messages

There's nothing(?) for "specify the JSON schema [or build one from pre-existing types]" and "write code to perform the validation and/or scalar-level coercion [or build it from pre-existing types]" and "provide data to the validator to explain a problem". If this existed, it opens the door for additional logic that could currently only be implemented with Transforms, removing one reason for them to exist.

By hooks, I mean that it could be possible to associate some code with a node in the Typebox schema. From my brief glance earlier, Typebox schemas seem to now have a ~codec key with encode/decode functions, so the underpinnings of this may already be in place. It's surely created / defined by Codec, but it could more directly be a set of functions provided by the user instead. The Typebox end of things could be as simple as "give me a function to run when I get here / after I've validated the data" and providing that function the ability to integrate with error collection.

As an example:

// the type of `value` would have to be provided by the schema that the codec was added to,
// but here i'm only showing an example of what this property might contain
schema[`~codec`].encode = (value, next /* ?? */) => {
  // node-style continuation passing is a bit old school, but it could
  // provide explicit differentiation between success, failure, and code defects?
  // return-or-throw could work, or some hybrid... 
  return camelToSnakeCase(value);
};
Value.Encode(schema, value, {case: 'snake_case'});

The useful properties of this approach are that a codec needn't be defined on every schema, only the ones where it makes sense; the implementation of the codec can provide error messages; Typebox handles traversal, association of data values with codec functions, and type safety / correctness, which means you can really just write:

const encode = (value: WhatIExpect): WhatIProvide => ...

Being able to define hooks like this would also decouple the underlying JSON schema association from the Transform type's current implementation. There is no longer a separate Transform responsible for supplying the JSON schema or the implementation of "how to validate" its input; instead, Typebox just just knows how to turn already-validated input into another type.

Something like this could remove the other use-case for Transform existing. Users would use custom schema types to represent expectations (data that means the same thing) and hooks to provide transformations (alter the shape/content/meaning of data to align with the application's internal needs)

myndzi avatar Dec 06 '25 08:12 myndzi

I was writing a post to follow up a little after reading #1333 but I wound up nerd-sniping myself:

https://tsplay.dev/mLnpAw

This demonstrates a technique I recently sorta worked out for tracking and transforming types as unions, where I may want to select or operate on some subset of those types.

In context here, it's able to know the encoded and decoded types of a schema that may or may not contain codecs, and also perform unions on codecs or create codecs on unions, e.g.

// const TNumberAsString: Schema<Variant<number, "encoded"> | Variant<string, "decoded">>
const TNumberAsString = createCodec(TNumber, {
  decode: (v: number): string => String(v),
  encode: (v: string): number => parseInt(v),
});

// const TNumberOrString: Schema<Variant<string | number, "encoded"> | Variant<string | number, "decoded">>
const TNumberOrString = union([TNumber, TString]);

// type encodedNumberOrString = string | number
type encodedNumberOrString = StaticEncode<typeof TNumberOrString>;

// type decodedNumberOrString = string | number
type decodedNumberOrString = StaticDecode<typeof TNumberOrString>;

// const TNumberOrStringAsString: Schema<Variant<string, "decoded"> | Variant<string | number, "encoded">>
const TNumberOrStringAsString = createCodec(TNumberOrString, {
  decode: v => String(v),
  encode: v => v,
});

// type foo = string | number
type foo = StaticEncode<typeof TNumberOrStringAsString>;

// type bar = string
type bar = StaticDecode<typeof TNumberOrStringAsString>;

// const TComplex: Schema<Variant<string | number | boolean, "encoded"> | Variant<string | boolean, "decoded">>
const TComplex = union([TNumberOrStringAsString, TBoolean]);

One caveat that directly relates to the OP is that (for simplicity of code), I implemented the codec "lifting" behavior in the union function in such a way that it merely tries the union members in the order they were provided. This can be seen on TBadOrder and TGoodOrder in the output:

{
  name: 'TBadOrder',
  decoded: [
    [ 123, 123 ], // doesn't convert to string
    [ '456', '456' ],
  ],
}
{
  name: 'TGoodOrder',
  decoded: [
    [ 123, '123' ], // does convert to string
    [ '456', '456' ],
  ],
}

The strategy here could perhaps be different; for example, it could check all schemas that have codecs and only if none are valid then fall back to schemas that do not. This is easy to accomplish in the union() function by simply cloning the array and reordering the clone. Then, you'd be guaranteed never to have a property that you have specified a codec for, so long as the data is a valid instance of that codec's type.

The general idea is that the type of a schema exists in a kind of superposition; the "encoded type" and the "decoded type". When we want to know "what this schema would decode to", we extract the decoded type from the union, and when we want to know "what this schema would encode to", we extract the encoded type. Both types get passed around and can be modified - union demonstrates this by turning Schema<T>|Schema<U> into Schema<T|U>.

myndzi avatar Dec 06 '25 18:12 myndzi

So.... I goofed.

When I was looking to see if this code had changed, I looked up the old version and tried to find where it went in the new code. In the process, I edited the url to change the commit SHA to "main", which didn't work, and then "master", which did. This is my shortcut for "show me the current code", but I was guessing at the main branch name.

However, since I was viewing a file that didn't exist on main, I mistakenly believed that "main" didn't exist. I later (while reading some issues) learned that it does, and is in fact the current release.

I spent a little time familiarizing myself with the codebase in preparation for maybe creating a PR, and learned that it very much isn't anymore :) It's still a little unexpected, but Typebox does respect the ordering of the union members passed to Union.

The rest of my comments are probably still relevant, but the OP is inaccurate with respect to Typebox 1.0. I'm not sure that the approach I described for dealing with "multiple types per schema" would readily slot into Typebox's architecture, so I'm not sure that there's anything actionable here that doesn't involve a large refactor, unless it's to design a different decoding strategy for unions?

Sorry for taking your time with all my yapping, but perhaps it's interesting or gives some ideas. Happy to talk over some stuff and see if I can contribute something if you want to.

myndzi avatar Dec 07 '25 02:12 myndzi

Unrelated: I looked to see if there's a way I can contact you non-publicly, but don't see anything. Are you on Discord, or is there an e-mail address I can send to?

myndzi avatar Dec 07 '25 02:12 myndzi