typescript-go icon indicating copy to clipboard operation
typescript-go copied to clipboard

tsgo panics with invalid UTF-8 when marshalling build info

Open tnowad opened this issue 4 months ago • 9 comments

Stack trace

panic: Failed to marshal build info: json: cannot marshal from Go incremental.BuildInfoDiagnosticsOfFile within "/semanticDiagnosticsPerFile/0": invalid UTF-8 within "/1/1/message" after offset 319

goroutine 1 [running]:
github.com/microsoft/typescript-go/internal/incremental.(*Program).emitBuildInfo(0xc002a0c180, {0xdb5728, 0x14ad4e0}, {0x0?, 0xd8?, 0x0?})
        github.com/microsoft/typescript-go/internal/incremental/program.go:264 +0x588
github.com/microsoft/typescript-go/internal/incremental.(*emitFilesHandler).emitAllAffectedFiles(0xc000b35590, {0x0?, 0x1e?, 0x0?})
        github.com/microsoft/typescript-go/internal/incremental/emitfileshandler.go:132 +0x696
github.com/microsoft/typescript-go/internal/incremental.emitFiles({0xdb5728, 0x14ad4e0}, 0xc002a0c180, {0x0?, 0x0?, 0x0?}, 0x0)
        github.com/microsoft/typescript-go/internal/incremental/emitfileshandler.go:273 +0x1a5
github.com/microsoft/typescript-go/internal/incremental.(*Program).Emit(0xc002a0c180, {0xdb5728, 0x14ad4e0}, {0x0?, 0x0?, 0x0?})
        github.com/microsoft/typescript-go/internal/incremental/program.go:193 +0x307
github.com/microsoft/typescript-go/internal/execute.emitFilesAndReportErrors({0xdb8450, 0xc000144000}, {0xdbb000, 0xc002a0c180}, 0xc0001285e0)
        github.com/microsoft/typescript-go/internal/execute/tsc.go:384 +0x1cf
github.com/microsoft/typescript-go/internal/execute.emitAndReportStatistics({0xdb8450, 0xc000144000}, {0xdbb000?, 0xc002a0c180?}, 0xc000002000, 0xc0001560f0, 0x0?, 0x6325c, 0x299e9ef, 0x7ffd6e, ...)
        github.com/microsoft/typescript-go/internal/execute/tsc.go:308 +0x6d
github.com/microsoft/typescript-go/internal/execute.performIncrementalCompilation({0xdb8450, 0xc000144000}, 0xc0001560f0, 0xc0001285e0, 0xc00011e240, 0x6325c, 0x0)
        github.com/microsoft/typescript-go/internal/execute/tsc.go:250 +0x40c
github.com/microsoft/typescript-go/internal/execute.tscCompilation({0xdb8450, 0xc000144000}, 0xc000156000, 0x0)
        github.com/microsoft/typescript-go/internal/execute/tsc.go:194 +0xb53
github.com/microsoft/typescript-go/internal/execute.CommandLine({0xdb8450, 0xc000144000}, {0xc0000200a0, 0x0, 0x0}, 0x0)
        github.com/microsoft/typescript-go/internal/execute/tsc.go:63 +0x17e
main.runMain()
        github.com/microsoft/typescript-go/cmd/tsgo/main.go:23 +0x109
main.main()
        github.com/microsoft/typescript-go/cmd/tsgo/main.go:10 +0x13

Steps to reproduce

  1. Create a .ts file with this content:
const createFileListFromFiles = (files: File[]): FileList => {
  const fileList: FileList = {
    length: files.length,
    item: (index: number): File | null => files[index] || null,
    [Symbol.iterator]: function* (): IterableIterator<File> {
      for (const file of files) yield file;
    },
    ...files,
  } as unknown as FileList;

  return fileList;
};
  1. Run:
tsgo

tnowad avatar Aug 07 '25 09:08 tnowad

This is a side effect of us switching to json/v2, which will error when trying to send strings that are not UTF-8 safe.

In our case, our internal symbol names begin with an illegal UTF-8 sequence to discriminate them from other names, and it sounds like a diagnostic is attempting to encode that, which we then serialize.

Throwing this in the playground, and yeah:

Playground Link

'length' is specified more than once, so this usage will be overwritten.
'__@iterator@17' is specified more than once, so this usage will be overwritten.

The __@ in tsgo is handled differently. I think we need to sanitize the strings that go into diagnostics, but this is going to be a generic problem for the API as well.

jakebailey avatar Aug 07 '25 18:08 jakebailey

I noticed that ts-go uses the following code to begin internal symbol names: https://github.com/microsoft/typescript-go/blob/72d2ee651371b71ed798ffbb721055b32d926d2a/internal/ast/symbol.go#L30

So we can use strings.ReplaceAll(s, InternalSymbolNamePrefix, "") to remove illegal characters. Specifically, this applies to both buildInfo.SemanticDiagnosticsPerFile and buildInfo. EmitDiagnosticsPerFile .

https://github.com/microsoft/typescript-go/blob/a57f4e0479b9af0ff7d7a89ded898139977f3227/internal/incremental/buildInfo.go#L430-L431

@jakebailey WDYT? If that sounds okay, I will make a PR for this later.

mayooot avatar Aug 08 '25 06:08 mayooot

That's not okay, because they will look like real names. We should likely instead convert them to something more like the old code with double underscores or similar.

But simply replacing them is a band aid for a larger problem, unfortunately.

jakebailey avatar Aug 08 '25 06:08 jakebailey

That's not okay, because they will look like real names. We should likely instead convert them to something more like the old code with double underscores or similar.

But simply replacing them is a band aid for a larger problem, unfortunately.

Ok, do you have any thoughts on how to fix this? I'd be happy to assist.

mayooot avatar Aug 08 '25 07:08 mayooot

I'm thinking about it. Maybe for now we just do the hotfix and make sure BuildInfoDiagnostic, buildInfoDiagnosticWithFileName, buildInfoDiagnosticWithFileName, replace ast.InternalSymbolNamePrefix with __.

jakebailey avatar Aug 11 '25 20:08 jakebailey

That or, we put JSON marshalling in compat mode for this one difference in buildinfo, but that won't work for other stuff in the LSP.

jakebailey avatar Aug 11 '25 20:08 jakebailey

we need to handle this part for mashalling as well as unmarshalling otherwise errors wont be consistent.

sheetalkamat avatar Aug 11 '25 23:08 sheetalkamat

Yeah, I forgot that we reproduce the errors too ☹️

I have a branch where I've changed the internal prefix to something we can roundtrip, but I'm not 100% certain it's correct because we sometimes turn internal symbol names into string names for types.

jakebailey avatar Aug 11 '25 23:08 jakebailey

Of course, internal symbols ever appearing in public anything are bugs. (In the API, that's okay, but not so much elsewhere.)

We also have some internal symbols that use a = prefix, which might imply we could get away with any non-identifier string, which could make my PR correct enough.

jakebailey avatar Aug 11 '25 23:08 jakebailey