Odin icon indicating copy to clipboard operation
Odin copied to clipboard

`core:os` and `core:os/os2`: `read` on windows loses/corrupts code points

Open Neirokan opened this issue 1 month ago • 2 comments

Context

    Odin:    dev-2025-11-nightly:1fb60c4
    OS:      Windows 11 Professional (version: 24H2), build 26100.3476
    CPU:     AMD Ryzen 5 5600X 6-Core Processor
    RAM:     32693 MiB
    Backend: LLVM 20.1.0

Requires UTF-8 encoding support in terminal to properly test this. Also technically it's multiple bugs, but they all related to a single procedure in the implementation - read_console: https://github.com/odin-lang/Odin/blob/1fb60c43481506da4f47f75ca4e0de1c70446cf0/core/os/os2/file_windows.odin#L317

Expected Behavior

Behavior

Basically I expect read_console not to corrupt information when buffer is too small to fit full line in one call by preserving a little info between calls (at least 1 code point) but the full list is:

  1. read_console properly copies input into buffer
  2. read_console properly reads utf-16 surrogate pairs even if it's separated across 2 win32.ReadConsoleW calls.
  3. if the last code point doesn't fully fit into the buffer, read_console spits out the rest of it on next call OR holds the whole code point until next call (if current call has written at least something? I'm not sure what should be the proper behavior if len(buffer) < 4). Both ways works for me, ~~but I would prefer the second one.~~ UPD: actually, it should probably match other platforms, so the first one?

Output for provided example

[240, 159, 153, 130, 240, 159, 153, 130, 13, 10]
[240, 159, 153, 130, 13, 10]

Current Behavior

Behavior

  1. read_console loses part of input because copying loop compares n+i < len(b), but both variables are incremented (n in the end of loop), so it should compare only n < len(b) (actually it's #5086, but because of other 2 points I decided to create this issue) https://github.com/odin-lang/Odin/blob/1fb60c43481506da4f47f75ca4e0de1c70446cf0/core/os/os2/file_windows.odin#L344
  2. read_console replaces surrogate pairs with 2 REPLACEMENT_CHAR's if it gets split across 2 win32.ReadConsoleW calls.
  3. read_console copies only part of code point if it doesn't fit into buffer and loses the rest of it.

Output for provided example

[240, 159, 153, 130, 239, 191, 239, 191, 189, 240, 159, 10]

It actually combines all bugs, since it consists of: [240, 159, 153, 130] (full emoji), [239, 191] (truncated replacement character), [239, 191, 189] (replacement character), [240, 159] (truncated emoji) and [10] ('\n').

Steps to Reproduce

Code:

package main

import "core:fmt"
import "core:os/os2"

main :: proc() {
	buf: [12]u8
	for {
		n, err := os2.read(os2.stdin, buf[:])
		assert(err == nil)
		fmt.printf("%d\n", buf[:n])
	}
}
  1. Open cmd in Windows Terminal
  2. Write chcp 65001
  3. Compile and run example code
  4. Paste into terminal 🙂🙂 and press Enter
  5. Paste into terminal 🙂 and press Enter

Neirokan avatar Nov 08 '25 02:11 Neirokan

I have a working fix for this case, but it needs some further testing before I merge it. And I also need to address what happens if len(buf) < 3.

Kelimion avatar Nov 08 '25 09:11 Kelimion

Windows handles their console reads really strangely compared to other operating systems. To address all of the edge cases, this may take a bit of a rewrite, so this patch may not necessarily land today. That depends on whether a hunch I have pays off.

Kelimion avatar Nov 08 '25 12:11 Kelimion