`core:os` and `core:os/os2`: `read` on windows loses/corrupts code points
Context
Odin: dev-2025-11-nightly:1fb60c4
OS: Windows 11 Professional (version: 24H2), build 26100.3476
CPU: AMD Ryzen 5 5600X 6-Core Processor
RAM: 32693 MiB
Backend: LLVM 20.1.0
Requires UTF-8 encoding support in terminal to properly test this.
Also technically it's multiple bugs, but they all related to a single procedure in the implementation - read_console:
https://github.com/odin-lang/Odin/blob/1fb60c43481506da4f47f75ca4e0de1c70446cf0/core/os/os2/file_windows.odin#L317
Expected Behavior
Behavior
Basically I expect read_console not to corrupt information when buffer is too small to fit full line in one call by preserving a little info between calls (at least 1 code point) but the full list is:
read_consoleproperly copies input into bufferread_consoleproperly reads utf-16 surrogate pairs even if it's separated across 2win32.ReadConsoleWcalls.- if the last code point doesn't fully fit into the buffer,
read_consolespits out the rest of it on next call OR holds the whole code point until next call (if current call has written at least something? I'm not sure what should be the proper behavior iflen(buffer) < 4). Both ways works for me, ~~but I would prefer the second one.~~ UPD: actually, it should probably match other platforms, so the first one?
Output for provided example
[240, 159, 153, 130, 240, 159, 153, 130, 13, 10]
[240, 159, 153, 130, 13, 10]
Current Behavior
Behavior
read_consoleloses part of input because copying loop comparesn+i < len(b), but both variables are incremented (nin the end of loop), so it should compare onlyn < len(b)(actually it's #5086, but because of other 2 points I decided to create this issue) https://github.com/odin-lang/Odin/blob/1fb60c43481506da4f47f75ca4e0de1c70446cf0/core/os/os2/file_windows.odin#L344read_consolereplaces surrogate pairs with 2 REPLACEMENT_CHAR's if it gets split across 2win32.ReadConsoleWcalls.read_consolecopies only part of code point if it doesn't fit into buffer and loses the rest of it.
Output for provided example
[240, 159, 153, 130, 239, 191, 239, 191, 189, 240, 159, 10]
It actually combines all bugs, since it consists of: [240, 159, 153, 130] (full emoji), [239, 191] (truncated replacement character), [239, 191, 189] (replacement character), [240, 159] (truncated emoji) and [10] ('\n').
Steps to Reproduce
Code:
package main
import "core:fmt"
import "core:os/os2"
main :: proc() {
buf: [12]u8
for {
n, err := os2.read(os2.stdin, buf[:])
assert(err == nil)
fmt.printf("%d\n", buf[:n])
}
}
- Open cmd in Windows Terminal
- Write
chcp 65001 - Compile and run example code
- Paste into terminal
🙂🙂and press Enter - Paste into terminal
🙂and press Enter
I have a working fix for this case, but it needs some further testing before I merge it.
And I also need to address what happens if len(buf) < 3.
Windows handles their console reads really strangely compared to other operating systems. To address all of the edge cases, this may take a bit of a rewrite, so this patch may not necessarily land today. That depends on whether a hunch I have pays off.