Find and Replace (regex) behavior changes when text is long (negated character class starts including newline)
Type: Bug
TL;DR: negated character class (e.g. [^/]+) starts matching newline when files are long.
- use the following py snippet to create two test text files.
string0 = 'KEEP/001'
string1 = 'REPLACE/ME/23'
for repeat in (10, 10000):
s = []
for i in range(repeat):
s.append(string0)
for i in range(3*repeat):
s.append(string1)
s = '\n'.join(s)
with open(f'repeat{repeat}.txt', 'w') as f:
f.write(s)
On both files, do the follwing find and replace with Use Regex Expression enabled:
Find: ^[^/]+?/[^/]+?/(\d+).*$
Replace: $1
Expected behavior:
The file content should become
KEEP/001
...
KEEP/001
23
...
23
Where KEEP/001 repeats repeat times, and then 23 repeats 3*repeat times. It is indeed the case for the repeat=10.
For the longer file, it's expected to be the same but with 10k/30k lines, respectively. However, it actually gives:
001
{repeated 4998 times}
001
23
{repeated 29998 times}
23
Every KEEP/001\nKEEP/001 got replaced by 001.
In other words, negated character class [^/]+ suddenly starts matching newline while it does not.
VS Code version: Code - Insiders 1.97.0-insider (1db1071148a1efa3b7ad7592d64507ef52536a3e, 2025-01-15T05:04:00.340Z) OS version: Windows_NT x64 10.0.19045 Modes:
System Info
| Item | Value |
|---|---|
| CPUs | Intel(R) Core(TM) i5-14600K (20 x 3494) |
| GPU Status | 2d_canvas: enabled canvas_oop_rasterization: enabled_on direct_rendering_display_compositor: disabled_off_ok gpu_compositing: enabled multiple_raster_threads: enabled_on opengl: enabled_on rasterization: enabled raw_draw: disabled_off_ok skia_graphite: disabled_off video_decode: enabled video_encode: enabled vulkan: disabled_off webgl: enabled webgl2: enabled webgpu: enabled webnn: disabled_off |
| Load (avg) | undefined |
| Memory (System) | 31.77GB (9.35GB free) |
| Process Argv | --crash-reporter-id ec59ae88-55f6-4589-8133-a873b1b9d699 |
| Screen Reader | no |
| VM | 0% |
Extensions (14)
| Extension | Author (truncated) | Version |
|---|---|---|
| copilot | Git | 1.257.0 |
| copilot-chat | Git | 0.23.2 |
| black-formatter | ms- | 2024.5.13171011 |
| debugpy | ms- | 2024.14.0 |
| python | ms- | 2024.22.2 |
| vscode-pylance | ms- | 2024.12.1 |
| jupyter | ms- | 2024.11.0 |
| jupyter-keymap | ms- | 1.1.2 |
| jupyter-renderers | ms- | 1.0.21 |
| vscode-jupyter-cell-tags | ms- | 0.1.9 |
| vscode-jupyter-slideshow | ms- | 0.1.6 |
| cpptools | ms- | 1.22.11 |
| rust-analyzer | rus | 0.4.2266 |
| volar | Vue | 2.2.0 |
A/B Experiments
vsliv368:30146709
vspor879:30202332
vspor708:30202333
vspor363:30204092
vscod805:30301674
vsaa593:30376534
py29gd2263:31024238
vscaac:30438845
c4g48928:30535728
a9j8j154:30646983
962ge761:30841072
pythonnoceb:30776497
dsvsc014:30777825
dsvsc015:30821418
pythonmypyd1:30859725
h48ei257:31000450
pythontbext0:30879054
cppperfnew:30980852
pythonait:30973460
dvdeprecation:31040973
dwnewjupyter:31046869
nativerepl1:31134653
pythonrstrctxt:31093868
nativeloc1:31118317
cf971741:31144450
e80f6927:31120813
iacca1:31150324
notype1:31143044
dwcopilot:31158714
h409b430:31177054
c3hdf307:31184662
6074i472:31201624
dwoutputs:31217127
This is the corresponding js file (generated with Copilot to create the two files repeat10.txt and repeat10000.txt
Here no python installed.
const fs = require('fs');
const string0 = 'KEEP/001';
const string1 = 'REPLACE/ME/23';
const repeats = [10, 10000];
repeats.forEach(repeat => {
let s = [];
// First loop for string0
for (let i = 0; i < repeat; i++) {
s.push(string0);
}
// Second loop for string1
for (let i = 0; i < 3 * repeat; i++) {
s.push(string1);
}
// Join array with newlines
const fileContent = s.join('\n');
// Write to file
fs.writeFileSync(`repeat${repeat}.txt`, fileContent);
});
With 1.96.3 I reproduce it.
Anyway, before gnerating the files wtih snippet above, I manually created a file with 57k rows, but with different distribution of the two strings. And there is not the issue.
I confirm that the count matters: with 10k of string1 and with 10k of string2, the regex works fine. The limit seems string1 equals to 10k and string2 to 19,998: total lines 29,999 with final empy line.
After replacing:
I played with the settings (Search: Max Results), just a try, but with no results.
I barely remember a similar issue years ago.
Found the old issue: https://github.com/microsoft/vscode/issues/496 And the follow up: https://github.com/microsoft/vscode/issues/169017
Thanks, I have no idea what that setting actually does, after some tests.
Even if I set that to 10, in "normal" cases I can still find/replace all (30000+) the matches fine, regardless if I'm replacing with pure text or regex. (But of course, it does not fix this bug either.) Setting it to a larger value or empty (unlimited) did not fix the issue, either.
Edit: also when you hovering:
The tooltip seems to suggest find should work on entire text with no mention of search.maxResults?
Edit2: after reading https://github.com/microsoft/vscode/pull/126762, it looks to me that setting is only related to global file search, not in-current-file search.
Any update or at least triage on this?
IMHO this is a relatively serious bug since it can lead to data loss if the users won't aware of this when doing batch text replacement using regex, which was the case for me when I firstly discovered this bug.
^ @rebornix Sorry to ping you directly