vscode icon indicating copy to clipboard operation
vscode copied to clipboard

Find and Replace (regex) behavior changes when text is long (negated character class starts including newline)

Open fireattack opened this issue 11 months ago • 4 comments

Type: Bug

TL;DR: negated character class (e.g. [^/]+) starts matching newline when files are long.

  1. use the following py snippet to create two test text files.
string0 = 'KEEP/001'
string1 = 'REPLACE/ME/23'

for repeat in (10, 10000):
    s = []
    for i in range(repeat):
        s.append(string0)
    for i in range(3*repeat):
        s.append(string1)

    s = '\n'.join(s)
    with open(f'repeat{repeat}.txt', 'w') as f:
        f.write(s)

On both files, do the follwing find and replace with Use Regex Expression enabled:

Find: ^[^/]+?/[^/]+?/(\d+).*$ Replace: $1

Expected behavior:

The file content should become

KEEP/001
...
KEEP/001
23
...
23

Where KEEP/001 repeats repeat times, and then 23 repeats 3*repeat times. It is indeed the case for the repeat=10.

For the longer file, it's expected to be the same but with 10k/30k lines, respectively. However, it actually gives:

001
{repeated 4998 times}
001
23
{repeated 29998 times}
23

Every KEEP/001\nKEEP/001 got replaced by 001.

In other words, negated character class [^/]+ suddenly starts matching newline while it does not.

VS Code version: Code - Insiders 1.97.0-insider (1db1071148a1efa3b7ad7592d64507ef52536a3e, 2025-01-15T05:04:00.340Z) OS version: Windows_NT x64 10.0.19045 Modes:

System Info
Item Value
CPUs Intel(R) Core(TM) i5-14600K (20 x 3494)
GPU Status 2d_canvas: enabled
canvas_oop_rasterization: enabled_on
direct_rendering_display_compositor: disabled_off_ok
gpu_compositing: enabled
multiple_raster_threads: enabled_on
opengl: enabled_on
rasterization: enabled
raw_draw: disabled_off_ok
skia_graphite: disabled_off
video_decode: enabled
video_encode: enabled
vulkan: disabled_off
webgl: enabled
webgl2: enabled
webgpu: enabled
webnn: disabled_off
Load (avg) undefined
Memory (System) 31.77GB (9.35GB free)
Process Argv --crash-reporter-id ec59ae88-55f6-4589-8133-a873b1b9d699
Screen Reader no
VM 0%
Extensions (14)
Extension Author (truncated) Version
copilot Git 1.257.0
copilot-chat Git 0.23.2
black-formatter ms- 2024.5.13171011
debugpy ms- 2024.14.0
python ms- 2024.22.2
vscode-pylance ms- 2024.12.1
jupyter ms- 2024.11.0
jupyter-keymap ms- 1.1.2
jupyter-renderers ms- 1.0.21
vscode-jupyter-cell-tags ms- 0.1.9
vscode-jupyter-slideshow ms- 0.1.6
cpptools ms- 1.22.11
rust-analyzer rus 0.4.2266
volar Vue 2.2.0
A/B Experiments
vsliv368:30146709
vspor879:30202332
vspor708:30202333
vspor363:30204092
vscod805:30301674
vsaa593:30376534
py29gd2263:31024238
vscaac:30438845
c4g48928:30535728
a9j8j154:30646983
962ge761:30841072
pythonnoceb:30776497
dsvsc014:30777825
dsvsc015:30821418
pythonmypyd1:30859725
h48ei257:31000450
pythontbext0:30879054
cppperfnew:30980852
pythonait:30973460
dvdeprecation:31040973
dwnewjupyter:31046869
nativerepl1:31134653
pythonrstrctxt:31093868
nativeloc1:31118317
cf971741:31144450
e80f6927:31120813
iacca1:31150324
notype1:31143044
dwcopilot:31158714
h409b430:31177054
c3hdf307:31184662
6074i472:31201624
dwoutputs:31217127

fireattack avatar Jan 15 '25 16:01 fireattack

This is the corresponding js file (generated with Copilot to create the two files repeat10.txt and repeat10000.txt

Here no python installed.

const fs = require('fs');

const string0 = 'KEEP/001';
const string1 = 'REPLACE/ME/23';

const repeats = [10, 10000];

repeats.forEach(repeat => {
    let s = [];

    // First loop for string0
    for (let i = 0; i < repeat; i++) {
        s.push(string0);
    }

    // Second loop for string1
    for (let i = 0; i < 3 * repeat; i++) {
        s.push(string1);
    }

    // Join array with newlines
    const fileContent = s.join('\n');

    // Write to file
    fs.writeFileSync(`repeat${repeat}.txt`, fileContent);
});

With 1.96.3 I reproduce it.

Anyway, before gnerating the files wtih snippet above, I manually created a file with 57k rows, but with different distribution of the two strings. And there is not the issue.

albertosantini avatar Jan 15 '25 17:01 albertosantini

I confirm that the count matters: with 10k of string1 and with 10k of string2, the regex works fine. The limit seems string1 equals to 10k and string2 to 19,998: total lines 29,999 with final empy line.

After replacing: Image Image

I played with the settings (Search: Max Results), just a try, but with no results.

I barely remember a similar issue years ago.

albertosantini avatar Jan 15 '25 17:01 albertosantini

Found the old issue: https://github.com/microsoft/vscode/issues/496 And the follow up: https://github.com/microsoft/vscode/issues/169017

albertosantini avatar Jan 15 '25 17:01 albertosantini

Thanks, I have no idea what that setting actually does, after some tests.

Even if I set that to 10, in "normal" cases I can still find/replace all (30000+) the matches fine, regardless if I'm replacing with pure text or regex. (But of course, it does not fix this bug either.) Setting it to a larger value or empty (unlimited) did not fix the issue, either.

Edit: also when you hovering:

Image

The tooltip seems to suggest find should work on entire text with no mention of search.maxResults?

Edit2: after reading https://github.com/microsoft/vscode/pull/126762, it looks to me that setting is only related to global file search, not in-current-file search.

fireattack avatar Jan 15 '25 18:01 fireattack

Any update or at least triage on this?

IMHO this is a relatively serious bug since it can lead to data loss if the users won't aware of this when doing batch text replacement using regex, which was the case for me when I firstly discovered this bug.

fireattack avatar Apr 02 '25 09:04 fireattack

^ @rebornix Sorry to ping you directly

albertosantini avatar Apr 02 '25 09:04 albertosantini