zoekt Always include trailing newline

The goal of this PR is to fix our "line model" to fix the edge cases that led to https://github.com/sourcegraph/sourcegraph/issues/60605. In short, this changes the definition of a "line" to include its terminating newline (if it exists).

Before this PR, we had defined a "line" as starting at the byte after a newline (or the beginning of a file) and ending at the byte before a newline (or the end of the file).

The problem with that definition is that a newline that is the last byte in the file can never successfully be matched because we would trim that from the returned content, so any ranges that would match that trailing newline would be out of bounds in the result returned to the client. That's the reason behind the panics caused by https://github.com/sourcegraph/zoekt/pull/709, which was an attempt to formalize the "line does not include a trailing newline" definition.

So, instead, this PR proposes that we redefine a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.

The downside is it does complicate the contract for the client a bit. In practice, it means to get the set of lines, you need to do something like chunk.content.replace(/\r?\n$/, '').split(/\r?\n/) instead of just chunk.content.split(/\r?\n/) because there may or may not be a trailing newline at the end of the file, but if there is, it does not indicate there is an empty line at the end of the file.

This PR makes significant client-observable changes, and would be accompanied by some associated updates in sourcegraph/sourcegraph drafted here.

Mar 18 '24 22:03 camdencheek

@sourcegraph/search-platform mind taking a peek at this to see if y'all agree with the direction?

Mar 18 '24 23:03 camdencheek

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206 History is on your side 🌞.

Mar 25 '24 23:03 isker

In short, this changes the definition of a "line" in a chunk to include its terminating newline (if it exists).

This sounds good and correct to me.

So, instead, this PR proposes that we redefine a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.

This makes a lot of sense to me. If your regex asks for \n and we say sure we found one but then it isn't there... that feels broken. So I 100% agree we need to go in this direction.

The downside is it does complicate the ChunkMatch contract for the client a bit. In practice, it means to get the set of lines, you need to do something like chunk.content.replace(/\r?\n$/, '').split(/\r?\n/) instead of just chunk.content.split(/\r?\n/) because there may or may not be a trailing newline at the end of the file, but if there is, it does not indicate there is an empty line at the end of the file.

It is already a bit tricky for the client right? It feels like we should document corner cases for content for clients, to ensure they test. Additionally what you propose I think will match what users want to see.

Some fun corner cases while I am here. Imagine your corpus is two files: `` (empty) and \n (single newline):

/^\n$/
/^\n?$/

This PR makes client-observable changes, and would be accompanied by some associated updates in sourcegraph/sourcegraph drafted here.

I'm happy with the description. Ready for code review?

Apr 03 '24 10:04 keegancsmith

@keegancsmith, this is ready for review now. It doesn't look we do any "release" for Zoekt -- any thoughts about how to publicize this since it is a breaking change?

Apr 10 '24 18:04 camdencheek

@keegancsmith, this is ready for review now. It doesn't look we do any "release" for Zoekt -- any thoughts about how to publicize this since it is a breaking change?

Yeah we don't have a changelog. So I am just going to CC people who I know use our API. cc @isker for neogrok and @binarymason from GitLab.

Alternatively is it possible to only do this trimming behaviour for ChunkMatches? Or is that asking for trouble.

Apr 14 '24 19:04 keegancsmith

is it possible to only do this trimming behaviour for ChunkMatches?

I did try that in my first attempt. Not impossible, but definitely messy. Also, the ambiguity described exists for LineMatch too, so I consider this a bugfix. If you'd like, I can make a sibling PR that attempts this and we can decide whether it's worth further splitting the code paths.

Apr 15 '24 23:04 camdencheek

Thanks for the ping @keegancsmith! The logic behind this PR makes sense to me, but I'll need to dig in a bit to see if there would be any breaking changes on our end. We pin to a specific commit, so I don't see any blockers that should prevent this from moving forward.

If I see any breaking changes, will report back ASAP. 🤝

Apr 17 '24 01:04 binarymason

This was a breaking change for neogrok, but one that I'm happy to handle.

Apr 19 '24 05:04 isker

Sorry and thank you @isker!

Apr 19 '24 07:04 keegancsmith