nix-search-cli icon indicating copy to clipboard operation
nix-search-cli copied to clipboard

Duplicated search results.

Open vic opened this issue 7 months ago • 6 comments

Hi,

Looks like something (I'm guessing maybe on the side of search.nixos.org elasticsearch) has changed, and now nix-search-cli is displaying each result twice:

> nix run nixpkgs#nix-search-cli -- hello
hello @ 2.12.1 : hello
hello @ 2.12.1 : hello
hello-go : hello-go
hello-go : hello-go
hello-cpp : hello-cpp
hello-cpp : hello-cpp
hello-unfree @ 1.0
hello-unfree @ 1.0
....

vic avatar May 07 '25 22:05 vic

Darn, I'm seeing this as well. Thanks for reporting the issue. Want to send a PR that does the deduplication?

peterldowns avatar May 08 '25 19:05 peterldowns

I've been digging into the source to find the root cause, and I believe I have found the problem, however I'm a bit unsure on how to properly fix it.

Here's how I investigated the issue, I added a couple of fmt.Println at esclient.go, printed the Elasticsearch URL as well as the query json. And also printed the response json. Then I searched for the same package hello at search.nixos.org and compared both URLs, requests and responses, here's what I found:

The query json looks well, the one from search.nixos.org frontend is the same except for some aggs used to group results but we dont need that on the cli.

One thing I noticed different was the request URLs:

CLI = https://nixos-search-7-1733963800.us-east-1.bonsaisearch.net:443/latest-*-nixos-unstable/_search
WEB = https://search.nixos.org/backend/latest-43-nixos-24.11/_search

Then, trying to query the very same endpoint, I changed

	ElasticSearchURLTemplate = `https://search.nixos.org/backend/%s/_search`

but still got duplicated results. noticed that our URL has latest-*-nixos-unstable so, tried using --channel 24.11 on the CLI just trying to hit the same URL.

Now, the problem is actually the -*- prefix here:

	ElasticSearchIndexPrefix = "latest-*-"

I'm guessing there are two indexes that return results, and thats why we are getting duplicates.

So, for one thing, I believe we should be hitting https://search.nixos.org/backend/, however I'm not really sure how to proceed regarding the latest-*- index prefix, since now having the -*- wildcard matches two indexes and both return results.

Any suggestions on how to work around it ?

vic avatar May 08 '25 23:05 vic

The 43 value ( latest-43- ) used by the search.nixos.org frontend is read from the environment as elasticsearchMappingSchemaVersion,

https://github.com/search?q=repo%3ANixOS%2Fnixos-search%20elasticsearchMappingSchemaVersion&type=code

so I'm guessing they bump that version number whenever they add new indexed fields or something like that and the schema changes.

I believe we should also use a fixed schemaVersion int in our Index-prefix, instead of using a wildcard -*-. Or maybe we could read it from a file in our repo containing that schemaVersion that gets updated from time to time ?

vic avatar May 08 '25 23:05 vic

This is it, the file where 43 is defined on their frontend:

https://github.com/NixOS/nixos-search/blob/main/VERSION (changed to 43 two days ago, precisely when we started getting dups)

we could download that file contents as part of our nix build I guess. what do you think ?

vic avatar May 08 '25 23:05 vic

Pushed a minimal PR that fixes this issue: https://github.com/peterldowns/nix-search-cli/pull/20

It uses a fixed schemaVersion, and now results are back to normal.

If you prefer the schemaVersion not to be hardcoded (anyways we have user/pwd hardcoded in there), tell me so.

vic avatar May 09 '25 00:05 vic

Grepping for latest-*-nixos-unstable aliases, matches twice once for schemaVersion 42 and one for schemaVersion 43. So I guess whenever we get a new schema version another latest- alias will be created for it.

curl https://search.nixos.org/backend/_aliases -u "$esUser:$esPass" | jq | rg latest-
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 28119    0 28119    0     0  89414      0 --:--:-- --:--:-- --:--:-- 89550
      "latest-43-nixos-unstable": {}
      "latest-42-group-manual": {}
      "latest-42-nixos-unstable": {}
      "latest-42-nixos-24.11": {}
      "latest-43-nixos-24.11": {}
      "latest-43-group-manual": {}

vic avatar May 09 '25 03:05 vic

ping @peterldowns

vic avatar May 17 '25 14:05 vic

Update: Today (June 4, 2025) things started working again. There's a single -*-unstable alias now: latest-43-nixos-unstable, looks like past aliases get removed after some time.

vic avatar Jun 05 '25 01:06 vic

@vic hey, yup, you figured things out — the latest-* alias prefix is used to always search over the latest available index, but upstream sometimes has multiple latest- indices, and we get duplicate search results. This is documented in the code here https://github.com/peterldowns/nix-search-cli/blob/7d6b4c501ee448dc2e5c123aa4c6d9db44a6dd12/pkg/nixsearch/esclient.go#L21 but not anywhere else, sorry that I didn't have time to point you to it.

If you prefer the schemaVersion not to be hardcoded (anyways we have user/pwd hardcoded in there), tell me so.

Hardcoding the schemaVersion (or the search index) is not a viable option because it would require users of nix-search-cli to re-build or download a new binary everytime the upstream index updates.

The best two options:

  • update the nix-search elasticsearch indexer script to consistently update a single versionless alias, like latest-unstable, to keep it pointing to whatever the actual latest index is.
  • update this repo's code to deduplicate results by package name.

peterldowns avatar Jul 14 '25 17:07 peterldowns

Fixed by https://github.com/peterldowns/nix-search-cli/pull/21

peterldowns avatar Jul 14 '25 19:07 peterldowns