Bug report: our implementation of searching SharePoint in large resultsets is buggy
Today I was searching SharePoint using PnP PowerShell (Invoke-PnPSearchQuery) and ran into duplicated results on large resultsets.
I got like a 100.000 files as resultrows, but only a third of those results were unique results (unique based on the URL of the file). Which is an odd thing. It appeared I was finding all files, but what I actually got back was duplicated rows. (Not due to TrimDuplicates:False)
In this scenario I implemented paging myself (for some reasons). I discovered that when I used the -All parameter on the commandlet (instead of paging myself), I would get as many result rows, but now they were all unique files. Very odd discovery.
When looking in the PnP.PowerShell codebase (and searching online) it turned out that using paging with the StartRow property is very untrustworthy. And therefore PnP.PowerShell has implemented a different way, that is also documented by MS. Source: https://learn.microsoft.com/en-us/sharepoint/dev/general-development/pagination-for-large-result-sets
Although this source doesn't mention the duplicate results that I was struggling with, it does show that for paging on large resultsets you should sort by the DocId and page like that. And that's precisely what PnP.PowerShell is doing if you look into their codebase.
I looked into ours and it seems we are paging by StartRow when --allResults is used.
I propose that we update the way we page in this command to the DocId version, to make our command trustworthy on large resultsets as well.
Implementation
In short, this is what we need to do:
- Query everything where
IndexDocId>${lastDocId} - Sort by DocId:
[DocId]:ascending
I wrote down a slightly more extensive explanation in my blog: https://www.blimped.nl/correctly-paging-when-searching-sharepoint/
Additional
While we;re at it, lets add extra verbose message while looping all results.
Something like:
Processing search query, retrieved 1000 of 1000000 items
Processing search query, retrieved 1500 of 1000000 items
Processing search query, retrieved 2000 of 1000000 items
Awesome find. Let's do it. Just to double check: there's no @odata.nextLink that we could use so that we don't need to fiddle with paging links ourselves?
Checked to be sure, there is no nextlink available...
Also, if I search a large resultset using m365 cli and an app only app, it crashes somewhere on an ECONNRESET. I think we need to implement some failure handling and retry mechanism. Search can be buggy.
and also: it would be helpful while looping all results if we write the current itemcount/total count using logging to stderr in verbose mode.
Something like:
Processing search query, retrieved 1000 of 1000000 items
Processing search query, retrieved 1500 of 1000000 items
Processing search query, retrieved 2000 of 1000000 items
Checked to be sure, there is no nextlink available...
Wait, metadata is set to none, let me try again with minimal/verbose
No next link...
Sorry @martinlingstuyl for the late reply. Awesome find and I am a huge fan of this one. lets get this shipped 🚀