readthedocs.org icon indicating copy to clipboard operation
readthedocs.org copied to clipboard

Subproject search sharing sibling search index

Open agjohnson opened this issue 2 years ago • 15 comments

It's not quite clear where we are with this feature. At some point this was a feature of our subproject search, though I recall it originally requiring some hacks to make this work. We discussed this internally and it seems this is indeed not a feature of our search anymore. However, currently our docs do mention that sibling projects share an index, and we have issues like https://github.com/readthedocs/readthedocs.org/issues/4623 that found sibling sharing is possible.

Did we refactor this feature out and need to add it back?

We should:

  • [x] Clarify the docs for now
  • [ ] Consider this as a regression of sorts and add this feature back in. What are the changes required here? Perhpas the hack that I'm thinking of is the fact that we never gave users control over this feature and had to hack in the configuration to make this all possible.

cc Wizard of Search, @stsewd

Front logo Front conversations

agjohnson avatar Nov 12 '21 20:11 agjohnson

I'm not aware that this feature existed. We have always returned results from subprojects when searching on the main project.

However, currently our docs do mention that sibling projects share an index

I think that refers to returning search results of subprojects from the main project. And that's done at the query level, not at the index level.

stsewd avatar Nov 15 '21 15:11 stsewd

Yeah, I think we might have implemented it in a proof of concept state for a customer or two, I don't think we ever made it an actual feature of our search.

I think that refers to returning search results of subprojects from the main project.

It seems our docs do pretty clearly describe sibling subprojects sharing an index:

image

"Sharing index" here maybe isn't technically correct, but yeah the effect to the user is a shared index even if we're just altering the queries used.

agjohnson avatar Nov 15 '21 19:11 agjohnson

Just noting: I think this would have been implemented in the old search/lib.py file, but I am not finding anything obvious in the history. Here are the commits before the file was refactored:

https://github.com/readthedocs/readthedocs.org/commits/cbf76fad89c924c248a96824d4d80a9d2af60d3e/readthedocs/search/lib.py

agjohnson avatar Nov 15 '21 19:11 agjohnson

BTW, there are some ideas about how to make this configurable in

https://github.com/readthedocs/readthedocs.org/issues/7217

It's about subprojects, but could be make more general, like:

Don't include results from subprojects

version: 2

search:
  projects: []

Include results from some sub/projects

version: 2

search:
  projects:
    - one
    - two

stsewd avatar Nov 16 '21 00:11 stsewd

Yeah, that's a great example.

So, to make this configuration option make sense as a readthedocs.yaml option (that is, an option that only affects a single version), would the search index be configurable per version? We've come across this for several project-level options and the UX of configuring project-level options via per-version file still seems odd -- though I'm probably not convinced per-version search index configuration is great UX either.

Per-version configuration of the search index could be the most explicit version of this configuration. For example, the ability to target a specific version of a project:

search:
  projects:
    - project: one
      version: latest
    - project: two
      version: 1.0

Perhaps, to step a bit back, another option that we haven't explored much yet is a separate configuration file for project level settings. This file is only valid on the main branch, so there don't need to be conflicts with readthedocs.yaml on each branch or historical branch.

agjohnson avatar Nov 26 '21 18:11 agjohnson

Perhaps, to step a bit back, another option that we haven't explored much yet is a separate configuration file for project level settings. This file is only valid on the main branch, so there don't need to be conflicts with readthedocs.yaml on each branch or historical branch.

Codecov has this option https://docs.codecov.com/docs/codecov-yaml#team-yaml

We can use something like that, this isn't a default setting for all projects, but for all versions (this is also helpful for projects migrating already released versions to rtd I guess).

This will require to rebuild the versions to apply some settings.

stsewd avatar Nov 29 '21 17:11 stsewd

This will require to rebuild the versions to apply some settings.

This is another good point yeah. There are some options, like maybe search configuration, that seem fine to apply on build like that. Redirects seem like another case for a project-level configuration file.

Codecov has this option https://docs.codecov.com/docs/codecov-yaml#team-yaml

I like how they merge the files, that could be something else to explore. Options like redirects would benefit from both project and version level configuration files.

agjohnson avatar Nov 30 '21 02:11 agjohnson

Had another user request this. It would be great if there was a simple way to optionally do this, but I agree that #7217 is the better & more flexible solution.

ericholscher avatar Apr 26 '22 23:04 ericholscher

Another option that we could explore is to do this at the API level, this is pass an array or projects as an option instead of just one, that won't need users to re-build a project, we could make it as an option in our search as you type extension (also I think we could do the default search override as an extension).

I'm thinking something like:

api/v2/search?project=docs&project=dev&q=test

The only thing missing would be the version, so what about using project:version? If the version isn't provided we could default to the default version or to the one passed on the query string &version=latest.

We already check for permissions on each version at the api level

https://github.com/readthedocs/readthedocs.org/blob/ee18557b913beb2976bd93f5a9c77579a4e83e7e/readthedocs/search/api.py#L237

this could allow to search across multiple versions as well!

stsewd avatar Jun 16 '22 21:06 stsewd

Another option that we could explore is to do this at the API level, this is pass an array or projects as an option instead of just one, that won't need users to re-build a project, we could make it as an option in our search as you type extension (also I think we could do the default search override as an extension).

I'm not sure to understand this. Won't this require knowing all the subprojects of a project at build time? How the hardcoded URL for search will know the exact list of projects it has to pass? What happens if a new subproject is added after the build of the other subprojects is done?

humitos avatar Jun 20 '22 09:06 humitos

I'm not sure to understand this. Won't this require knowing all the subprojects of a project at build time?

This is a more general solution, to allow to share results across any set of given projects.

But to include all sibling projects, we could do something similar at the API level instead of the build level, like &include-siblings=true and &include-children=true/&include-subprojects=true

Of course, those options will probably be hardcoded at build time to generate the final URL in our extension, but it leaves the door open for adding more options to the search UI, like a checkbox to change those options at runtime.

stsewd avatar Jun 20 '22 17:06 stsewd

I definitely think the API should be more flexible. If we allow people to pass multiple projects, users could also build some kind of collection search concept completely without us having to be involved. I'm definitely 👍 on expanding the API as a first step, and then adding new features on our first-party implementations that have a nice UX around the API extensions.

ericholscher avatar Jun 20 '22 18:06 ericholscher

So, completing my idea, here is how it will look like:

  • Accept a list of projects via the "project" parameter, each project can be in the form of {project_slug} or {project_slug:version_slug}, if the version isn't present, we use the default version of the project.

  • It will include only the results from the versions that were explicitly requested (no subprojects)

  • Every version will be validated if the user has permission over it, like we already are, if the user doesn't have permission over one, we don't include results from that version, but we still return results from the other versions. Why not fail? So users can use one endpoint for all their users, and don't have to worry about what permissions the user has.

    Similar if the version doesn't exist, we would skip it instead of returning an error, so search doesn't break just because one project/version was deleted.

    So, how users will know what projects are they exactly seeing results from? We will return in the API the pair of projects/version that we used to do the search.

  • If we are using the same parameter, how do we make it backwards compatible? For the old behavior of returning the project and its subprojects, the version parameter will be required, and one of the values from the project will be taking into consideration (like we already do if you pass several projects).

  • Now, for the cache tags, we will return all the list of projects instead of just one project, this is project1, project1:version, project2, project2:version instead of project1, project1:version.

  • CORS: Since the request is no longer attached to a single project/version, we can't make the decision if we should enable CORS or not on a given request from the middleware easily, so we won't allow cross site requests when using the new syntax for now (we need to refactor our CORS code, so every view can decide if CORS should be allowed or not).

  • Search analytics: since the request isn't attached to a single project, I think we could just record the same query for each project.

The response from our current API already supports including results from multiple projects, so the only new field that will be added is the field that says what were the projects/versions that we used in the final query.

https://docs.readthedocs.io/en/stable/server-side-search.html#api

Some examples of a request would be:

  • /api/v2/search?q=search&project=docs&project=dev: Search on the default version of our user and dev docs
  • /api/v2/search?q=search&project=docs:latest&project=dev:latest: Search on the latest version of our user and dev docs
  • /api/v2/search?q=search&project=docs:stable: Search only in our user docs (subprojects won't be included)
  • /api/v2/search?q=search&project=docs&version=latest: This is the current/old form, search only in our user docs (subprojects will be included).
  • /api/v2/search?q=search&project=docs:latest&version=latest: This will 404, since we are using the old form (version is included) and there isn't a project named docs:latest.

To decide

  • What about having the old behavior (including all subprojects) using the new syntax? We could add support for this later if needed, or we can introduce a include-subprojects parameter.
  • What about searching on several versions of the same project? This can be supported, but there are some small changes that we need to make in our search code.

I'm putting this in a comment, but let me know if you prefer a design doc.

stsewd avatar Jun 23 '22 00:06 stsewd

I don't have a great understanding of all the usage of our search API, but this description looks great to me 💯 . I see it flexible and promising about the UX we can build on top of it. I'm 👍🏼 on it.


@stsewd

Accept a list of projects via the "project" parameter, each project can be in the form of {project_slug} or {project_slug:version_slug}, if the version isn't present, we use the default version of the project.

Is this : separator a common practice on APIs? I didn't find too much information about this, but I see other people using , as well for the same purpose. I think it's fine, I'm just asking in case there is a nicer way to express this.

/api/v2/search?q=search&project=docs:latest&version=latest: This will 404, since we are using the old form (version is included) and there isn't a project named docs:latest.

We could consider returning 400 here as well and explaining the error: "When using project=<slug>:<version> the attribute version cannot be used" or similar.

What about having the old behavior (including all subprojects) using the new syntax? We could add support for this later if needed, or we can introduce a include-subprojects parameter.

I think this parameter would require more thinking. Where the API would perform the search if we pass ?project=superproject1&project=superproject2&include-subprojects=true? Would it search in both and also on the subprojects of each of them?

I'd start implementing what you have described: "search on all the projects received by parameters; no matter if they are subprojects or they are not related at all, just search on them all". This looks simpler and more explicit for API users to me. Also, combined with an API query of /api/v3/projects/<slug>/subprojects/ users can get all the subprojects before creating the final URL.

What about searching on several versions of the same project? This can be supported, but there are some small changes that we need to make in our search code.

I assume this would be something like project=docs:latest&project=docs:stable, right? If that's the case, I think it could be useful at some point. However, I'm not seeing explicit use cases at the moment. Compared with the other features of the proposal, this one does not look so important.

humitos avatar Jun 23 '22 08:06 humitos

Is this : separator a common practice on APIs? I didn't find too much information about this, but I see other people using , as well for the same purpose. I think it's fine, I'm just asking in case there is a nicer way to express this.

Not sure, since this is kind of specific to our platform, the most similar comparison could be a platform where there are namespaces/hierarchies, like GitHub/GitLab user/project. On rtd we have been using the project:version syntax in our logging/repl, and then we used it for our cache tags.

I think you mean the , as a separator for several items, what we are using for that is to pass an array on the get parameter.

I think this parameter would require more thinking. Where the API would perform the search if we pass ?project=superproject1&project=superproject2&include-subprojects=true? Would it search in both and also on the subprojects of each of them?

I was thinking this will look for subprojects on all given projects, yes.

stsewd avatar Jun 23 '22 16:06 stsewd

I'm reopening this issue. This feature is still not yet implemented, only the API changes are implemented. We should update the description here with the remaining parts of this implementation. It seems the configuration pieces are still required for this functionality?

agjohnson avatar Dec 05 '22 08:12 agjohnson

@agjohnson no configuration is needed, users just need to search with the project:... syntax, like project:docs project:dev search, that will search in both projects, but we currently aren't using the API v3 as our default indoc-search, we could make it an option for our search extension (like append_query = "project:dev project:docs" to always add that before searching).

stsewd avatar Dec 19 '22 22:12 stsewd

It's possible to do that, but I think what we're talking about with this feature is automatic sibling project search. This was the feature that we lost from our in-doc search. Right now, readers won't know how to configure our search at all, the advanced queries are more a dashboard search UX.

Configuring the query to append search terms seems like a neat idea. This would need to happen outside our Sphinx search extension too, in the standard search override.

agjohnson avatar Dec 19 '22 23:12 agjohnson

IMHO, documentation authors should not define whether or not the reader's query is performed in subprojects/siblings. This should be a reader's decision. The UI has to be clear about "where your query is going to be performed" and also "results should be clear about where they come from".

Imagine that GH search always perform the search over the whole organization because the owner of that organization decided that in advance. I'd say that's bad UX. Clicking in our own in-doc search should show the modal with something similar to this:

Screenshot_2022-12-21_12-44-35

  • In this project
  • In this project and subprojects
  • In this project and siblings
  • In this organization
  • All Read the Docs

@stsewd

users just need to search with the project:... syntax, like project:docs project:dev search, that will search in both projects

Even if the user are currently able to search on siblings by themselves, they have to know the project slugs from the other projects as well, which is almost impossible, and even if it's possible, pretty hard to type 😄

like append_query = "project:dev project:docs" to always add that before searching

This reduces the "platform integrations" we have been talking about and requires the author to update this setting each time they add a new subproject. Besides, it still considers being a Sphinx extension instead of tool agnostic.

humitos avatar Dec 21 '22 11:12 humitos

We have the subprojects:{project} query as well.

stsewd avatar Dec 21 '22 14:12 stsewd

Right, but query strings are undiscoverable for normal users. We need a much nicer UX to make those concepts discoverable, which I think the GH example @humitos showed is 💯

ericholscher avatar Dec 21 '22 22:12 ericholscher

This should be a reader's decision. The UI has to be clear about "where your query is going to be performed" and also "results should be clear about where they come from".

This is more specific to GitHub, where the users of GitHub are probably familiar with GitHub's concept of repository/organization/etc.

For our use case, most reader users are not familiar with RTD at all, and probably won't know whether they mean to search subprojects/organization projects/etc. Also, it's hard/impossible for a reader user to guess where content might live in a superproject/subproject relationship.

Imagine that GH search always perform the search over the whole organization because the owner of that organization decided that in advance. I'd say that's bad UX.

Yeah true, it could go either way though. The documentation author probably knows more that the reader about project topology though, so can most likely decide this for the user with mostly positive outcomes.

Clicking in our own in-doc search should show the modal with something similar to this

This is a great idea :+1: Perhaps we should apply more UI like this, but also make this UI default configurable. That is, a project author can decide the default mode of search, but users can always override it.

I'm thinking mostly of commercial projects that use nesting. Community likely would indeed benefit from consistency in the search method, as the reader users are generally more technical already.

agjohnson avatar Dec 21 '22 22:12 agjohnson

@agjohnson

This is more specific to GitHub, where the users of GitHub are probably familiar with GitHub's concept of repository/organization/etc. For our use case, most reader users are not familiar with RTD at all, and probably won't know whether they mean to search subprojects/organization projects/etc.

Well, I think communicating "whether they mean to search subprojects/organization projects/etc" is part of the work the UI has to do. Showing these options to the user will make them to ask themselves this question and think about it. Definitely, not giving them these options is a worse UX.

That is, a project author can decide the default mode of search, but users can always override it.

👍🏼 on giving authors the ability to override defaults. This is part of the "platform integrations" we are talking about:

  • have sane defaults for the most common use case
  • allow authors/admin/owners to override those defaults
  • allow readers to change them in-place
  • remember readers' decisions (via cookies, profile settings, or similar)

My point here is that I don't want to leave the readers outside the equation. "Authors know more about their project than readers" is not 100% accurate. It may be true, but not accurate in multiple contexts. The experience from the author's perspective is completely different than from the reader's perspective.

humitos avatar Dec 22 '22 12:12 humitos

Well, I think communicating "whether they mean to search subprojects/organization projects/etc" is part of the work the UI has to do.

Yeah a bit, though I'm describing a problem deeper than this too. How is the reader supposed to know that the documentation they are looking for is in a subproject, or why they even need to search in subprojects or sibling projects at all? The reader won't know this, except through trial and error, which doesn't fair well with less/non technical readers.

I'm not sure how to solve this with UI, other than complicating the UX a good deal. But for now, giving authors control of the project search defaults will be the most we can do to help reader users. The documentation authors know more about how to find content in their documentation than we could, and definitely know more than readers do.

Definitely, not giving them these options is a worse UX.

To clarify, I'm not advocating for this option. Surfacing UI to override search defaults is still a good option.

My point here is that I don't want to leave the readers outside the equation.

Yeah, still agreed here :100:

And to keep this UI approachable, I'd probably say that we shouldn't be talking about subprojects/sibling projects at all. These are maintainers terms that readers will be confused by.

What about reducing this to Search related projects, or something similar -- would that be enough? Could we give the project maintainers/authors control over what is considered "related projects"? I think @stsewd suggested this somewhere

agjohnson avatar Jan 03 '23 18:01 agjohnson

What about reducing this to Search related projects, or something similar -- would that be enough? Could we give the project maintainers/authors control over what is considered "related projects"? I think @stsewd suggested this somewhere

I like this, or perhaps search all projects on this domain if we don't want to explain subprojects, but even that is oddly specific. I think related projects is probably fine to start, and would be a nice checkbox above the search.

ericholscher avatar Jan 03 '23 22:01 ericholscher

Making sure we are all on the same page, we want to implement this in our search extension, right? Not in our indoc-search override or dashboard (this could be done with the new templates)

stsewd avatar Jan 23 '23 23:01 stsewd

@stsewd

we want to implement this in our search extension, right? Not in our indoc-search override or dashboard

If I understand correctly what we've discussed, we want one and only one search interface. This means that the UX for indoc-search and dashboard will be exactly the same, with small differences in the UI items presented due to the context (ie. "search related projects" may not appear in dashboard search). There is a long conversation at https://github.com/readthedocs/reports/pull/20#discussion_r1046872899 with more context about why I'm saying this.

That said, I'd this work should be done in a Javascript library that we can embed in all the places where we need search 😄

humitos avatar Jan 24 '23 09:01 humitos

Yeah, I would probably implement the scoping options in a larger overhaul of the front end pieces. I'm not yet certain how this will look, given some technical hurdles with joining documentation and dashboard UI and/or reusing the code for this UI.

What exactly would the end solution look like, technically speaking? Are we going to automate appending all of the project:dev project:docs subprojects:docs etc combinations to search API queries?

If so, and if we want to expose this configuration to users in the final solution, perhaps our first step is to add these configuration options and automation on our side.

If our end goal is to eventually have a search dropdown like search this project, search related projects, and search all organization projects, perhaps we're talking about surfacing configuration options for the search related projects scope, and the upgrade path when we do have new UI will be seamless.

Just a thought, I haven't thought too deeply here.

agjohnson avatar Jan 24 '23 18:01 agjohnson

What exactly would the end solution look like, technically speaking? Are we going to automate appending all of the project:dev project:docs subprojects:docs etc combinations to search API queries?

I was thinking of having users define a default query, but this query would be shown in the text box, so it's explicit to users what they are searching for, and they can change that default query if needed.

For example, something like this, the subprojects:docs part would be something the owner of the docs added, and this query is from the user.

Screenshot 2023-01-25 at 11-04-52 Read the Docs Documentation Simplified — Read the Docs user documentation 9 3 0 documentation

And we could have "shortcuts" below the saerch box, like search this project, search subprojects, search all my projects, when clicking those we would add the proper query parameters to the search box.

And even we could expose a @this shortcut, so users don't have to write the slug, this is subprojects:@this.

perhaps our first step is to add these configuration options and automation on our side.

Are we thinking on an option in our js extension/library, or something more high level like in our config file?

stsewd avatar Jan 25 '23 16:01 stsewd

And we could have "shortcuts" below the saerch box

+1 on this eventually, but what I'm trying to describe would be a step in the middle. We wouldn't add this UI yet, but we would start by adding the default query configuration option at least.

The goal would be to give RTD users, especially users not using our search extension, the ability to search subprojects by default.

The UX on the search scope options is one I'd like to put off, because we are talking about changing the search UI for all projects, not just projects using the search extension. I'd rather visit this particular change when we're rethinking the flyout search UI instead of changing this in multiple places.

The query addition as text that you're describing really goes a long way here though, so I think the UI additions will just be later polish.

I was thinking of having users define a default query

I think I agree, though might extend this further. The end goal for configuration options is probably something like:

# Control the default operation when user types a query and hits enter
search_scope_default = 'related'
# Control for what is prepended
search_scope_queries = {
    'project': '',
    'related': 'subprojects:@this project:foo-bar',
    'organization': 'organization:foo organization:bar',
}

So, if we add a default query option, we should make sure it aligns with this plan, or we're building in backwards incompatibilities already.

I think that search_scope_default = 'related' is the most important addition here. It unlocks the feature we're describing more immediately, and query customization isn't needed yet if all we need to do is automate prepending subprojects:@this.

I was thinking of having users define a default query, but this query would be shown in the text box

What does this look like for users not using our extension? I'm mostly concerned about that use case right now as we have a few customers waiting on this change.

Prepending the search works less well in that UI element (we don't want to pollute the default search input box), but we could still sneak the term into the search query on the fly, and show it on the search results page. That is, the in-doc search input does not have subprojects:project, but the search results page would after searching.

And even we could expose a @this shortcut, so users don't have to write the slug, this is subprojects:@this

That's a neat addition :+1: This could come later of course though, I sort of like this pattern from GitHub though. We probably have some more too, like version:@latest

Are we thinking on an option in our js extension/library, or something more high level like in our config file?

No idea, I was going to ask you this :laughing:

This needs to be usable from our JS, so it needs to be in the JSON context var block (or API return, but I don't think that use aligns well). So it would either need to be a sphinx configuration (for the short term), or would need to be a .readthedocs.yaml configuration option (?)

Long term, .readthedocs.yaml probably makes the most sense, so we can do this for any build backend type.

agjohnson avatar Jan 25 '23 19:01 agjohnson