openverse
openverse copied to clipboard
Change search query approach to include only available providers
Fixes
Fixes #4076 by @obulat
Description
This PR proposes an alternative solution to the one proposed in the issue: The search controller should limit the results queried to those associated with existing and not hidden providers. IMO, this is a simpler approach that takes advantage of the existing code structure. It fits the "filtered provider" concept in the sense that it queries for valid providers instead of skipping the "excluded." What I don't know is if we want to update the value of the FILTERED_PROVIDERS_CACHE_VERSION variable in this case.
It also has the benefit of making the default EMPTY_QUERY unnecessary for the must condition since we always want to have at least one provider always available (if not, no search is possible at all). The same changes are applied to the related/ endpoint search.
Testing Instructions
- Spin up the API,
just a, and go to http://localhost:50280/v1/images/ to confirm the search is working with the usual providers - Hide one of the image providers via Django admin, e.g. StockSnap.
- Confirm the results for the hidden provider are not returned http://localhost:50280/v1/images/
- Show the results of StockSnap again and delete the
ContentProviderentry for Flickr. - Confirm the results for the deleted provider are not returned http://localhost:50280/v1/images/
Checklist
- [x] My pull request has a descriptive title (not a vague title like Update index.md`).
- [x] My pull request targets the default branch of the repository (
main) or a parent feature branch. - [x] My commit messages follow best practices.
- [x] My code follows the established code style of the repository.
- [x] I added or updated tests for the changes I made (if applicable).
- [ ] I added or updated documentation (if applicable).
- [x] I tried running the project locally and verified that there are no visible errors.
- [ ] I ran the DAG documentation generator (only applicable for catalog PRs).
Developer Certificate of Origin
Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
I'm happy to change the terminology and, of course, the clause for the query. I had similar thoughts regarding the names, but it seemed to me that they could fit with the existing environment variables, and I didn't want to give it much thought since it could also be decided in the review 😄
This is back to draft in the meantime.
@sarayourfriend ContentProvider contains sources, and I don't see any issue with the functionalities. I tried hiding Freesound locally, for example, and I requested audio exclusively from it, http://localhost:50280/v1/audio/?source=freesound, and got zero rows as expected (since the source is disabled). Let me know if you or @obulat find a conflict.
The problem with the implementation, if there is one (if sources are in ContentProvider in addition to providers) is we only use the entries in ContentProvider to filter on the provider field of the ES documents, not source, whereas the query parameters refer to source.
To be clear, if this is a problem now, then it was a problem before with the exclusions, but will be worse now because we wouldn't be including all the sources, only those which happen to also be providers.
If ContentProvider really does have sources, not just providers, then my hunch is that we actually have a complex issue here when it comes to providers like smithsonian with many sources. Do we have an individual content provider entry for all smithsonian sources? I didn't think we did (I can check later), but if we don't, and also need to filter in source based on ContentProvider, then there's some kind of strange intersection issue. We need an or in there. "Include documents if provider is in the list of enabled ContentProvider OR if source is in the list of enabled ContentProvider".
Maybe this isn't an issue and I am just confusing myself. But if sources are also in ContentProvider, then there is a big risk to the current implementation that only filters in based on provider.
I had to rebase it with main because the API image wasn't being built correctly. Other than that, the variable in the test is updated ✔️
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:
@dhruvkb @stacimc @sarayourfriend This reminder is being automatically generated due to the urgency configuration.
Excluding weekend[^1] days, this PR was ready for review 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)[^2].
@krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.
[^1]: Specifically, Saturday and Sunday. [^2]: For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.
I'll re-review this today :+1: