openverse-api
openverse-api copied to clipboard
Proof of concept: Sort all search results by created_on date
This draft PR should not be merged. It's a simple proof of concept to showcase:
- Adding
created_onto the API responses - Sorting all searches by
created_on
It's meant to show how ElasticSearch's built-in sorting functionality works. I would love to see someone expand on this (my python abilities are really limited 😅) in the following ways:
- Make this sorting conditional on a
sort_by=newquery param. - Optional: Add a
sort_dir=asc/descoption to configure the direction of sort - Make sure these query params are undocumented
- Make this feature only accessible to authenticated users
- Bonus: make access to this feature accessible only by a special user permission with a checkbox in the Django admin
- Write tests
Once those items are completed we could test this against production data! We should also research the integrity of the created_on field, and make sure it is a true representation of when a creative work is added to the Openverse catalog and that we never accidentally rewrite that value in production.
API Developer Docs Preview: Ready
https://wordpress.github.io/openverse-api/_preview/916
Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.
You can check the GitHub pages deployment action list to see the current status of the deployments.
When @panchovm returns we can revisit turning this into a formal project with some frontend UI and additional testing.
Marking as blocked until @panchovm is back and is able to work on the UI for this.
@panchovm Just wanted to confirm whether you've seen this PR. I've marked it as "design" to hopefully surface it better for you. I suspect this would need a frontend issue for the design of the feature? @zackkrida do you intend for this PR to get merged eventually or would this be picked up as part of a larger project down the road?
I tried to replace the aspect: design label with needs design but could not find it. I think it's only available on the front-end repo.
I suspect this would need a frontend issue for the design of the feature?
Yes. And I envision a few design iterations before starting its implementation. I am thinking of different solutions that might force the current layout to tweak other areas.
Since this is marked with low priority, I was planning to work on the homepage redesign before arriving at this task. But now that it is blocked, should we revisit the priority?
It's still low priority @panchovm, no need to rush on this. I think the API feature can be completed before the design, but I see it as part of a larger project.
@zackkrida Should this be closed in favour of a more comprehensive planning? This change would be blocked until we have analytics, is my understanding at least, and it comes around for the MSR to review weekly to check this PR until then.
@sarayourfriend, there was a recent sync session where folks discussed this, and there was actually consensus on having someone open and finish this PR on the sooner side. The idea being that it would be useful for two things:
- As a default sort order for provider landing pages on the frontend
- The Openverse Gutenberg integration has some existing code scaffolded for allowing for sort by new which could leverage this as well
The only actual "work" this PR needs is to wrap this in a conditional which checks for a query string:
https://github.com/WordPress/openverse-api/blob/f5b20801bc01a2122a68ec5798ef567c40cff22d/api/catalog/api/controllers/search_controller.py#L387-L389
Sure we'll probably want test(s) as well, and to decide whether to leave this feature documented or undocumented for now.
Folks felt we could defer adding new frontend UI to allow users to sort by new. What do you think about this?
Maybe we can follow the Gutenberg convention of adding unstable_ to the front of query params we aren't yet committed to. If we did that, then I don't have an opinion about whether this PR is finished, I just don't see a strong reason to do it now, especially if the frontend implementation wouldn't be worked on until some future unknown date (presumably when the change could be measured, though maybe people's sentiments on that too has changed).
I verified that in both the catalog and the API, the created_on field is not the date of creation of a particular media. However, the catalog can contain date_taken inside the meta_data column for some providers as seen for some Flickr images. This field isn't always present and being JSON can be named other things.
SELECT meta_data FROM image WHERE identifier = '9dd74c70-6069-4658-a843-01c1b072c6fc';
{"views": "21", "pub_date": "1284481786", "date_taken": "2010-09-11 08:25:09", "license_url": "https://creativecommons.org/licenses/by-nd/2.0/", "raw_license_url": null}
This makes me hesitant to proceed with this PR because created_on (in its current form) is not a useful basis for sorting.
Maybe the framing of "new in Openverse" would be most appropriate for this.
Maybe the framing of "new in Openverse" would be most appropriate for this.
I just want to point out that that is a very different definition than "newer content" which was the original hypothesis. I'm not objecting, just wanted to point out that the basis of this was that newer stuff might be more relevant or higher quality.
If we sort by "new to Openverse" then we'll probably end up serving a lot more from big new providers, like iNaturalist, right?
Again, not objecting or trying to block, just wanting to think through what the implications actually are and make sure it's not being glossed over.
I just want to point out that that is a very different definition than "newer content" which was the original hypothesis. I'm not objecting, just wanted to point out that the basis of this was that newer stuff might be more relevant or higher quality.
If we sort by "new to Openverse" then we'll probably end up serving a lot more from big new providers, like iNaturalist, right?
Again, not objecting or trying to block, just wanting to think through what the implications actually are and make sure it's not being glossed over.
Totally. I think part of the desire for this was use on the following (future) pages:
- Provider view
- Tag view
- Creator view
The first view wouldn't matter so much, but the point you make about potentially weighing towards a particular provider still holds for the latter two views.
Oh wow, I had no idea that was the intention behind this work, apologies. I'd never heard of these other than the tags view.
In any case, the feature is marked as unstable__ so there's no risk in merging it, just if we use it should be aware of the implications of the way we sort and what the intention is (another reason why I don't think using this on the frontend until we have the ability to test it's effect is wise, but just a personal opinion).
Update:
This PR is undrafted and open for review. The additional features have have been implemented:
- This sorting behaviour is conditional, controlled by the
unstable__sort_byparam. unstable__sort_dir(can beasc/desc) controls the direction of the sort.- These params are automatically hidden from the docs as long as they contain the
unstable__prefix. - This feature requires authentication. If an anonymous user sets these flags, it has no effect.
- Adding user-level permission would require a migration, which is overkill at the PoC and testing stage.
- Tests have been added.
This PR also adds documentation and fixes an ordering bug that occurs on main.
Just noting here: I always interpreted this to mean "new to Openverse". If a site adds hundreds of cool new photos of a collection of Helenistic pottery from 30 BCE, I would want that to appear in this filter. "Recently added" is probably the most common name for this type of sorting.
That said, as long as we mark this "unstable" as you've done, @dhruvkb, I'm totally cool with merging this and playing with the results. We should formally make a decision about what this filter means and what value it offers users before we use it anywhere in production.
If 'newly indexed in Openverse' is what this sort was supposed to mean, then this works 100% according to your expectation. Personally I thought new meant 'newly created' or 'newly uploaded at the source' so I was surprised that the sorted results were totally different.
Again, 'recently added' is not the same as 'recently indexed'. If a site added photos of "Helenistic pottery from 30 BCE" in early 1990s and we just added it as a new provider, we'll still get those photos in this category (which is neither "recently created", nor "recently added", just "recently indexed").
Considering they're not already present, creation_date and uploaded_date fields might be something to consider? (cc: @WordPress/openverse-catalog)
Considering they're not already present,
creation_dateanduploaded_datefields might be something to consider? (cc: @WordPress/openverse-catalog)
Definitely, though it seems like it might not be something that's present in all providers and so I wonder if a full-on column makes sense in those cases. Perhaps a way to add it in the MediaStore class that goes into meta_data 🤔
@dhruvkb Do you mind updating the description and testing instructions?
@stacimc done!
Since I'm technically the author, I can't officially approve this, but it looks excellent, @dhruvkb.