openverse-api icon indicating copy to clipboard operation
openverse-api copied to clipboard

Proof of concept: Sort all search results by created_on date

Open zackkrida opened this issue 3 years ago • 2 comments
trafficstars

This draft PR should not be merged. It's a simple proof of concept to showcase:

  • Adding created_on to the API responses
  • Sorting all searches by created_on

It's meant to show how ElasticSearch's built-in sorting functionality works. I would love to see someone expand on this (my python abilities are really limited 😅) in the following ways:

  • Make this sorting conditional on a sort_by=new query param.
  • Optional: Add a sort_dir=asc/desc option to configure the direction of sort
  • Make sure these query params are undocumented
  • Make this feature only accessible to authenticated users
    • Bonus: make access to this feature accessible only by a special user permission with a checkbox in the Django admin
  • Write tests

Once those items are completed we could test this against production data! We should also research the integrity of the created_on field, and make sure it is a true representation of when a creative work is added to the Openverse catalog and that we never accidentally rewrite that value in production.

zackkrida avatar Sep 07 '22 17:09 zackkrida

API Developer Docs Preview: Ready

https://wordpress.github.io/openverse-api/_preview/916

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

github-actions[bot] avatar Sep 07 '22 17:09 github-actions[bot]

When @panchovm returns we can revisit turning this into a formal project with some frontend UI and additional testing.

zackkrida avatar Oct 06 '22 17:10 zackkrida

Marking as blocked until @panchovm is back and is able to work on the UI for this.

krysal avatar Oct 27 '22 21:10 krysal

@panchovm Just wanted to confirm whether you've seen this PR. I've marked it as "design" to hopefully surface it better for you. I suspect this would need a frontend issue for the design of the feature? @zackkrida do you intend for this PR to get merged eventually or would this be picked up as part of a larger project down the road?

sarayourfriend avatar Nov 10 '22 00:11 sarayourfriend

I tried to replace the aspect: design label with needs design but could not find it. I think it's only available on the front-end repo.

I suspect this would need a frontend issue for the design of the feature?

Yes. And I envision a few design iterations before starting its implementation. I am thinking of different solutions that might force the current layout to tweak other areas.

Since this is marked with low priority, I was planning to work on the homepage redesign before arriving at this task. But now that it is blocked, should we revisit the priority?

fcoveram avatar Nov 10 '22 08:11 fcoveram

It's still low priority @panchovm, no need to rush on this. I think the API feature can be completed before the design, but I see it as part of a larger project.

krysal avatar Nov 10 '22 20:11 krysal

@zackkrida Should this be closed in favour of a more comprehensive planning? This change would be blocked until we have analytics, is my understanding at least, and it comes around for the MSR to review weekly to check this PR until then.

sarayourfriend avatar Jan 26 '23 00:01 sarayourfriend

@sarayourfriend, there was a recent sync session where folks discussed this, and there was actually consensus on having someone open and finish this PR on the sooner side. The idea being that it would be useful for two things:

  • As a default sort order for provider landing pages on the frontend
  • The Openverse Gutenberg integration has some existing code scaffolded for allowing for sort by new which could leverage this as well

The only actual "work" this PR needs is to wrap this in a conditional which checks for a query string:

https://github.com/WordPress/openverse-api/blob/f5b20801bc01a2122a68ec5798ef567c40cff22d/api/catalog/api/controllers/search_controller.py#L387-L389

Sure we'll probably want test(s) as well, and to decide whether to leave this feature documented or undocumented for now.

Folks felt we could defer adding new frontend UI to allow users to sort by new. What do you think about this?

zackkrida avatar Feb 02 '23 18:02 zackkrida

Maybe we can follow the Gutenberg convention of adding unstable_ to the front of query params we aren't yet committed to. If we did that, then I don't have an opinion about whether this PR is finished, I just don't see a strong reason to do it now, especially if the frontend implementation wouldn't be worked on until some future unknown date (presumably when the change could be measured, though maybe people's sentiments on that too has changed).

sarayourfriend avatar Feb 03 '23 01:02 sarayourfriend

I verified that in both the catalog and the API, the created_on field is not the date of creation of a particular media. However, the catalog can contain date_taken inside the meta_data column for some providers as seen for some Flickr images. This field isn't always present and being JSON can be named other things.

SELECT meta_data FROM image WHERE identifier = '9dd74c70-6069-4658-a843-01c1b072c6fc';
{"views": "21", "pub_date": "1284481786", "date_taken": "2010-09-11 08:25:09", "license_url": "https://creativecommons.org/licenses/by-nd/2.0/", "raw_license_url": null}

This makes me hesitant to proceed with this PR because created_on (in its current form) is not a useful basis for sorting.

dhruvkb avatar Feb 09 '23 11:02 dhruvkb

Maybe the framing of "new in Openverse" would be most appropriate for this.

AetherUnbound avatar Feb 09 '23 19:02 AetherUnbound

Maybe the framing of "new in Openverse" would be most appropriate for this.

I just want to point out that that is a very different definition than "newer content" which was the original hypothesis. I'm not objecting, just wanted to point out that the basis of this was that newer stuff might be more relevant or higher quality.

If we sort by "new to Openverse" then we'll probably end up serving a lot more from big new providers, like iNaturalist, right?

Again, not objecting or trying to block, just wanting to think through what the implications actually are and make sure it's not being glossed over.

sarayourfriend avatar Feb 09 '23 21:02 sarayourfriend

I just want to point out that that is a very different definition than "newer content" which was the original hypothesis. I'm not objecting, just wanted to point out that the basis of this was that newer stuff might be more relevant or higher quality.

If we sort by "new to Openverse" then we'll probably end up serving a lot more from big new providers, like iNaturalist, right?

Again, not objecting or trying to block, just wanting to think through what the implications actually are and make sure it's not being glossed over.

Totally. I think part of the desire for this was use on the following (future) pages:

  • Provider view
  • Tag view
  • Creator view

The first view wouldn't matter so much, but the point you make about potentially weighing towards a particular provider still holds for the latter two views.

AetherUnbound avatar Feb 10 '23 00:02 AetherUnbound

Oh wow, I had no idea that was the intention behind this work, apologies. I'd never heard of these other than the tags view.

In any case, the feature is marked as unstable__ so there's no risk in merging it, just if we use it should be aware of the implications of the way we sort and what the intention is (another reason why I don't think using this on the frontend until we have the ability to test it's effect is wise, but just a personal opinion).

sarayourfriend avatar Feb 10 '23 06:02 sarayourfriend

Update:

This PR is undrafted and open for review. The additional features have have been implemented:

  • This sorting behaviour is conditional, controlled by the unstable__sort_by param.
  • unstable__sort_dir (can be asc/desc) controls the direction of the sort.
  • These params are automatically hidden from the docs as long as they contain the unstable__ prefix.
  • This feature requires authentication. If an anonymous user sets these flags, it has no effect.
    • Adding user-level permission would require a migration, which is overkill at the PoC and testing stage.
  • Tests have been added.

This PR also adds documentation and fixes an ordering bug that occurs on main.

dhruvkb avatar Feb 10 '23 10:02 dhruvkb

Just noting here: I always interpreted this to mean "new to Openverse". If a site adds hundreds of cool new photos of a collection of Helenistic pottery from 30 BCE, I would want that to appear in this filter. "Recently added" is probably the most common name for this type of sorting.

That said, as long as we mark this "unstable" as you've done, @dhruvkb, I'm totally cool with merging this and playing with the results. We should formally make a decision about what this filter means and what value it offers users before we use it anywhere in production.

zackkrida avatar Feb 12 '23 16:02 zackkrida

If 'newly indexed in Openverse' is what this sort was supposed to mean, then this works 100% according to your expectation. Personally I thought new meant 'newly created' or 'newly uploaded at the source' so I was surprised that the sorted results were totally different.

Again, 'recently added' is not the same as 'recently indexed'. If a site added photos of "Helenistic pottery from 30 BCE" in early 1990s and we just added it as a new provider, we'll still get those photos in this category (which is neither "recently created", nor "recently added", just "recently indexed").

Considering they're not already present, creation_date and uploaded_date fields might be something to consider? (cc: @WordPress/openverse-catalog)

dhruvkb avatar Feb 13 '23 04:02 dhruvkb

Considering they're not already present, creation_date and uploaded_date fields might be something to consider? (cc: @WordPress/openverse-catalog)

Definitely, though it seems like it might not be something that's present in all providers and so I wonder if a full-on column makes sense in those cases. Perhaps a way to add it in the MediaStore class that goes into meta_data 🤔

AetherUnbound avatar Feb 13 '23 19:02 AetherUnbound

@dhruvkb Do you mind updating the description and testing instructions?

stacimc avatar Feb 13 '23 21:02 stacimc

@stacimc done!

dhruvkb avatar Feb 13 '23 22:02 dhruvkb

Since I'm technically the author, I can't officially approve this, but it looks excellent, @dhruvkb.

zackkrida avatar Feb 15 '23 18:02 zackkrida