CommunityScrapers icon indicating copy to clipboard operation
CommunityScrapers copied to clipboard

AlgoliaAPI release vs modified date

Open Splash4K opened this issue 9 months ago • 8 comments

Are you scraping a scene, gallery, movie, or performer?

  • Movie

Scrape with URL? If so, what URLs have you tried?

  • https://www.genderxfilms.com/en/movie/Transsexual-Hitchhikers-4/123876
  • https://www.genderxfilms.com/en/movie/Family-Transformation/77977
  • https://www.genderxfilms.com/en/movie/Genderx-Initiations/121227 (Can provide more if needed)

Scraper is scraping incorrect dates.

Image Image

Splash4K avatar Apr 11 '25 03:04 Splash4K

This scraper uses the Algolia API, which has different information than is present on the page. In the case of the dates, the API is providing three different dates, "Created", "Upcoming" and "Last Modified". The Algolia scraper is selecting the "created" date, which might be the most appropriate as this is, hopefully, when the movie was first published or released. The site is showing, probably, the last mofidied date.

Discussion may be required to determine the desired path to take.

25-04-13 13:34:36 Debug   
[Scrape / GenderX Films] Dates available: upcoming 2025-03-13 - created 2024-08-05 - last modified 2025-03-13
2025-04-13 13:34:36 Debug   
[Scrape / GenderX Films] Scraping movie
2025-04-13 13:34:36 Debug   
[Scrape / GenderX Films] URL Scraping: https://www.genderxfilms.com/en/movie/Transsexual-Hitchhikers-4/123876

MortonBridges avatar Apr 13 '25 17:04 MortonBridges

notabaug, will wait for admin decision and discussion

feederbox826 avatar Apr 14 '25 02:04 feederbox826

The created date may be when the movie (empty collection, or perhaps with the first scene) is first published, and then after all the scenes are individually published (often on a weekly basis) then the last modified date might be updated to match the latest scene... not sure about upcoming, possibly that's a forecasted date of the final/last scene publish date.

It would probably be useful to look at a new movie release that doesn't yet have all its scenes published. The scenes are normally listed on the movie page with future dates. Then perhaps the three date values could be observed throughout the scene publishing lifecycle of the movie.

nrg101 avatar Apr 18 '25 00:04 nrg101

Last Modified I don't think can be our answer, the question kinda lies in how they have it distributed, ie is it considered distributed when the first one hits the web or the last one, or if it's completely seperate with web vs physical?

feederbox826 avatar Apr 18 '25 04:04 feederbox826

Ok, let's look at some examples in Algolia

Evil Angel

upcoming movies

https://www.evilangel.com/en/movie/Kimber-James--Chris-Epic/129147

            "date_created": "2025-04-14",
            "nb_of_scenes": 3,
            "last_modified": null,
            "upcoming": "2025-04-24",
  • all 3 scenes have future date 2025-04-24

https://www.evilangel.com/en/movie/Pleasure-Vixens-03/129007

            "date_created": "2025-04-07",
            "nb_of_scenes": 4,
            "last_modified": null,
            "upcoming": "2025-04-24",
  • scene 1: 2025-04-24
  • scene 2: 2025-04-26
  • scene 3: 2025-04-28
  • scene 4: 2025-04-30

so this upcoming movie currently has an upcoming date of the first scene

latest movies

https://www.evilangel.com/en/movie/Crossing-Borders-02/128218

            "date_created": "2025-02-25",
            "nb_of_scenes": 12,
            "last_modified": "2025-04-22",
            "upcoming": "2025-04-22",
  • scene 1: 2025-03-25
  • scene 12: 2025-04-22

here we can see the upcoming date is the date of the last scene

theory

for one of the above upcoming examples, https://www.evilangel.com/en/movie/Pleasure-Vixens-03/129007, it currently has:

            "date_created": "2025-04-07",
            "nb_of_scenes": 4,
            "last_modified": null,
            "upcoming": "2025-04-24",

as the scenes are released on the dates:

  • scene 1: 2025-04-24
  • scene 2: 2025-04-26
  • scene 3: 2025-04-28
  • scene 4: 2025-04-30

the last_modified and upcoming values can be observed and noted.

I would guess that the upcoming is initially the date of the first scene, and then is updated on either:

  • each scene release, or
  • the final scene release to end up being the same value as the date of the final release

Also, I would guess the last_modified to behave as it sounds, with it being initially null when the movie is created (but with all scenes yet to be published), and then updated when each scene is published.

GenderX Films

latest movies

https://www.genderxfilms.com/en/movie/Couples-Loving-Trans-2/126333

            "date_created": "2024-11-18",
            "nb_of_scenes": 4,
            "last_modified": "2025-04-17",
            "upcoming": "2025-04-24",

date shown on page: 2025-04-17

scenes:

  • 1: 2025-04-17
  • 2: 2025-04-24
  • 3: 2025-05-01
  • 4: 2025-05-08

Here we can see that as of today (2025-04-22), only scene 1 is published on 2025-04-17, and the last_modified date is as you would expect, the date of the last scene being published, 2025-04-17

The upcoming date value is the date of the next scene to be published, scene 2, which makes sense in a way, in that the movie's next upcoming date is the date of the upcoming publishing of the next scene.

This shows that the upcoming date will:

  • have a starting value of the first scene publishing date
  • be updated to the next scene publishing date
  • end up as the final scene publishing date

https://www.genderxfilms.com/en/movie/Trans-Campers/118787

            "date_created": "2024-01-10",
            "nb_of_scenes": 4,
            "last_modified": "2024-06-13",
            "upcoming": "2024-06-13",

date shown on page: 2024-06-13

  • scene 1: 2024-05-16
  • scene 4: 2024-06-13

Conclusion

Algolia appears to use the date fields date_created, last_modified, and upcoming in a fairly logical way.

API date field usage

date_created: this appears to be when the movie is first added to the API

last_modified: this is initially null, until the first scene is published, at which point the value matches the most recently published scene

upcoming: this is initially the date of the first scene, until the first scene is published, at which point it matches the next scene (unless, obviously, there is no next scene, so it would then just remain matching the final scene)

page date usage

This appears to use the upcoming date.

When the movie does not yet have any scenes published, this will be the date of the first scene. When the movie is part way published, it will be the date of the next (upcoming, future publish date) scene. When the movie has all scenes published, it will be the date of the final scene

scraping date implications

Personally, I would agree with the page's usage of the upcoming date, as I would consider a movie to only be truly published, when all of its scenes have been published.

This means that when a movie has some scenes in the future, the date will be "in flux" and tracking the next scene, so you would have to be mindful of this and rescrape the movie after the final scene has been published.

A possible (rather convoluted) solution would be to determine the scenes of a movie, and look at the final scene's release_date, and use that as the movie's scraped date.

Even though a scene has, e.g.

            "release_date": "2025-04-24",
            "upcoming": 1,
            "movie_id": 126333,
            "movie_title": "Couples Loving Trans 2",
            "movie_desc": "",
            "movie_date_created": "2024-11-18",

The movie was "created" on 2024-11-18, but scene 2 is not even available yet. I would say the movie_date_created (in an API scene, which is the same as an API movie's date_created) is just the date that the movie was added to the API, almost like a placeholder, until the scenes are all published.

You can see in the examples above that a movie is often created in the API (with all scenes with future publishing dates) several months before the scene release schedule begins.

I would say that the Algolia scraper(s) (I currently have one in a branch) should be updated to use upcoming for the movie date as that is more closely related to publishing in my opinion as it is when all the scenes are published and therefore the movie is fully available to watch.

nrg101 avatar Apr 22 '25 14:04 nrg101

For the little mention of the "convoluted" solution to determining a movie's final publishing date, the scenes of a movie can be looked up in Algolia API like this:

e.g. for https://www.genderxfilms.com/en/movie/Couples-Loving-Trans-2/126333

scenes for movie id 126333:

curl --location 'https://TSMKFA364Q.algolia.net/1/indexes/all_scenes/query' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--header 'x-algolia-api-key: ****' \
--header 'x-algolia-application-id: ****' \
--data '{
  "params": "hitsPerPage=20&page=0&query=",
  "facetFilters": ["movie_id:126333"]
}'

response (edited for brevity):

{
	"hits": [
		{
			"clip_id": 255847,
			"title": "Couples Loving Trans 2 - Scene 4",
			"release_date": "2025-05-08",
			"upcoming": 1
		},
		{
			"clip_id": 255846,
			"title": "Couples Loving Trans 2 - Scene 3",
			"release_date": "2025-05-01",
			"upcoming": 1
		},
		{
			"clip_id": 255845,
			"title": "Couples Loving Trans 2 - Scene 2",
			"release_date": "2025-04-24",
			"upcoming": 1
		},
		{
			"clip_id": 255844,
			"title": "Couples Loving Trans 2 - Scene 1",
			"release_date": "2025-04-17",
			"upcoming": 0
		}
	],
	"nbHits": 4
}

This list of scenes could easily be processed to find the final scene, e.g. in python:

final_scene_date = max([ scene["release_date"] for scene in api_response["hits"] ])
# 2025-05-08

With the final scene date known throughout the movie lifecycle (pre-release, during scene publishing, after final scene published), the final_scene_date value, extracted in the above example code, could be used for the date that will show on a movie's web page when the final scene has been published, even if the movie's scenes have not yet been published.

nrg101 avatar Apr 22 '25 14:04 nrg101

my personal opinion:

We should take the earliest date just for ease, it's the easiet to pull, it will be the most consistent and it makes the most sense logically when grouping up scenes, instead of going back to front with releases, you can go front-to-back

feederbox826 avatar Apr 23 '25 04:04 feederbox826

Moving discussion to discourse https://discourse.stashapp.cc/t/algolia-release-modified-dates/1951

feederbox826 avatar Jun 05 '25 04:06 feederbox826