stash icon indicating copy to clipboard operation
stash copied to clipboard

[Bug Report] Scrapers that build a queryURL

Open Maista6969 opened this issue 2 years ago • 0 comments

There are several niggling issues with scrapers that build a queryURL in order to provide scrapeByFragment functionality and in aggregate this is creating a poor user experience.

For this example we can use the following scraper:

name: StudioX
sceneByURL:
  - action: scrapeXPath
    url:
      - studiox.com/update
    scraper: sceneScraper
sceneByFragment:
  action: scrapeXPath
  queryURL: "{url}"
  scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    scene:
      Title:
        fixed: Example scraper

The implicit requirements for this scraper mean that a scene needs to:

  • Have an URL saved to the scene
  • That URL needs to be the first URL
  • The URL needs to match the pattern of the sceneByURL action

If the user is not aware of these requirements and fails to meet them, Stash will either show the cryptic error scraper StudioX: Get "%7Burl%7D": unsupported protocol scheme (if the URL is missing) or give a false positive with a green notification that says No scenes found (if the URL does not match, or the matcing URL is not the first one).

To reproduce this we can create an empty scene and:

  • select StudioX from the "Scrape with..." dropdown: first confusing error message
  • add (and save) the URL https://example.com to this scene and select StudioX from the "Scrape with..." dropdown: No scenes found
  • add another URL like https://example.com/update/2024 (which matches the pattern) and scrape again: No scenes found

Since the queryURL can be built from several fields (checksum, oshash, filename, title and url in queryURLParametersFromScene) the first two error cases would apply to most of these fields.

Expected behavior If constructing an appropriate queryURL fails I would expect a more specific error message to help the user solve the problem. "Scraping this requires that X, Y, Z fields be filled" or something to this effect.

If a scene has multiple values for a field (most importantly URLs) then I'd expect the scraper to try all of them until one works (or simply matches a pattern in the sceneScraper) or return an error message like "No matching URLs found for this scraper"

Additional context I looked at the code for scraping scenes in scraper/xpath.go and while I'm not familiar enough with the scraper codebase to know where a fix should be applied (this would effect json scrapers as well, should this be pulled up a level to avoid duplication?) something like this might be a start:

if !s.config.matchesURL(url, ScrapeContentTypeScene) {
	re := regexp.MustCompile(`\{([^}]+)\}`)
	remainingPlaceholders := re.FindAllString(url, -1)
	if len(remainingPlaceholders) == 0 {
		return nil, fmt.Errorf("url doesn't match scraper: %s", url)
	}
	missingReplacements := make([]string, len(remainingPlaceholders))
	for i, v := range remainingPlaceholders {
		missingReplacements[i] = v[1 : len(v)-1]
	}
	errMsg := strings.Join(missingReplacements, ", ")
	return nil, fmt.Errorf("missing fields: %s", errMsg)
}

Maista6969 avatar Jan 02 '24 06:01 Maista6969