koillection icon indicating copy to clipboard operation
koillection copied to clipboard

HtmlCollectionScraper gives 500 error

Open OmnipotentEntity opened this issue 1 year ago • 5 comments

I get the following error:

{"message":"Uncaught PHP Exception ArgumentCountError: \"Too few arguments to function App\\Service\\Scraper\\HtmlScraper::extract(), 3 passed in /var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php on line 18 and exactly 4 expected\" at HtmlScraper.php line 48","context":{"exception":{"class":"ArgumentCountError","message":"Too few arguments to function App\\Service\\Scraper\\HtmlScraper::extract(), 3 passed in /var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php on line 18 and exactly 4 expected","code":0,"file":"/var/www/koillection/src/Service/Scraper/HtmlScraper.php:48"}},"level":500,"level_name":"CRITICAL","channel":"request","datetime":"2024-12-27T03:27:24.772326-06:00","extra":{}}

I was able to hunt it down to commit 432f476ea which seems to had added image scraping, which required an API change, but this API change wasn't added to HtmlCollectionScraper.php and also line 22.

It seems like $scraping as a variable is available in this context, so it might be as simple as simply adding this variable to the 4th argument position in both locations. However, I'm not familiar enough with the project to feel confident in creating a PR.

Thank you for your hard work!

OmnipotentEntity avatar Dec 27 '24 09:12 OmnipotentEntity

I have attempted to modify these files in place and restart the service and I have the following new error which seems to be related to the image not being scraped properly. This probably has something to do with the fact that I only very barely attempted to understand what's going on here, and there's probably a few other changes that needed to happen to emulate the referenced commit.

The new error is:

{"message":"Warning: file_get_contents(): SSL operation failed with code 1. OpenSSL Error messages:\nerror:0A000086:SSL routines::certificate verify failed","context":{"exception":{"class":"ErrorException","message":"Warning: file_get_contents(): SSL operation failed with code 1. OpenSSL Error messages:\nerror:0A000086:SSL routines::certificate verify failed","code":0,"file":"/var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php:23"}},"level":400,"level_name":"ERROR","channel":"php","datetime":"2024-12-27T03:51:51.994482-06:00","extra":{}}
{"message":"Warning: file_get_contents(): Failed to enable crypto","context":{"exception":{"class":"ErrorException","message":"Warning: file_get_contents(): Failed to enable crypto","code":0,"file":"/var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php:23"}},"level":400,"level_name":"ERROR","channel":"php","datetime":"2024-12-27T03:51:51.994630-06:00","extra":{}}
{"message":"Warning: file_get_contents(https://s4.anilist.co/file/anilistcdn/media/manga/cover/large/bx30703-iRLjKRnSwCFP.jpg): Failed to open stream: operation failed","context":{"exception":{"class":"ErrorException","message":"Warning: file_get_contents(https://s4.anilist.co/file/anilistcdn/media/manga/cover/large/bx30703-iRLjKRnSwCFP.jpg): Failed to open stream: operation failed","code":0,"file":"/var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php:23"}},"level":400,"level_name":"ERROR","channel":"php","datetime":"2024-12-27T03:51:51.994706-06:00","extra":{}}
{"message":"Uncaught PHP Exception TypeError: \"base64_encode(): Argument #1 ($string) must be of type string, false given\" at HtmlCollectionScraper.php line 23","context":{"exception":{"class":"TypeError","message":"base64_encode(): Argument #1 ($string) must be of type string, false given","code":0,"file":"/var/www/koillection/src/Service/Scraper/HtmlCollectionScraper.php:23"}},"level":500,"level_name":"CRITICAL","channel":"request","datetime":"2024-12-27T03:51:51.994961-06:00","extra":{}}

For completeness sake, here is my scraper:

Name: Anilist - Manga Series
Url Pattern: https://anilist.co/manga/
Name Path: #//div[@class="type"][text()="English"]/following-sibling::div/text()#
Image Path: #//img[@class="cover"]/@src#
Volume Count: (Text) #//div[@class="type"][text()="Volumes"]/following-sibling::div/text()#
Status: (Text) #//div[@class="type"][text()="Status"]/following-sibling::div/text()#

OmnipotentEntity avatar Dec 27 '24 09:12 OmnipotentEntity

With this patch the scrap finishes successfully, but the thumbnail isn't scraped properly. So it's not a full solution yet.

--- HtmlCollectionScraper.php.old       2024-12-27 09:49:20.107123727 +0000
+++ HtmlCollectionScraper.php.new       2024-12-27 19:36:08.045680868 +0000
@@ -15,12 +15,12 @@
         $crawler = $this->getCrawler($scraping);
         $scraper = $scraping->getScraper();
 
-        $image = $scraping->getScrapImage() ? $this->extract($scraper->getImagePath(), DatumTypeEnum::TYPE_TEXT, $crawler) : null;
+        $image = $scraping->getScrapImage() ? $this->extract($scraper->getImagePath(), DatumTypeEnum::TYPE_TEXT, $crawler, $scraper) : null;
         $image = $this->guessHost($image, $scraping);
 
         return [
-            'name' => $scraping->getScrapName() ? $this->extract($scraper->getNamePath(), DatumTypeEnum::TYPE_TEXT, $crawler) : null,
-            'base64Image' => 'data:image/png;base64,' . base64_encode(file_get_contents($image)),
+            'name' => $scraping->getScrapName() ? $this->extract($scraper->getNamePath(), DatumTypeEnum::TYPE_TEXT, $crawler, $scraper) : null,
+            'image' => $image,
             'data' => $this->scrapData($scraping, $crawler, ScraperTypeEnum::TYPE_COLLECTION),
             'scrapedUrl' => $scraping->getUrl()
         ];

OmnipotentEntity avatar Dec 27 '24 19:12 OmnipotentEntity

I had a quick look today and did a quick fix but as you noticed the image can't be properly scrapped.

I'm looking into new ways to scrap urls, like this method suggested here https://github.com/benjaminjonard/koillection/discussions/1263. While it works better than the current implementation, I still can't make it work with your example. The website returns a blank page saying javascript is required.

I may have another solution but I'm having a hard time making it work with Docker (https://github.com/symfony/panther)

It's going to take some time but I hope I can push a better implementation for the scrapper in the next release

benjaminjonard avatar Dec 27 '24 20:12 benjaminjonard

That's interesting, because the same scraper seems to work as an Item scraper rather than a collection scraper. Unless something changed with the website overnight (which is possible.)

OmnipotentEntity avatar Dec 27 '24 21:12 OmnipotentEntity

I've tried only for Wish scraper and it gives the same error. I tried the patch, it didn't solve my problem.

TaylanTatli avatar Feb 25 '25 16:02 TaylanTatli