JavLibrary scrapers not working
** Scraper name ** JavLibrary_python
** Scraper method ** Getting the following error when trying to scrape with JavLibrary_python:
error while fragment scraping with scraper JavLibrary_python: could not unmarshal json from script output: EOF
Thank you for reporting this: I suspect this is failing for you because the Python scrapers need a little more setup than other scrapers as described in the README for this repo (we have not yet found a way to display this in the Stash UI)
But even after you've got that set up I'm afraid the JavLibrary scraper will still fail due to their recent application of a more aggressive Cloudflare strategy. We have reports from users who have used a VPN to get a Japanese IP address and that seems to work for scraping JavLibrary, but for right now there's no easy way to scrape their site.
When you set the Flaresolverr url in the Javlibrary Python file you can get it going again even with non-JP VPN. Bit more work and you need a Windows machine with Flaresolverr however getting consistent good results with Stash.
@Net005 did you have to do anything special apart from the standard Flaresolverr setup to get it to work? When I try to use Flaresolverr, I get the following errors:
The Cloudflare 'Verify you are human' button not found on the page. Waiting for title (attempt X): Just a moment... Timeout waiting for selector
It keeps retrying but it can't find the checkbox
@Net005 did you have to do anything special apart from the standard Flaresolverr setup to get it to work?
I was able to get the scraper working for me with flaresolverr in a separate docker container: -> It seems to be an issue with flaresolverr, that it can't find the checkbox. There is fork which doesn't have this issue, as I found out here: https://github.com/FlareSolverr/FlareSolverr/issues/1380#issuecomment-2390708414 (21hsmw/flaresolverr:nodriver)
-> I had to change line 445 in the .py from response_html.status_code = json_input['solution']['status'] to response_html.status_code = responseJson.status_code, as flaresolverr always returned "None"in the json while the code expected a status code (like 200). Not sure if this is an issue with the flaresolverr fork, as it's my first time using flaresolverr.
https://github.com/stashapp/CommunityScrapers/blob/0f56cb5c43baed3ea353e98217a166eacac2354d/scrapers/JavLibrary_python/JavLibrary_python.py#L445
Maybe this will help someone.
(Edited: Fixed wrong information)
line 445 in which file ? docker-compose.yml as like 20 lines.
Sorry, had a brainfart I guess. I meant the .py of the scraper and forgot the link: https://github.com/stashapp/CommunityScrapers/blob/0f56cb5c43baed3ea353e98217a166eacac2354d/scrapers/JavLibrary_python/JavLibrary_python.py#L445
My reproduction steps to my problem: I am running in a linux container on a windows 10 host.
docker-compose.yml
# APPNICENAME=Stash
# APPDESCRIPTION=An organizer for your porn, written in Go
services:
stash:
image: stashapp/stash:latest
container_name: stash
hostname: stash
restart: unless-stopped
## the container's port must be the same with the STASH_PORT in the environment section
ports:
- "3003:9999"
## If you intend to use stash's DLNA functionality uncomment the below network mode and comment out the above ports section
# network_mode: host
logging:
driver: "json-file"
options:
max-file: "10"
max-size: "2m"
environment:
- TZ="Europe/Rome"
- STASH_STASH=/data/
- STASH_GENERATED=/generated/
- STASH_METADATA=/metadata/
- STASH_CACHE=/cache/
## Adjust below to change default port (9999)
- STASH_PORT=9999
volumes:
## Adjust below paths (the left part) to your liking.
## E.g. you can change ./config:/root/.stash to ./stash:/root/.stash
## Keep configs, scrapers, and plugins here.
- ./config:/root/.stash
## Point this at your collection.
- D:/private/media:/data:ro
## This is where your stash's metadata lives
- ./metadata:/metadata
## Any other cache content.
- ./cache:/cache
## Where to store binary blob data (scene covers, images)
- ./blobs:/blobs
## Where to store generated content (screenshots,previews,transcodes,sprites)
- ./generated:/generated
# ===== FlareSolverr =====
flaresolverr:
image: 21hsmw/flaresolverr:fixlooping
container_name: flaresolverr-stash
environment:
- LOG_LEVEL=${LOG_LEVEL:-info}
- LOG_HTML=${LOG_HTML:-false}
- CAPTCHA_SOLVER=${CAPTCHA_SOLVER:-none}
- TZ=Europe/Rome
- LANG=fr-FR
restart: unless-stopped
networks:
default:
name: caddy_net
external: true
I installed the Javlibrary_python scraper from the settings, then edited JavLibrary_python.py
FLARESOLVERR_ENABLED = True
FLARESOLVERR_URL = "http://flaresolverr-stash:8191/v1"
In the Tagger page I have a file named BBAN-414.mp4, when I query BBAN-414 and press search I get The search doesn't return any result.
I tried debugging the script on my own but I can't seem to get it working, it looked like the script was not following redirects maybe? When I entered the value of JAV_SEARCH_HTML.url in my browser it redirected me to the correct entry.
For example:
JAV_SEARCH_HTML.url = https://www.javlibrary.com/en/vl_searchbyid.php?keyword=BBAN-414 redirects to https://www.javlibrary.com/en/?v=javmeebk4e in my browser.
I solved my problem. I had to use 21hsmw/flaresolverr:nodriver as the image instead of fixlooping. No other changes (other than enabling flaresolverr and configuring the url).
As a extra to this, I tried to get it running as well.
With Flaresolverr in docker I couldn't get it to run at all. I installed Flaresolverr on my windows PC and routed through that. I now at least get logs and Flaresolverr seems to be detecting the cloudflare but just times out.
2024-10-25 15:11:22 INFO Challenge detected. Title found: Just a moment...
2024-10-25 15:12:24 ERROR Error: Error solving the challenge. Timeout after 60.0 seconds.
2024-10-25 15:12:24 INFO Response in 62.984 s
2024-10-25 15:12:24 INFO 192.168.1.54 POST http://192.168.1.10:8191/v1 500 Internal Server Error
2024-10-25 15:12:29 INFO Incoming request => POST /v1 body: {'cmd': 'request.get', 'url': 'https://www.javlibrary.com/en/vl_searchbyid.php?keyword=SHYN-213', 'maxTimeout': 60000}
2024-10-25 15:12:30 INFO Challenge detected. Title found: Just a moment...
2024-10-25 15:13:32 ERROR Error: Error solving the challenge. Timeout after 60.0 seconds.
2024-10-25 15:13:32 INFO Response in 63.026 s
2024-10-25 15:13:32 INFO 192.168.1.54 POST http://192.168.1.10:8191/v1 500 Internal Server Error
Using this pr:
alexfozor/flaresolverr:pr-1300-experimental
I was able to get around the CF issue.
2024-10-26 00:30:07 INFO Serving on http://0.0.0.0:8191
2024-10-26 00:34:09 INFO Incoming request => POST /v1 body: {'cmd': 'request.get', 'url': 'https://www.javlibrary.com/en/vl_searchbyid.php?keyword=mdvr00324 1', 'maxTimeout': 60000}
2024-10-26 00:34:17 INFO Challenge detected.
2024-10-26 00:34:24 INFO Challenge solved!
2024-10-26 00:34:25 INFO Response in 15.336 s
2024-10-26 00:34:25 INFO 172.17.0.1 POST http://192.168.1.54:8191/v1 200 OK
2024-10-26 00:34:43 INFO Incoming request => POST /v1 body: {'cmd': 'request.get', 'url': 'https://www.javlibrary.com/en/vl_searchbyid.php?keyword=SHYN-213', 'maxTimeout': 60000}
2024-10-26 00:34:46 INFO Challenge detected.
2024-10-26 00:34:49 INFO Challenge solved!
2024-10-26 00:34:50 INFO Response in 6.463 s
2024-10-26 00:34:50 INFO 172.17.0.1 POST http://192.168.1.54:8191/v1 200 OK
It still returns no data for some reason but at least it isn't a CF block anymore. Maybe @Maista6969 can look into why that's happening
I got the same issue. I run stash app on my nas docker, my nas can visit missav.com, but the MissAV(jp) not working:
Partially resoled by #2218, will reopen if there is JavLibrary-specific blocks