Error On Run
WebComix Version: 3.11.1 OS: Windows 10 Enterprise Edition (10.0.19044.2006) (x64) Python: 3.9.5
Trying to download a custom comic, the XPATHs are correct and work in scrapy shell.
The error message is very unhelpful.
The comic in question is NSFW so I'm not comfortable putting the command line argument here. I will try some other comics and see if I get the same result though.
Update: Unfortunately it worked fine on the non NSFW site.
Traceback (most recent call last):
File "C:\Python\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Python\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Python\Scripts\webcomix.exe\__main__.py", line 7, in <module>
File "C:\Python\lib\site-packages\click\core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "C:\Python\lib\site-packages\click\core.py", line 1049, in main
args = _expand_args(args)
File "C:\Python\lib\site-packages\click\utils.py", line 572, in _expand_args
matches = glob(arg, recursive=glob_recursive)
File "C:\Python\lib\glob.py", line 21, in glob
return list(iglob(pathname, recursive=recursive))
File "C:\Python\lib\glob.py", line 73, in _iglob
for dirname in dirs:
File "C:\Python\lib\glob.py", line 74, in _iglob
for name in glob_in_dir(dirname, basename, dironly):
File "C:\Python\lib\glob.py", line 85, in _glob1
return fnmatch.filter(names, pattern)
File "C:\Python\lib\fnmatch.py", line 58, in filter
match = _compile_pattern(pat)
File "C:\Python\lib\fnmatch.py", line 52, in _compile_pattern
return re.compile(res).match
File "C:\Python\lib\re.py", line 252, in compile
return _compile(pattern, flags)
File "C:\Python\lib\re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Python\lib\sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "C:\Python\lib\sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Python\lib\sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Python\lib\sre_parse.py", line 834, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "C:\Python\lib\sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Python\lib\sre_parse.py", line 598, in _parse
raise source.error(msg, len(this) + 1 + len(that))
re.error: bad character range v-b at position 22
If I look at the beginning of the traceback, I can see that the error comes from click, which is the library used to create the CLI. I think the issue comes from this library not being able to parse the command as a whole into its arguments properly.
Make sure you enclose URLs and XPath in double-quotes and that there aren't double-quotes in the XPath itself (you either can escape them or use single-quotes, which should also work)
Hmm that's odd, no double quotes.
Here are the XPATHs:
--next-page-xpath="//a[@class='comic-nav-base comic-nav-next']/@href" --image-xpath="//div[@id='comic']//img/@src"
They work in scrapy shell.
I'll try quoting the --start-url, you didn't quote them in the examples so it didn't even occur to me. If that fixes it, it will then be a documentation issue ;)
Update; Nope, same error.
I have a second case of this, this time it's a (relatively) SFW comic.
Command Line
webcomix custom zoophobia --start-url="https://zoophobia-comic.tumblr.com/post/127351123949" --next-page-xpath="//a[@class='next-button']/@href" --image-xpath="//figure[@class='photo-hires-item correct']//img/@src" --cbz
Notes
- In this case there are two "next" buttons but they are duplicates of each other.
- The error isn't 100% the same but close.
Scrapy Shell
scrapy shell https://zoophobia-comic.tumblr.com/post/127351123949
>>> response.xpath("//a[@class='next-button']/@href").get()
'https://zoophobia-comic.tumblr.com/post/127351131639'
>>> response.xpath("//figure[@class='photo-hires-item correct']//img/@src").get()
'https://64.media.tumblr.com/3a343a2dd3226b62b5ba286702b8949e/tumblr_ntidtra2XG1udrxz7o1_1280.png'
Error
Traceback (most recent call last):
File "C:\Python\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Python\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Python\Scripts\webcomix.exe\__main__.py", line 7, in <module>
File "C:\Python\lib\site-packages\click\core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "C:\Python\lib\site-packages\click\core.py", line 1049, in main
args = _expand_args(args)
File "C:\Python\lib\site-packages\click\utils.py", line 572, in _expand_args
matches = glob(arg, recursive=glob_recursive)
File "C:\Python\lib\glob.py", line 21, in glob
return list(iglob(pathname, recursive=recursive))
File "C:\Python\lib\glob.py", line 73, in _iglob
for dirname in dirs:
File "C:\Python\lib\glob.py", line 74, in _iglob
for name in glob_in_dir(dirname, basename, dironly):
File "C:\Python\lib\glob.py", line 85, in _glob1
return fnmatch.filter(names, pattern)
File "C:\Python\lib\fnmatch.py", line 58, in filter
match = _compile_pattern(pat)
File "C:\Python\lib\fnmatch.py", line 52, in _compile_pattern
return re.compile(res).match
File "C:\Python\lib\re.py", line 252, in compile
return _compile(pattern, flags)
File "C:\Python\lib\re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Python\lib\sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "C:\Python\lib\sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Python\lib\sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Python\lib\sre_parse.py", line 834, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "C:\Python\lib\sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Python\lib\sre_parse.py", line 598, in _parse
raise source.error(msg, len(this) + 1 + len(that))
re.error: bad character range t-b at position 17
I tried to reproduce this issue on Linux, but couldn't. I assume this is something related to click and its handling of regular expressions in Windows, as that's something that I've found in its issues board. One of the ways you could fix this issue is by using a Unix system (through dual-boot or Docker) to download the images.
While testing for this, I found another issue which is a bit puzzling: The tumblr example you gave me doesn't give me the same view in my spider vs. in scrapy shell. I'll investigate this issue further when I have some time.
After exploring the second issue a bit more, this issue seems related to the usage of a fake useragent, since not having it solves the issue. I'll do a PR to test both settings at some point.
UPDATE: I was able to rip a comic using WSL (Windows Subsystem for Linux) so there does seem to be some issues with the Windows port of the xpath parser (and possibly other bits).
For some reason, the comic doesn't work anymore on my end, whether with or without the user agent. With that said, I'll see what I can do to fix the issue related to Windows.
The newest release 3.11.3 should help solve the issue you were having on Windows. If not, I'll try to investigate it a bit further.
@LeeThompson I've also been testing it on my own Windows installation and I haven't been able to reproduce your issue using the same Python and webcomix version 🤔