w3lib
w3lib copied to clipboard
[MRG+1] Added: Removing comments before extracting base URLs. Not a solution to #70, but does help in some cases.
Helps resolving the issue in such cases, which does happen in several websites:
>>> from w3lib import html
>>> html.get_base_url("""<!-- <base href="http://example.com/" /> -->""")
'http://example.com/'
Fixes #70 (since the original #70 report is about this scenario; for other scenarios, we should have separate issues)
Current coverage is 94.10% (diff: 100%)
@@ master #77 diff @@
==========================================
Files 7 7
Lines 406 407 +1
Methods 0 0
Messages 0 0
Branches 84 84
==========================================
+ Hits 382 383 +1
Misses 16 16
Partials 8 8
Powered by Codecov. Last update 03c28d2...11b5d26
Can you add tests for this? Can you provide example websites showing this issue?
Thanks for the notice @redapple . A test has been added.
Here's a sample site which triggers this issue: http://planweb01.rother.gov.uk/OcellaWeb/planningSearch
@kmike Could you have a look?
This is related to #70
Bumping to close outdated PR.