Parsing sitemaps not working
Since v0.19, parsing xml not longer works
Just try to crawl a sitemap:
lychee https://example.com/sitemap.xml
Have you tried the latest version, 0.20.1? It was just released today and was aiming to fix this very issue.
References:
- https://github.com/lycheeverse/lychee/pull/1816
- https://github.com/lycheeverse/lychee-action/issues/305
Still not working ;-(
FYI, I've installed lychee with homebrew (MacOS 15.6.1)
Huh, no idea yet.
What's strange is that this should be your exact setup: https://github.com/lycheeverse/lychee-action/issues/305#issuecomment-3217163544
@tooomm, do you maybe see the difference?
Difference is that I dump the extracted links from the xml file to md. And run the check against the md.
I also provide a local xml file, not a remote one.
@Matb85 Can you try if dumping the links (instead checking) works if you add --dump? And maybe try downloading and providing the sitemap file locally.
Still nothing, try for yourself
Also dumping is not working
Am I doing something wrong?
This is weird. I don't know why https://github.com/lycheeverse/lychee/pull/1824 passes.
Am I doing something wrong?
Yeah, --dump is just a flag. It does not take a file input. So file.xml gets treated as another input, which is not what you want.
Try it without the file.xml.
(That is an unrelated issue, though.)
Same thing unfortunately:
This is weird. I don't know why #1824 passes.
It works only because example.com/sitemap.xml isn't a real sitemap.xml; it returns an HTML page instead.
lychee --verbose --dump https://example.com/sitemap.xml
https://www.iana.org/domains/example (https://example.com/sitemap.xml)
curl https://example.com/sitemap.xml
<!doctype html>
<html>
<head>
<title>Example Domain</title>
Sites like https://www.samsung.com/sitemap.xml and https://www.youtube.com/sitemaps/sitemap.xml return a sitemap index that lists sub-sitemap.xml files, not individual URLs.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://www.samsung.com/ae/sitemap.xml</loc><lastmod>2018-06-15</lastmod></sitemap>
<sitemap><loc>https://www.samsung.com/ae_ar/sitemap.xml</loc><lastmod>2018-06-15</lastmod></sitemap>
<sitemap><loc>https://www.samsung.com/africa_en/sitemap.xml</loc><lastmod>2018-06-15</lastmod></sitemap>
<sitemap><loc>https://www.samsung.com/africa_fr/sitemap.xml</loc><lastmod>2018-06-15</lastmod></sitemap>
<sitemap><loc>https://www.samsung.com/africa_pt/sitemap.xml</loc><lastmod>2018-06-15</lastmod></sitemap>
So, the essence is that lychee doesn't support recursive lookup, and since the sitemap index contains only sub-sitemap.xml links, it may return no results.
Yeah, so my test in https://github.com/lycheeverse/lychee/pull/1824 is not testing the right thing. Recursion support (#78) would fix this then?