lychee icon indicating copy to clipboard operation
lychee copied to clipboard

Parsing sitemaps not working

Open Matb85 opened this issue 4 months ago • 12 comments

Since v0.19, parsing xml not longer works

Just try to crawl a sitemap:

 lychee https://example.com/sitemap.xml
Image

Matb85 avatar Aug 25 '25 14:08 Matb85

Have you tried the latest version, 0.20.1? It was just released today and was aiming to fix this very issue.

References:

  • https://github.com/lycheeverse/lychee/pull/1816
  • https://github.com/lycheeverse/lychee-action/issues/305

mre avatar Aug 25 '25 14:08 mre

Image

Still not working ;-(

FYI, I've installed lychee with homebrew (MacOS 15.6.1)

Matb85 avatar Aug 25 '25 14:08 Matb85

Huh, no idea yet.

mre avatar Aug 25 '25 19:08 mre

What's strange is that this should be your exact setup: https://github.com/lycheeverse/lychee-action/issues/305#issuecomment-3217163544

@tooomm, do you maybe see the difference?

mre avatar Aug 26 '25 12:08 mre

Difference is that I dump the extracted links from the xml file to md. And run the check against the md.

I also provide a local xml file, not a remote one.

@Matb85 Can you try if dumping the links (instead checking) works if you add --dump? And maybe try downloading and providing the sitemap file locally.

tooomm avatar Aug 26 '25 16:08 tooomm

Image

Still nothing, try for yourself

Matb85 avatar Aug 26 '25 16:08 Matb85

Also dumping is not working Image

Image

Am I doing something wrong?

Matb85 avatar Aug 28 '25 09:08 Matb85

This is weird. I don't know why https://github.com/lycheeverse/lychee/pull/1824 passes.

mre avatar Aug 28 '25 10:08 mre

Am I doing something wrong?

Yeah, --dump is just a flag. It does not take a file input. So file.xml gets treated as another input, which is not what you want.

Try it without the file.xml.

(That is an unrelated issue, though.)

mre avatar Aug 28 '25 10:08 mre

Same thing unfortunately:

Image

Matb85 avatar Aug 28 '25 20:08 Matb85

This is weird. I don't know why #1824 passes.

It works only because example.com/sitemap.xml isn't a real sitemap.xml; it returns an HTML page instead.

lychee --verbose --dump https://example.com/sitemap.xml
https://www.iana.org/domains/example (https://example.com/sitemap.xml)

curl https://example.com/sitemap.xml
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

Sites like https://www.samsung.com/sitemap.xml and https://www.youtube.com/sitemaps/sitemap.xml return a sitemap index that lists sub-sitemap.xml files, not individual URLs.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
     <sitemap><loc>https://www.samsung.com/ae/sitemap.xml</loc><lastmod>2018-06-15</lastmod></sitemap>
     <sitemap><loc>https://www.samsung.com/ae_ar/sitemap.xml</loc><lastmod>2018-06-15</lastmod></sitemap>
     <sitemap><loc>https://www.samsung.com/africa_en/sitemap.xml</loc><lastmod>2018-06-15</lastmod></sitemap>
     <sitemap><loc>https://www.samsung.com/africa_fr/sitemap.xml</loc><lastmod>2018-06-15</lastmod></sitemap>
     <sitemap><loc>https://www.samsung.com/africa_pt/sitemap.xml</loc><lastmod>2018-06-15</lastmod></sitemap>

So, the essence is that lychee doesn't support recursive lookup, and since the sitemap index contains only sub-sitemap.xml links, it may return no results.

XmchxUp avatar Sep 11 '25 01:09 XmchxUp

Yeah, so my test in https://github.com/lycheeverse/lychee/pull/1824 is not testing the right thing. Recursion support (#78) would fix this then?

mre avatar Sep 11 '25 09:09 mre