evaluatory icon indicating copy to clipboard operation
evaluatory copied to clipboard

Trouble with sitemap.xml scans

Open mgifford opened this issue 2 years ago • 2 comments
trafficstars

I tried 3 sitemap scans on Drupal sites. I took the URLs & copied them over from the browser so I knew that they would work. I got 2 different responses. The first one ran, but never stopped. The first one was a single XML file

% npx evaluatory https://www.example1.gov/sitemap.xml --sitemap ℹ Evaluating 1 URL ℹ Clearing output folder: /Users/mgifford/Downloads/evaluatory-results ℹ Running [base] for https://www.example1.gov/sitemap.xml ℹ Running [axe-core] for https://www.example1.gov/sitemap.xml ...

I left it overnight and then killed it in the morning. Should I see signs of progress? I would have thought I'd see a progression of URLs as it marched through the sitemap.

The other two were multi-page XML files.

mgifford@Mikes-Mac-Studio Downloads % npx evaluatory --sitemap https://www.example2.gov/sitemap.xml ℹ Adding 0 URLs from the sitemap. ✖ Error: Specify URLs to evaluate. at main (/Users/mgifford/.nvm/versions/node/v15.2.0/lib/node_modules/evaluatory/bin/evaluatory.js:39:11) at processTicksAndRejections (node:internal/process/task_queues:93:5)

mgifford@Mikes-Mac-Studio Downloads % https://beta.example3.gov/sitemap.xml zsh: no such file or directory: https://beta.example3.gov/sitemap.xml mgifford@Mikes-Mac-Studio Downloads % npx evaluatory --sitemap https://beta.example3.gov/sitemap.xml ℹ Adding 0 URLs from the sitemap. ✖ Error: Specify URLs to evaluate. at main (/Users/mgifford/.nvm/versions/node/v15.2.0/lib/node_modules/evaluatory/bin/evaluatory.js:39:11) at processTicksAndRejections (node:internal/process/task_queues:93:5)

Are multi-page scans supported?

mgifford avatar Nov 24 '22 14:11 mgifford

Issue 1 - Wrong command

npx evaluatory https://www.example1.gov/sitemap.xml --sitemap

This command tells Evaluatory to check the accessibility of the "https://www.example1.gov/sitemap.xml" page (so the XML page itself). That's why it states "Evaluating 1 URL". You need to swap --sitemap and the URL to pass the URL as an argument. I think it would be a small improvement if Evaluatory would catch such errors (i.e. missing --sitemap value).

Issue 2 - Evaluatory freezes

I can reproduce it with the first GOV sitemap you have sent me. In short, it is a consequence of issue 1: The page itself is evaluated, which doesn't make much sense for an XML file. It is probably caused by a 2MB page containing 16k+ elements. One of the modules and/or Playwright doesn't handle this well.

When called correctly, the sitemap entries are evaluated instead:

$ npx evaluatory --sitemap https://www.example1.gov/sitemap.xml
ℹ Adding 16385 URLs from the sitemap.
ℹ Evaluating 16385 URLs

Keep in mind that this will take a lot of time. I've once analyzed 1/10th of this size and it took several hours.

Issue 3

Are multi-page scans supported?

In short: no. You would have to extract the sitemaps yourself and call Evaluatory for each of them:

npx evaluatory --sitemap https://beta.example3.gov/sitemap.xml?page=1 -o sitemap1
npx evaluatory --sitemap https://beta.example3.gov/sitemap.xml?page=2 -o sitemap2
...

(Note that if you don't pass an explicit output folder (-o or --output), subsequent evaluatory calls will overwrite previous results)

Ideas

One of the goals of this project is being able to analyze multiple web pages at once. Sitemaps are a great way to automate this and evaluate all pages at once. So I think it's worth investing some time into improving the handling. Some ideas:

  1. Catch empty --sitemap value and provide a helpful error message.
  2. Check if a URL that is being evaluated contains sitemap.xml and output a warning message. There might be pages rendering sitemap.xml as HTML, so it should not be forbidden.
  3. Support pausing/resuming the evaluation of a big URL list.
  4. Support multi-page sitemaps. This should be doable due to the <sitemap> tag within a sitemap.

darekkay avatar Nov 29 '22 14:11 darekkay

It's running fine for me now. Thanks @darekkay

Thanks for the additional context and clarity that --sitemap needs to preceed the URL. I should have tried that.

Would be great to be able to exclude paths or extensions. First sitemap I ran was for 2000 pages, of which 1999 were PDFs. PDF's aren't supported at this point, so probably these should just be skipped.

I've been playing with these types of tools for a while, see https://github.com/civicactions/purple-hats

I'm much more of a hack than a software developer, but I was able generate some more valuable reports by hacking the work of another tool.

Also useful to see how the output might be able to be aggregated in a way similar to

  • https://github.com/cloudfour/lighthouse-parade - in a Google Sheet. Not as graceful, but also pretty powerful.

mgifford avatar Dec 19 '22 13:12 mgifford