crawler
crawler copied to clipboard
Improving #754
From @brotkrueml comments in review: https://github.com/tomasnorre/crawler/pull/754#issuecomment-864586872
Okay, it is very chatty ;-).
Now a URL with MP is used:
Processing
https://website.ddev.site:8443/en/?MP= () =>
OK:
This shouldn't be an issue, as the correct canonical is used (without MP), just a little bit unaesthetic.
I use the following configuration (3 is the start page):
tx_crawler.crawlerCfg.paramSets {
deployment = &L=[0-3]
deployment {
pidsOnly = 3
}
}
With this configuration:
crawler:buildQueue 3 deployment --depth 99 --mode exec
So, it should only generate the start pages for four languages. But I get information from all other pages (Page-x are my placeholder for the real name):
Page-1: (Because page is hidden)
Page-2: (Because page is hidden)
Page-3: (Because doktype is not allowed)
This is not very helpful as only one page in different languages should be generated.
Yes, I use depth
, but the pidsOnly
should overrule this IMHO.
Running the command without depth
:
ddev t3cmd crawler:buildQueue 3 deployment --mode exec
Executing 4 requests right away:
[20.06.21 17:19] https://website.ddev.site:8443/en/?MP= (URL already existed)<br>[20.06.21 17:19] https://website.ddev.site:8443/de/?MP= (URL already existed)<br>[20.06.21 17:19] https://website.ddev.site:8443/pl/?MP= (URL already existed)<br>[20.06.21 17:19] https://website.ddev.site:8443/tr/?MP= (URL already existed)
omits the detailled information from above for other pages. The <br>
tag should be converted to a new line on console.
When I run the command for a non-existing page:
crawler:buildQueue 9999999 deployment --depth 99 --mode exec
The following output is given:
Executing 0 requests right away:
Processing
0 [->--------------------------]
Perhaps it is better to give an error, that the page does not exist.
I am getting many empty lines when calling a buildQueue
command with depth
. Perhaps these empty lines come from "successful" pages without any output. I think, they should be avoided.
What does the message Doktype was excluded by "0"
mean? This is given on a page with doktype "Backend user section".
Because it is very chatty, I think it would be a possibility to hide the output behind an option, perhaps --verbose
or --debug
. You can show the total number of omitted pages (when the option is not given) with the hint to display them with the option.
At at the end: I like it, it is possible now to get the information needed for a given page, why it is not crawled. Well done :-)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.