PFERD
PFERD copied to clipboard
Contents of type "Inhaltsseite" won't get crawled
My analysis course uses the structure of "Inhaltsseite" (icon looks like a laptop showing a diagram) to provide the script (which gets updated regularly) as well as the exercise sheets and its solutions.
Unfortunately I can't download them with pferd
. I tried using the command line, the config file downloading the whole course and explicit URL but nothing is working.
When executing pferd kit-ilias-web [url] .
it just says
Loading auth:ilias
Loading crawl:ilias
Running crawl:ilias
Crawled '.'
Report for crawl:ilias
Nothing changed
And the folder stays empty.
Is this a misconfiguration on my end or is this type of structure not implemented yet?
Is this a misconfiguration on my end or is this type of structure not implemented yet?
I'd guess the latter, could you pass the --explain
switch as the first parameter to pferd (before the kit-ilias-web)? Then PFERD should try and explain itself, maybe it will tell you that it has no idea what's happening.
Here is the output with the --explain
-flag
Loading config
CLI command specified, loading config from its arguments
Creating config for command 'kit-ilias-web'
Deciding which crawlers to run
No crawlers specified on CLI
Running crawlers specified in config
Loading auth:ilias
Loading crawl:ilias
Running crawl:ilias
Loading cookies
Sharing cookies
'/home/me/documents/uni/ilias/ana_blatt/.cookies' has newest mtime so far
Loading cookies from '/home/me/documents/uni/ilias/ana_blatt/.cookies'
Creating base directory at '/home/me/documents/uni/ilias/ana_blatt'
Loading previous report from '/home/me/documents/uni/ilias/ana_blatt/.report'
Loaded report successfully
Inferred crawl target: URL https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
Decision: Crawl '.'
Final result: '.'
Answer: Yes
Parsing root HTML page
URL: https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
Page is a normal folder, searching for elements
Crawled '.'
Decision: Clean up files
No warnings or errors occurred during this run
Answer: Yes
Storing report to '/home/me/documents/uni/ilias/ana_blatt/.report'
Stored report successfully
Total amount of HTTP requests: 1
Saving cookies
Saving cookies to '/home/me/documents/uni/ilias/ana_blatt/.cookies'
Report for crawl:ilias
Nothing changed
Yea, so it apparently did not recognize anything useful. I will have a look at it, but not before the ILIAS 7 migration in a few days if that's alright with you. That one will probably absolutely slaughter the HTML parser anyways :P
Could you have a look at what https://github.com/Garmelon/PFERD/releases/tag/v3.3.0 produces @Geronymos?
Even though pferd 3.3 can download all regular content again (thank you for that!), it unfortunately still downloads nothing for those types of links. But it recognizes that it is of type content page (see explain log).
As I see it "Inhaltsseite" might be an option for the lecturer to write pure html. So maybe it could be handled like a "external link": downloaded as plaintext and download links within the page.
explain-log
Loading config
CLI command specified, loading config from its arguments
Creating config for command 'kit-ilias-web'
Deciding which crawlers to run
No crawlers specified on CLI
Running crawlers specified in config
Loading auth:ilias
Loading crawl:ilias
Running crawl:ilias
Loading cookies
Sharing cookies
'/home/me/documents/uni/ilias/ana_blatt/.cookies' has newest mtime so far
Loading cookies from '/home/me/documents/uni/ilias/ana_blatt/.cookies'
Creating base directory at '/home/me/documents/uni/ilias/ana_blatt'
Loading previous report from '/home/me/documents/uni/ilias/ana_blatt/.report'
Loaded report successfully
Inferred crawl target: URL https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
Decision: Crawl '.'
Final result: '.'
Answer: Yes
Parsing root HTML page
URL: https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
Page is a content page, searching for elements
Crawled '.'
Decision: Clean up files
No warnings or errors occurred during this run
Answer: Yes
Storing report to '/home/me/documents/uni/ilias/ana_blatt/.report'
Stored report successfully
Total amount of HTTP requests: 1
Saving cookies
Saving cookies to '/home/me/documents/uni/ilias/ana_blatt/.cookies'
Report for crawl:ilias
Nothing changed
The "content page" has a "file" feature which I added support for. I thought they were nice enough to use it but they are not...
I don't really want to crawl random pages linked by the content page - that could lead to weird network requests, errors when the remote file is behind authentication and so on. I was about to suggest writing a dedicated crawler type for the math page but they don't even link them there... So I guess I will have to find a compromise here.
-
I could do a HEAD to find out the content type of the remote server and store it as an "external link" file if it is text/html and otherwise download it, but that would cause an additional network request for each item - even if they are already present locally.
-
Slightly less fancy, I could just use the name of the link and perform the same check. That would allow me to do this in one request and not do anything if it is present locally, but the file extension will be off.
-
As a third option I could just download them as-is and you might end up with downloaded HTML files if it links to things which can not be downloaded directly.
All of these will lead to errors if there are links to files behind authentication.