wail icon indicating copy to clipboard operation
wail copied to clipboard

Provide a means in the UI for a user to see the last time a crawl was executed and other info

Open machawk1 opened this issue 4 years ago • 12 comments

  • Last time crawled
  • Number of crawls
  • URI crawled

This might require a separate window to be comprehensive but at-a-glance would be useful to show basic info like last crawl time.

machawk1 avatar May 14 '20 14:05 machawk1

Number of unique domains crawled will also be a good metric to report. Additionally, number of resources crawled from each domain. If you want to go the next level, you can report a breakdown of downloads of each media type as well (at least grouped as "HTML", "Text", "CSS", "JS", "Image", "PDF", and "Others").

ibnesayeed avatar May 14 '20 14:05 ibnesayeed

@ibnesayeed Thanks for the suggestion. I agree. This sort of information will need to be calculated, which seems like it has been done in other packages. Familiar with any (e.g., Monitrix)?

machawk1 avatar May 14 '20 14:05 machawk1

Monitrix is great, but I am wondering, will it be too big of a tool to include for rather small crawls. Being a desktop application, I do not expect people would use WAIL to run long crawl sessions.

ibnesayeed avatar May 14 '20 14:05 ibnesayeed

@ibnesayeed They might want to run longer jobs. My concern with Monitrix is getting it packaged into a desktop app (#46), even if it requires a bundled runtime. Per that GH issue, I have yet to investigate it further.

Some rudimentary methods can be calculated, like the above, but it might be reinventing the wheel.

machawk1 avatar May 14 '20 14:05 machawk1

They might want to run longer jobs.

If a feature is not used by majority, then it is better to avoid adding that unless it is critical for those who do want to use it.

If you have the list of URIs crawled, at least domain counting and domain grouping will be rather easy without any additional tools. If you have a way to read crawl logs from the WAIL process, then see what attributes are reported in it and think what you can do without bundling yet another tool and make the package heavy.

ibnesayeed avatar May 14 '20 14:05 ibnesayeed

The list of domains can be mined out of the crawl jobs' instances. Some jobs might be run more than once and the configuration modified between runs.

I can code up a solution but if there is a pre-existing approach/tool, it might be more powerful to leverage existing functionality implemented elsewhere.

machawk1 avatar May 14 '20 14:05 machawk1

I think we should assess our options (such as external tools vs. in-house solutions) and their pros and cons to evaluate what would be a more practical route to go forward.

ibnesayeed avatar May 14 '20 15:05 ibnesayeed

This would be an awesome feature that I would definitely use! Thanks for suggesting it!

zahnz avatar Oct 29 '20 19:10 zahnz

@machawk1 you may want to have a look at logtrix. I have used it recently for quick inspection.

ibnesayeed avatar Oct 30 '20 12:10 ibnesayeed

@ibnesayeed Thanks for the pointer to logtrix, I was not aware of that package. The examples are a little disheartening, as there does not appear to be an API to use its features beyond programming in Java, which I would like to avoid in this project.

machawk1 avatar Oct 30 '20 15:10 machawk1

Yes, it was aimed to serve as a library to parse Heritrix crawl log, but I agree that a built in CLI would have been more immediately useful. It does provide some CLI capabilities (that's how I used it), but it can be improved. It was built during one of those IIPC hackathons.

ibnesayeed avatar Oct 30 '20 15:10 ibnesayeed

Interesting to know, @ibnesayeed. I can relate to building software at Hackathons to disseminate afterward. ;-)

I was also looking to bundle UKWA's Monitrix in #46 but its dependencies (at the time) on a specific version of the Play! framework and the inability to package the application as a native execution were barriers. This latter point is a primary requirement for bundling additional software into WAIL for access via the GUI.

machawk1 avatar Oct 30 '20 15:10 machawk1