urlchecker-action
urlchecker-action copied to clipboard
Action fails, but no error? Appears incomplete - no summary.
Whenever the urlchecker-action
runs, it fails. The end of the log file appears to be incomplete, as it doesn't provide any summary and/or error messages.
This is what my workflow file looks like:
name: URLChecker
on: [push]
jobs:
check-urls:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: urlchecker-action
uses: urlstechie/[email protected]
with:
# A subfolder or path to navigate to in the present or cloned repository
subfolder: posts
file_types: .qmd
# Choose whether to print a more verbose end summary with files and broken URLs.
verbose: true
# The timeout seconds to provide to requests, defaults to 5 seconds
timeout: 5
# How many times to retry a failed request (each is logged, defaults to 1)
retry_count: 3
# choose if the force pass or not
force_pass: true
I'm not sure how to troubleshoot. Are there log files that get generated somewhere that I can look through?
I am not sure what is the issue exactly but you seem to have this issue for a while. The way I would debug this is to test the same workflow using the Python module locally and see if that works. If so then maybe I would lower the number of workers, or set it to 1 and see if that solves the issue.
Thanks so much for the quick response and suggestions.
Admittedly, we're not sure how to run actions locally (we're a group of biologists, so not well-versed in software development stuff), but we'll poke around the web and report back.
Hey @kubu4 ! You shouldn't need to poke around the web - urlchecker is a command line python tool, and there are instructions for install and usage here:
https://github.com/urlstechie/urlchecker-python
The action is simply running that under the hood. Let us know if you have any questions! I work with a lot of biologists. :)
Ha! Thanks!
Brief update:
I think it's a memory issue. I ran urlchecker check --file-types ".qmd" .
in my repo on a high-memory computer (256GM RAM) and the memory was pegged all the way to the top!
I didn't let it finish because some other people were trying to use the computer for some other tasks and I, essentially, locked it up.
Possible solution? Reducing the number of workers, per @SuperKogito's suggestion?
That seems strange - how many files are you checking (and what is a qmd extension)? Try adding --serial
Thousands of files. .qmd
is a Quarto markdown (it's still just markdown, but with a YAML that can be parsed by Quarto).
Many links are to large files (multi-GB in size). Would that have an impact on how this action runs?
Yes likely - maybe try out testing a smaller subset of the files first and see at what size it starts to not work?
this is actually consistent with the memory overflow; the files are loaded and scanned for urls. This definitely will require a lot of RAM if your files are too big/ too many. Using multiple workers will only make this worse, hence I mentioned using one worker. Using '--serial' per @vsoch recommendation is also a possible solution but if your files are too big, it will be hard to escape this. Especially, if the memory is not flushed as soon as the links are extracted.
Likely we need a fix that processes them in batches (and doesn't try to load everything into memory at once).
You could also just target runs on separate subdirectories (one at a time or in a matrix), depending on how large your repository is.
I suspect that this has something to do with the memory management and garbage collection.
Memory Allocation for File Reading: When you read a file in Python, the data is loaded into RAM. If you read the entire file at once (e.g., using
read()
orreadlines()
), the entire file content is loaded into memory. This can be problematic for large files, as it can consume a significant amount of RAM.https://github.com/urlstechie/urlchecker-python/blob/7dbd7ac171cf85788728b4cf5576c191f13c8399/urlchecker/core/fileproc.py#L135
Garbage Collection: Python uses a garbage collector to reclaim memory from objects that are no longer in use. The primary garbage collection mechanism in Python is reference counting. An object's memory is deallocated when its reference count drops to zero (i.e., when there are no more references to it in your program).
When Memory is Freed:
- Automatic De-allocation: Memory for file data is automatically freed once the file object is no longer referenced. This can happen when the variable holding the file data goes out of scope, or if you explicitly
del
the variable.- Context Managers (
with
Statement): Using awith
statement to handle file operations is a good practice. It ensures that the file is properly closed after its suite finishes, even if an error occurs. However, closing a file does not immediately free the memory used for its content stored in a variable.- Manual Intervention: If you're dealing with very large files and want to ensure memory is freed promptly, you might need to manually delete large objects or use more granular read operations.
Strategies for Large Files:
- Read in Chunks: Instead of reading the whole file at once, you can read it in smaller chunks (e.g., line by line, or a fixed number of bytes at a time). This way, you only keep a small part of the file in memory at any given time.
- Use Generators: Generators can be very effective for reading large files as they yield data on-the-fly and do not store it in memory.
- External Libraries: Some Python libraries are optimized for handling large datasets and can be more efficient than standard file reading methods.
@vsoch generators could be a good fix here, what do you think of this? @kubu4 as @vsoch mentioned processing in batches seems to be the best option atm. Just make multiple workflows each processing a different subset.
@SuperKogito my first suggestion to @kubu4 is to try processing in batches (e.g., multiple runs on different roots, and that can be put into an action matrix). If that doesn't work, then I think we should add some kind of support to handle that internally.