urlchecker-action icon indicating copy to clipboard operation
urlchecker-action copied to clipboard

Action fails, but no error? Appears incomplete - no summary.

Open kubu4 opened this issue 1 year ago • 13 comments

Whenever the urlchecker-action runs, it fails. The end of the log file appears to be incomplete, as it doesn't provide any summary and/or error messages.

screencap of log file

This is what my workflow file looks like:

name: URLChecker
on: [push]

jobs:
  check-urls:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: urlchecker-action
      uses: urlstechie/[email protected]
      with:
        # A subfolder or path to navigate to in the present or cloned repository
        subfolder: posts
        file_types: .qmd
        # Choose whether to print a more verbose end summary with files and broken URLs.
        verbose: true
        # The timeout seconds to provide to requests, defaults to 5 seconds
        timeout: 5
        # How many times to retry a failed request (each is logged, defaults to 1)
        retry_count: 3
        # choose if the force pass or not
        force_pass: true

I'm not sure how to troubleshoot. Are there log files that get generated somewhere that I can look through?

kubu4 avatar Dec 12 '23 14:12 kubu4

I am not sure what is the issue exactly but you seem to have this issue for a while. The way I would debug this is to test the same workflow using the Python module locally and see if that works. If so then maybe I would lower the number of workers, or set it to 1 and see if that solves the issue.

SuperKogito avatar Dec 12 '23 16:12 SuperKogito

Thanks so much for the quick response and suggestions.

Admittedly, we're not sure how to run actions locally (we're a group of biologists, so not well-versed in software development stuff), but we'll poke around the web and report back.

kubu4 avatar Dec 12 '23 16:12 kubu4

Hey @kubu4 ! You shouldn't need to poke around the web - urlchecker is a command line python tool, and there are instructions for install and usage here:

https://github.com/urlstechie/urlchecker-python

The action is simply running that under the hood. Let us know if you have any questions! I work with a lot of biologists. :)

vsoch avatar Dec 12 '23 17:12 vsoch

Ha! Thanks!

kubu4 avatar Dec 12 '23 17:12 kubu4

Brief update:

I think it's a memory issue. I ran urlchecker check --file-types ".qmd" . in my repo on a high-memory computer (256GM RAM) and the memory was pegged all the way to the top!

I didn't let it finish because some other people were trying to use the computer for some other tasks and I, essentially, locked it up.

Possible solution? Reducing the number of workers, per @SuperKogito's suggestion?

kubu4 avatar Dec 12 '23 20:12 kubu4

That seems strange - how many files are you checking (and what is a qmd extension)? Try adding --serial

vsoch avatar Dec 12 '23 20:12 vsoch

Thousands of files. .qmd is a Quarto markdown (it's still just markdown, but with a YAML that can be parsed by Quarto).

Many links are to large files (multi-GB in size). Would that have an impact on how this action runs?

kubu4 avatar Dec 12 '23 20:12 kubu4

Yes likely - maybe try out testing a smaller subset of the files first and see at what size it starts to not work?

vsoch avatar Dec 12 '23 20:12 vsoch

this is actually consistent with the memory overflow; the files are loaded and scanned for urls. This definitely will require a lot of RAM if your files are too big/ too many. Using multiple workers will only make this worse, hence I mentioned using one worker. Using '--serial' per @vsoch recommendation is also a possible solution but if your files are too big, it will be hard to escape this. Especially, if the memory is not flushed as soon as the links are extracted.

SuperKogito avatar Dec 12 '23 23:12 SuperKogito

Likely we need a fix that processes them in batches (and doesn't try to load everything into memory at once).

vsoch avatar Dec 12 '23 23:12 vsoch

You could also just target runs on separate subdirectories (one at a time or in a matrix), depending on how large your repository is.

vsoch avatar Dec 12 '23 23:12 vsoch

I suspect that this has something to do with the memory management and garbage collection.

Memory Allocation for File Reading: When you read a file in Python, the data is loaded into RAM. If you read the entire file at once (e.g., using read() or readlines()), the entire file content is loaded into memory. This can be problematic for large files, as it can consume a significant amount of RAM.

https://github.com/urlstechie/urlchecker-python/blob/7dbd7ac171cf85788728b4cf5576c191f13c8399/urlchecker/core/fileproc.py#L135

Garbage Collection: Python uses a garbage collector to reclaim memory from objects that are no longer in use. The primary garbage collection mechanism in Python is reference counting. An object's memory is deallocated when its reference count drops to zero (i.e., when there are no more references to it in your program).

When Memory is Freed:

  • Automatic De-allocation: Memory for file data is automatically freed once the file object is no longer referenced. This can happen when the variable holding the file data goes out of scope, or if you explicitly del the variable.
  • Context Managers (with Statement): Using a with statement to handle file operations is a good practice. It ensures that the file is properly closed after its suite finishes, even if an error occurs. However, closing a file does not immediately free the memory used for its content stored in a variable.
  • Manual Intervention: If you're dealing with very large files and want to ensure memory is freed promptly, you might need to manually delete large objects or use more granular read operations.

Strategies for Large Files:

  • Read in Chunks: Instead of reading the whole file at once, you can read it in smaller chunks (e.g., line by line, or a fixed number of bytes at a time). This way, you only keep a small part of the file in memory at any given time.
  • Use Generators: Generators can be very effective for reading large files as they yield data on-the-fly and do not store it in memory.
  • External Libraries: Some Python libraries are optimized for handling large datasets and can be more efficient than standard file reading methods.

@vsoch generators could be a good fix here, what do you think of this? @kubu4 as @vsoch mentioned processing in batches seems to be the best option atm. Just make multiple workflows each processing a different subset.

SuperKogito avatar Dec 12 '23 23:12 SuperKogito

@SuperKogito my first suggestion to @kubu4 is to try processing in batches (e.g., multiple runs on different roots, and that can be put into an action matrix). If that doesn't work, then I think we should add some kind of support to handle that internally.

vsoch avatar Dec 13 '23 02:12 vsoch