spritejs
spritejs copied to clipboard
scripts/check-more-info-urls.py: add script
- [x] The PR title conforms to the recommended templates.
TODO:
- [x] docs: Add documentation in the script itself
- [ ] fix: Implement domain fetching rotation (ref. https://github.com/tldr-pages/tldr/issues/12289#issuecomment-1951514833)
- [ ] feat: Add optional regex filter for links
The build for this PR failed with the following error(s):
would reformat /home/runner/work/tldr/tldr/scripts/detect-broken-more-info-links.py
Oh no! 💥 💔 💥
1 file would be reformatted, 5 files would be left unchanged.
scripts/detect-broken-more-info-links.py:4:1: F401 'random' imported but unused
scripts/detect-broken-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/detect-broken-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/detect-broken-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/detect-broken-more-info-links.py:18:1: E302 expected 2 blank lines, found 1
scripts/detect-broken-more-info-links.py:54:13: F841 local variable 'e' is assigned to but never used
scripts/detect-broken-more-info-links.py:128:1: W391 blank line at end of file
Please fix the error(s) and push again.
The build for this PR failed with the following error(s):
would reformat /home/runner/work/tldr/tldr/scripts/detect-broken-more-info-links.py
Oh no! 💥 💔 💥
1 file would be reformatted, 6 files would be left unchanged.
scripts/check-more-info-links.py:4:1: F401 'random' imported but unused
scripts/check-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/check-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-links.py:55:13: F841 local variable 'e' is assigned to but never used
scripts/detect-broken-more-info-links.py:4:1: F401 'random' imported but unused
scripts/detect-broken-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/detect-broken-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/detect-broken-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/detect-broken-more-info-links.py:18:1: E302 expected 2 blank lines, found 1
scripts/detect-broken-more-info-links.py:54:13: F841 local variable 'e' is assigned to but never used
scripts/detect-broken-more-info-links.py:128:1: W391 blank line at end of file
Please fix the error(s) and push again.
The build for this PR failed with the following error(s):
scripts/check-more-info-links.py:4:1: F401 'random' imported but unused
scripts/check-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/check-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-links.py:55:13: F841 local variable 'e' is assigned to but never used
Please fix the error(s) and push again.
The build for this PR failed with the following error(s):
scripts/check-more-info-links.py:4:1: F401 'random' imported but unused
scripts/check-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/check-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-links.py:55:13: F841 local variable 'e' is assigned to but never used
Please fix the error(s) and push again.
Hi @vitorhcl, Any updates on this?
Hi @kbdharun, thanks for pinging me.
I'm gonna try to do the pending fixes and documentation until Monday, but I'll leave the regex filter for another PR.
The build for this PR failed with the following error(s):
scripts/check-more-info-links.py:4:1: F401 'random' imported but unused scripts/check-more-info-links.py:7:1: F401 'sys' imported but unused scripts/check-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused scripts/check-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused scripts/check-more-info-links.py:55:13: F841 local variable 'e' is assigned to but never used
Please fix the error(s) and push again.
Anyone knows why this is returning an error? Is it because of asynchronous functions?
The build for this PR failed with the following error(s):
scripts/check-more-info-urls.py:13:1: F401 'random' imported but unused
scripts/check-more-info-urls.py:16:1: F401 'sys' imported but unused
scripts/check-more-info-urls.py:17:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-urls.py:20:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-urls.py:64:13: F841 local variable 'e' is assigned to but never used
Please fix the error(s) and push again.
The build for this PR failed with the following error(s):
scripts/check-more-info-urls.py:13:1: F401 'random' imported but unused
scripts/check-more-info-urls.py:16:1: F401 'sys' imported but unused
scripts/check-more-info-urls.py:17:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-urls.py:20:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-urls.py:64:13: F841 local variable 'e' is assigned to but never used
Please fix the error(s) and push again.
Anyone knows why this is returning an error? Is it because of asynchronous functions?
Probably yeah, will check the script locally and maybe fix this issue.
Edit. That didn't take a lot of time, fixed the issue and also updated the README file. It seems like some functions were imported but not actually used so removed it, for the e unused exception variable, I added it to the aprint
's output.
i.e.
diff --git a/scripts/check-more-info-urls.py b/scripts/check-more-info-urls.py
index 5d055e9a5bd3f..847232bdef3ab 100644
--- a/scripts/check-more-info-urls.py
+++ b/scripts/check-more-info-urls.py
@@ -2,22 +2,19 @@
# SPDX-License-Identifier: MIT
"""
-A Python script to check for bad (HTTP status code different than 200) "More information" URLs accross all pages.
+A Python script to check for bad (HTTP status code different than 200) "More information" URLs across all pages.
-These bad codes tipically indicate a not found page or a redirection. They are written to bad-urls.txt with their respective URLs.
+These bad codes typically indicate a page not found or a redirection. They are written to bad-urls.txt with their respective URLs.
Usage:
python3 scripts/check-more-info-urls.py
"""
-import random
import re
import asyncio
-import sys
-from aiofile import AIOFile, Reader, Writer
import aiohttp.client_exceptions
from aioconsole import aprint
-from aiofile import async_open
+from aiofile import AIOFile, Writer
from aiopath import AsyncPath
MAX_CONCURRENCY = 500
@@ -62,7 +59,7 @@ async def process_file(
try:
content = await f.read()
except Exception as e:
- await aprint(file.parts[-3:])
+ await aprint(f"Error: {e}, File: {file.parts[-3:]}")
return
url = extract_url(content)
Feel free to check it out and modify my changes @vitorhcl.
@kbdharun your change LGTM, thank you for the fixes.
Are you going to implement the domain rotation or do you want me to do that?
Are you going to implement the domain rotation or do you want me to do that?
Feel free to do it, I assigned myself for the previous change (and to sort this PR separately under my notifications 😅 ).
@vitorhcl is this PR ready for review? Or should it become draft until it is ready for review?
@vitorhcl is this PR ready for review? Or should it become draft until it is ready for review?
Hmm it should become draft until it's ready for merge.
PS: My 3 previous commits have bodies that explain each change.
Fixed the merge conflicts in the README file. We still need to implement the remaining todo tasks.
@vitorhcl any update on this PR?
Whilst running, I found out that you will eventually get a 429 on the GitHub links. And sometimes you will get a redirect, resulting in 30X. To reduce the 429’s, I guess we should just check less URLs in the same time. A 30X is not wrong as well, but now it gets marked as a bad-url