spritejs icon indicating copy to clipboard operation
spritejs copied to clipboard

scripts/check-more-info-urls.py: add script

Open vitorhcl opened this issue 11 months ago • 17 comments

  • [x] The PR title conforms to the recommended templates.

TODO:

  • [x] docs: Add documentation in the script itself
  • [ ] fix: Implement domain fetching rotation (ref. https://github.com/tldr-pages/tldr/issues/12289#issuecomment-1951514833)
  • [ ] feat: Add optional regex filter for links

vitorhcl avatar Mar 14 '24 17:03 vitorhcl

The build for this PR failed with the following error(s):

would reformat /home/runner/work/tldr/tldr/scripts/detect-broken-more-info-links.py

Oh no! 💥 💔 💥
1 file would be reformatted, 5 files would be left unchanged.
scripts/detect-broken-more-info-links.py:4:1: F401 'random' imported but unused
scripts/detect-broken-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/detect-broken-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/detect-broken-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/detect-broken-more-info-links.py:18:1: E302 expected 2 blank lines, found 1
scripts/detect-broken-more-info-links.py:54:13: F841 local variable 'e' is assigned to but never used
scripts/detect-broken-more-info-links.py:128:1: W391 blank line at end of file

Please fix the error(s) and push again.

tldr-bot avatar Mar 14 '24 17:03 tldr-bot

The build for this PR failed with the following error(s):

would reformat /home/runner/work/tldr/tldr/scripts/detect-broken-more-info-links.py

Oh no! 💥 💔 💥
1 file would be reformatted, 6 files would be left unchanged.
scripts/check-more-info-links.py:4:1: F401 'random' imported but unused
scripts/check-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/check-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-links.py:55:13: F841 local variable 'e' is assigned to but never used
scripts/detect-broken-more-info-links.py:4:1: F401 'random' imported but unused
scripts/detect-broken-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/detect-broken-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/detect-broken-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/detect-broken-more-info-links.py:18:1: E302 expected 2 blank lines, found 1
scripts/detect-broken-more-info-links.py:54:13: F841 local variable 'e' is assigned to but never used
scripts/detect-broken-more-info-links.py:128:1: W391 blank line at end of file

Please fix the error(s) and push again.

tldr-bot avatar Mar 15 '24 11:03 tldr-bot

The build for this PR failed with the following error(s):

scripts/check-more-info-links.py:4:1: F401 'random' imported but unused
scripts/check-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/check-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-links.py:55:13: F841 local variable 'e' is assigned to but never used

Please fix the error(s) and push again.

tldr-bot avatar Mar 15 '24 11:03 tldr-bot

The build for this PR failed with the following error(s):

scripts/check-more-info-links.py:4:1: F401 'random' imported but unused
scripts/check-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/check-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-links.py:55:13: F841 local variable 'e' is assigned to but never used

Please fix the error(s) and push again.

tldr-bot avatar Apr 03 '24 16:04 tldr-bot

Hi @vitorhcl, Any updates on this?

kbdharun avatar Apr 18 '24 04:04 kbdharun

Hi @kbdharun, thanks for pinging me.

I'm gonna try to do the pending fixes and documentation until Monday, but I'll leave the regex filter for another PR.

vitorhcl avatar Apr 18 '24 19:04 vitorhcl

The build for this PR failed with the following error(s):

scripts/check-more-info-links.py:4:1: F401 'random' imported but unused
scripts/check-more-info-links.py:7:1: F401 'sys' imported but unused
scripts/check-more-info-links.py:8:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-links.py:11:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-links.py:55:13: F841 local variable 'e' is assigned to but never used

Please fix the error(s) and push again.

Anyone knows why this is returning an error? Is it because of asynchronous functions?

vitorhcl avatar Apr 18 '24 21:04 vitorhcl

The build for this PR failed with the following error(s):

scripts/check-more-info-urls.py:13:1: F401 'random' imported but unused
scripts/check-more-info-urls.py:16:1: F401 'sys' imported but unused
scripts/check-more-info-urls.py:17:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-urls.py:20:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-urls.py:64:13: F841 local variable 'e' is assigned to but never used

Please fix the error(s) and push again.

tldr-bot avatar Apr 27 '24 15:04 tldr-bot

The build for this PR failed with the following error(s):

scripts/check-more-info-urls.py:13:1: F401 'random' imported but unused
scripts/check-more-info-urls.py:16:1: F401 'sys' imported but unused
scripts/check-more-info-urls.py:17:1: F401 'aiofile.Reader' imported but unused
scripts/check-more-info-urls.py:20:1: F401 'aiofile.async_open' imported but unused
scripts/check-more-info-urls.py:64:13: F841 local variable 'e' is assigned to but never used

Please fix the error(s) and push again.

tldr-bot avatar Apr 27 '24 15:04 tldr-bot

Anyone knows why this is returning an error? Is it because of asynchronous functions?

Probably yeah, will check the script locally and maybe fix this issue.


Edit. That didn't take a lot of time, fixed the issue and also updated the README file. It seems like some functions were imported but not actually used so removed it, for the e unused exception variable, I added it to the aprint's output.

i.e.

diff --git a/scripts/check-more-info-urls.py b/scripts/check-more-info-urls.py
index 5d055e9a5bd3f..847232bdef3ab 100644
--- a/scripts/check-more-info-urls.py
+++ b/scripts/check-more-info-urls.py
@@ -2,22 +2,19 @@
 # SPDX-License-Identifier: MIT
 
 """
-A Python script to check for bad (HTTP status code different than 200) "More information" URLs accross all pages.
+A Python script to check for bad (HTTP status code different than 200) "More information" URLs across all pages.
 
-These bad codes tipically indicate a not found page or a redirection. They are written to bad-urls.txt with their respective URLs.
+These bad codes typically indicate a page not found or a redirection. They are written to bad-urls.txt with their respective URLs.
 
 Usage:
     python3 scripts/check-more-info-urls.py
 """
 
-import random
 import re
 import asyncio
-import sys
-from aiofile import AIOFile, Reader, Writer
 import aiohttp.client_exceptions
 from aioconsole import aprint
-from aiofile import async_open
+from aiofile import AIOFile, Writer
 from aiopath import AsyncPath
 
 MAX_CONCURRENCY = 500
@@ -62,7 +59,7 @@ async def process_file(
             try:
                 content = await f.read()
             except Exception as e:
-                await aprint(file.parts[-3:])
+                await aprint(f"Error: {e}, File: {file.parts[-3:]}")
                 return
 
     url = extract_url(content)

Feel free to check it out and modify my changes @vitorhcl.

kbdharun avatar Apr 28 '24 12:04 kbdharun

@kbdharun your change LGTM, thank you for the fixes.

vitorhcl avatar Apr 28 '24 15:04 vitorhcl

Are you going to implement the domain rotation or do you want me to do that?

vitorhcl avatar Apr 28 '24 15:04 vitorhcl

Are you going to implement the domain rotation or do you want me to do that?

Feel free to do it, I assigned myself for the previous change (and to sort this PR separately under my notifications 😅 ).

kbdharun avatar Apr 28 '24 15:04 kbdharun

@vitorhcl is this PR ready for review? Or should it become draft until it is ready for review?

sebastiaanspeck avatar May 11 '24 05:05 sebastiaanspeck

@vitorhcl is this PR ready for review? Or should it become draft until it is ready for review?

Hmm it should become draft until it's ready for merge.

PS: My 3 previous commits have bodies that explain each change.

vitorhcl avatar May 11 '24 11:05 vitorhcl

Fixed the merge conflicts in the README file. We still need to implement the remaining todo tasks.

kbdharun avatar May 18 '24 04:05 kbdharun

@vitorhcl any update on this PR?

sebastiaanspeck avatar Aug 19 '24 19:08 sebastiaanspeck

Whilst running, I found out that you will eventually get a 429 on the GitHub links. And sometimes you will get a redirect, resulting in 30X. To reduce the 429’s, I guess we should just check less URLs in the same time. A 30X is not wrong as well, but now it gets marked as a bad-url

sebastiaanspeck avatar Sep 18 '24 03:09 sebastiaanspeck