mlc icon indicating copy to clipboard operation
mlc copied to clipboard

Allow specifying HTTP request parameters

Open diegorondini opened this issue 3 years ago • 10 comments

Is your feature request related to a problem? Please describe. Some URLs require specific HTTP request parameters. One example is the github docs pages, for example this .md will fail:

$ cat mdtest.md 
= Test =

[Github docs link](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository)

$ mlc

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+                                                          +
+            markup link checker - mlc v0.15.2             +
+                                                          +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

[Err ] ./mdtest.md (3, 1) => https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository - 403 - Forbidden

Result (1 links):

OK       0
Skipped  0
Warnings 0
Errors   1


The following links could not be resolved:

./mdtest.md (3, 1) => https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository.

The reason is that the page requires specific HTTP headers: https://github.com/github-community/community/discussions/14773

Describe the solution you'd like It would be nice to have a way to specify HTTP request parameters, possibly per-URL.

diegorondini avatar Jul 15 '22 08:07 diegorondini

I like this idea. Just don't know how exactly one would pass all the possible header fields to mlc? Via commandarg?

becheran avatar Jul 18 '22 06:07 becheran

Probably the best option would be a config file, otherwise it would be impractical to specify different headers for different URLs.

See for example: https://github.com/orgs/github-community/discussions/14773#discussioncomment-2679987 https://github.com/tcort/markdown-link-check#config-file-format

diegorondini avatar Jul 18 '22 06:07 diegorondini

I think your pipeline has been hit by this bug: https://github.com/becheran/mlc/actions/runs/3559864946/jobs/5979511630

[Err ] ./README.md (62, 22) => https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions - 403 - Forbidden
Error: https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions. 403 - Forbidden

diegorondini avatar Nov 28 '22 07:11 diegorondini

@diegorondini fun fact: It does not fail when I run it locally. Does github somehow prevent requests to GitHub.com from their own runners? You mention missing request parameters? What would that be in this case?

becheran avatar Nov 28 '22 09:11 becheran

@becheran I think the first question is why the pipeline checks that link even if there's no such link in the README.md:

$ grep 'docs\.github' README.md

Returning to this bug, docs.github.com requires the Accept-Encoding: zstd, br, gzip, deflate header:

$ curl -i -X GET https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions
HTTP/2 403 
x-azure-ref: 0wn2EYwAAAACr4P2HgpUzTatC1/nj5XnyTU5aMjIxMDYwNjEzMDIxADU5NmQ3OGEyLWNhNWYtNDc5ZC1iY2RjLTA4MzU4MzMxNzRiMg==
accept-ranges: bytes
via: 1.1 varnish, 1.1 varnish
date: Mon, 28 Nov 2022 09:22:10 GMT
x-served-by: cache-iad-kiad7000135-IAD, cache-mrs10563-MRS
x-cache: MISS, MISS
x-cache-hits: 0, 0
x-timer: S1669627330.213655,VS0,VE92
strict-transport-security: max-age=31557600

$ curl -i -H "Accept-Encoding: zstd, br, gzip, deflate" -X GET https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions
HTTP/2 200 
cache-control: public, max-age=60
content-type: text/html; charset=utf-8
access-control-allow-origin: *
content-security-policy: default-src 'none';prefetch-src 'self';connect-src 'self';font-src 'self' data: githubdocs.azureedge.net;img-src 'self' github.com *.github.com *.githubusercontent.com *.githubassets.com data: githubdocs.azureedge.net placehold.it;object-src 'self';script-src 'self' data: githubdocs.azureedge.net;frame-src 'self' github.com *.github.com *.githubusercontent.com *.githubassets.com https://www.youtube-nocookie.com;frame-ancestors 'self' github.com *.github.com *.githubusercontent.com *.githubassets.com;style-src 'self' 'unsafe-inline' data: githubdocs.azureedge.net;child-src 'self';upgrade-insecure-requests;base-uri 'self';form-action 'self';script-src-attr 'none'
cross-origin-opener-policy: same-origin
cross-origin-resource-policy: same-origin
x-dns-prefetch-control: off
x-frame-options: SAMEORIGIN
x-download-options: noopen
x-content-type-options: nosniff
origin-agent-cluster: ?1
x-permitted-cross-domain-policies: none
referrer-policy: strict-origin-when-cross-origin
x-xss-protection: 0
x-powered-by: Next.js
x-azure-ref: 0hXyEYwAAAADMF8jkAx/XToTRxIg5u1m/UEhMMzBFREdFMDMxOQA1OTZkNzhhMi1jYTVmLTQ3OWQtYmNkYy0wODM1ODMzMTc0YjI=
content-encoding: br
via: 1.1 varnish, 1.1 varnish
accept-ranges: bytes
date: Mon, 28 Nov 2022 09:22:29 GMT
age: 335
x-served-by: cache-iad-kiad7000135-IAD, cache-mrs10583-MRS
x-cache: CONFIG_NOCACHE, HIT, HIT
x-cache-hits: 3, 1
x-timer: S1669627349.305248,VS0,VE1
vary: Accept-Encoding
strict-transport-security: max-age=31557600
content-length: 38324

Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

diegorondini avatar Nov 28 '22 09:11 diegorondini

Sorry, I just realized I should have checked out the github-action-output branch. Now it fails for me as well with 0.15.4:

$ mlc ./README.md

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+                                                          +
+            markup link checker - mlc v0.15.4             +
+                                                          +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

09:31:29 [WARN] Broken reference link: Borrowed("possible values: md, html")
09:31:29 [WARN] Strip everything after #. The chapter part '#ci-pipeline-integration' is not checked.
[ OK ] ./README.md (19, 8) => #ci-pipeline-integration - 
[ OK ] ./README.md (64, 1) => ./docs/FailingAnnotation.PNG - 
[ OK ] ./README.md (32, 28) => https://doc.rust-lang.org/cargo/ - 
[ OK ] ./README.md (4, 2) => https://badgen.net/crates/d/mlc?color=blue - 
[ OK ] ./README.md (46, 56) => https://github.com/marketplace/actions/markup-link-checker-mlc - 
[ OK ] ./README.md (20, 29) => https://rust-lang.github.io/async-book/ - 
[ OK ] ./README.md (3, 2) => https://img.shields.io/crates/v/mlc.svg?color=orange - 
[ OK ] ./README.md (9, 1) => https://asciinema.org/a/299100 - 
[ OK ] ./README.md (9, 2) => https://asciinema.org/a/299100.svg - 
[ OK ] ./README.md (6, 2) => https://img.shields.io/badge/License-MIT-yellow.svg - 
[ OK ] ./README.md (5, 2) => https://github.com/becheran/mlc/actions/workflows/rust.yml/badge.svg - 
[ OK ] ./README.md (7, 2) => https://img.shields.io/badge/PRs-welcome-brightgreen.svg - 
[ OK ] ./README.md (3, 1) => https://crates.io/crates/mlc - 
[ OK ] ./README.md (4, 1) => https://crates.io/crates/mlc - 
[ OK ] ./README.md (32, 92) => https://crates.io/crates/mlc - 
[Err ] ./README.md (62, 22) => https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions - 403 - Forbidden
[ OK ] ./README.md (144, 60) => https://github.com/becheran/mlc/blob/master/LICENSE - 
[ OK ] ./README.md (75, 32) => https://github.com/becheran/ntest/blob/master/.github/workflows/ci.yml - 
[ OK ] ./README.md (79, 37) => https://hub.docker.com/repository/docker/becheran/mlc - 
[ OK ] ./README.md (140, 14) => https://github.com/becheran/mlc/blob/master/CHANGELOG.md - 
[ OK ] ./README.md (6, 1) => https://opensource.org/licenses/MIT - 
[ OK ] ./README.md (112, 221) => https://github.com/becheran/wildmatch - 
[ OK ] ./README.md (40, 54) => https://github.com/becheran/mlc/releases - 
[ OK ] ./README.md (5, 1) => https://github.com/becheran/mlc/actions/workflows/rust.yml - 
[ OK ] ./README.md (7, 1) => https://github.com/becheran/mlc/blob/master/CONTRIBUTING.md - 

Result (25 links):

OK       24
Skipped  0
Warnings 0
Errors   1


The following links could not be resolved:

./README.md (62, 22) => https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions.

diegorondini avatar Nov 28 '22 09:11 diegorondini

Ah, right. Did the same mistake and ran it on wrong branch locally 🤦‍♂️

becheran avatar Nov 28 '22 10:11 becheran

@diegorondini would 'Accept-Encoding: *' help in this case? Might be a sane default since we don't care about the content anyways right now.

To make it configurable I think a map of links with wildcards and associated headers would make sense as config parameter. Will think about it.

becheran avatar Nov 28 '22 10:11 becheran

@becheran well, not literally:

$ curl -i -H "Accept-Encoding: *" -X GET https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions
HTTP/2 403
[...]

The official way to mean any encoding should be Accept-Encoding: */*, but I don't know how much it works in pratice. https://stackoverflow.com/questions/25182888/does-in-an-http-accepts-encoding-header-mean-gzip-is-supported

The library you're using (reqwest?) may support accepting all encodings. Libcurl does that: https://curl.se/libcurl/c/CURLOPT_ACCEPT_ENCODING.html

Not sure though if servers that don't support compression / encoding peacefully decline the "Accept-Encoding" header.

diegorondini avatar Nov 28 '22 12:11 diegorondini

Yes, I am using reqwest. I did turn on all supported encodings (brotli, gzip, deflate) and that did the trick for now. But I guess there are other cases where a custom request is still required. For example if a authentication token is required for a specific link.

becheran avatar Nov 28 '22 19:11 becheran