unfound links
Hi, I had troubles with unfound links so I built a minimal example to reproduce the bug:
ancors.ml
module S = Soup
module M = Mechaml
let url = "https://fr.wikipedia.org/wiki/OCaml"
module Debug = struct
let pretty_print page css_selector =
let soup = M.Page.soup page in
let node = S.R.select_one css_selector soup in
print_endline (S.pretty_print node)
end
let action () =
let open M.Agent.Monad.Infix in
M.Agent.get url
>|= (fun http_res ->
let page = M.Agent.HttpResponse.page http_res in
let infobox_selector = ".infobox_v2 > tbody" in
(* select all descendants of the infobox *)
let link_selector = Printf.sprintf "%s *" infobox_selector in
(* checks that selector find first child element of infobox *)
Debug.pretty_print page link_selector;
(* is supposed to collect all links in the infobox *)
let link_seq = M.Page.links_with link_selector page in
let l = M.Page.to_list link_seq in
(* prints 0, while there is ~25 links in infobox *)
print_int (List.length l);
print_newline ())
let () = snd (M.Agent.Monad.run (M.Agent.init ()) (action ()))
dune
(executable (name ancors)
(libraries lambdasoup mechaml))
My version of Mechamel is 1.2.1. I do not know what is the problem, maybe relative links ? I hope I did not make a trivial usage mistake of the library though. As a supplementary information, if I try to select all links of this page, I get about a list of length ~500.
Best,
Hey again, I have investigated a bit and found the source of the problem, inside page.ml :
let tag_selector tag = function
| "" -> tag
| s when s.[0]='*' -> s
| s when is_identifier_char s.[0] -> s
| s -> tag^s
(* ... *)
let links_with selector p =
p.soup
|> Soup.select (tag_selector "a" selector)
|> Soup.filter (tag_filter "a")
|> seq_from_nodes (fun elt -> Link.( {resolver = resolver p; elt} ))
I see the use-case that was in mind when developing this was feeding links_with with a css attribute selector (aka surrounded by "[]"). Unfortunately, it does not work at all when we want to specify that the links we are looking for are descendants of elements with class .x or id #y.
A workaround is prefixing the css selector we give to links_with (and other _with functions) with "html ", as an example.
To fix this issue, I suggest either:
- to add these two lines to
tag_selectorfunction:
| s when s.[0]='#' -> s
| s when s.[0]='.' -> s
- or to simply suppress the
tag_selectorfunction and just let people write right their css selector (and eventually document it is on the user to provide the complete css selector for elements to match) to match elements they want.
I have a preference for the second solution because it is more predictable for users, and easy to fix. The first is even simpler, but maybe not exhaustive (now or in the future). Please tell me what you prefer so that I can make a PR accordingly ;)
Hi @RuyBlast, thanks for the investigation! I believe the second solution is unfortunately backward incompatible. So, for now, I would vouch for the first one, which is simple and doesn't change current behavior. This lets us time to think about the second one :slightly_smiling_face: (maybe we can just add an alternative function to do so, links_with_no_tag_selector_but_obviously_a_better_name). Will happily review your PR and make a new minor release.