mechaml unfound links

Hi, I had troubles with unfound links so I built a minimal example to reproduce the bug:

ancors.ml

module S = Soup
module M = Mechaml

let url = "https://fr.wikipedia.org/wiki/OCaml"

module Debug = struct
  let pretty_print page css_selector =
    let soup = M.Page.soup page in
    let node = S.R.select_one css_selector soup in
    print_endline (S.pretty_print node)
end

let action () =
  let open M.Agent.Monad.Infix in
  M.Agent.get url
  >|= (fun http_res ->
    let page = M.Agent.HttpResponse.page http_res in
    let infobox_selector = ".infobox_v2 > tbody" in

    (* select all descendants of the infobox *)
    let link_selector = Printf.sprintf "%s *" infobox_selector in

    (* checks that selector find first child element of infobox *)
    Debug.pretty_print page link_selector;

    (* is supposed to collect all links in the infobox *)
    let link_seq = M.Page.links_with link_selector page in
    let l = M.Page.to_list link_seq in

    (* prints 0, while there is ~25 links in infobox *)
    print_int (List.length l);
    print_newline ())

let () = snd (M.Agent.Monad.run (M.Agent.init ()) (action ()))

dune

(executable (name ancors)
 (libraries lambdasoup mechaml))

My version of Mechamel is 1.2.1. I do not know what is the problem, maybe relative links ? I hope I did not make a trivial usage mistake of the library though. As a supplementary information, if I try to select all links of this page, I get about a list of length ~500.

Best,

Aug 19 '23 15:08 RuyBlast

Hey again, I have investigated a bit and found the source of the problem, inside page.ml :

let tag_selector tag = function
  | "" -> tag
  | s when s.[0]='*' -> s
  | s when is_identifier_char s.[0] -> s
  | s -> tag^s

(* ... *)

let links_with selector p =
  p.soup
  |> Soup.select (tag_selector "a" selector)
  |> Soup.filter (tag_filter "a")
  |> seq_from_nodes (fun elt -> Link.( {resolver = resolver p; elt} ))

I see the use-case that was in mind when developing this was feeding links_with with a css attribute selector (aka surrounded by "[]"). Unfortunately, it does not work at all when we want to specify that the links we are looking for are descendants of elements with class .x or id #y.

A workaround is prefixing the css selector we give to links_with (and other _with functions) with "html ", as an example.

To fix this issue, I suggest either:

to add these two lines to tag_selector function:

  | s when s.[0]='#' -> s
  | s when s.[0]='.' -> s

or to simply suppress the tag_selector function and just let people write right their css selector (and eventually document it is on the user to provide the complete css selector for elements to match) to match elements they want.

I have a preference for the second solution because it is more predictable for users, and easy to fix. The first is even simpler, but maybe not exhaustive (now or in the future). Please tell me what you prefer so that I can make a PR accordingly ;)

Aug 26 '23 15:08 RuyBlast

Hi @RuyBlast, thanks for the investigation! I believe the second solution is unfortunately backward incompatible. So, for now, I would vouch for the first one, which is simple and doesn't change current behavior. This lets us time to think about the second one :slightly_smiling_face: (maybe we can just add an alternative function to do so, links_with_no_tag_selector_but_obviously_a_better_name). Will happily review your PR and make a new minor release.

Sep 06 '23 09:09 yannham