FSharp.Data icon indicating copy to clipboard operation
FSharp.Data copied to clipboard

Descendants and CssSelect do not take all elements corectly

Open RemSoftDev opened this issue 7 years ago • 8 comments

https://www.amazon.com/s/ref=lp_1_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508&bbn=1&ie=UTF8&qid=1519043803&rnid=1

try to select all "ul". it will find always random count of "ul" elements within interval 12-19, but page always contain exactly 38 ULs.

or simply run this code

open System.Net
open System
open System.IO
open FSharp.Data
// Fetch the contents of a web page
let fetchUrl callback url =        
    let req = WebRequest.Create(Uri(url)) :?> HttpWebRequest 
    req.UserAgent <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";

    use resp = req.GetResponse() 
    use stream = resp.GetResponseStream() 
    use reader = new IO.StreamReader(stream)
    callback reader url

let GetStreamUrl pReauest pUrlBuilder pUrlPart =
    let fullUrl:string = pUrlBuilder pUrlPart
    pReauest fullUrl
    

let myCallback (reader:IO.StreamReader) url = 
    let html = reader.ReadToEnd()
    let html1000 = html.Substring(0,1000)
    printfn "Downloaded %s. First 1000 is %s" url html1000
    html      // return all the html

let GetLinkNameUrl pGetStreamUrl pUrlPart name listT = 
    let pUrlSteam:string = pGetStreamUrl pUrlPart
    let doc = HtmlDocument.Parse pUrlSteam
    let desc = doc.Descendants "ul"
    let css = doc.CssSelect "ul"
    ""

let FullUrl l r = 
    sprintf "%s%s" l r
[<EntryPoint>]
let main argv = 
    let baseUrl = "https://www.amazon.com"
    let url =  GetLinkNameUrl 
                (GetStreamUrl (fetchUrl myCallback) (FullUrl baseUrl))  
                "/s/ref=lp_1_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508&bbn=1&ie=UTF8&qid=1519043803&rnid=1"
                "123" 
                List.empty
               
    printfn "%A" argv
    0 // return an integer exit code

let desc = doc.Descendants "ul"
let css = doc.CssSelect "ul"

should contains 38 elements.

RemSoftDev avatar Feb 20 '18 12:02 RemSoftDev

@rflechner, can you have a look?

ovatsus avatar Feb 20 '18 14:02 ovatsus

Hi,

I will try to have a look this week.

Regards.

rflechner avatar Feb 21 '18 12:02 rflechner

Hi,

I confirm that div#mainResults seems to be removed by the HTML parser.

I will try to understand why.

Regards.

rflechner avatar Feb 22 '18 12:02 rflechner

@rflechner , did you got a change to look at this?

ovatsus avatar Apr 09 '18 15:04 ovatsus

/cc @colinbull as well

ovatsus avatar Apr 09 '18 17:04 ovatsus

The issue is that the parser being in the script mode cannot distinguish js division from regular expression. Currently, whenever it encounters '/' (outside of a string literal, and not followed by another '/' or '*' representing comments) it thinks it's a beginning of a regular expression. It will consume all characters until next '/' and will treat it as a regular expression. Parser will continue looking for the closing script tag. Hence, if there's a division operator it may cause some html to be swallowed by a script tag. For example: <script>var a = 5 / 2</script><div></div><script>var b = 3 / 2</script> will be parsed as a single script block.

Unit test:

[<TestCase("""var a = 5 / 2""")>]
let ``Can distinguish division from regular expression`` content =
    let result = HtmlDocument.Parse (sprintf "<script>%s</script><div></div>" content)
    let expected = 
        HtmlDocument.New
            [ HtmlNode.NewElement("script",
                [],
                [ HtmlNode.NewText content ]) 
              HtmlNode.NewElement("div", [], [])]
    result |> should equal expected

serhiiz avatar Jun 30 '18 07:06 serhiiz

@serhiiz will it be fixed somehow?

a-a-k avatar Aug 04 '18 17:08 a-a-k

@a-a-k sorry, so far I haven't had a chance to come up with a fix. Just in case, there's some info available at https://stackoverflow.com/questions/5519596/when-parsing-javascript-what-determines-the-meaning-of-a-slash

serhiiz avatar Aug 09 '18 07:08 serhiiz