FSharp.Data
FSharp.Data copied to clipboard
Descendants and CssSelect do not take all elements corectly
https://www.amazon.com/s/ref=lp_1_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508&bbn=1&ie=UTF8&qid=1519043803&rnid=1
try to select all "ul". it will find always random count of "ul" elements within interval 12-19, but page always contain exactly 38 ULs.
or simply run this code
open System.Net
open System
open System.IO
open FSharp.Data
// Fetch the contents of a web page
let fetchUrl callback url =
let req = WebRequest.Create(Uri(url)) :?> HttpWebRequest
req.UserAgent <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
use resp = req.GetResponse()
use stream = resp.GetResponseStream()
use reader = new IO.StreamReader(stream)
callback reader url
let GetStreamUrl pReauest pUrlBuilder pUrlPart =
let fullUrl:string = pUrlBuilder pUrlPart
pReauest fullUrl
let myCallback (reader:IO.StreamReader) url =
let html = reader.ReadToEnd()
let html1000 = html.Substring(0,1000)
printfn "Downloaded %s. First 1000 is %s" url html1000
html // return all the html
let GetLinkNameUrl pGetStreamUrl pUrlPart name listT =
let pUrlSteam:string = pGetStreamUrl pUrlPart
let doc = HtmlDocument.Parse pUrlSteam
let desc = doc.Descendants "ul"
let css = doc.CssSelect "ul"
""
let FullUrl l r =
sprintf "%s%s" l r
[<EntryPoint>]
let main argv =
let baseUrl = "https://www.amazon.com"
let url = GetLinkNameUrl
(GetStreamUrl (fetchUrl myCallback) (FullUrl baseUrl))
"/s/ref=lp_1_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508&bbn=1&ie=UTF8&qid=1519043803&rnid=1"
"123"
List.empty
printfn "%A" argv
0 // return an integer exit code
let desc = doc.Descendants "ul"
let css = doc.CssSelect "ul"
should contains 38 elements.
@rflechner, can you have a look?
Hi,
I will try to have a look this week.
Regards.
Hi,
I confirm that div#mainResults
seems to be removed by the HTML parser.
I will try to understand why.
Regards.
@rflechner , did you got a change to look at this?
/cc @colinbull as well
The issue is that the parser being in the script mode cannot distinguish js division from regular expression. Currently, whenever it encounters '/' (outside of a string literal, and not followed by another '/' or '*' representing comments) it thinks it's a beginning of a regular expression. It will consume all characters until next '/' and will treat it as a regular expression. Parser will continue looking for the closing script tag.
Hence, if there's a division operator it may cause some html to be swallowed by a script tag.
For example:
<script>var a = 5 / 2</script><div></div><script>var b = 3 / 2</script>
will be parsed as a single script block.
Unit test:
[<TestCase("""var a = 5 / 2""")>]
let ``Can distinguish division from regular expression`` content =
let result = HtmlDocument.Parse (sprintf "<script>%s</script><div></div>" content)
let expected =
HtmlDocument.New
[ HtmlNode.NewElement("script",
[],
[ HtmlNode.NewText content ])
HtmlNode.NewElement("div", [], [])]
result |> should equal expected
@serhiiz will it be fixed somehow?
@a-a-k sorry, so far I haven't had a chance to come up with a fix. Just in case, there's some info available at https://stackoverflow.com/questions/5519596/when-parsing-javascript-what-determines-the-meaning-of-a-slash