colly
colly copied to clipboard
Extract JS Code (Not execute)
I'm attempting to extract/locate Javascript code within an HTML page; whilst Colly is not a headless browser and hence, JS execution is not a feature, I don't actually need to execute the code, just to locate a subset (or even a set of strings) based on their names and other similar features.
Any chance anyone stumbled upon this?
Hey, maybe this part here helps you. I am using this in a tiny project to find an array:
localCollector.OnHTML("body", func(e *colly.HTMLElement) {
s := e.DOM.Find("script").Text()
r := regexp.MustCompile(`something\.array\s*=\s*(.+\}])\s*`)
res := r.FindString(s)
res = strings.ReplaceAll(res, "something.array = ", "")
...
})
With that you can freely search in the script context. In my case I can parse it into an struct and use the given array. Not sure if that is exactly what you want.
My example to find script element with id __NEXT_DATA__:
c.OnHTML("script#__NEXT_DATA__", func(h *colly.HTMLElement) {
var js map[string]interface{}
err := json.Unmarshal([]byte(h.Text), &js)
if err != nil {
panic(errors.New("can't parse script#__NEXT_DATA__"))
}
})
It is tested to work on script element inside body.
Since there is a functional solution, I think the issue should be closed.
JS
!= JSON
🚨
I was working with a specific site and its NEXT_DATA was formatted as json object. So, I got to structured data that I had interest in extracting.
Yes, generic case is JS, but this specific case was about JSON data in JS.
I figured something like that, but isn't that a bit strange to have pure json object inside <script>
tag? 😛 Anyway thanks for the info.