colly icon indicating copy to clipboard operation
colly copied to clipboard

Extract JS Code (Not execute)

Open pdavis156879 opened this issue 3 years ago • 5 comments

I'm attempting to extract/locate Javascript code within an HTML page; whilst Colly is not a headless browser and hence, JS execution is not a feature, I don't actually need to execute the code, just to locate a subset (or even a set of strings) based on their names and other similar features.

Any chance anyone stumbled upon this?

pdavis156879 avatar Aug 26 '21 08:08 pdavis156879

Hey, maybe this part here helps you. I am using this in a tiny project to find an array:

localCollector.OnHTML("body", func(e *colly.HTMLElement) {
		s := e.DOM.Find("script").Text()

		r := regexp.MustCompile(`something\.array\s*=\s*(.+\}])\s*`)
		res := r.FindString(s)
		res = strings.ReplaceAll(res, "something.array = ", "")
...
})

With that you can freely search in the script context. In my case I can parse it into an struct and use the given array. Not sure if that is exactly what you want.

Mxrk avatar Oct 07 '21 10:10 Mxrk

My example to find script element with id __NEXT_DATA__:

c.OnHTML("script#__NEXT_DATA__", func(h *colly.HTMLElement) {
	var js map[string]interface{}
	err := json.Unmarshal([]byte(h.Text), &js)
	if err != nil {
		panic(errors.New("can't parse script#__NEXT_DATA__"))
	}
})

It is tested to work on script element inside body.

Since there is a functional solution, I think the issue should be closed.

kulak avatar Oct 23 '21 01:10 kulak

JS != JSON 🚨

RensTillmann avatar Nov 23 '22 20:11 RensTillmann

I was working with a specific site and its NEXT_DATA was formatted as json object. So, I got to structured data that I had interest in extracting.

Yes, generic case is JS, but this specific case was about JSON data in JS.

kulak avatar Nov 23 '22 21:11 kulak

I figured something like that, but isn't that a bit strange to have pure json object inside <script> tag? 😛 Anyway thanks for the info.

RensTillmann avatar Nov 23 '22 22:11 RensTillmann