artoo icon indicating copy to clipboard operation
artoo copied to clipboard

scrape digitalData object

Open geeuy opened this issue 8 years ago • 9 comments

Hi,

I want to make use of the artoo ajaxSpider and scraper in order to fetch the content of a specific javascript object which is often known as digitalData or dataLayer or Universal_variable. Info stored in this object is often in the form of JSO notation.

I tried the following artoo.ajaxSpider(['url1', 'url2'], function() { console.log('Retrieved data:', JSON.stringify.digitalData); });

for url1 and url2 I used two real urls of course. I shows me the right output but it won't 'go' to the next url in the list.

It would be really great to have in the end an output and use this in excel like attached in this issue. example_output_artoo_github

hope everything is clear. thanks already

geeuy avatar May 10 '16 15:05 geeuy

Hello @geeuy. I am not sure I understand what you are doing here. From which sites are you trying to scrape data?

Yomguithereal avatar May 10 '16 17:05 Yomguithereal

Hi, yeah I already thought my question might be unclear Here's an example. This demo website has a javascript object called 'dataLayer'. Also enabled by the fact it uses Google Tag Manager. http://emailpanel.nl/deelnemer007/

If I use the console and return this dataLayer object I get a few objects with some values in it. datalayer_scrape_artoo

What I want is to basically scrape all the dataLayer contents (' stringified') for each url (or a list of urls).

So for this website I would have something like attachement 2 example_output_datalayer

Hope this clarifies it a bit.

thanks already

geeuy avatar May 10 '16 18:05 geeuy

Ok. I see now. The thing is that you cannot achieve this by using an ajax spider because you are trying to access the value of a JavaScript variable on each page and AJAX does not execute JavaScript. It just retrieves the page's HTML. You'll either have to find another way to get the data from the HTML itself or deploy more complex things such as a browser extension to do so if you want to automatize it.

Yomguithereal avatar May 23 '16 21:05 Yomguithereal

@geeuy Is there a reason you want stringified json in your output? Wouldn't you rather have a table with actual columns as csv? If so you can use artoo's helpers for this such as artoo.writers.csv(jsondata) https://medialab.github.io/artoo/helpers/#to-csv-string

boogheta avatar May 24 '16 08:05 boogheta

@boogheta actually this exactly what I want, to have a table with columns as csv. The output in stringify was more of an idea to have at least some info on the dataLayer per page. But reading @Yomguithereal response it is not possible to access the Javascript variable (dataLayer) on each page. I didn't know that, so I think there is no solution then for scraping dataLayer contents for each page (or several pages).

geeuy avatar May 24 '16 11:05 geeuy

Did you check in the network tab of the developer's console of your browser whether this data is fetched directly as json with individual queries? If such, you could probably query all of those and just collect the json files before converting them?

boogheta avatar May 24 '16 11:05 boogheta

@boogheta not sure if I follow your last comment.

I tried this command: artoo.ajaxSpider(['http://emailpanel.nl/deelnemer007/'], function() { console.log(artoo.writers.csv(dataLayer)); });

This seems to work for one page, but if I put a list of pages in the above array then it only returns the dataLayer contents of the current page. How can I automate this process for several pages and save it as csv? So, I will end up having something like (attached): example_output_artoo_github2

geeuy avatar May 24 '16 14:05 geeuy

I was thinking that maybe each page was querying dataLayer from a separate json url that you could directly fetch but apparently it is hardwritten in the raw html of each page. Artoo's ajaxspiders only do simple http requests, they don't load in the browser the pages, so you won't be able to do that with it as the javascript within each page is not executed. You can try and use sandcrawler with phantomJs instead which will, but the documentation is still lacking a bit details http://medialab.github.io/sandcrawler/

boogheta avatar May 24 '16 14:05 boogheta

@boogheta Okay. I will have a look. Thanks for your answers and suggestions!

geeuy avatar May 24 '16 15:05 geeuy