artoo
artoo copied to clipboard
scrape digitalData object
Hi,
I want to make use of the artoo ajaxSpider and scraper in order to fetch the content of a specific javascript object which is often known as digitalData or dataLayer or Universal_variable. Info stored in this object is often in the form of JSO notation.
I tried the following artoo.ajaxSpider(['url1', 'url2'], function() { console.log('Retrieved data:', JSON.stringify.digitalData); });
for url1 and url2 I used two real urls of course. I shows me the right output but it won't 'go' to the next url in the list.
It would be really great to have in the end an output and use this in excel like attached in this issue.
hope everything is clear. thanks already
Hello @geeuy. I am not sure I understand what you are doing here. From which sites are you trying to scrape data?
Hi, yeah I already thought my question might be unclear Here's an example. This demo website has a javascript object called 'dataLayer'. Also enabled by the fact it uses Google Tag Manager. http://emailpanel.nl/deelnemer007/
If I use the console and return this dataLayer object I get a few objects with some values in it.
What I want is to basically scrape all the dataLayer contents (' stringified') for each url (or a list of urls).
So for this website I would have something like attachement 2
Hope this clarifies it a bit.
thanks already
Ok. I see now. The thing is that you cannot achieve this by using an ajax spider because you are trying to access the value of a JavaScript variable on each page and AJAX does not execute JavaScript. It just retrieves the page's HTML. You'll either have to find another way to get the data from the HTML itself or deploy more complex things such as a browser extension to do so if you want to automatize it.
@geeuy Is there a reason you want stringified json in your output? Wouldn't you rather have a table with actual columns as csv? If so you can use artoo's helpers for this such as artoo.writers.csv(jsondata)
https://medialab.github.io/artoo/helpers/#to-csv-string
@boogheta actually this exactly what I want, to have a table with columns as csv. The output in stringify was more of an idea to have at least some info on the dataLayer per page. But reading @Yomguithereal response it is not possible to access the Javascript variable (dataLayer) on each page. I didn't know that, so I think there is no solution then for scraping dataLayer contents for each page (or several pages).
Did you check in the network tab of the developer's console of your browser whether this data is fetched directly as json with individual queries? If such, you could probably query all of those and just collect the json files before converting them?
@boogheta not sure if I follow your last comment.
I tried this command: artoo.ajaxSpider(['http://emailpanel.nl/deelnemer007/'], function() { console.log(artoo.writers.csv(dataLayer)); });
This seems to work for one page, but if I put a list of pages in the above array then it only returns the dataLayer contents of the current page. How can I automate this process for several pages and save it as csv? So, I will end up having something like (attached):
I was thinking that maybe each page was querying dataLayer from a separate json url that you could directly fetch but apparently it is hardwritten in the raw html of each page. Artoo's ajaxspiders only do simple http requests, they don't load in the browser the pages, so you won't be able to do that with it as the javascript within each page is not executed. You can try and use sandcrawler with phantomJs instead which will, but the documentation is still lacking a bit details http://medialab.github.io/sandcrawler/
@boogheta Okay. I will have a look. Thanks for your answers and suggestions!