Did you know you can build your own simple scrapers with google spreadsheets?
I didnt until a couple of weeks ago.
Now that ive played around with this functionality, its time to show & tell.
oDigger Offer Domain Scraper
Before I start, here is the end product of this tutorial.
What does it do?
It scrapes the domains of all the offers listed on oDigger.
For example. Say you wanted to scrape all domains of dating offers in the US.
The related oDigger page would be:
You just paste in the URL of the offer page you want to scrape into cell A4 and the spreadsheet goes out to odigger, and get each of the "sample" url’s.
It then removes the path/pages and trims it to the main domain.
"Wow Tijn – How do I do this myself?"
You need to understand web page / html structure, code and css.
And its not easy.
Read on if you want to learn.
Or…just reply to this thread with the data you would love to scrape and I will try and build them.
I already have working scrapers for:
* Google SERPs
* Google Suggest
* Keyword Spy
Now why would you want to scrape oDigger offer domains? There are some hints throughout the forum, but ill try and do a post on this in the future.
What do you need?
1. Google Chrome Browser
You will use the Chrome Web Developer Tools as well as an important extension.
2. XPath Helper Extension
Install this extension:
What is XPath?
xpath is a way of identifying sections and elements in xml/html documents. Its prity advanced stuff and can get very complicated.
But by using this extension you can quickly find the xpath you need to extract the content you want.
3. Google Docs
This scraper is going to be build in Google Spreadsheets so you will need a Google Docs account.
Step 1: Create a clean spreadsheet
First of create a new spreadsheet in google docs.
Then, paste the URL of the page you want to scrape into cell A1.
Then in cell A3, enter the following formula:
The importxml formula is the key to scraping with Google Docs. It contains in this case the reference to the URL which you copied into cell A1, and a Xpath statement.
This particular statement will extract all urls for the links in the document. It basically says grab all <a> tags("//a") and return the value of href ("[email protected]").
You can read more about this function here. There are also a couple of other functions to grab different types of data. For example:
Step 2: Determine the Correct XPath Statement
Ok go back to the browser with the page loaded that you want to scrape.
Open up XPath Helper by pressing Shift-Ctrl-X (make sure you havent got the address bar selected cause it wont work then. Just click somewhere in the main document".
In the case of oDigger, your page should look something like this:
The area i highlighted in green is the XPath Helper window, consisting of the Query & the result.
Next, hover the mouse cursor over the part of the page you want to scrape and press the Shift key:
Pressing the shift key will populate the window with the xpath expression for that specific element:
Here is the xpath expression:
/html/body[@class='search search-results affiliate_offer']/div[@id='wrapper'] /div[@id='main']/div[@id='container']/div[@id='content']/div[@class='main-panel rounded-panel'] /div[@class='content']/div[@class='content-left']/div[@class='content-right'] /div[@class='content-content']/div[@class='affiliate_offer-results-table'] /table/tbody/tr/td[@class='result-preview-link'][email protected]
Problem with this is two fold: It doesnt return all the URLs on the page, and the statement itself is pritty complex.
Next step is to open the Developer Tools and inspect the relevant HTML code. You do this by right clicking on the element from just now, and selecting Inspect Element.
Then skip through the html code until you have found the element you need:
Ive underlined the important elements that distinguish the data we need.
Next you just need to play around with the xpath statement until you have the result your looking for.
First, I always try and remove as much specific xpath references as possible.
I just look at the element i want to grab, and work backwards to the first unique html element that covers all the data.
In this case that is:
and the related xpath element from before is:
Adding in the double slash (//) means that you want to grab the element wherever it occurs in the document.
Next is to widen the query at the end so that it returns all the links, not just the first one. I do this by removing the img and src references, and adding the href.
In the XPath Helper Results window you will be able to see that we got the right query.
Step 3: Update your spreadsheet
In your spreadsheet, add the Xpath query you just created into the formula:
You only need to add that formula once, because it takes care of adding additional formulas underneath in case it has more data.
In some cases you might extract a table with multiple columns. In that case the data will expand to the right automatically as well.
Step 4: Extract the domain name
In this particular case Row A now includes the full URL from the preview link in oDigger.
Here we are not interested in the full URL, just the domain.
So you will need to do some cleaning up.
In column B, next to each URL, enter the following formula:
This uses a regular expression to remove the http://www. and anything after the domain extension.
Step 5: Remove duplicates
Because oDigger includes many copies of the same offer, you will have the same domain in the list several times.
To remove duplicates you can use the UNIQUE function:
Hope you enjoyed this.
Make sure you share the scrapers you build here. Just add &newcopy to the google docs sharing URL so whenever people load it it creates a copy for themselves.
Ps. Remember – post your ideas below and I will build one or two more
Ps. Ps. You probably will already have thanked this post, but what I would really appreciate is if you could actually rate the post as well. That way I know whether or not to produce more stuff like this.
ps. 3 – Those wondering what happened with the HVT tutorial conclusion – its on the todo list, but got so excited about this new scraping technique that I just had to share it first
Holy . . . Freakin’ . . . Awesome.
That shit is advanced! Many thanks for this great tut man
prity, pritty, PRETTY good!
you can also use this to scrape content and output as rss -> reimport to wordpress love it.
yeah great suggestion movingshadow. and with the right google gadget you can also export to Json.
hope your liking this guys.
what information would you love to be able to scrape?
tjin as usual you are shit hot. I still worship at the sole of your Clarks for the POF campaign day parter as it has saved me many hours sleep. I thank you for that.
May your every click be a conversion.
what information would you love to be able to scrape?
How about the conversion pixels from an offer’s conversion page so I can spy on their traffic sources?
Here is a tool that will give you the code for a few more SERPs:
A scraper for Google+ Sparks would be awesome.
For google suggest, this is the best scraper I’ve used (Windows, IE8 only)
great suggestions inversion!
heres my favourite google suggest one:
By clicking on the keywords you can expand the suggest search further.
Dude, you kick ass….. Can i suggest a scraper – one an acquaintance of mine was telling me he used to great effect the other day. You set it to scrape high authority websites of your choice, and it finds all the links and is searching for 404 errors.. ie. domains that have possibly been left to lapse. You then do some quick research, and if the site has some good authority and page rank, try to buy the domain. This guy then uses googles cached pages for the about pages and privacy etc and reinstates them on the blog. He always finds they quickly regain their page rank and he can then use them for either back-linking from or building out for a niche.
I actually listened to an interview the other day from Kenny Goodman, who’s purchased a lapsed domain for $10…. turns out the owner has a fleet of 400+ waste trucks… and ended up paying $25,000 to buy his domain back…. Now I wouldn’t say no to a bit of luck like that.
^ i find that hard to believe.. anything that has any decent PR, backlinks, etc will 99.99% of the time end up in a private drop on snapnames or namejet…. then the bidding begins @$69 unless there’s competition in which case sky is the limit
That’s the idea of the scrapping tool – to find domains that people had let lapse, but clearly had a link from a authority site (the one your scrapping), so may well have some PR and domain age associated with it.
Domains that people are selling have shot up in price these last couple of years, the idea here was simply finding domains that people have let lapse, then re-instating them. Even picking up domains with a couple of years domain age on them is useful, afterall, age is something you can’t control in any shape or form.
I’ve let countless domains lapse after my early buying frenzy when I got started on the internet… looking back, if nothing else, they all had 2-3 yrs age going for them. Some even had PR2-3.. I even discovered my original domain name I first bought, that I still own, is showing a domain age of 9yrs. Even though I have owned only 3yrs, I must have bought it shortly after it expired and it is still showing as 9yrs of activity. So now i’m working on adding some content to get some PR and putting it aside for future use.
^ and my point was and still is… there’s no such thing as domain that lapses that has "value" by any means and it doesn’t get picked up by snapnames, pool or namejet or an existing registrar drop. I have over 10K domains and been in the game for a long run. There’s specific tools and lists including verisign zone files that get scanned and indexed daily for the exact same reason of picking those "gems".
In the past registrars would let domains drop, snap and pool where the first that aggregated these drops, collected the domain lists and offered them in a private marketplace. Enom later launched their own drop and took over tucows, netsol, etc…
/ not arguing, simply stating facts and bringing in a domainer perspective
I registered a lapsed domain for $7.50 and sold it 8 months later for $4000, to a company who ran a site that supported data integration between PayPal and QuickBooks …
Forgot all about it until they sent an email wanting to buy it.
Ok so i looked at the odigger scraper. Its cool but how do i put in my own urls and get the domains? It just says this is not editable and if i try to save it and open it up in excel it doesnt have any of the code thats required…
Here’s my contribution – ezinearticle.com scraper
And the code: =importxml($A$4, "//div[@class=’article’][email protected]")
@vidivo – you have to make a copy of the spreadsheet by clicking on the link with the &newcopy addition in it. It seems google is playing up as its not always working.
Try and clear your cookies and log backj into google docs. Then click the link below and it should ask you whether you want to make a copy:
@inversion – great stuff!!
Is there a way to scrape sites behind logins?
I’m trying to scrape related sites in Google Adplanner
Here’s a related sites scraper using xmarks.com
Just enter the domain in D1.
@shanktank – yes there is a way but you would have to delve into google Apps script (a lot like JS but not exactly). I havent tried yet!
Looks like oDigger didn’t like us scraping offer LPs with this. They built in a preview system now so you’d have to go 3 levels in to get a useable domain now.
The Article Published IN 08-05-2011 02:39 AM