Tuesday, 27 May 2014

Scraping websites with Google spreadsheets

Frequently when working with data collection or visualizations it's useful to get data from an external website that doesn't have an API. There are plenty of tools to help you with this but one of the easiest is certainly google spreadsheets, or you could of course write your own web scraping bot with Python and the Beautiful Soup library (which I will outline in a future tutorial). One thing to note - web scraping is not always legal so make sure to check the website's data use policy before you go scraping it and sharing it with everyone!

To get started go grab a google account and/or login to google documents and create a new google spreadsheet. then go find the URL of the data you want to scrape. To keep things simple I'm going to demonstrate with the highest-grossing films data from Wikipedia. Here's the URL:
 
http://en.wikipedia.org/wiki/List_of_highest-grossing_films#Highest-grossing_films

If you're not familiar with URLs, the # character directs your requests to a specific subset of the page - in this case to the 'highest grossing films' section. If you go to the site you will see a table which is the data we're interested in:



In the top left hand cell of your spreadsheet, use the following command:
 =ImportHtml("http://en.wikipedia.org/wiki/List_of_highest-grossing_films#Highest-grossing_films","Table", 1)

Hit enter and the spreadsheet will slowly populate itself with all the data from the table we're looking at and you can graph it or save it as a .csv or text file for later analysis with another program or script.

If it didn't work there's a few things to check:
  1. Make sure you have enough of the right kind of quotation marks. There should be double quotes around both the URL and the word Table in the above command
  2. Check the URL is correct.
So what precisely is happening? The =ImportHtml() command directs google spreadsheets to the particular page and the next command "Table" tells it too look for an element in the format of a table. You can also use alternative Html elements here like lists ("List"). Finally the value 1 tells it which element to look at. Start with 0 and keep increasing the number by 1 until you find the data you're looking for. 

Have a go changing the number in the above command and see if you can scrape the table from further down the page with films adjusted for inflation.


There's a functional example of the spreadsheet here if you're interested.

It's also worth noting that Google spreadsheet come with several other import tools for importing different types of data including .xml with:
  • ImportHtml()
  • ImportXML()
  • ImportData()
  • ImportFeed()
Finally you can also use the other google spreadsheet commands to sort through the data you get. One of the most useful is the =SPLIT() command that separates the incoming data by a certain character (so the command =SPLIT(A1, " ") would split whatever text is in the cell A1 wherever there was a white space.)  Another useful command is =REGEXEXTRACT(), which searches a sting for a particular pattern of letters and numbers known as a regular expression, and then returns it. Regex commands are a little complicated to get used to but really useful - here's one that searches for the pattern of a year(i.e. 2014) in a string:  
=REGEXEXTRACT(A1,"[0-9]{4}"
)



No comments:

Post a Comment