To get started go grab a google account and/or login to google documents and create a new google spreadsheet. then go find the URL of the data you want to scrape. To keep things simple I'm going to demonstrate with the highest-grossing films data from Wikipedia. Here's the URL:
http://en.wikipedia.org/wiki/List_of_highest-grossing_films#Highest-grossing_films
If you're not familiar with URLs, the # character directs your requests to a specific subset of the page - in this case to the 'highest grossing films' section. If you go to the site you will see a table which is the data we're interested in:
In the top left hand cell of your spreadsheet, use the following command:
=ImportHtml("http://en.wikipedia.org/wiki/List_of_highest-grossing_films#Highest-grossing_films","Table", 1)
Hit enter and the spreadsheet will slowly populate itself with all the data from the table we're looking at and you can graph it or save it as a .csv or text file for later analysis with another program or script.
If it didn't work there's a few things to check:
- Make sure you have enough of the right kind of quotation marks. There should be double quotes around both the URL and the word Table in the above command
- Check the URL is correct.
Have a go changing the number in the above command and see if you can scrape the table from further down the page with films adjusted for inflation.
There's a functional example of the spreadsheet here if you're interested.
It's also worth noting that Google spreadsheet come with several other import tools for importing different types of data including .xml with:
- ImportHtml()
- ImportXML()
- ImportData()
- ImportFeed()
=REGEXEXTRACT(A1,"[0-9]{4}"
)
No comments:
Post a Comment