Wednesday 1 December 2010

Awesome New Tool, Needlebase.

Needlebase allows you to view web pages through a virtual browser, point and click to train it in understanding what fields on that page are of interest to you and how those fields relate to each other. Then the program goes and scrapes the data from all of those fields, publishes them into a table, list or map, and recommends merges of cells that appear to be mistakenly separate. It's very cool and it lets non-technical people do things with data quickly and easily that we used to require the assistance of someone more technical to do.

Video is here:

http://www.youtube.com/watch?v=58Gzlq4zSDk&feature=player_embedded

Investigative journalism

Last month a local newspaper reported that a big new data center had opened in Salt Lake City with a mystery anchor client. The paper believed the client was Twitter, as the company has said it was going to open its first off-site data center in Utah at an undisclosed date.
We used Needlebase to look at all the tweets from people on the Twitter list of Twitter staff members and extract the username, message body and location, if exposed. Needlebase scraped the last 1500 Tweets in less than 5 minutes. We displayed them on a map and saw that there was just one Tweet published in that time from Utah: a Twitter Site Operations Technician who had just left San Francisco to move to Salt Lake City, complaining about Qwest router problems. That wasn't quite confirmation, but it sure felt like a valuable clue and was very easy to come by thanks to Needlebase.

Data Re-Sorting

Last night I found a solution to a long-running issue I've been struggling with. I've got this list of 300 blogs around the web that cover geotechnology (that's a whole other story) and have them all run through Postrank. That service ranks them in order of most to least social media and reader engagement per blog post.
Wouldn't it be great to extract that data over time, to track it and to turn it into blog posts? I think it would. I couldn't figure out how to get all the data out that I wanted though.
Enter Needlebase. Last night I pointed Needle to my Postrank pages for geotech blogs and in minutes it pulled down all the data I wanted. I exported that data as a CSV, uploaded it to Google Docs as a spreadsheet, did a little subtraction and now have the following chart tracking the top 300 geotech blogs on the web. Now in my handy spreadsheet, I was able to set up a function to show me which blogs jumped or fell in the rankings the most over the previous week. Thanks, Needlebase!Event Preparation
I've written here about how to use Mechanical Turk to get ready and rock an industry event. Needlebase can prove useful for that as well.

The DIY Data Hackers Toolkit

I put Needle in my mind in between two other wonderful tools. On one end of the spectrum is the now Yahoo-acquired Dapper, which anyone can use to build an RSS feed from changes made to any field on any web page. (See: The Glory and Bliss of Screen Scraping and How Yahoo's Latest Acquisition Stole and Broke My Heart)
One the other end of the spectrum is the brand-new Extractiv, a bulk web-crawling and semantic analysis tool that's also remarkably easy to use. Earlier this month I used Extractiv to search across 300 top geotech blogs for all instances of the word "ESRI," all entities mentioned in relation to ESRI and the words used to describe those relations. The service processed 125,000 pages and spit out my results in less than an hour for less than a dollar. That's incredible - it's a game changer.
Needlebase is too. It sits somewhere in between Dapper and Extractiv, I think. These tools are democratizing the ability to extract and work with data from across the web. They are to text processing what blogging was to text publishing.

No comments:

Post a Comment