All posts from Data Mining: Text Mining, Visualization and Social Media

The Recorded Future is Here

The-Recorded-Future-is...

Recorded Future is a new venture which mines the web for statements that are associated with some time expressions. It then uses this corpus to describe the future in various geographies for various topics. In addition to the application of information extraction methods, they also present this information in creative visual displays.
 

The site is plenty full of jQuery goodness, but I did find the newbie experience a little puzzling (how do I navigate to the data visualization? not clear...)
Finally, I loved this quote from a satisfied customer:

"This definitely reduces time in figuring out what may or may not be happening in the future based on what has been happening in the past. It cuts that time in half. "Advertising Executive
[HT Sundar]
 

The Spectrum of Time Series Forms

The-Spectrum-of-Time-S...

I've been playing around with time series data recently. Seeing the many forms that these take, I was planning a post describing a bestuary of time series. This, it turns out, is too much work, so here I have collected some examples that I hope, as a collection, demonstrate at least a small part of the vast spectrum of forms that time series data can take. (See an earlier post on using HTML5 to create this type of output).
 









  
    

Mobile's Not Zen

Mobiles-Not-Zen

How cool is this? The picture below is from the Twitter Map overlay on Bing Maps. It shows images tweeted from the Rush concert at White Water Amphitheater last night.

Some of the tweets from the concert: "Solo still going", "Yay drum solo", etc.
During the concert, the audience was lit in part by the ethereal glow of mobile devices held aloft to capture the spectacle.
To me, none of these people are living in the moment. It is as if they are instituting a homunculus - a little agent which is accounting what they are doing 'ok, now I am watching a drum solo', 'I'm watching this performance through an 'eye' that is watching this performance.'
Oh, and my neighbor's seet was row 21, seat 12.

Are Facebook likes Flooding the Internet?

Are-Facebook-likes-Flo...

I recently enabled the Facebook 'like' feature on this blog, which is hosted by SixApart's Typepad service. Of late, I haven't been blogging at anything like the rate I'd like, but - O happy day - my traffic (according to Typepad) has been increasing quite a bit. Which blogger wouldn't be happy to see this:

While I might have suddenly become more relevant, I suspect the reason that I'm getting this increase in traffic (except for that large peak, which is legitimate) is related in some way to Facebook's 'like' feature.
Consider the following from my traffic details in my Typepad dashboard:

Generally, this is to be read 'a visitor came from www.facebook.com/plugins to the page with the path /data_mining/datamining.' But the pattern is too predictable and not easily explained by a human visitor.
A possible explanation of the problem is the following: when someone visits a view of my blog which involves aggregates of posts (say, visiting the home page, which collects the most recent 10 posts), the Facebook 'like' button gets rendered. Facebook wakes up and decides to pull the page - somehow leaving behind this plugins reference. Unfortunately, this seems to be happening almost every time, rather than in a sensible, cache supported manner.
I'm pretty sure I don't have all the details right. For example, when I look at the two other services which I use to track traffic, they don't appear to register these references from Facebook. Does this indicate that it is something to do with the setup between Typepad and Facebook? Or perhaps it is some issue with how Typepad collects and displays visits. Perhaps the  other two services are incorrectly removing these references? Generally, when a robot crawls your site, it doesn't leave an indication of where it came 'from' as it would just be fetching from a list of effectively arbitrary URLs. Does that indicate it is some sort of crawler faking identity as a human user?
The Facebook API documentation says:

When does Facebook scrape my page?
Facebook needs to scrape your page to know how to display it around the site.
Facebook scrapes your page every 24 hours to ensure the properties are up to date. The page is also scraped when an admin for the Open Graph page clicks the Like button and when the URL is entered into the Facebook URL Linter. Facebook observes cache headers on your URLs - it will look at "Expires" and "Cache-Control" in order of preference. However, even if you specify a longer time, Facebook will scrape your page every 24 hours.
At anyrate, while I'd be happy to be getting the increased traffic, I'd rather get accurate traffic reports.
Does anyone have any insights? Anyone from SixApart or Facebook? 

BlackBerry - Love it, Hate it, Whatever

BlackBerry-Love-it-Hat...

I was amused by this moment of journalistic dissonance assembled by Techmeme where to authors started off their articles with opposing generalities about the BlackBerry.

Square is the new Round, in Web Design

Square-is-the-new-Roun...

One of the hallmarks of the Web 2.0 movement was the use of rounded corners in pretty much every element that graced the pages of any hot new start up. These rounded corners were often accompanied by two other elements: the lozenge-like lighting on buttons and the mirrored reflections on an imagined surface.
Here's a snippet of the Picnik website with lots of rounded goodness:

Apple adopted this idiom in many of its products. The Safari browser uses the reflections in its grid presentation of browsing history:

and perhpas most famously, the iPhone design language is all about the rounded corners of Web 2.0 stickers:

The new Web 2.0 design idiom, however, is all about squares. For example, the new BBC design uses simple blocks and solid colours:

Stamen design and Infosthetics both adopt a dense, but appealing approach to illustrated navigation:
Stamen

Infosthetics

Making the transition complete, this squared approach to design will be the hallmark of the Windows Phone 7 UX:

As the web adopts these new crisp corners, will the iPhone UI start to look stale, just as it looked fresh and current when it launched?      

The Interpretation of Tables in Texts, 2000

The-Interpretation-of-...

Ten years ago, I submited the final version of my PhD thesis: The Interpretation of Tables in Texts. At the time, there wasn't a huge amount of research going on in the space. Those working in the area pretty much all knew each other and would meet at a couple of conferences, generally in the OCR community as there wasn't much interest in table understanding in other research areas.
Now, there is quite a healthy interest in table understanding due in part to the promise of tabular data being a reasonable way to bootstrap semantic relationships via the large scale mining of the web.
Most recently, I spotted this paper by Finin et al :Exploiting a Web of Semantic Data for Interpreting Tables, WebSci10, 2010 which echoes much of the promise of the 'first generation' of table understanding work by the likes of myself, Dan Lopresti, Thomas Kieninger, Jianying Hu, etc. In fact, the motivating example in that paper:
 bears a strong similarity to that from my thesis:

with the later also illustrating to some extent the complexity of table semantics.
I'm still very much interested in tabular data. Perhaps as it represents the simplest transition point from textual presentations of information to graphical, or topological representations of information.
For posterity, I've embeded the Scribd incarnation of my thesis below.2000 - Hurst (PhD Thesis) - The Interpretation of Tables in Texts


Crowd Sourcing Butterfly Conservation

Crowd-Sourcing-Butterf...

The BBC writes about an effort in the UK to use crowd sourcing to populate data recording the number of different types of butterflies: the Big Butterfly Count. Participants are asked to spend 15 minutes spotting butterflies and moths. The data, currently 5121 sightings, is displayed on a map.
A couple of thoughts. Firstly, I think the data could be displayed in a far more engaging manner with a heat map of some sort, with the ability to show clusters of different species at least. The following is an inefficient way to show the data for a species:
 Secondly, I wonder if Twitter could be used in some way to channel the data - one could even tweet a picture to the project. That way, the data could be verified and it would come with geolocation and time associated.Finally, and perhaps most importantly, the site suffers from the age old problem of inadvertent-tab-ellipsis-renaming:

Augmented Reality - 17 years from Concept to Product?

Augmented-Reality-17-y...

Almost twenty years ago, I recall coming across a paper (either in the AI library in Edinburgh, or the Computer Science library in Cambridge) which described an augmented reality approach to that most intractable of problems: fixing printers.A number of forces have conspired to allow me to access a reference to that paper (Google's crawl/search, my memory being prompted repeatedly by augmented reality applications on mobile devices).At any rate, I suspect the image below, from a document with a 1993 time stamp, may be one of the earliest incarnation of augmented reality. Feiner,
S., MacIntyre, B., and Seligmann, D. (1993) "Knowledge-Based
Augmented Reality." Communications of the ACM, Vol. 36(7),
pp. 53-62.
 Looking around now, what will be hitting mainstream in 17 years?

Visualizing Location and Mood in Twitter

Visualizing-Location-a...

Last year I wrote about some work done by Sune Lehmann and colleagues at the Barabasi Labs which explored the relationship between location and affect signals in Twitter (here). Sune pointed me to a recent update which extends the work to a cartogram approach to visualizing moods and volume: Mood, Twitter and the new shape of America.

The work explores both visualization methods and the data itself. It also demands answers to some interesting questions, the least of which is the apparent difference between the coasts and the bits in the middle. I'd be interested in seeing an analysis of these distinctions,

FindTheBest.com A Comparison Site for Everything

FindTheBestcom-A-Compa...

Jack Middlebrook from FindTheBest.com dropped me a line to tell me about the site, and specifically the matrix it generates for comparing teams playing in the World Cup.
FindTheBest describes itself thus:

an objective comparison search engine that allows you to find a topic, compare your options and decide what's best for you. Ultimately, FindTheBest allows you to make faster and more informed decisions by allowing you to easily compare all the available options.
The site provides a table of entities and attributes, similar to Google Squared.
The site then allows you to select a number of rows, which it then pivots on to provide a readable comparison set.

Visualizing the London Underground

Visualizing-the-London...

A reader of this blog pointed me to this interesting project Matthew Somerville which displays the live position of trains on the London underground railway system. It is a fun visualization, and a great example of free data and dev ecosystems coming together. I wonder where this could go product-wise? Analytics for train problems? delays? Many of the tube lines have trains running at a frequency such that one just hops on the next one that arrives. I'm guessing that it might be useful in a late night, mobile scenario where optimizing dynamically over connections could make a big difference to getting around.

This feed is found in the following collections ↓

infographics infographics infographics

infographics

Collection made by rvw

rvw
infographics images infographics images infographics images

infographics images

Collection made by markuos

markuos
visualization visualization visualization

visualization

Collection made by daviddeboer

daviddeboer