Today, futurists using Twitter, Delicious, Digg and other Web 2. 0 services publish a flow of content that is probably already too large for any person to follow, and is growing rapidly.

For example, Twitter publishes roughly 600-700 tweets per day marked with the #future hash tag. The futurists I follow post 70-80 tweets per day (though some of those posts are personal or auto-generated by other agents). Futures-oriented lists on Twitter follow anywhere from a dozen to three hundred people, and almost those lists are all available via RSS.

Other systems generate equally substantial bodies of content. Users on Delicious, the oldest social bookmarking service, post about 350 bookmarks per day with the tag "future." My network (which includes a select few futurists) posts about 220 bookmarks per year. That translates into about 1120 separate data-points per day, or over 400,000 signals per year — just from three services. Futurists' blogs publish between 100 and 200 posts per week.

Casting one's net wider, one can rapidly capture an enormous number of potential signals. Consider Tweet the Future, a Web site that monitors Twitter for tweets containing the word "future." It finds about 30 tweets every minute– over 40,000 a day– though the vast majority of these tweets have nothing to do with futures or forecasting.

So many if not most futurists, consulting companies, and futures-oriented nonprofits are using one or more these systems. Most of these datastreams are real time-reflections of what people are reading. These datastreams represent a vast but untapped resource that could be used to build a picture of the collective attention of the futures community, and detect weak signals: indeed, it can largely replace the kind of commissioned content that fed Delta Scan and Signtific. We no longer have to work alone to find interesting things. Instead, we can detect patterns in our and our colleagues' datastreams.

How would a social scanning platform work? Here's what I imagine a very simple but useful system doing.

Its core functionality would be an engine that gathers signals from the free and nearly real-time content produced by futurists and subject-matter experts on blogs, Twitter, and other social media platforms; analyzes this content to find subjects and citations that are of greatest interest to the futures community; and clusters together material that shares unusual terms, keywords, or links to common references. This would let us identify both popular subjects and outlying wild cards, and create a body of data that could support others tools or services.

The system would harvest RSS feeds generated by a list of blogs, Twitter,, Digg and other services generated by the system's managers. The list would have some simple metadata about sources, most notably their authors; it would also record metadata from its sources, particularly the publication date and time of posts and articles, and whatever tags attach to the content.

What would the system it do with this datastream? The first key task would to filter it. By gathering information about the author of each feed, it would be able to associate multiple feeds with the same author. If the same author has several different sources that the system is following, the system would look across those and filters out repeats. For example, if I have a blog and account, and both automatically push updates to a Twitter account, the system knows to look for cross-posts between those services, and count a blog post that generates a Tweet only once.

The second key piece of filtering involves associating multiple hits on the same subject. Different people may talk about the same event but reference articles published in different places, or the same article published in multiple places– a wire service article that appears in several newspapers, or an article that is reblogged. The system would also need to be able to identify different URLs as pointing to the same article—e.g., the full URL or an article and a shortened URL. Identifying these sources could be done by software, by users, or both. So while repetition by an individual would be controlled for, multiple citations and references are recorded. The former is noise in the system, but the latter is signal: the more people who tag or blog about a subject, the more important it is. (Citation and referencing also filters out non-professional noise. Many Twitter users combine references to major new articles with announcements like "I am eating a sandwich;" the latter are far less likely to be referenced by others than the former.)

In Delta Scan and Signtific, contributors or community members were supposed to formally rate the importance of different trends. In this system, we can simply assume that if someone takes the time to share a link to an article, they consider that article to be worth their attention. More links, especially links over time, indicate the emergence of a group consensus that a link points to a trend worth watching.

This kind of filtering could be done automatically, and improved by users. People may be able to identify associations between articles that automated systems don't. They could group together content from the data stream by adding tags to specific pieces of content; and they can either tags or identify synonymous terms (e.g., ubiquitous computing, ubicomp, and ubic, and ubiq all mean the same thing, for example). My experience with Delta Scan and Signtific suggests, however, that this system needs to be kept as simple as possible. People generally don't classify things unless there are clear incentives and immediate rewards. Even then there are huge variations in the use of hash tags, keywords, etc. between users and across systems, and little chance that people can be induced to adopt standard taxonomies or ontologies. However, when you're working with high social knowledge, and information that by nature exists at the boundaries of the human corpus, it's important to maintain a high degree of ontological flexibility.

Rewarding people for doing this kind of tagging and associating would send the important signal that community-oriented work deserves to be recognized and encouraged. This kind of work has traditionally been essential for high-quality scholarly and professional activity (think of the legal profession's vast corpus of precedents and codes, the medical profession's reference works, the scientific world's gigantic structures for sharing everything from raw data to polished research) but has either been done largely by professionals– librarians, catalogers, and others– with little professional visibility, or by organizations that extract high rents for their work. By rewarding users for improving the system and contributing to the professional good, we can harvest some of the benefits of that organizational work without incurring its costs.

[This is extracted from a longer essay on social scanning. A PDF of the entire piece is available.]