« July 2008 | Main | September 2008 »

August 29, 2008

The Tall Head of URL Exploration

In The Long Tail of URL Exploration, I looked at the distribution of URL visits by 102K people in a day.  In What does the Nth Explorer of the Web Find?, I looked at how adding users grows the long tail and the number of unique URLs explored.  In this 3rd (of 4) post, I look at the how the tall head changes as we add explorers.

The tall head is the short list of sites that get the most visits.  We could call it the Top 10, Top 100 or whatever we think is relevant.  Later I propose a couple of heuristics for determining tall head membership in real time.

The tall head is made up of URLs that much of the population visits.  These are the "winner take all" URLs of user attention--URLs like cnn.com, google.com, or facebook.com.  For this reason, we might expect that while the long tail is growing with more unique URLs and the number of URLs with 2-3 hits is growing rapidly, the tall head is relatively stable.

This is the case.

One way to think about how stable the tall head might be is to ask how well a subset of the population predicts membership in the tall head for the entire sample.  The data from the last two posts is well-suited to look at this question.

Below is a plot of the accuracy of various subsets of the population (the same subsets we used previously) in predicting the Top 10, Top 20, Top 50 and Top 100 URLs of the entire population.  Just over 40% percent of the population predicts the Top 100 URLs with 90% accuracy.  The Top 10 are predicted to 90% accuracy by 10% of the population.

predicting the tall head


Figure.  Red-Top 10 URLs by visit; Blue Top 20
URLs by visit; Green-Top50 URLs by visits; Cyan-Top
100 URLs by visits.

The composition of the tall head depends relatively weakly on the subset of the population doing the predicting.

How can I predict the tall head for the day by 9 am in the morning? This is the real-time problem of long tail distributions.  The dynamics of the system are that real time Web exploration data appears as a time-ordered list of URLs from whatever users happen to be surfing.  This means that a real time heuristic for determining top URLs for the day has to rely on the properties of the time series including a small surfer sample size and recent counts of visits.

Fortunately, by the results illustrated above, a small sample size is a pretty good bet for determining tall head URLs. What we are still missing is metrics or intuition for how the long tail distribution evolves over time.

We do know that for a URL to end up in the tall head, it must be visited by many Web explorers.  This means that we can rule out all URLs that are visited by only one or two users.  This assumption also leads to a heuristic based on the time between visits--URLs visited by many people should have the same visit/time distribution as the users/time distribution of the entire sample.  More specifically, we might guess that if the time between visits has an average near 1 day/number of visits and relatively low variance, it is likely to be in the tall head.  A project for a little later...

Next post: How does the composition of the tall head change from day to day?

August 28, 2008

Worth Your Weight in Gold - Link

Evil Mad Scientist has fun analysis of this question for all kinds of materials from different forms of money (dimes = quarters by weight) to flour and culminating in antimatter (BTW, very valuable by weight!).  EMS gives the question a few entertaining twists.  Here is an excerpt:

Kopi Luwak coffee costs approximately the same amount per pound as human blood. (Knowing where it comes from, I think I'd rather drink the blood. It's been pointed out before that printer ink is also up there, but I'd rather not drink that either.)

Would you have guessed that peacock feathers can be worth more than their weight in dollar bills? Or that a fancy steak costs twice as much as its weight in dollar coins?

August 27, 2008

What does the Nth explorer of the Web find?

In The Long Tail of URL Exploration, I looked at the distribution of URLs and visits. This was on the way to trying to answer questions like:

  • How much overlap is there between the URLs 10 people visit and those in the 11th person's click stream?
  • How about the 100th or 100,000th person? Does the millionth user explore any unique URLs at all?
  • Can we build a model to answer How many people are required to crawl 10% of the Web?


The second part of the answer is to look at how the model of URLs and visits evolves as we add users.  To get samples with of different sizes using the same click stream data set, randomly select a subset of the users and run the analysis from the previous post.  Through everyone back into the pot and randomly select a slightly larger set.  Repeat.

I reran the model for 3%, 4%, 5%, 7%, 9%, 11%, 15%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of the overall users in the 1-day data set. The sample sized ranged from 3,000 to 91,200 users. For the entire data set, the average user made 184 URL visits during the day. In the randomly chosen subsets, users made an average of between 181 and 187 URL visits with most of the variation in the smaller sample size as expected.

Do I expect the number of unique URLs be linearly proportional to the number of users? Or if users are visiting many of the same URLs and URLs tend to have the "winner take all" properties we looked at before, we might expect the number of unique URLs added by the 90,000th user to be fewer than the number of unique URLs added by the 1,000th user.

I first plotted the number of unique URLs against the number of users in each sample.  The curve looks straight but may be slightly concave downward.  It is very subtly.  I needed to look at the data in a way the amplified the change over the various subsamples.

Below is a plot of the number of unique URLs/user vs. the number of users. This line is flat if the number of URLs is growing linearly with the number of users.

URLs per user


The blue curve is the best fit to another power function ( f(x)=ax^k ). The first few thousand users are contribute more original URLs (>90 URLs per user) to the sample than the 100,000th (83 URLs).  If you are the first explorer of the a new world, all of your discoveries are original; when you are a late comer, your contributions are around the margins. It may be surprising how much original content being explored by the 100,000th explorer.

Does the long tail get relatively longer or shorter?  For simplicity, I use the URLs with only one visit to represent the long tail.  Then ratio of 1-visit URLs to unique URLs decreases subtly. For the smallest samples size 70.0% of the unique URLs are hit only once; for the overall data set, the ratio is 69.2%. To amplify this change like above, the plot of 1-visit URLs per user is shown below.

long tail per user


At 100,000 users, the long tail is growing at 57 URLs per additional user.  The decrease with each additional user is slowing.  The blue curve is the best fit to another power function.

If the power function is the best explanation of the underlying dynamics, the number of unique URLs and the long tail both continue to grow no matter how many people are exploring.  Since an increasing number of people need to explore to keep the exploration rate constant, the cost of exploration per URL goes up as explorers are added.

Does anything interesting happen in the tall head where the big winners are? That will have to wait for another post.

 

The Long Tail of URL Exploration


Unlike robot crawlers, people visit only the links they think will be interesting.  People follow news stories, follow links from friends, follow links to videos or pictures.  People much more rarely follow links to boring stuff like privacy policies, lists of data or company mission statements.

In order to understand the pattern or people discovering content on the Web, I looked at the click streams of real Web users.  (For privacy hounds, I have real data, but the user names and URLs are hashed so I can't personally identify anyone or look up the actual URLs they visited.)

How much overlap is there between the URLs 10 people visit and those in the 11th person's click stream?  How about the 100th or 100,000th person? Does the millionth user explore any unique URLs at all? Can we build a model to answer How many people are required to crawl 10% of the Web?

The first part of the answer is to look at the distribution of URLs created by a group of users.  The sample has 102,000 user's click streams for 1 day.  I use only 1 day, because using multiple days complicates the estimates due to nearly all users having a set of URLs they visit daily. In the sample, users make 18.6M visits to just under 8.4M unique URLs.

When the URLs are ranked by number of visits, the distribution of visits over the 8.4M URLs shows that a few URLs get many hits while the tail of the distribution (way out to the right) is a long flat curve with many URLs getting only a handful of hits.

The distribution of visits can be fitted fairly accurately with a power law, p(x)=ax^k.  I don't plot the curve, because the head is so tall compared to the tail and the distribution falls off so quickly that the plot is a very sharp "L" shape and we don't get much from looking at it.  It is more useful to look at the cumulative distribution function (CDF) of the distribution.  This is the sum of the probabilities over the rank from highest to lowest.  Summing the probabilities to the lowest ranked URL, gives 100% of the visits recorded. Using the CDF perspective gives insight we can apply to practical situations.

Below is a plot of the CDF.  The red dots are the data points calculated from the sample while the blue line is the best fit to the CDF of the proposed power law distribution.  (The fit parameters are a=0.00219 and k=-0.690.)


CDF of URL Visits



One conclusion that comes out of this view of the data is that URL visits follow the so-called "80/20 Rule." This predicts 80% of the visits for the day went to roughly 20% of the URLs. Actually, for this data, the proportion is about 80/50--80% of the traffic when to the top 4.5M URLs or the top 57% of URLs.

This view shows that the tall head of big visit winners takes a significant fraction of the overall attention of the group.  These are likely URLs like www.google.com or www.cnn.com.

What does the long tail look like? The tail is just as surprising. For this data set, 5.8M URLs or 69% of the URLs visited during the day were visited only once. The number of URLs visited twice is 1.4M.

How do these numbers scale with the size of the group? That's coming in the next post. 

August 13, 2008

Synchronizing Fireflies Link

This is a really cool project.  It combines a great example of collective effects like the weakly coupled pendulum experiment with a simple extension of the Programmable LED project.  Nice work Tinkerlog!

August 12, 2008

Qualifiers with Infinite Denominators Link

I like digging through how we use and perceive words.  Here is a fun one from Chris Andersen's Blog

Many words we use imply a ratio.  Here are the top 5 words whose meaning fizzles when the implied denominator becomes large:

  1. "Most"
  2. "Average"
  3. "Typical"
  4. "All"
  5. "None/No"
It is not only large denominators that undermine these words (more accurately, the success we have using our intuition to make predictions with these words). They also fall apart when probability distributions are anything but Normal (peaked, symmetric and short-tailed).  We don't seem to do a great job of helping people understand multi-modal or skewed distributions with our basic statistics education.  It is increasingly important in planning and executing many business ideas that we become more sophisticated with long-tailed, asymmetric and multi-modal probability distributions.

Chris is the author of The Long Tail and editor of WIRED magazine and curator of TED.  He is working on a new book about "free" products and services.  All recommended--check them out.

August 11, 2008

Copyright Link

I often rant about the current challenges of over-zealous copyright laws.  Andy Oram over at O'Reilly sums it up nicely in a lengthy post on the blog.  If you care about innovation--the future of creativity and individual contribution to invention, art or software development--it is worthwhile to take a few minutes to understand the legal and economic state we have gotten ourselves into by extending copyright at the urging of a few powerful copyright holders.

August 04, 2008

Department Of Homeland Security and Software Tools

Tools extend the capacity for human intentions to manipulate the physical world.  Advances in making and using tools have created huge gains of time or capabilities.

At each step of evolution, as tools have extended capacity for manipulating physical objects, they have also shaped the intentions they were created to fulfill. Examples are everywhere, but we sometimes overlook how the creation and use of a lineage of tools is intricately linked with learning and changing perception of tool users.

Software is a game changer in the evolution of tools in two ways.  First, software reuses complicated, expensive hardware for a broad variety of intentions.  Software allows us to decouple expensive, slowly evolving (matter-manipulating) machines from intentions.  This trend means we think less and less about hardware having a narrow intentional purpose.  We think of hardware as increasingly abstract and serving the purposes of the software.

Secondly, software is closer to being directly about human intention than earlier tools.  This is a punctuated step. A shovel is useful for digging a ditch to irrigate the garden and so is about an important intention.  But the shovel embodies nearly nothing about the human values placed on the garden, the nature of gardens, or the biology of plants.

Software, on the other hand, is increasingly embodying our values, the way our minds and bodies work, our social interactions, and more.  We depend on it to give us instant access to more information than we could possibly remember. We use it to communicate with friends. We use it for analysis and prediction to supplement millions of years evolution of heuristics and biases.

From this perspective, computing devices with the software they execute are a special kind of tool that embodies records, patterns, values, habits and assumptions of our intentions. In the shadow of this claim, the Department Of Homeland Security's cavalier attitude toward seizure of any form of information storage and processing device whether "analog or digital" without cause for suspicion is an extreme violation.

August 01, 2008

Department Of Homeland Security wants my Hand?

From Slashdot:

DHS officials said that the newly disclosed policies--which apply to anyone entering the country, including US citizens--are reasonable and necessary to prevent terrorism... The policies cover 'any device capable of storing information in digital or analog form,' including hard drives, flash drives, cell phones, iPods, pagers, beepers, and video and audio tapes. They also cover 'all papers and other written documentation,' including books, pamphlets and 'written materials commonly referred to as "pocket trash..."'"

It’s "The policies cover 'any device capable of storing information in digital or analog form,..." that really got my attention.  Remember the little trick with your knuckles to keep the number of days in various months straight?  Presumably that would have to count as an analog information storage device.  Goodbye hand.

Hosting by Yahoo!