Main

November 16, 2008

Are you already "socially networked" with everyone you know?

Facebook and LinkedIn Network Connections

Both my Facebook and LinkedIn social networks have gone through periods of rapid grown in the past year. Although I have been a member of LinkedIn for many years, my network expanded rapidly last year, approximately doubling in a couple of months.  I have had a similar experience in the last few months with my Facebook network.

This is likely a convergence of factors.  My contemporaries are reconnecting with high-school and college acquaintances on Facebook.  More of the people I have worked with in the past have joined LinkedIn and started added colleagues from past employers.  This has led to a critical mass of people both connecting and inviting others to connect.

In each case, it seems that when enough people in my circle of acquaintances are connected, new connections form rapidly because it isn't far (in terms of network hops) to other people I know.

So am I already "socially networked" with everyone I know? I think the answer is yes.  Because I am not very active at inviting new friends, it might be more accurate to say that I could be completely socially networked with an evening's work.  In terms of network dynamics, this is because the network of my local group of friends has undergone the phase transition network mathematicians refer to as "percolation."

A simple model of network percolation

Where are without a model? Not very far on this blog. Instead of modeling a complete social network, let's start with a simplified model and see if the behavior I want to describe pops out.

Here is a model that is easy to visualize. Imagine I am about to tile my kitchen floor. I draw a grid on the floor showing where each tile will be placed.  Looking at the kitchen floor as I prepare to lay the tile, you see a simple grid of lines, like graph paper. Now, start adding tiles. But instead of working from one side of the kitchen to the other in rows (like a normal person), I add tiles randomly.

I lay one tile by the sink, then two by the fridge, then another by the stairwell, a couple in the center.  The floor is getting cover red with tiles, but in a hodgepodge sprinkling.  Not too many of them are sitting right next to each other at first.  As I put down more and more tiles, however, there start to be little clusters of adjacent tiles. After a little while, the floor might look like this:

A few tiles in the kitchen tile network


But then, along comes my nephew Hayden, wanting to go to the sink for a drink of water. I don't want him to walk where there are no tiles, because he will track glue everywhere. But he is small, and can't take steps bigger than one tile. So I need to hurry to put down enough tiles so he can walk to the sink.  But I don't want to ruin my beautiful experiment with the random tiles.  How many tiles do I have to put down (still placing them at random locations) before Hayden can walk on adjacent tiles from the living room to the sink to get a drink?

Kitchen tile network percolation


When he can do this, the edges of the network of tiles are connected.  The network of tiles is said to have a percolating group of tiles.

Continue reading "Are you already "socially networked" with everyone you know?" »

October 01, 2008

The Singlularity will always be near

Kevin Kelly's essay Thinkism is an important essay responding to apocalyptic thinking about the "Singularity." This is a very clear and considered essay, written with deep understanding and experience of how order emerges in complex systems. Related, I highly recommend his book Out of Control. It is a little dated, but if you hold this in mind as you read, it is all the more wonderful for Kelly's depth and prescience.

September 03, 2008

Day-to-Day Tall Head of URL Exploration

This is the 4th post on the statistics of URL exploration. In the previous three (The Long Tail of URL Exploration, What does the Nth Explorer of the Web Find? and The Tall Head of URL Exploration) I looked at how adding users grows the long tail and tall head of URLs for a single day.  Today, the data covers 20 days with a relatively constant population.

To get some idea how the tall head evolves, compare the tall head on day 0 with 19 successive days. The plot below shows the Top 10, Top 50, Top 100, Top 500 and Top 1000 URLs for day zero and the fraction of the top URLs on day zero appearing in the tall head on the nth day.

 

Tall Head URLs by day
Figure 1.  Red-Top 10 URLs; Blue Top 50 URLs;
Green-Top 100 URLs; Cyan-Top 500 URLs; Yellow-Top
1000 URLs. (URLs ranked by visits).
 

While the Top 10 and Top 50 URLs show stability day after day, the Top 500 and Top 1000 roll over at a fairly constant rate after day one.  The plot can be used to estimate the size of the persistent tall head of URLs for this population and the rate at which the tall head evolves.

First, look for a change in behavior from maintaining a constant fraction of the day-0 URLs to a steady decline from day to day. By this heuristic, estimate the persistent tall head to be between 50 and 100 URLs.

Secondly, to estimate the turnover of the tall head, choose the approximate desired tall head, e.g., the Top 500 URLs (cyan), and look at the slope of the line for days 1-19. (Alternately, choose a timescale for which the tall head should turn over to a given fraction remaining, say, 75%, giving a timescale of approximately 15 days.)

Tall Head Top 500 Fit


Figure 2.  Red-Top 500 URLs; Blue Top 1000 URLs, shown
for comparison; Green-Fit to Top 500 URLs. (Days 1-20,
URLs ranked by visits).

The plot above shows the Top 500 URLs rollover about 0.5% per day from days 1 to 20.

August 29, 2008

The Tall Head of URL Exploration

In The Long Tail of URL Exploration, I looked at the distribution of URL visits by 102K people in a day.  In What does the Nth Explorer of the Web Find?, I looked at how adding users grows the long tail and the number of unique URLs explored.  In this 3rd (of 4) post, I look at the how the tall head changes as we add explorers.

The tall head is the short list of sites that get the most visits.  We could call it the Top 10, Top 100 or whatever we think is relevant.  Later I propose a couple of heuristics for determining tall head membership in real time.

The tall head is made up of URLs that much of the population visits.  These are the "winner take all" URLs of user attention--URLs like cnn.com, google.com, or facebook.com.  For this reason, we might expect that while the long tail is growing with more unique URLs and the number of URLs with 2-3 hits is growing rapidly, the tall head is relatively stable.

This is the case.

One way to think about how stable the tall head might be is to ask how well a subset of the population predicts membership in the tall head for the entire sample.  The data from the last two posts is well-suited to look at this question.

Below is a plot of the accuracy of various subsets of the population (the same subsets we used previously) in predicting the Top 10, Top 20, Top 50 and Top 100 URLs of the entire population.  Just over 40% percent of the population predicts the Top 100 URLs with 90% accuracy.  The Top 10 are predicted to 90% accuracy by 10% of the population.

predicting the tall head


Figure.  Red-Top 10 URLs by visit; Blue Top 20
URLs by visit; Green-Top50 URLs by visits; Cyan-Top
100 URLs by visits.

The composition of the tall head depends relatively weakly on the subset of the population doing the predicting.

How can I predict the tall head for the day by 9 am in the morning? This is the real-time problem of long tail distributions.  The dynamics of the system are that real time Web exploration data appears as a time-ordered list of URLs from whatever users happen to be surfing.  This means that a real time heuristic for determining top URLs for the day has to rely on the properties of the time series including a small surfer sample size and recent counts of visits.

Fortunately, by the results illustrated above, a small sample size is a pretty good bet for determining tall head URLs. What we are still missing is metrics or intuition for how the long tail distribution evolves over time.

We do know that for a URL to end up in the tall head, it must be visited by many Web explorers.  This means that we can rule out all URLs that are visited by only one or two users.  This assumption also leads to a heuristic based on the time between visits--URLs visited by many people should have the same visit/time distribution as the users/time distribution of the entire sample.  More specifically, we might guess that if the time between visits has an average near 1 day/number of visits and relatively low variance, it is likely to be in the tall head.  A project for a little later...

Next post: How does the composition of the tall head change from day to day?

August 27, 2008

What does the Nth explorer of the Web find?

In The Long Tail of URL Exploration, I looked at the distribution of URLs and visits. This was on the way to trying to answer questions like:

  • How much overlap is there between the URLs 10 people visit and those in the 11th person's click stream?
  • How about the 100th or 100,000th person? Does the millionth user explore any unique URLs at all?
  • Can we build a model to answer How many people are required to crawl 10% of the Web?


The second part of the answer is to look at how the model of URLs and visits evolves as we add users.  To get samples with of different sizes using the same click stream data set, randomly select a subset of the users and run the analysis from the previous post.  Through everyone back into the pot and randomly select a slightly larger set.  Repeat.

I reran the model for 3%, 4%, 5%, 7%, 9%, 11%, 15%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of the overall users in the 1-day data set. The sample sized ranged from 3,000 to 91,200 users. For the entire data set, the average user made 184 URL visits during the day. In the randomly chosen subsets, users made an average of between 181 and 187 URL visits with most of the variation in the smaller sample size as expected.

Do I expect the number of unique URLs be linearly proportional to the number of users? Or if users are visiting many of the same URLs and URLs tend to have the "winner take all" properties we looked at before, we might expect the number of unique URLs added by the 90,000th user to be fewer than the number of unique URLs added by the 1,000th user.

I first plotted the number of unique URLs against the number of users in each sample.  The curve looks straight but may be slightly concave downward.  It is very subtly.  I needed to look at the data in a way the amplified the change over the various subsamples.

Below is a plot of the number of unique URLs/user vs. the number of users. This line is flat if the number of URLs is growing linearly with the number of users.

URLs per user


The blue curve is the best fit to another power function ( f(x)=ax^k ). The first few thousand users are contribute more original URLs (>90 URLs per user) to the sample than the 100,000th (83 URLs).  If you are the first explorer of the a new world, all of your discoveries are original; when you are a late comer, your contributions are around the margins. It may be surprising how much original content being explored by the 100,000th explorer.

Does the long tail get relatively longer or shorter?  For simplicity, I use the URLs with only one visit to represent the long tail.  Then ratio of 1-visit URLs to unique URLs decreases subtly. For the smallest samples size 70.0% of the unique URLs are hit only once; for the overall data set, the ratio is 69.2%. To amplify this change like above, the plot of 1-visit URLs per user is shown below.

long tail per user


At 100,000 users, the long tail is growing at 57 URLs per additional user.  The decrease with each additional user is slowing.  The blue curve is the best fit to another power function.

If the power function is the best explanation of the underlying dynamics, the number of unique URLs and the long tail both continue to grow no matter how many people are exploring.  Since an increasing number of people need to explore to keep the exploration rate constant, the cost of exploration per URL goes up as explorers are added.

Does anything interesting happen in the tall head where the big winners are? That will have to wait for another post.

 

The Long Tail of URL Exploration


Unlike robot crawlers, people visit only the links they think will be interesting.  People follow news stories, follow links from friends, follow links to videos or pictures.  People much more rarely follow links to boring stuff like privacy policies, lists of data or company mission statements.

In order to understand the pattern or people discovering content on the Web, I looked at the click streams of real Web users.  (For privacy hounds, I have real data, but the user names and URLs are hashed so I can't personally identify anyone or look up the actual URLs they visited.)

How much overlap is there between the URLs 10 people visit and those in the 11th person's click stream?  How about the 100th or 100,000th person? Does the millionth user explore any unique URLs at all? Can we build a model to answer How many people are required to crawl 10% of the Web?

The first part of the answer is to look at the distribution of URLs created by a group of users.  The sample has 102,000 user's click streams for 1 day.  I use only 1 day, because using multiple days complicates the estimates due to nearly all users having a set of URLs they visit daily. In the sample, users make 18.6M visits to just under 8.4M unique URLs.

When the URLs are ranked by number of visits, the distribution of visits over the 8.4M URLs shows that a few URLs get many hits while the tail of the distribution (way out to the right) is a long flat curve with many URLs getting only a handful of hits.

The distribution of visits can be fitted fairly accurately with a power law, p(x)=ax^k.  I don't plot the curve, because the head is so tall compared to the tail and the distribution falls off so quickly that the plot is a very sharp "L" shape and we don't get much from looking at it.  It is more useful to look at the cumulative distribution function (CDF) of the distribution.  This is the sum of the probabilities over the rank from highest to lowest.  Summing the probabilities to the lowest ranked URL, gives 100% of the visits recorded. Using the CDF perspective gives insight we can apply to practical situations.

Below is a plot of the CDF.  The red dots are the data points calculated from the sample while the blue line is the best fit to the CDF of the proposed power law distribution.  (The fit parameters are a=0.00219 and k=-0.690.)


CDF of URL Visits



One conclusion that comes out of this view of the data is that URL visits follow the so-called "80/20 Rule." This predicts 80% of the visits for the day went to roughly 20% of the URLs. Actually, for this data, the proportion is about 80/50--80% of the traffic when to the top 4.5M URLs or the top 57% of URLs.

This view shows that the tall head of big visit winners takes a significant fraction of the overall attention of the group.  These are likely URLs like www.google.com or www.cnn.com.

What does the long tail look like? The tail is just as surprising. For this data set, 5.8M URLs or 69% of the URLs visited during the day were visited only once. The number of URLs visited twice is 1.4M.

How do these numbers scale with the size of the group? That's coming in the next post. 

August 12, 2008

Qualifiers with Infinite Denominators Link

I like digging through how we use and perceive words.  Here is a fun one from Chris Andersen's Blog

Many words we use imply a ratio.  Here are the top 5 words whose meaning fizzles when the implied denominator becomes large:

  1. "Most"
  2. "Average"
  3. "Typical"
  4. "All"
  5. "None/No"
It is not only large denominators that undermine these words (more accurately, the success we have using our intuition to make predictions with these words). They also fall apart when probability distributions are anything but Normal (peaked, symmetric and short-tailed).  We don't seem to do a great job of helping people understand multi-modal or skewed distributions with our basic statistics education.  It is increasingly important in planning and executing many business ideas that we become more sophisticated with long-tailed, asymmetric and multi-modal probability distributions.

Chris is the author of The Long Tail and editor of WIRED magazine and curator of TED.  He is working on a new book about "free" products and services.  All recommended--check them out.

February 13, 2008

Competition III--Another Iterated Prisoner's Dilemma Tournament

These results are a follow on to my last post on competition and iterated prisoner’s dilemma simulation. In the tournament below, I used the tournament rule that every agent plays every agent at each round.  This takes a lot longer to run and the results are different. AvgT4T is still the winner, but T4TForgive beats out T4T.

Tournament 1 was one in which everyone played once each round.  This seems more like a competitive environment in which the other players don't get complete information about the how play proceeded in that round--maybe more like competing for jobs or making friends in high school. 

Tournament 2 seems to give more complete information at each round as a public auction or public disclosure pricing might provide.  I am sure the merits of using one style of tournament over another have been debated plenty.

The difference in outcome illustrates an important characteristic of complex interactions among agents: The initial conditions and the rules of play make all the difference in the world.  For those who are committed to free markets, there seems to be a parallel assumption that the ultimate achievement in free markets is one with now rules at all. Citizens of a connected and crowded planet may choose to design the rules of interacting based on our initial conditions. When I play with these simulations for awhile, I start to understand Case's argument against depending on the simplistic dogma of free-markets to answer all questions of our common wellbeing.

Prisoner's Dilema Population by Round 

Data tables ... 

Continue reading "Competition III--Another Iterated Prisoner's Dilemma Tournament" »

February 11, 2008

Competition II

I couldn't let this one alone.  While reading Case's book on Competition, I decided to simulate the theoretical game Iterated Prisoner's Dilemma. The explanation on Wikipedia is great.  I use the canonical PD payoff matrix {(3,3), (0,5), (5,0) (1,1)} where the scores represent points awarded to (agent1, agent2) for combinations of play {(mum, mum),(mum, snitch),(snitch, mum),(snitch, snitch)} respectively.

The simulation runs 200-game matches for 40 rounds of play.  I have coded both everybody-plays-once-per-round and everybody-plays-everybody-per-round tournaments, although the results here are only from  everybody-plays-once-per-round tournaments. Tournaments start with 25 agents of each strategy type.

I programmed 7 agent strategies:
  • Defect (always snitch)
  • Random (fair coin toss)
  • Tit-for-tat (Do what your opponent did last play--T4T is best overall according to the literature)
  • T4T with occasional snitches
  • T4T with occasional forgiveness
  • Prey on cooperators
  • Average T4T (T4T based on the average of all plays to date)

Below are the results of 40 rounds of play. After each round, the total points earned by each strategy are tallied, then the population is redistributed according to the point breakdown.  Predatory strategies tend to become extinct after 15 to 20 rounds.

 

IPD Population Plot


Average T4T seems to be the population winner in my simulations, settling in at about 40% of the population.  I haven't seen any documentation on this effect.  It may be an artifact of my particular mix of agents when the tournament starts.  Or I may have discovered a new, effective agent strategy.  This doesn't seem likely as it is such an obvious strategy. Has anyone seen this effect before?  Please drop me a line.

The game win-loss-tie record between agent's strategies was a little surprising to me.  Defection wins most of the matches. But it is clear that winning games doesn't correspond closely to achieving the highest scores per round.  Coming in a close second against a wide range of strategies is what explains a particular population's survival and growth.

Match Win-Loss-Tie

 

AvgT4T-Random: 0.492, 0.508, 0.000
T4TDefect-Random:1.000, 0.000, 0.000
T4TDefect-T4TForgive: 1.000, 0.000, 0.000
AvgT4T-T4TForgive:0.000, 0.000, 1.000
Defect-T4TDefect: 1.000, 0.000, 0.000
Defect-Pred1: 1.000, 0.000, 0.000
T4T-AvgT4T: 0.000, 0.000, 1.000
Pred1-T4TForgive:1.000, 0.000, 0.000
T4TDefect-T4T: 0.974, 0.000, 0.026
Pred1-AvgT4T:1.000, 0.000, 0.000
T4T-T4TForgive: 0.000, 0.000, 1.000
Defect-AvgT4T:1.000, 0.000, 0.000
T4TDefect-AvgT4T: 1.000, 0.000, 0.000
Random-T4TForgive: 1.000, 0.000, 0.000
Defect-Random: 1.000, 0.000, 0.000
Pred1-Random: 0.250, 0.750, 0.000
Defect-T4T: 1.000, 0.000, 0.000
Defect-T4TForgive:1.000, 0.000, 0.000
Pred1-T4T: 1.000, 0.000, 0.000
Random-T4T: 0.750, 0.000, 0.250
T4TDefect-Pred1: 1.000, 0.000, 0.000

If you compare the percentage of points earned in the (7!)/(2!)(5!) = 21 possible pairings, it is clear that the point leaders fairly evening split the points with all opponents, while the agent strategies the result in rapid extinction lose to some strategies by wider margins.

Match Scores (Fraction of Points)

T4T-T4TForgive: 0.500, 0.500
T4TDefect-T4TForgive: 0.510, 0.490
Random-T4TDefect:0.492, 0.508
Defect-AvgT4T: 0.506, 0.494
T4TForgive-Random:0.491, 0.509
Pred1-AvgT4T: 0.522, 0.478
Defect-T4T: 0.506, 0.494
T4TForgive-AvgT4T: 0.500, 0.500
T4TDefect-AvgT4T:0.518, 0.482
Pred1-Random: 0.397, 0.603
Defect-Random: 0.858, 0.142
T4T-T4TDefect: 0.496, 0.504
AvgT4T-T4T: 0.500, 0.500
T4TDefect-Pred1:0.512, 0.488
Pred1-T4T: 0.503, 0.497
Pred1-T4TForgive: 0.510, 0.490
Random-T4T:0.502, 0.498
Random-AvgT4T: 0.503, 0.497
Defect-T4TDefect: 0.506, 0.494
Defect-T4TForgive:0.537, 0.463
Defect-Pred1:0.506, 0.494



You can download the code (zip, gzip) here to play with ideas for strategies or see how the dynamics of the system change based on the mix of agents. There is a Readme file explaining how to create new agents with your favorite strategies. Adding new agent types is fairly straight forward and requires only a few lines of new code in most cases. Python 2.5  is required to run the simulations.  The analysis programs require MatPlotLib to create the plots.  This software is available under a non-commercial Creative Commons License.  Have fun!

February 09, 2008

Competition-Mobius Thought Loop

James Case's Competition: The birth of a new science  caught my eye while I was browsing the stacks in Powell's Bookstore over the holidays. The Amazon reviews seem to hit most of the book-review points I might try to make so no use in reproducing that here.

Case has an interesting background as a pro baseball pitcher and a professional mathematician. In Competition, he is making the argument that classical economics is in the process of maturing toward a "real" science by embracing experiments and increasingly holding theories to the standard of allowing the possibility they may shown false by successful explanations of new data.  This is something he both observes and documents at the edges of economic academia. And it is also a cause Case promotes.

Here is one additional thought: This discussion is moving along an ironic and amusing thought loop-maybe, an argumentative Mobius Strip?

Starting on one side, Case explains that it is time for a paradigm shift (Kuhn's kind) in Economics, that data and new analysis abilities are building a compelling argument against the classical economic axioms: Efficient Markets, Gaussian randomness, the tenant that free-agent-self-interest leads to optimizations, free trade always creates domestic benefits (Principle of Comparative Advantage-David Ricardo [1817]--no Wikipedia hits on this), etc. Case is not trying to prove this is happening in one short book; instead he lays out the background science, framing the argument, and citing original works.

Case spends a lot of energy explaining how experiments in actual and simulated competition show that monopoly behaviors (characterized by predatory actions on competitors executed in order to keep markets already taken lucrative) ubiquitous. As you continue to follow the loop, Case goes on to explain some of the ills to society these misconceptions (idealizations) cause.

To show the nature of the resistance, he quotes many criticisms of the "new science" by the classical economists. He catches them ignoring new data, new ideas behind the analysis of competition and real agents (people with partial knowledge, greedy, colluding, etc).

Now in a sense, we seem to be back to where we started.  Case may be overlooking the irony or he may have dulled his sense of absurdity, or he may feel irony is not useful in this discussion.  It seems that he might have dispensed with the straight face and ended arguing that economic classicists will change slowly or not at all because, with regard of their careers as academics and government advisors, the current monopoly, that is, their current economy, is working great.  Why use theory and data to ruin a perfectly good system?

One can imagine a day in the future when some new explanation or technology from the "new science" (empirically falsifiable, complexity, self-organizing complex systems)--or some newer science--is so compelling and valuable that the market decides in its favor over the classical views.  Don't expect Microsoft to start feeling uneasy that it might be a monopoly and contemplate breaking itself up.  Likewise, why would classical economists give up their central positions to make a few model-generated lines run a little closer to a few experimental data points?

October 22, 2007

Networks And Obesity

There is an interesting article in the New England Journal Of Medicine this month in which the authors analyzed data from a long multi-generational health study of over 12,000 people to understand how the network of friends and relations correlates to obesity.  Christakis and Fowler found a number of relationships with increased probablility of shared obesity.

Among their findings,

A person's chances of becoming obese increased by 57% (95% confidence interval [CI], 6 to 123) if he or she had a friend who became obese in a given interval. Among pairs of adult siblings, if one sibling became obese, the chance that the other would become obese increased by 40% (95% CI, 21 to 60). If one spouse became obese, the likelihood that the other spouse would become obese increased by 37% (95% CI, 7 to 73).

Click through to read the whole article on the NEJM site (Free).  It is an interesting look at human behavior and a nice example of the power of network analysis.

 

 

October 11, 2007

Is this long tail distribution a power law?

The discussion of power law vs long tail came up on Chris Anderson's Long Tail blog a couple of days ago. 

"Power law or not?" vs "Long-Tail or not?" are separate questions. If I understand Chris' thesis, Long Tail is the idea that there is a significant population in the "not-hit" part of the distribution, usually of low volume in any rank, but continuing out to very high ranks.

The idea that there is a region where the distribution is essentially "scale free" seems like the key concept. If we start there, interesting questions include: Can we characterize this region with a power law? And (my favorite), what are the dynamics of the system where scale matters? This last question is at the core of the economics of the long tail businesses in general. For example, determining how inexpensive we make "find" and "acquire" activities corresponds with the "knee" in the distribution.

Scale-free is a misleading mathematical idea in that nothing in nature is actually scale free for all domains. For example, an absurdity of assuming scale-free in every domain WRT music or movie hits is that anything created has at least 1 fan (i.e. we don't have an arbitrarily small hit)--this introduces scale and consequently, the region of arbitrarily small hits with less than 1 fan can't be modeled by a power law. That's a toy case, but illustrates how much scale matters.

Instead of including all the points in the power law fit, maybe we can look at the points up to the knee in power law model (it looks like x~3) and then try to understand what interesting dynamics shape the knee for x>3 with the assumption that some scaling has been introduced by cost, potential audience size limits, or whatever...

July 31, 2007

Long Tail Statistics (Kilkki's Formulation Hides Latent Demand)

The expression “The Long Tail” first started to buzz with the WIRED article of that name by Chris Anderson a couple of years ago.  Later, he expanded the article into a book explaining the ideas more fully and giving examples from book sales, music and movies.

The idea is explained in many places.  Start with Wikipedia.  I won’t try to reproduce a basic explanation.

However, recently, I read a paper by Kalevi Kilkki titled "A practical model for analyzing long tails". In it, Kilkki explains an alternate Long Tail distribution to the traditional Power Law distribtution.

I don’t see Kilkki's motivation--or rather, I don't trust it. He seems to be blindly "refining" the distribution to get more precision.  I don't trust this for two reasons (1) precision isn't going to help prediction or design (without understanding underlying processes, at the very least) and (2) I think his quest to line up the dots with the lines hides important effects. We will see. Finally, I think Kilkki over emphasizes the differences and benefits of his model. To try out my main critique, I built a (rough) Mathematica notebook to look at Kilkki’s model and the traditional power law.

As usual, I generated more questions than answers, but, unless I really misunderstood Kilkki, I don’t see the benefits he is pointing to.  I hope you enjoy a quick look at The Long Tail and send comments if you have any thoughts.

June 22, 2007

Random Network Phase Change

My network code is evolving.  Today, I was able to look at the phase change phenomenon in a randomly linked network.

To build a random network choose two nodes at random and connect them together.  As one continues to add links to the system, some nodes by chance end up with more links than others.  The number of nodes directly linked (the nearest neighbors) to a chosen node is called the “degree”.  The distribution of degree for a 1,000 node system is shown for various numbers of total links in the system in my post from a couple of days ago.

As more links are added to the system, more and more nodes are connected together.  Some nodes get connected to nodes that already have links, making larger groups of nodes linked together.  As one adds links to the system, new groups of 2 form as well as groups becoming linked together to form larger groups.

In the system simulated here, I have 1,000 nodes.  If I add 500 random links to the system, this would be enough to link every node once.  But because links are added randomly, some nodes are highly linked, and others aren’t linked at all.  One might expect the groups to get bigger gradually until there are so many links in the system, that nearly all the nodes are in one big group.

Instead, something else interesting happens as we add enough links in the system to theoretically connect every node (500 in my simulation).

If we look at the groups that form, the system goes through an abrupt transition from the unlinked phase to the linked phase when the (links in system)/(nodes in system) ~ ½.  See Figure 1 below.  The largest group rapidly grows from containing a tiny fraction of the nodes to containing nearly all of them.  This is the phase transition in random networks depicted in Watts’ book Six Degrees on pg 45 (hardback edition).  The value Average Links/Node = 0.5 is known as the critical point.  This was explained in 1959 by Erdos and Renyi (without the benefit of a straight-forward simulation like the one here).

 Phase Change Diagram

Figure 1. Phase Transition Diagram 

What is the distribution of group size at Average Links/Node = 0.5? Figure 2 shows a histogram the group sizes for one simulation run.

Distibution of group size  

Figure 2. Distribution of group sizes at the Critical Point. The largest groups are too
far off to the right to show on the histogram.  The largest group in the system
at the critical point has 118 nodes, or about 12% of the nodes.

June 19, 2007

Python Code for Network (Graph) Analysis

For the last week or so I have been revising some code I wrote a couple of years ago when I first read Barabasi's excellent book, Linked. The refactored code is a better object-oriented design and my use of Python has improved. I will make the code available here soon.

To get to some results quickly, I started with a simple case: a network in which the nodes are randomly linked. For a network (graph) where the nodes are linked (edges) randomly, some basic results are fairly easy to get to and straight forward reproduce.

The node rank (or degree) is the number of links a given node has to other nodes.  If one starts with 100s of nodes, but only a few links, it is easy to see that the node rank of most of the nodes will be zero.  As the number of links in the network increases, the rank distribution evolves.  What is the distribution of node rank?  How does it depend on the ratio of links to nodes?

Below are three examples cases from the network simulations.  The first figure shows the distribution of rank for the case when there are half as many links as nodes (l = .5 n), the second figure shows the distributino when the number of links equals the number nodes (l = n), and the third figure shows the case when links are about 2.5 times the number of nodes (l = 2.5 n).  The distributions are said to be Poisson which the results here resemble--will have to test for that later.

 

node rank distribution 1
 

 Figure 1. What I labeled "Node Rank" is also called "Degree" in the literature and is the number of nodes one link away from a given node.

 

 

node rank distribution 2

 Figure 2

 

 

node rank distribution 3
 
Figure 3 
 

 

May 08, 2007

Is the distribution of the digits 0-9 uniform?

Another way of stating this question is "Are all digits equally likely?"

It turns out, no.  For large sets of numbers resulting from measurements of nearly anything, the lower numbers are more common.  In fact, they tend to follow a power law (See below).

But saying so doesn't make it so. How about some examples?

To get some quick results, I wrote a Python script to count digits.  The core counting routine is shown below (download .py, PyX required for making plots).

...

inf = file(options.filename,'r')
buf = inf.readlines()

nre = re.compile('[0-9]')

hist = {0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0}

for line in buf:
    nlist = nre.findall(line)
    for n in nlist:
        hist[int(n)] += 1

... 

Next, find some data.  I started close to home by looking at the data from a monthly report of online performance data and financial performance data for the employer.  For this data, the histogram of 1 month's data looks like Figure 1 below.

 

April Performance Report Histogram
 

 

Figure 1. Distribution of digits 0-9 in monthly performance data for AdPay, Inc.

 

For a quick comparison, let's find some data on the Web for rainfall and population statistics.

Rainfall Histogram

Figure 2.  Distribution of digits 0-9 in rainfall data. Is the digit '3' unusually common in rainfall data?

 

Distribution Population Historgram 

Figure 3.  Distribution of digits 0-9 in population data from a combination of several countries.

 

For more information:

Continue reading "Is the distribution of the digits 0-9 uniform?" »

March 21, 2007

Net Neutrality (Matters!)

People seem to understand the issues fairly quickly, but awareness isn't very broad.  Here's a little film that explains it nicely:  "Humanity Lobotomy." Spread the word so people can make informed decisions.

February 21, 2007

What is 'Local'? Part 3: What is the value of your network?

How does network value scale with network size? Is the number of members of the network the right parameter to look at? Or can we measure something else more indicative of network value?

Look at some of the value scaling ideas currently in play:

  • Sarnoff’s scaling law says a network has value in proportion to the number of users, n.
  • Metcalfe’ scaling law says a network has value in proportion to the number of possible connections between pairs of possible endpoints, n(n-1)/2.
  • Reed’s scaling law</a> says a network has value in proportion to the number of possible non-trivial groups that can be formed by network members, 2^n-n-1.

These scaling laws are covered nicely in the wikipedia entries at Metcalf and Reed. There have been a number of criticisms (see Tom Evslin among many others) of these scaling laws. Many based on very common sense, e.g., the value of a very large network doesn’t double every time a new user joins, as Reed would predict.

Among the more recent scholarly looks at the issues, Odlyzko and Tilly provide a good overview of the question and propose another scaling law, this time network value scales faster than Sarnoff, but slower than Metcalf, at nlog(n). One of their key points is that not every connection in a network has the same value, and that dealing with this when deriving a scaling law is not so simple.

Also, this article is valuable in that it explicitly recognizes the importance and connection of a network scaling law with the economic ideas of the Long Tail and the idea of locality (see some of my previous posts).

I am not convinced that Odlyzko and Tilly have hit the right scaling law, but their hint at the synthesis of these ideas is definitely moving us in the right direction.

When I think about why I join networks and how I use and value them, I think the value of a network to an individual scales with the number of desirable groups one can afford to find and join. Maybe following this line of thinking results in a new scaling law?

What is 'Local'? Part 2: Affinity Over Proximity!

Improving our (cap)ability to “find” and “get” objects or experiences means lowering acquisition costs of acquiring these things. So, for a given threshold 'radius' of acquisition cost, the world of attainable objects and experiences is larger and richer. This lets people make more choices based on affinity, rather than proximity.

What is 'Local'? Part 1

‘Local’... applies when a desirable experience or object is within a radius of acquisition cost that a particular individual is willing to incur.

 

Acquisition Costs are incurred in the form of time, money, emotional effort or intellectual effort. The particular individual is key to sorting out the context and relative value of other desirable experiences and objects. Each individual has a personal system of exchange of time for money, money for emotional, time for intellectual effort etc. based on their predisposition to one sort of activity or another, source of income, family and other commitments, passion for an object or experience. This may sound complicated, but everyone is constantly working this out for themselves.

 

This definition generalizes well for cases where geographic locality isn't a practical restriction. For example, things like buying books at Amazon, selling items on eBay, and playing Go with someone in Norway are all sufficiently within my radius of acquisition cost to be considered 'Local.'