Lecture Notes in Physics
Founding Editors: W. Beiglb
¨
ock, J. Ehlers, K. Hepp, H. Weidenm
¨
uller
Editorial Board
R. Beig, Vienna, Austria
W. Beiglb
¨
ock, Heidelberg, Germany
W. Domcke, Garching, Germany
B.-G. Englert, Singapore
U. Frisch, Nice, France
F. Guinea, Madrid, Spain
P. H
¨
anggi, Augsburg, Germany
W. Hillebrandt, Garching, Germany
R. L. Jaffe, Cambridge, MA, USA
W. Janke, Leipzig, Germany
H. v. L
¨
ohneysen, Karlsruhe, Germany
M. Mangano, Geneva, Switzerland
J.-M. Raimond, Paris, France
D. Sornette, Zurich, Switzerland
S. Theisen, Potsdam, Germany
D. Vollhardt, Augsburg, Germany
W. Weise, Garching, Germany
J. Zittartz, K
¨
oln, Germany
The Lecture Notes in Physics
The series Lecture Notes in Physics (LNP), founded in 1969, reports new developments
in physics research and teaching quickly and informally, but with a high quality and
the explicit aim to summarize and communicate current knowledge in an accessible way.
Books published in this series are conceived as bridging material between advanced grad-
uate textbooks and the forefront of research and to serve three purposes:
to be a compact and modern up-to-date source of reference on a well-defined topic
to serve as an accessible introduction to the field to postgraduate students and
nonspecialist researchers from related areas
to be a source of advanced teaching material for specialized seminars, courses and
schools
Both monographs and multi-author volumes will be considered for publication. Edited
volumes should, however, consist of a very limited number of contributions only. Pro-
ceedings will not be considered for LNP.
Volumes published in LNP are disseminated both in print and in electronic formats, the
electronic archive being available at springerlink.com. The series content is indexed, ab-
stracted and referenced by many abstracting and information services, bibliographic net-
works, subscription agencies, library networks, and consortia.
Proposals should be sent to a member of the Editorial Board, or directly to the managing
editor at Springer:
Christian Caron
Springer Heidelberg
Physics Editorial Department I
Tiergartenstrasse 17
69121 Heidelberg / Germany
christian.caron@springer.com
J. Reichardt
Structure in Complex
Networks
123
J
¨
org Reichardt
Univ. W
¨
urzburg
Inst. Theoretische Physik
Am Hubland
97074 W
¨
urzburg
Germany
Reichardt, J., Structure in Complex Networks, Lect. Notes Phys. 766 (Springer, Berlin
Heidelberg 2009), DOI 10.1007/978-3-540-87833-9
ISBN: 978-3-540-87832-2 e-ISBN: 978-3-540-87833-9
DOI 10.1007/978-3-540-87833-9
Lecture Notes in Physics ISSN: 0075-8450 e-ISSN: 1616-6361
Library of Congress Control Number: 2008936629
c
Springer-Verlag Berlin Heidelberg 2009
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations are
liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Cover design: Integra Software Services Pvt. Ltd.
Printed on acid-free paper
987654321
springer.com
Preface
Data have always been the driving force of natural science ever since Galileo
Galilei (1564–1642). In renaissance time, he introduced the comparison of
theoretical predictions with experimental data as the ultimate test of any
scientific theory. This change of paradigm from a qualitative to a quantitative
description of nature has shaped the way we think about how science should
proceed. Today, we call it the scientific method: Start with establishing a
theory and test its prediction in experiments until you find a contradiction
with experimental data. Then refine your theory and start all over again.
But how do we come up with a theory to start with? There are two main
approaches to this: the insight of a genius and the careful inspection of data.
As examples of the former, take the idea that the entropy of an ideal gas is
proportional to the log of the number of microstates accessible to the atoms
of this gas in phase space as proposed by Ludwig Boltzmann (1844–1906). Or
the ergodic hypothesis, generally attributed to Josiah Willard Gibbs (1839–
1903), that all microstates compatible with the energy of a system are equally
probable. Or even more strikingly, Sir Isaac Newton’s (1643–1727) idea that
the force which keeps the earth orbiting the sun and the force which makes
an apple fall from a tree are one and the same. Such insights are certainly
motivated by, but evidently not derived from, experimental evidence. On the
other hand, think of the laws of planetary motion discovered by Johannes
Kepler (1571–1630). He used the extensive and accurate observations compiled
by Tycho Brahe (1546–1601) to derive his theory. Other examples may be
Charles Darwin (1809–1882) or Gregor Johann Mendel (1822–1884) who also
relied on extensive gathering of observations to come up with their theories
of evolution and inheritance.
What unites all scientists is that they seek to explain patterns in data
revealing underlying principles that govern what we see in experiments. Some
have great insight, others have to rely on the inspection of a large body of data
to arrive at a hypothesis or theory. In the above examples, hypotheses were
derived by humans. Kepler, Darwin, Mendel and many others contemplated
their data often for years until they saw a common pattern. Today, computers
VI Preface
facilitate this task. Genome databases are scanned for risk factors for certain
diseases, banks let computer algorithms analyze all data available on their
customers to assess their credit risk and online vendors try to detect fraud in
an automated fashion to name but a few examples.
Besides the facilitation of the search for patterns connected with a specific
question, more importantly, computers have enabled us to analyze data even
without a specific question in mind. We can let computers mine data asking
whether there is any pattern or structure at all. The scientist must then answer
the question what brings about this structure. Naturally, this brings us to the
question of what do we consider to be a pattern or regularity. In this work,
we will understand everything that cannot be explained by randomness as
a pattern. The structures we will be concerned with are characterized by a
maximal deviation from random null models.
Specifically, we will be concerned with one particular aspect of pattern
recognition and data mining: that of clustering and dimensionality reduction.
With the ever increasing amount of empirical information that scientists from
all disciplines are dealing with, there exists a great need for robust, scalable
and easy to use clustering techniques for data abstraction, dimensionality
reduction or visualization to cope with and manage this avalanche of data.
The usefulness of statistical physics in solving these problems was recognized
as early as in the 1950s long before computers became as abundant as
they are today [1, 2]. This monograph will show that even today, methods
and in particular spin models from statistical mechanics can help in resolving
and more importantly in understanding the related statistical inference
problems.
Clustering techniques belong to the field of unsupervised learning [3].
Given a data set, the goal is to group the data points such that the points
within a cluster are similar to each other and dissimilar to the rest of the
data points [4–7]. The more similar the objects within a cluster are and the
more different the clusters are, the better the clustering. Instead of having to
deal with many individual objects, the researcher can then concentrate on the
properties of only a few clusters or classes of objects with distinct features.
Though intuitively clear, clustering represents a technically ill-posed problem
for a number of reasons. First, it is not clear at which level of detail a cluster
structure is to be defined, i.e., what is the number of clusters in a data set or
whether a subset of data points shall be regarded as one cluster or be divided
further. It may also be debated whether data points may belong to more than
one cluster. Part of this problem is that the term “cluster” does not have a
well-defined meaning. Second, it is not clear what an appropriate similarity
or dissimilarity measure is supposed to be. Third and most importantly, it is
difficult to differentiate whether one is not only finding what one is search-
ing for, i.e., all clustering techniques will find some cluster structures even in
random, unstructured data.
Because of these problems there exists no single clustering technique for
all types of data and clustering techniques are still the subject of ongoing
Preface VII
research. The simplest case is most likely multivariate data where each of the
data points is characterized by a D-dimensional feature vector containing real
valued entries. A typical example would be a set of objects characterized by
a number of measurements. Then, a natural similarity measure would be the
euclidean distance. As a naive approach, one can now compute the distance
between all pairs of data points and successively join close data points to
clusters. This is impractical for large data sets as the number of pairwise
distances scales as the square of the number of data points. The method of
choice then is to introduce prototypical data points as cluster centers and
find the position of these cluster centers such that they represent the data
set in some optimal way. To do this, only the determination of the distance
of each data point from each cluster center is necessary which makes the
computational effort linear in the number of data points for a given number
of clusters. This approach is taken by the k-means algorithm which is probably
the most widely used clustering technique despite its many shortcomings [4, 5].
Note that the introduction of prototypical data points which are repre-
sentative of a cluster is only possible when an actual distance measure exists
between the data points. It is not possible for instance when a matrix of
pairwise similarities alone is given. This, however, is often the case.
Another problem, known as the “curse of dimensionality” [8], arises when
the dimension D of the data set to be clustered increases [9]. The reason is that
the data points become increasingly sparse as the dimensionality increases
and the relative difference between the closest and the farthest point from
an independently selected point in the data set goes to zero with increasing
dimension [9, 10].
Both of these problems arise intrinsically when dealing with relational
data. Here, the objects to be clustered are characterized by some sort of rela-
tion. Typically, these relations are present or known for only a small fraction of
the pairs of objects and can be represented as graphs or networks. The nodes
in these networks represent the objects and their relations are represented
by their connections. A typical example is the set of authors of a number of
scientific articles and the relation between them is whether or not they have
co-authored an article together. Such data are intrinsically sparse and often
the average distance (defined as the number of steps in the network) between
two arbitrarily chosen nodes scales as the logarithm of the systems size, i.e.,
every object is close to every other object. Further, if the graph is connected,
objects in different clusters will often be only the minimal distance of one step
away from each other. There is no way to introduce prototypical objects as
only pairwise relations are given.
While in the past multivariate data sets have dominated the applications,
an increasing use and availability of data warehousing technology allows ac-
cess to more and more relational data sets. Another aspect is that the first
level of description for many complex systems is through the topology of their
interactions, i.e., networks again. Network clustering techniques hence do not
only represent exploratory data analysis tools but also are a first step in
VIII Preface
understanding complex systems [11]. Since conventional clustering techniques
are inadequate for networks, a number of novel approaches have been devel-
oped in recent years [12, 13]. Despite the many efforts, a number of issues
remain.
An “ideal” clustering procedure for graphs should be computationally effi-
cient in order to deal with very large data sets, i.e., it should be able to exploit
the sparsity of the data. At the same time, it should be accurate. If there is
a trade-off between runtime and accuracy this should be easily mediated. It
should further allow for overlapping as well as hierarchical clusters and allow
to set the level of detail at which a clustering should be performed. It should
have only few parameters and these should have an intuitive meaning. There
should exist a precise interpretation of the clusters found, independent from
the clustering technique. And most importantly, an ideal clustering procedure
should provide a measure of how strong the cluster structure found deviates
from that found in equivalent random data. While none of the presently avail-
able algorithms is able to combine all of these features, the present monograph
is intended to provide, analyze and show the application of a clustering pro-
cedure ideal in these ways.
Chapter 1 will give a short introduction into some graph theoretical terms
necessary for the discussions to follow and provide a brief overview over some
important aspects of complex network study. It will illustrate the problem of
structure recognition again and underline the importance of novel tests for
statistical significance.
Chapter 2 then reviews a number of cluster definitions from different fields
of science and surveys the current state of the art in graph clustering including
discussions of the merits and shortcomings of each method.
After this discussion, a first principles approach to graph clustering in
complex networks is developed in Chapters 3 and 4. In particular, the prob-
lem of structure detection is tackled via an analogy with a physical system
by mapping it onto finding the ground state of an infinite range Potts spin
glass Hamiltonian. The ground state energy of this spin glass is equivalent
to the quality of the clustering with lower energies corresponding to better
clusterings. Benchmarks for the accuracy in comparison with other methods
are given. Computationally efficient update rules for optimization routines are
given which work only on the links of the network and hence take advantage
of the sparsity of the system. Model systems with overlapping and hierarchical
cluster structures are studied. The equivalence of graph clustering and graph
partitioning is derived from the fact that random graphs cluster into equally
sized parts. Using known results for the expected cut size in graph partition-
ing for dense graphs with a Poissonian degree distribution, expectation values
for the modularity of such networks are derived.
In order to extend these results to dense random networks with arbitrary
degree distribution, the replica method is used in Chapter 5. The modularity
of graphs with arbitrary degree distributions is calculated which allows the
comparison of modularities found in real world networks and random null
Preface IX
models. The results can also be used to improve results for the expected cut
size of graph partitioning problems.
Chapter 6 is devoted to the study of modularity in sparse random net-
works of arbitrary degree distribution via the cavity method. This approach
is complementary to the replica method and improves its results in cases
of small average connectivities. Furthermore, it allows the calculation of the
maximum achievable accuracy of a clustering method and provides insight
into the theoretical limitations of data-driven research.
In Chapter 7, the newly developed clustering method is applied to two
real world networks. The first example is an analysis of the structure of the
world trade network across a number of different commodities on the level
of individual countries. The second application deals with a large market
network and studies the segmentation of the individual users of this market.
The application shows how a network clustering process can be used to deal
with large sparse data sets where conventional analyses fail.
The concluding Chapter 8 summarizes work and hints on directions for
further research.
Last but not least, it is my pleasure to acknowledge a number of cowork-
ers and colleagues for their contributions in making this work possible. I am
grateful for the fruitful collaboration with Stefan Bornholdt, Peter Ahnert,
Michele Leone and Douglas R. White. I have greatly enjoyed and bene-
fited from many discussions with Konstantin Klemm, Holger Kirsten, Stefan
Braunewell, Klaus Pawelzik, Albert Diaz-Guilera, Ionas Erb, Peter Stadler,
Francesco Rao, Amadeo Caflisch, Geoff Rodgers, Andreas Engel, Riccardo
Zecchina, Wolfgang Kinzel, David Saad and Georg Reents. Many parts of this
work would not have been possible without the great expertise and support
of these colleagues.
Bremen/W¨urzburg, org Reichardt
May 2008
References
1. E. T. Jaynes. Information theory and statistical mechanics. Physical Review,
106(4):620–630, 1957. VI
2. E. T. Jaynes. Information theory and statistical mechanics ii. Physical Review,
108(2):171–190, 1957. VI
3. A. Engel and C. Van den Broeck. Statistical Mechanics of Learning. Cambridge
University Press, New York, 2001. VI
4. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to
Cluster Analysis. Wiley-Interscience, New York, 1990. VI, VII
5. B.S. Everitt, S. Landau, and M. Leese. Cluster Analysis. Arnold, London, 4
edition, 2001. VI, VII