Lecture Notes in Physics

Founding Editors: W. Beiglb

ock, J. Ehlers, K. Hepp, H. Weidenm

uller

Editorial Board

R. Beig, Vienna, Austria

W. Beiglb

ock, Heidelberg, Germany

W. Domcke, Garching, Germany

B.-G. Englert, Singapore

U. Frisch, Nice, France

F. Guinea, Madrid, Spain

P. H

anggi, Augsburg, Germany

W. Hillebrandt, Garching, Germany

R. L. Jaffe, Cambridge, MA, USA

W. Janke, Leipzig, Germany

H. v. L

ohneysen, Karlsruhe, Germany

M. Mangano, Geneva, Switzerland

J.-M. Raimond, Paris, France

D. Sornette, Zurich, Switzerland

S. Theisen, Potsdam, Germany

D. Vollhardt, Augsburg, Germany

W. Weise, Garching, Germany

J. Zittartz, K

oln, Germany

The Lecture Notes in Physics

The series Lecture Notes in Physics (LNP), founded in 1969, reports new developments

in physics research and teaching – quickly and informally, but with a high quality and

the explicit aim to summarize and communicate current knowledge in an accessible way.

Books published in this series are conceived as bridging material between advanced grad-

uate textbooks and the forefront of research and to serve three purposes:

• to be a compact and modern up-to-date source of reference on a well-deﬁned topic

• to serve as an accessible introduction to the ﬁeld to postgraduate students and

nonspecialist researchers from related areas

• to be a source of advanced teaching material for specialized seminars, courses and

schools

Both monographs and multi-author volumes will be considered for publication. Edited

volumes should, however, consist of a very limited number of contributions only. Pro-

ceedings will not be considered for LNP.

Volumes published in LNP are disseminated both in print and in electronic formats, the

electronic archive being available at springerlink.com. The series content is indexed, ab-

stracted and referenced by many abstracting and information services, bibliographic net-

works, subscription agencies, library networks, and consortia.

Proposals should be sent to a member of the Editorial Board, or directly to the managing

editor at Springer:

Christian Caron

Springer Heidelberg

Physics Editorial Department I

Tiergartenstrasse 17

69121 Heidelberg / Germany

christian.caron@springer.com

J. Reichardt

Structure in Complex

Networks

123

org Reichardt

Univ. W

urzburg

Inst. Theoretische Physik

Am Hubland

97074 W

urzburg

Germany

[email protected]urg.de

Reichardt, J., Structure in Complex Networks, Lect. Notes Phys. 766 (Springer, Berlin

Heidelberg 2009), DOI 10.1007/978-3-540-87833-9

ISBN: 978-3-540-87832-2 e-ISBN: 978-3-540-87833-9

DOI 10.1007/978-3-540-87833-9

Lecture Notes in Physics ISSN: 0075-8450 e-ISSN: 1616-6361

Library of Congress Control Number: 2008936629



Springer-Verlag Berlin Heidelberg 2009

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is

concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,

reproduction on microﬁlm or in any other way, and storage in data banks. Duplication of this publication

or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,

1965, in its current version, and permission for use must always be obtained from Springer. Violations are

liable to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,

even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws

and regulations and therefore free for general use.

Cover design: Integra Software Services Pvt. Ltd.

Printed on acid-free paper

987654321

springer.com

Preface

Data have always been the driving force of natural science ever since Galileo

Galilei (1564–1642). In renaissance time, he introduced the comparison of

theoretical predictions with experimental data as the ultimate test of any

scientiﬁc theory. This change of paradigm from a qualitative to a quantitative

description of nature has shaped the way we think about how science should

proceed. Today, we call it the scientiﬁc method: Start with establishing a

theory and test its prediction in experiments until you ﬁnd a contradiction

with experimental data. Then reﬁne your theory and start all over again.

But how do we come up with a theory to start with? There are two main

approaches to this: the insight of a genius and the careful inspection of data.

As examples of the former, take the idea that the entropy of an ideal gas is

proportional to the log of the number of microstates accessible to the atoms

of this gas in phase space as proposed by Ludwig Boltzmann (1844–1906). Or

the ergodic hypothesis, generally attributed to Josiah Willard Gibbs (1839–

1903), that all microstates compatible with the energy of a system are equally

probable. Or even more strikingly, Sir Isaac Newton’s (1643–1727) idea that

the force which keeps the earth orbiting the sun and the force which makes

an apple fall from a tree are one and the same. Such insights are certainly

motivated by, but evidently not derived from, experimental evidence. On the

other hand, think of the laws of planetary motion discovered by Johannes

Kepler (1571–1630). He used the extensive and accurate observations compiled

by Tycho Brahe (1546–1601) to derive his theory. Other examples may be

Charles Darwin (1809–1882) or Gregor Johann Mendel (1822–1884) who also

relied on extensive gathering of observations to come up with their theories

of evolution and inheritance.

What unites all scientists is that they seek to explain patterns in data

revealing underlying principles that govern what we see in experiments. Some

have great insight, others have to rely on the inspection of a large body of data

to arrive at a hypothesis or theory. In the above examples, hypotheses were

derived by humans. Kepler, Darwin, Mendel and many others contemplated

their data often for years until they saw a common pattern. Today, computers

VI Preface

facilitate this task. Genome databases are scanned for risk factors for certain

diseases, banks let computer algorithms analyze all data available on their

customers to assess their credit risk and online vendors try to detect fraud in

an automated fashion to name but a few examples.

Besides the facilitation of the search for patterns connected with a speciﬁc

question, more importantly, computers have enabled us to analyze data even

without a speciﬁc question in mind. We can let computers mine data asking

whether there is any pattern or structure at all. The scientist must then answer

the question what brings about this structure. Naturally, this brings us to the

question of what do we consider to be a pattern or regularity. In this work,

we will understand everything that cannot be explained by randomness as

a pattern. The structures we will be concerned with are characterized by a

maximal deviation from random null models.

Speciﬁcally, we will be concerned with one particular aspect of pattern

recognition and data mining: that of clustering and dimensionality reduction.

With the ever increasing amount of empirical information that scientists from

all disciplines are dealing with, there exists a great need for robust, scalable

and easy to use clustering techniques for data abstraction, dimensionality

reduction or visualization to cope with and manage this avalanche of data.

The usefulness of statistical physics in solving these problems was recognized

as early as in the 1950s – long before computers became as abundant as

they are today [1, 2]. This monograph will show that even today, methods

and in particular spin models from statistical mechanics can help in resolving

– and more importantly – in understanding the related statistical inference

problems.

Clustering techniques belong to the ﬁeld of unsupervised learning [3].

Given a data set, the goal is to group the data points such that the points

within a cluster are similar to each other and dissimilar to the rest of the

data points [4–7]. The more similar the objects within a cluster are and the

more diﬀerent the clusters are, the better the clustering. Instead of having to

deal with many individual objects, the researcher can then concentrate on the

properties of only a few clusters or classes of objects with distinct features.

Though intuitively clear, clustering represents a technically ill-posed problem

for a number of reasons. First, it is not clear at which level of detail a cluster

structure is to be deﬁned, i.e., what is the number of clusters in a data set or

whether a subset of data points shall be regarded as one cluster or be divided

further. It may also be debated whether data points may belong to more than

one cluster. Part of this problem is that the term “cluster” does not have a

well-deﬁned meaning. Second, it is not clear what an appropriate similarity

or dissimilarity measure is supposed to be. Third and most importantly, it is

diﬃcult to diﬀerentiate whether one is not only ﬁnding what one is search-

ing for, i.e., all clustering techniques will ﬁnd some cluster structures even in

random, unstructured data.

Because of these problems there exists no single clustering technique for

all types of data and clustering techniques are still the subject of ongoing

Preface VII

research. The simplest case is most likely multivariate data where each of the

data points is characterized by a D-dimensional feature vector containing real

valued entries. A typical example would be a set of objects characterized by

a number of measurements. Then, a natural similarity measure would be the

euclidean distance. As a naive approach, one can now compute the distance

between all pairs of data points and successively join close data points to

clusters. This is impractical for large data sets as the number of pairwise

distances scales as the square of the number of data points. The method of

choice then is to introduce prototypical data points as cluster centers and

ﬁnd the position of these cluster centers such that they represent the data

set in some optimal way. To do this, only the determination of the distance

of each data point from each cluster center is necessary which makes the

computational eﬀort linear in the number of data points for a given number

of clusters. This approach is taken by the k-means algorithm which is probably

the most widely used clustering technique despite its many shortcomings [4, 5].

Note that the introduction of prototypical data points which are repre-

sentative of a cluster is only possible when an actual distance measure exists

between the data points. It is not possible for instance when a matrix of

pairwise similarities alone is given. This, however, is often the case.

Another problem, known as the “curse of dimensionality” [8], arises when

the dimension D of the data set to be clustered increases [9]. The reason is that

the data points become increasingly sparse as the dimensionality increases

and the relative diﬀerence between the closest and the farthest point from

an independently selected point in the data set goes to zero with increasing

dimension [9, 10].

Both of these problems arise intrinsically when dealing with relational

data. Here, the objects to be clustered are characterized by some sort of rela-

tion. Typically, these relations are present or known for only a small fraction of

the pairs of objects and can be represented as graphs or networks. The nodes

in these networks represent the objects and their relations are represented

by their connections. A typical example is the set of authors of a number of

scientiﬁc articles and the relation between them is whether or not they have

co-authored an article together. Such data are intrinsically sparse and often

the average distance (deﬁned as the number of steps in the network) between

two arbitrarily chosen nodes scales as the logarithm of the systems size, i.e.,

every object is close to every other object. Further, if the graph is connected,

objects in diﬀerent clusters will often be only the minimal distance of one step

away from each other. There is no way to introduce prototypical objects as

only pairwise relations are given.

While in the past multivariate data sets have dominated the applications,

an increasing use and availability of data warehousing technology allows ac-

cess to more and more relational data sets. Another aspect is that the ﬁrst

level of description for many complex systems is through the topology of their

interactions, i.e., networks again. Network clustering techniques hence do not

only represent exploratory data analysis tools but also are a ﬁrst step in

VIII Preface

understanding complex systems [11]. Since conventional clustering techniques

are inadequate for networks, a number of novel approaches have been devel-

oped in recent years [12, 13]. Despite the many eﬀorts, a number of issues

remain.

An “ideal” clustering procedure for graphs should be computationally eﬃ-

cient in order to deal with very large data sets, i.e., it should be able to exploit

the sparsity of the data. At the same time, it should be accurate. If there is

a trade-oﬀ between runtime and accuracy this should be easily mediated. It

should further allow for overlapping as well as hierarchical clusters and allow

to set the level of detail at which a clustering should be performed. It should

have only few parameters and these should have an intuitive meaning. There

should exist a precise interpretation of the clusters found, independent from

the clustering technique. And most importantly, an ideal clustering procedure

should provide a measure of how strong the cluster structure found deviates

from that found in equivalent random data. While none of the presently avail-

able algorithms is able to combine all of these features, the present monograph

is intended to provide, analyze and show the application of a clustering pro-

cedure ideal in these ways.

Chapter 1 will give a short introduction into some graph theoretical terms

necessary for the discussions to follow and provide a brief overview over some

important aspects of complex network study. It will illustrate the problem of

structure recognition again and underline the importance of novel tests for

statistical signiﬁcance.

Chapter 2 then reviews a number of cluster deﬁnitions from diﬀerent ﬁelds

of science and surveys the current state of the art in graph clustering including

discussions of the merits and shortcomings of each method.

After this discussion, a ﬁrst principles approach to graph clustering in

complex networks is developed in Chapters 3 and 4. In particular, the prob-

lem of structure detection is tackled via an analogy with a physical system

by mapping it onto ﬁnding the ground state of an inﬁnite range Potts spin

glass Hamiltonian. The ground state energy of this spin glass is equivalent

to the quality of the clustering with lower energies corresponding to better

clusterings. Benchmarks for the accuracy in comparison with other methods

are given. Computationally eﬃcient update rules for optimization routines are

given which work only on the links of the network and hence take advantage

of the sparsity of the system. Model systems with overlapping and hierarchical

cluster structures are studied. The equivalence of graph clustering and graph

partitioning is derived from the fact that random graphs cluster into equally

sized parts. Using known results for the expected cut size in graph partition-

ing for dense graphs with a Poissonian degree distribution, expectation values

for the modularity of such networks are derived.

In order to extend these results to dense random networks with arbitrary

degree distribution, the replica method is used in Chapter 5. The modularity

of graphs with arbitrary degree distributions is calculated which allows the

comparison of modularities found in real world networks and random null

Preface IX

models. The results can also be used to improve results for the expected cut

size of graph partitioning problems.

Chapter 6 is devoted to the study of modularity in sparse random net-

works of arbitrary degree distribution via the cavity method. This approach

is complementary to the replica method and improves its results in cases

of small average connectivities. Furthermore, it allows the calculation of the

maximum achievable accuracy of a clustering method and provides insight

into the theoretical limitations of data-driven research.

In Chapter 7, the newly developed clustering method is applied to two

real world networks. The ﬁrst example is an analysis of the structure of the

world trade network across a number of diﬀerent commodities on the level

of individual countries. The second application deals with a large market

network and studies the segmentation of the individual users of this market.

The application shows how a network clustering process can be used to deal

with large sparse data sets where conventional analyses fail.

The concluding Chapter 8 summarizes work and hints on directions for

further research.

Last but not least, it is my pleasure to acknowledge a number of cowork-

ers and colleagues for their contributions in making this work possible. I am

grateful for the fruitful collaboration with Stefan Bornholdt, Peter Ahnert,

Michele Leone and Douglas R. White. I have greatly enjoyed and bene-

ﬁted from many discussions with Konstantin Klemm, Holger Kirsten, Stefan

Braunewell, Klaus Pawelzik, Albert Diaz-Guilera, Ionas Erb, Peter Stadler,

Francesco Rao, Amadeo Caﬂisch, Geoﬀ Rodgers, Andreas Engel, Riccardo

Zecchina, Wolfgang Kinzel, David Saad and Georg Reents. Many parts of this

work would not have been possible without the great expertise and support

of these colleagues.

Bremen/W¨urzburg, J¨org Reichardt

May 2008

References

1. E. T. Jaynes. Information theory and statistical mechanics. Physical Review,

106(4):620–630, 1957. VI

2. E. T. Jaynes. Information theory and statistical mechanics ii. Physical Review,

108(2):171–190, 1957. VI

3. A. Engel and C. Van den Broeck. Statistical Mechanics of Learning. Cambridge

University Press, New York, 2001. VI

4. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to

Cluster Analysis. Wiley-Interscience, New York, 1990. VI, VII

5. B.S. Everitt, S. Landau, and M. Leese. Cluster Analysis. Arnold, London, 4

edition, 2001. VI, VII