VI Preface
facilitate this task. Genome databases are scanned for risk factors for certain
diseases, banks let computer algorithms analyze all data available on their
customers to assess their credit risk and online vendors try to detect fraud in
an automated fashion to name but a few examples.
Besides the facilitation of the search for patterns connected with a specific
question, more importantly, computers have enabled us to analyze data even
without a specific question in mind. We can let computers mine data asking
whether there is any pattern or structure at all. The scientist must then answer
the question what brings about this structure. Naturally, this brings us to the
question of what do we consider to be a pattern or regularity. In this work,
we will understand everything that cannot be explained by randomness as
a pattern. The structures we will be concerned with are characterized by a
maximal deviation from random null models.
Specifically, we will be concerned with one particular aspect of pattern
recognition and data mining: that of clustering and dimensionality reduction.
With the ever increasing amount of empirical information that scientists from
all disciplines are dealing with, there exists a great need for robust, scalable
and easy to use clustering techniques for data abstraction, dimensionality
reduction or visualization to cope with and manage this avalanche of data.
The usefulness of statistical physics in solving these problems was recognized
as early as in the 1950s – long before computers became as abundant as
they are today [1, 2]. This monograph will show that even today, methods
and in particular spin models from statistical mechanics can help in resolving
– and more importantly – in understanding the related statistical inference
problems.
Clustering techniques belong to the field of unsupervised learning [3].
Given a data set, the goal is to group the data points such that the points
within a cluster are similar to each other and dissimilar to the rest of the
data points [4–7]. The more similar the objects within a cluster are and the
more different the clusters are, the better the clustering. Instead of having to
deal with many individual objects, the researcher can then concentrate on the
properties of only a few clusters or classes of objects with distinct features.
Though intuitively clear, clustering represents a technically ill-posed problem
for a number of reasons. First, it is not clear at which level of detail a cluster
structure is to be defined, i.e., what is the number of clusters in a data set or
whether a subset of data points shall be regarded as one cluster or be divided
further. It may also be debated whether data points may belong to more than
one cluster. Part of this problem is that the term “cluster” does not have a
well-defined meaning. Second, it is not clear what an appropriate similarity
or dissimilarity measure is supposed to be. Third and most importantly, it is
difficult to differentiate whether one is not only finding what one is search-
ing for, i.e., all clustering techniques will find some cluster structures even in
random, unstructured data.
Because of these problems there exists no single clustering technique for
all types of data and clustering techniques are still the subject of ongoing