Data Clustering

Satu Virtanen, 47074N

Report for the 2003 Seminar on String Algorithms (T-106.850, CSE, HUT)
in relation to an oral presentation on March 17th, 2003.

Abstract

This report gives an introductory overview to data clustering; the targets, goals, and methods. We begin by describing the goal of clustering algorithms, mentioning some applications. Then we discuss some clustering approaches starting on general level, proceeding to metric spaces, and then to string data, which is the relevant theme for this seminar. We also review clustering in graphs, which is closely related to our own work. The discussion is intended for a reader with no prior familiarity with clustering and hence concentrates on the principles rather than details.

Contents

  1. Introduction
  2. General clustering methods
  3. Clustering in metric spaces
  4. Clustering string data
  5. Clustering in graphs
  6. Concluding remarks
  7. References

Introduction

Many practical applications of information processing involve massive amounts of data, most of which tends to be noise, and only a small fraction contains semantically interesting information. Clustering is the process of organizing properly represented data into meaningful groups, possibly ignoring as much noise as is feasible. The goal is to interpret from some properties of the data, which pieces (usually data points) of the gathered data are in some sense connected. If the data were a set of coordinates on a Euclidean plane, an intuitive goal for clustering would be to group points that are sufficiently close to each other into a common cluster. Of course any coordinate system can be used as well as arbitrary dimensionality, depending on the data.

The central questions immediately emerge: how close is sufficiently close in order for points to belong to the same cluster, how many points does it take to start a new cluster, does the relation have to be symmetric or transitive, etc. There are no definitive answers to these questions; in this sense the whole concept of clustering is a bit fuzzy. Of course we may define strict rules for what is a proper cluster and what is not, but such rules will inevitably be application-specific instead of global truths. Some issues worth considering in defining a good cluster are the justification behind the definition and the amount of computation needed to determine whether something is indeed a good cluster [2].

Most clustering algorithms will produce clusters regardless of the data - even for uniformly random data (see e.g. [4]). In some applications it may be desirable to leave some of the data points, called outliers, outside of all clusters, as they are regarded as noise or erroneous data. It is however very difficult and error prone to deem a data point an outlier, as ignoring it may cause significant changes in the way the rest of the data points are clustered. Baeza-Yates et al. [2] mention as an example that large holes in the ozone layer went unnoticed for a while as they were disregarded and classified as outliers when analyzing the measurement data.

Some common clustering algorithms produce a fixed number of clusters, where the number of clusters is given as a parameter to the algorithm. With such an algorithm the user must either have some a priori information on the suitable number of clusters or needs to perform the clustering several times for different numbers of clusters to find the most "convincing" clustering. An example of such methods is the K-means clustering algorithm, in which K points in the space in which the data points are spread are chosen as cluster centers and each of the N data points it assigned to a particular center. The resulting clusters are the sets formed at each center. The algorithm is iterated to minimize the total distance from the data points to their cluster centers. If K and the number of iterations L are fixed, this algorithm performs in time O(NKL), which is linear with respect to the number of data points, and requires O(N+K) space.

In addition to the problem of determining beforehand the number of clusters to produce, many algorithms also require setting a threshold value that determines the boundaries of the clusters, i.e., how close in the sense of the defined proximity measure do two data points need to be in order to be classified into the same cluster. Different values of the threshold will yield different clusterings. For example, if the task is to cluster points on a plane, as those of Figure 1 below, it must be determined the desired distance below which data points are included in the same cluster. The threshold value may either be set for the absolute distance or some relative measure. In practice, an ideal algorithm would not require such a threshold but would rather interpret it dynamically depending on the input data. However, in some application areas, naturally defined threshold values exist and hence clustering algorithms that need them are also justified.

Points on a
plane

Figure 1: A set of points on a plane (on the left) can be clustered in various ways, depending on the objective function employed. One possible clustering is shown (on the right) with the dotted lines, another by the colors of the dots. Some measure of physical proximity is generally resorted to in simple clustering setups such as this.

In image processing, the task may be for example to group hand-written letters according to the actual symbol that each one corresponds to, in order to properly interpret the written text. Each symbol can be captured in a small image file, which in turn can be represented by a binary matrix that determines the pixels that are "colored" and those that are "blank". It is the task of some other information processing tool to decide where one character ends and another begins, if some more sophisticated input method than a one-letter-at-a-time interface as that of the early Palm PDAs is being employed. Data points representing the different matrices could then be assigned coordinates depending on their structure and thereafter clustered based on proximity.

Other applications include image segmentation, which allow to reduce noise in images and to emphasize meaningful entities. This is discussed in detail in Section 6.1 of [4]. Clustering is also useful in genome comparison (see e.g. [5]), which is important in recognizing proteins, for example. Also in information retrieval, especially in data mining which is the task of extracting useful information from massive amounts of data, clustering methods help to organize the data into a format that is more approachable for people.

General clustering methods

Usually a clustering algorithm produces clusters for all of the data points simultaneously. Some exceptions exist, where the data points are clustered sequentially such that the cluster of one will be completely decided before another is considered; such methods are called incremental clustering algorithms. The general approach is to "move the points around" until some object function is satisfied. Clustering algorithms can be classified on the basis of whether the clustering produced by an algorithm in unambiguous or allows for several alternative interpretations.

Partitional clustering

Clustering algorithms that classify the given data into a single collection of subsets are called partitional. If the subsets of the collection, i.e. the clusters, do not overlap, the division is called crisp - methods in which one data point may belong to more than one cluster are called fuzzy [2]. In mathematical terms, if the subsets do not overlap, they form a partition of the data set. Overlapping subsets whose union contains all elements of the data set form a cover of the set.

An example of partitional clustering is the Squared Error approach (see e.g. Section 5.2.1 of [4]), in which that partitioning the minimizes the following measure will be chosen as the clustering. From each subset of a given partitioning, a centroid is calculated. In simple terms, if the point are on a plane, the centroid is the center of mass of the data points of the subset. For each data point, represented by a coordinate vector, the sum of squared differences between the data vector and the centroid is calculated. Summing these over the entire set yields the squared error of the cluster, and summing over all clusters we obtain the squared error of that particular partition. Then the partition with minimum squared error is chosen. Some limits on the set of feasible partitions need to be imposed, as certainly assigning each data point an own cluster produces zero error.

Hierarchical clustering

Many clustering methods do not produce one absolute clustering but several different "levels" going from rather coarse clustering (an extreme case being one cluster for all of the data points) to an excessively fine clustering (an extreme case being a separate cluster for each data point). Methods that produce such a hierarchy of clusterings are naturally called hierarchical methods. The hierarchical structure is commonly represented by a diagram such as that of Figure 2, called a dendrogram. It depends on the application which is more feasible, i.e., whether an absolute definition of a good cluster is at all possible or whether the user simply uses his or her own judgment to choose between the possible clusterings provided by a hierarchical method.

An example dendrogram.

Figure 2: An example of a dendrogram. Above each dotted line, a different set of clusters may be read. On the top, all points belong to the same cluster, whereas on the bottom, each point has a cluster of its own.

Clustering in metric spaces

Often the only data given as an input to a clustering algorithm is a matrix of distances between the data points. The distance measure of course may vary. A metric space is (in this context, using the notations of [2]) a universe X together with a distance function d(u, v) that is a metric. A metric is a function that fulfills the following three conditions:

We only consider a finite subset U of the universe X, such that |U | = N. Hence the clustering is a partition (or a cover) of U such that the distance between points in the same cluster is "as small as feasible" and the distance between points in different clusters is "as large as possible". Note that partitioning the subset of the universe in which the points are located also partitions $\dataset$ and hence produces the desired clustering. For practical applications, the distance function d should be light to compute especially for large |U|.

When a feasible metric d has been defined on X, the clusters in the data set P contained in U may be formed for example by selecting the K nearest points of each data point to be its cluster (this is very likely to produce a fuzzy clustering and the value of K should be reasoned out by the user), or by connecting iteratively the points nearest to each other into the same initial cluster until all points have been connected to at least one other point (which produces a crisp clustering, possibly with just one cluster).

A common choice for a metric in continuous spaces is the Euclidean distance: for two K-element vectors, this is the square root of the sum of the square of the pairwise difference of each of the K positions. For a 2-dimensional example, the Euclidean distance of u = (4, 5) and v = (3, 2) is the square root of (4-3)˛ + (5-2)˛. Although clustering can very well be conducted using a proximity measure that is not a metric, metric are usually chosen. An example of a method that uses a non-metrical proximity measure to assign points to clusters is a method by Gowda and Krishna (see [4] and the references therein). With respect to each data point, all other data points are numbered from 1 to N-1 in increasing order of some distance measure (that may very well be a metric), such that the nearest neighbor is assigned value 1 and the farthest point is assigned value N-1. We denote by NN(u, v) the number of data point v with respect to u. The mutual neighborhood distance MND, which is the non-metrical measure used to define the clusters, is defined as

MND(u, v) = NN(u, v) + NN(v, u).

This is obviously symmetric, and by defining NN(u, u) = 0, also the reflexive condition is fulfilled. However, the triangle equality is not satisfied and hence MND is not a metric. In this method, a threshold function needs to be defined to decide how large values of MND are acceptable within a cluster.

Clustering string data

When clustering "multimedia data", no intuitive definition of distance is readily available. When discussing e.g. text documents, there is no dimensional information or obvious coordinate vector attached to the data. Many transformations however exist to introduce a proximity measure, usually even a distance metric for such data. The Hamming distance of two vectors, binary words, strings or practically any two sequences that have the same length is the number of positions in which their contents differ. This can be used as a proximity measure for clustering. However, the requirement that the sequences have the same length is somewhat limiting. Of course one could embed the shorter sequence with some extra character, which would raise the Hamming distance by the number of added symbols, but also another widely used measure for comparing sequences exists: the Levenshtein distance, commonly known as edit distance. Edit distance is the smallest number of atomic operations listed below that are needed to transform sequence A = (a_1, a_2, ..., a_s) to sequence B = (b_1, b_2, ..., b_t):

This can be shown as a metric fairly easily with induction and calculated by dynamic programming in O(st) time. Also generalizations to different sets of atomic operations exist. In addition to edit distance, also other similarity measures may prove useful. In the case of text documents, also pronunciation may be significant in determining whether two words are similar. If the person writing the document has made an error in typing, the correct word is likely to have small edit distance from the erroneous version, but if the person has mixed up words that only sound similar, another approach is needed. This is particularly true for languages that have complicated stemming and conflation rules, such as Finnish. For example, all of the following words have the same stem talo:

talot talossa talosta taloon

Similarly in English: hous is the common stem of house, houses, and housing. At times it does not suffice to just take the best common prefix or suffix, or even the longest common substring. A stemmer that aims to take into account the structural properties of words is called morphological. Such a stemmer can be based on the holomorphic distance, which is in turn based on feature extraction on subsequences of the original string. The following example showing 10 nearest neighbors in a dictionary for the Spanish word "apetitosa" (which means tasty, appetizing, or tempting), illustrating the difference between a holomorphic distance and the edit distance (for more details, see Section 3 of [2]).

Holom. dist.TranslationEdit distanceTranslation
apetitosatastyapetitosatasty
apetitoappetiteapetitosotasty
apetitosotastyaceitosaoily
apetiteappetiteapestosasickening
apetitivaappetizingapetitivaappetizing
apetitivoappetizingapetitoappetite
apetibletemptingaceitosooily
apetecercraveacetososour
apetecedortemptingalentosareassuring
apetenciahunger (for)aparatosapretentious

Also much data that originally is not in textual format can be transformed into a string, such as the DNA sequences. Any binary vector or matrix can also be handled by string methods. Hence also applications of the edit distance are useful even though they do not grasp the finesse of human languages.

Jain et al. [4] consider clustering books, defining the similarity of two books as the similarity of classification strings, such as the ACM CR classification labels. Using the Hamming distance or the edit distance for this is infeasible, as the strings are extremely long and both measures ignore semantic of the classification, and hence new distance measures need to be defined. For the ACM CR labels, using the ratio of the length of the longest common prefix to the length of the first string works. Examples of such labels are H242, I233, I522, where the symbols represent nodes of a tree that contains the classes used.

Parts of the
ACM CR tree.

Figure 3: Some branches of the ACM Computing Classification System tree, from which the ACM CR labels are derived.

Clustering in graphs

In a graph G = (V, E), the data points are represented by the vertex set V and the semantic connections between them by the edge set E. For many different types of data, fluent transformations into graph format exist. For example, the Delaunay graph of a set of points on a plane can be constructed by representing each point by a vertex and placing an edge between each pair of points that are Voronoi neighbors (see [4] and the references therein). The Voronoi neighborhood relation is defined by partitioning the plane that contains N data points into N convex polygons such that the following conditions hold:

The resulting diagram of points are polygons is called a Voronoi diagram or a Dirichlet tessellation. Two points are Voronoi neighbors if their respective polygons share a part of their border. This can also be generalized into higher dimensions than the plane. An illustrative Java Applet that shows the Voronoi polygons and also upon request the edges of the Delaunay graph is available at http://www.cs.cornell.edu/Info/People/chew/Delaunay.html.

Voronoi neighbors.

Figure 4: An example Voronoi diagram (on the left, approximated by hand) and the corresponding Delaunay graph (on the right, sketched with a light-blue dotted line on the Voronoi diagram). The relative positions of the dots in both the Voronoi diagram and the Delaunay graph are the same.

A common approach to cluster graphs is using a minimum spanning tree of the graph (see e.g. [4] and the references therein). A spanning subgraph of a graph is essentially a subset of S of the edge set E such that by traversing only edges that are included in S, one may reach an arbitrary vertex in V from any other vertex. If the spanning subgraph does not contain any cycles, it is a spanning tree. If the edges have been assigned weights of some kind, as often is the case when complex data is represented by a graph, the minimum spanning tree is the one where the sum of the weights over S is the smallest. The minimal spanning tree can be employed to locate clusters by deleting the edges in decreasing order of their weights.

MST clustering

Figure 5: For the example graph, a minimum-weight spanning tree is found by the greedy algorithm (drawn thicker and green in the first picture). Then all other edges of the graph are disregarded (second picture). In the consequent pictures, the edges of the spanning tree are removed starting from the one with the largest weight, eventually splitting the graph hierarchically in clusters that are the connected components of the remaining graph. The process could be continued until each vertex is in a singleton cluster.

Denoting the number of vertices of a graph G = (V, E) by N and the number of edges by M, the density of a simple undirected graph is defined as the fraction M to the maximal number of edges possible N(N-1)/2. Density may also be defined for any induced subgraph, which is formed by a set S of vertices and all edges that have both endpoints in S. In simple terms, a cluster in a graph can be for example a "surprisingly dense induced subgraph". Some authors have only considered the maximal cliques of graphs as proper clusters. A clique is a subgraph of H vertices in which all of the possible H(H-1)/2 edges connecting the H vertices are present, also known as a complete subgraph. A maximal clique is such an H in which no other vertex can be added without losing the completeness property (see e.g. Section 5.2.2 of [4]).

Mihail et al. [7] consider the relative density of a subgraph, defined as the number of edges connecting vertices that both belong to the cluster divided by the number of edges that are at all incident to the cluster. They use spectral methods to identify clusters that have a high relative density. The spectrum of a graph is the vector of its eigenvalues (i.e. the eigenvalues of the adjacency matrix of the graph) and hence computed globally for the entire graph. The computation of eigenvalues can be quite heavy, but fortunately Mihail et al. do not employ the entire spectrum.

Another way to think about clusters in graphs is through connectivity: how many distinct paths exist between a pair of vertices. For some vertices to belong to the same cluster, they should be highly connected to each other, whereas there should not be many paths connecting them to vertices outside the cluster. On such approach is presented in [3], where the authors iteratively split the graph in two parts such that there are as few edges as possible between the parts (i.e. by removing the minimum cut of the graph), considering a remaining part of H vertices a proper cluster if more than H/2 edges would be needed to split it further.

Many clustering approaches for graphs assign weights to the edges of the graph in some manner, then globally pruning the edges with respect to the chosen function until the graph has decomposed into satisfactory clusters. Often a hierarchical structure of clustering emerges, from which the user is to select a proper phase for the application at hand. One possibility for such a weight function is edge-betweenness, which means in common terms the number of shortest paths connecting any pair of vertices that passes through the edge in question (see e.g. [8]). It can be argued that edges with high betweenness must be intercluster edges instead of being within a cluster. Hence a clustering hierarchy emerges if at each iteration the edge with highest betweenness is removed and then the betweenness calculation is repeated [8].

In our own work we consider a different approach that does not require knowledge of the entire graph, as we attempt to find the cluster for only one vertex at a time with local search [9]. We use simulated annealing by Kirkpatrick et al. [6], but also many other approaches exist. Such an approach is interesting because of the massive size of some application networks, such as the World Wide Web, which is certainly infeasible to cluster with a global method. The results of this ongoing work have not yet been published; some preliminary discussion will be included the the Licentiate's thesis [9] of the author, currently submitted for grading. We suggest a fitness function that essentially aims to locate the optimal subgraph containing the start vertex with respect to the ratio of edges within the cluster to the number of edges connecting the candidate cluster to other vertices in the graph.

Concluding remarks

In addition to the methods discussed in this report, also artificial neural networks are commonly used tools for clustering. It would also be possible to perform clustering by genetic algorithms (also known as evolutionary algorithms). In this report we have skipped these interesting branches for brevity; references to such work are available in [4]. We close the review in the closing remarks of Jain et al. [4] that successfully capture the spirit of clustering problems:

"In summary, clustering is an interesting, useful, and challenging problem. It has great potential in applications like object recognition, image segmentation, and information filtering and retrieval. However, it is possible to exploit this potential only after making several design choices carefully."

References

  1. The ACM Computing Classification System. Association for Computing Machinery (ACM), 1998.
    Available online at the ACM website, http://www.acm.org/class/.
  2. Baeza-Yates, R., Chávez, E., Herrera, N., and Navarro, G., Clustering in Metric Spaces with Applications in Information Retrieval. In Wu and Xiong Eds. "Information Retrieval and Clustering", Kluwer Academic Publishers (to appear on 2003).
    Available on-line at Gonzalo Navarros' homepage, http://www.dcc.uchile.cl/~gnavarro/publ.html.
  3. Hartuv, E., and Shamir, R., A clustering algorithm based on graph connectivity. Information Processing Letters, Vol. 76, No. 4-6, pp. 175-181, 2000.
    Available on-line via http://www.sciencedirect.com/ from the HUT domain.
  4. Jain, A.K., Murty, M.N., and Flynn, P.J., Data Clustering: A Review. ACM Computing Surveys, Vol. 31, No. 3, September 1999.
    Available on-line via http://portal.acm.org from the HUT domain.
  5. Kim, S., Graph Theoretic Sequence Clustering Algorithms and Their Applications to Genome Comparison. In Wang, J.T.L., Wu, C.H., and Wang P.P Eds. "Computational Biology and Genome Informatics", Chapter 4. World Scientific Publishing Company, Singapore, (to appear on 2003).
    Available on-line at Sun Kim's homepage at http://bio.informatics.indiana.edu/sunkim/papers/revised-ch4.doc.
  6. Kirkpatrick, S., Gelatt, C.D. Jr, and Vecchi, M.P., Optimization by Simulated Annealing. Science, Vol. 220, No. 4598, pp. 671-680, 1983.
    Available on-line via http://www.sciencedirect.com/.
  7. Mihail, M., Gkantsidis, C., Saberi, A., and Zegura, E., On the semantics of Internet topologies. Technical report GIT-CC-02-07, College of Computing, Georgia Institute of Technology. January, 2002.
    Available on-line at the institute's website, http://www.cc.gatech.edu/tech_reports/index.02.html.
  8. Newman, M.E.J, and Girvan, M., Mixing patterns and community structure in networks. In Pastor-Satorras, R., and Rubi, J. Eds. "Proceedings of the XVIII Sitges Conference on Statistical Mechanics", Springer-Verlag, Berlin, Germany, 2002/2003.
    Available on-line via the cond-mat archive, http://arxiv.org/abs/cond-mat/0210146/.
  9. Virtanen, S.E., Properties of nonuniform random graph models. Draft of a Licentiate's Thesis, Laboratory for Theoretical Computer Science, Helsinki University of Technology, 2003.
    Available on-line at http://www.tcs.hut.fi/~satu/draft.ps.


Latest update: 15 March 2003.