Data Warehouse
Data StructureClustering in the computer science world is the classification of data or object into different groups. It can also be referred to as partitioning of a data set into different subsets. Each data in the subset ideally shares some common traits.
Data cluster are created to meet specific requirements that cannot created using any of the categorical levels. One can combine data subjects as a temporary group to get a data cluster.
Data clustering is used in many exploratory process including pattern-analysis, decision making, grouping. It is also heavily used in machine learning situations like data mining, image segmentation, document retrieval and classification of data patterns.
There are two general algorithms used in data clustering. These categories are hierarchical and partitional. Hierarchical algorithms work by finding successive clusters with the use of previously established clusters. Hierarchical algorithms can be further subcategorized as a agglomerative ("bottom-up") or divisive ("top-down"). On the other hand, partitional algorithms work by determining all clusters at once and them partitioning them.
Within the data clustering taxonomy, the following issues exist:
Agglomerative vs. Divisive: This issue refers to the algorithmic structure and operation of data clusters. The agglomerative approach starts with each pattern in a distinct cluster (singleton) and then successively does merging of rest of the data until a certain condition is being met. The divisive approach starts with all clusters patterns within a single cluster and then splits them until a condition is satisfied.
Monothetic vs. Polythetic: This issue refers to the sequential or simultaneous use of features in the process of clustering. Most data clustering algorithms are polythetic in nature. This means that all features are done in computation of distance patterns. The monothetic approach is simpler. It considers features sequentially and then divides the given group of patterns.
Hard vs. Fuzzy: Hard clustering is done by allocation each pattern into one cluster during the clustering operation and in its final output. On the other hand, a fuzzy clustering is done by assigning degrees of membership in many clusters to each input pattern. A fuzzy clustering method can be converted to hard clustering method by assigning each pattern to another cluster having the largest measure of membership.
Incremental vs. Non-incremental: This issue can be encountered in cases when the pattern set to be clustered is very large and some constraints are met with regards to the memory space or time affecting the algorithm's architecture.
The advent of data mining where relevant data need to extracted from billions of disparate data within one or more repositories has furthered the development of clustering algorithms designed to minimize number of scans and therefore effect in lesser load for servers. Incremental clustering is based on the assumption that patterns can be considered one at a time and have them assigned to other existing clusters.
The process of data clustering is sometimes closely associated with such terms as cluster analysis, automatic classification and numerical taxonomy.