Many commonly used data-mining techniques utilized across research fields perform poorly when used for large data sets. Sequential agglomerative hierarchical non-overlapping clustering is one technique for which the algorithms’ scaling properties prohibit clustering of a large amount of items. Besides the unfavorable time complexity of O(n2), these algorithms have a space complexity of O(n2), which can be reduced to O(n) if the time complexity is allowed to rise to O(n2 log2n). In this paper, we propose the use of locality-sensitive hashing combined with a novel data structure called twister tries to provide an approximate clustering for average linkage. Our approach requires only linear space. Furthermore, its time complexity is linear in the number of items to be clustered, making it feasible to apply it on a larger scale. We evaluate the approach both analytically and by applying it to several data sets.
Publication of Conference SIGMOD/POD’15 International Conference on Management of Data, Melbourne, May 31-June 4, 2015
Michael Cochez, Hao Mou (University of Jyväskylä): Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time