Session Program

 

  • 11 July 2017
  • 04:00PM - 06:00PM
  • Room: Auditorium
  • Chairs: José Antonio Sanz and Mikel Galar

Fuzzy Methods and Data Mining III: Clustering

Abstract - The existence of large volumes of time series data in many applications has motivated data miners to investigate specialized methods for mining time series data. Clustering is a popular data mining method due to its powerful exploratory nature and its usefulness as a preprocessing step for other data mining techniques. This article develops two novel clustering algorithms for time series data that are extensions of a crisp c-shapes algorithm. The two new algorithms are heuristic derivatives of fuzzy c-means (FCM). Fuzzy c-Shapes plus (FCS+) replaces the inner product norm in the FCM model with a shape-based distance function. Fuzzy c-Shapes double plus (FCS++) uses the shape-based distance, and also replaces the FCM cluster centers with shape-extracted prototypes. Numerical experiments on 48 real time series data sets show that the two new algorithms outperform state-of-the-art shape-based clustering algorithms in terms of accuracy and efficiency. Four external cluster validity indices (the Rand index, Adjusted Rand Index, Variation of Information, and Normalized Mutual Information) are used to match candidate partitions generated by each of the studied algorithms. All four indices agree that for these finite waveform data sets, FCS++ gives a small improvement over FCS+, and in turn, FCS+ is better than the original crisp c-shapes method. Finally, we apply two tests of statistical significance to the three algorithms. The Wilcoxon and Friedman statistics both rank the three algorithms in exactly the same way as the four cluster validity indices.
Abstract - This paper proposes a new iterative fuzzy clustering (IFC) algorithm to impute missing values of datasets. The information provided by fuzzy clustering is used to update the imputed values through iterations. The performance of the IFC algorithm is examined by conducting experiments on three commonly used datasets and a case study on a city mobility database. Experimental results show that the IFC algorithm not only works well for datasets with a small number of missing values but also provides an effective imputation result for datasets where the proportion of missing data is high.
Abstract - The problem of identifying protein complexes is of great significance for studying the protein mechanisms in different cellular systems. It is for this reason that many computational approaches have been proposed to solve the problem. Yet few of them have endeavored to discover overlapping protein complexes, which are crucial to improve the accuracy performance. Hence, in this paper, we explore the feasibility of making use of a fuzzy clustering approach to identify overlapping protein complexes in a natural manner. To do so, we first formulate the identification problem as an optimization problem by following certain intuitions and then develop an algorithm to solve it so that the memberships of each protein to different protein complexes can be optimized to eventually infer the protein complexes of interest. The experimental results on several yeast protein interaction networks show that our algorithm is promising in terms of accuracy.
Abstract - In these last years, in the field of Machine Learning, there is a great interest in data structures, such as sequences, trees and graphs. In this work an unsupervised recursive learning schema for structured data clustering is introduced. The schema allows to process data organized in trees for both tree-focused and node-focused applications. The clustering approach is derived from the schema by using a Fuzzy C-Means algorithm. Some experiments are proposed to show its performances and to compare it with other known in literature for node-focused clustering.
Abstract - Multimedia data, particularly digital videos, which contain various modalities (visual, audio, and text) are complex and time consuming to model, process, and retrieve. Therefore, efficient methods are required for retrieval of such complex data. In this paper, we propose a multimodal query level fusion approach using a fuzzy cluster-based learning method to improve the retrieval performance of multimedia data. Experimental results on a real dataset demonstrate that employing fuzzy clustering achieves notable improvement in the concept-based query retrieval performance.
Abstract - Data visualization has always been a vital tool to explore and understand underlying data structures and patterns. However, emerging technologies such as the Internet of Things (IoT) have enabled the collection of very large amounts of data over time. The sheer quantity of data available challenges existing time series visualisation methods. In this paper we present an introductory analysis of time series clustering with a focus on a novel shape-based measure of similarity, which is invariant under uniform time shift and uniform amplitude scaling. Based on this measure we develop a Visual Assessment of cluster Tendency (VAT) algorithm to assess large time series data sets and demonstrate its advantages in terms of complexity and propensity for implementation in a distributed computing environment. This algorithm is implemented as a cloud application using Spark where the run-time of the high complexity dissimilarity matrix calculations are reduced by up to 7.0 times in a 16 core computing cluster with even higher speed-up factors expected for larger computing clusters.