The study of Social Networks advanced significantly during last years, with the development of sophisticated techniques for Social Network Analysis and Mining, driven by a strong demand from Web 2.0 applications: social web sites, e-mails and IM systems. The applications includes classification systems (text classification, churn, ...), the detection of users communities and recommendation systems. Social Network Analysis faces difficult problems, like modeling the nature of social interactions, mining structured data (social graphs, text, heterogeneous data), or understanding the dynamic of the social networks. Moreover, the applications typically generate huge datasets, with networks counting several millions of nodes, and the mining algorithm have to deal with the data using limited computing resources. In this communication, we will present several problematics arising in Social Network Analysis, describe some recent advances and give some examples showing how social graph encoding can improve data mining tasks.
An Optimization Methodology for Neural Network Weights and Architectures This talk introduces a methodology for neural network global optimization. The aim is the simultaneous optimization of multilayer perceptron (MLP) network weights and architectures, in order to generate topologies with few connections and high classification performance for any data sets. The approach combines the advantages of simulated annealing, tabu search and the backpropagation training algorithm in order to generate an automatic process for producing networks with high classification performance and low complexity. Experimental results obtained with four classification problems and one prediction problem has shown to be better than those obtained by the most commonly used optimization techniques. Considering the data sets used in the work presented in this talk, the methodology was able to generate automatically MLP topologies with many fewer connections than the maximum number allowed. The results also generate interesting conclusions about the importance of each input feature in the classification and prediction task. The proposed methodology was originally not designed to deal with different number of hidden layers but it does work with different numbers of hidden layers. Some experiments were made with more than one hidden layer. In any case, a decision needs to be made about the size of the initial topology. So in the experiments made, the initial topologies have only one hidden layer with all possible feedforward connections.
The estimation of data transformation is very useful to yield response variables satisfying closely a normal linear model. Generalized linear models enable the fitting of models to a wide range of data types.These models are based on exponential dispersion models.We propose a new class of transformed generalized linear models to extend the Box and Cox models and the generalized linear models. We use the generalized linear model framework to fit these models and discuss maximum likelihood estimation and inference. We give a simple formula to estimate the parameter that index the transformation of the response variable for a subclass of models. We also give a s imple formula to estimate the r-th moment of the original dependent variable. We explore the possibility of using these models to time series data to extend the generalized auto regressive moving average models discussed by Benjamin et al.[Generalized auto regressive moving average models .J.Amer.Statist. Assoc. 98, 214–223]. The usefulness of these models is illustrated in a simulation study and in applications to three real data sets.
Cluster analysis have been widely used in numerous fields including pattern recognition, data mining and image processing. Their aim is to organize a set of items into clusters such that items within a given cluster have a high degree of similarity, whereas items belonging to different clusters have a high degree of dissimilarity. In particular, partitioning clustering models aims to organize a set of items into a pre-defined number of clusters. Our reference clustering model is the partitioning dynamic cluster algorithms (Diday and Simon (1976)). They are iterative two steps relocation clustering algorithms involving at each iteration the construction of the clusters and the identification of a suitable representative or prototype (means, factorial axes, probability laws, etc.) of each cluster by locally optimizing an adequacy criterion between the clusters and their corresponding prototypes. Often, objects to be clustered are represented as a vector of quantitative features. However, the recording of interval data has become a common practice in real world applications and nowadays this kind of data is often used to describe objects. Symbolic Data Analysis (SDA)is an area related to multivariate analysis, data mining and pattern recognition, which has provided suitable data analysis methods for managing objects described as a vector of intervals (Bock and Diday (2000)). In this presentation, we review partitioning clustering models and algorithms for interval-valued data having as reference the dynamic clustering algorithm. For each clustering model, it is given the clustering criterion, the best ptototype of each cluster, the best distance associated to each cluster (if any) as well as the best partition in a fixed number of clusters. Moreover, various tools for the partition and cluster interpretation of interval-valued data furnished by these algorithms are also presented. Finally, in order to show the usefulness of these algorithms and the merit of the partition and cluster interpretation tools, experiments with real interval-valued data sets are given. H.H. Bock and E. Diday, editors. Analysis of Symbolic Data, Exploratory methods for extracting statistical information from complex data. E. Diday, and J. J. Simon, Clustering Analysis. In: K. S. Fu Eds., Digital Pattern Recognition. Springer, Heidelberg, 47-94, 1976.
Plenty of machine learning algorithms have been proposed so far either for supervised or unsupervised learning. Over all, these algorithms stress more on the metrical aspect than on the topological relations between the data set.
Abstract The first two modalities may be minimized (i) by investing in metering technology (GHAJAR, KHALIFE & RICHANI, 2000; GHAJAR & KHALIFE, 2003), (ii) through an efficient inspection program (CABRAL & GONTIJO, 2004; AHMAD & MOHAMAD, 2007), or (iii) by changing system ownership from public to private or by some other market strategy (PIERCE, Jr., 2003). To accomplish loss minimization through inspection, which is our main concern, one must first focus on the problem of selecting meters to be inspected from a population of consumers such that irregularity detection is maximized. Approaches vary in the literature, although usually dealing with predicting customers’ future consumption and analyzing abnormalities in their demand time series. In this paper, we propose an SPC (Statistical Process Control)-based strategy for detecting unusual behavior in customers’ demand time series. Although proposing a simplified and easily implementable forecasting model to predict demand, our method is essentially grounded on the analysis of historical demand behavior in search of potential fraudulent customers. For that purpose, we propose the combined use of robust statistics and SPC rules. Our proposal is illustrated in a case study using a large dataset provided by an electricity distributor located in southern Brazil.
AHMAd, A.R. & MOHAMAD, A.M. (2007). Intelligent system for detection of abnormalities and probable fraud by metered customers, 19th International Conference on Electricity Distribution, Vienna, 21-24 May 2007.
Title: NEW ADVANCES IN SYMBOLIC DATA ANALYSIS and SPATIAL CLASSIFICATION Abstract The usual Data mining model is based on two parts: the first concerns the units (called here "individuals"), the second, contains their description by several standard variables including a class variable. The Symbolic Data Analysis model needs two more parts: the first concerns units called "concepts" and the second concerns their "description". The concepts are characterized by a set of properties called "intent" and by an "extent" defined by the set of individuals which satisfy these properties. These concepts are described by "symbolic data" which are standard categorical or numerical data and moreover interval, histograms, sequences of values, etc. These new kind of data allows keeping the internal variation of the extent of each concept. Then, new knowledge can be extracted from this model by new tools of Data Mining extended to concepts considered as new units. Among these tools, Spatial Classification allows a graphical visualisation of the given concepts on a grid and at different level of generalisation organised by a spatial hierarchy or pyramid (allowing overlapping clusters). The SYR software has been developed by SYROKKO company after the academic SODAS software developed by two EUROPEAN projects until 2003, The first aim of SYR is to extract, from a data file (.txt, .csv, ACCESS database) of several millions of units a reduced References: L. Billard, E. Diday (2006) "Symbolic Data Analysis: conceptual statistics and data Mining". Wiley. ISBN 0-470-09016-2. 351 pages. E. Diday, M. Noirhomme (2008) " Symbolic Data Analysis and the SODAS software" 457 Pages. Wiley. ISBN 978-0-470-01883-5. E. Diday (2008) Spatial classification. DAM (Discrete Applied Mathematics) Volume 156, Issue 8, Pages 1271
Title: Some clustering methods on dissimilarity or similarity matrices: Uncovering clusters in WEB Content, Structure and Usage Abstract Clustering is one of the most popular techniques in knowledge acquisition and it is applied in various fields including data mining and statistical data analysis. Clustering involves organizing a set of individuals into clusters in such a way that individuals within a given cluster have a high degree of similarity, while individuals belonging to different clusters have a high degree of dissimilarity. The definition of “homogeneous cluster” depends on a particular We propose an clustering method for partitioning a set of objects where the relation between two objects is described by a dissimilarity or similarity measures. The clustering criterion, based on the sum of weighted dissimilarities between the objects belonging to the same class, measures the homogeneity of the cluster. The mathematical properties of these weighted distances and to implement the corresponding algorithms which optimize the clustering criterion and an empirical framework to their evaluation will be studied The advantage of this approach is that the clustering algorithm recognizes different shapes and sizes of clusters. Clustering is a valuable technique for analyzing the Web. We propose to study clustering approaches in Content and Structure Document Mining and Usage mining. The analysis of a web site based on its usage data is an important task as it provides insight into the organization of the site and its adequacy regarding user needs. We thus defined an approach for discovering the profiles of visitor groups.
The classical view off statistical modelling consists in establishing a parsimonious representation of a random phenomenon, generally based upon the knowledge of an expert of the application field: the aim of a model is to provide a better understanding of data and of the underlying mechanism which have produced it. On the other hand in Data Mining and Statistical Learning predictive models are merely algorithms and the quality of a model is assessed by its performance for predicting new observations. In this communication, we develop some general considerations about both aspects of modelling. |