WorkShop Franco-Brésilien

Conférenciers invités

Emmanuel VIENNET, Université Paris-Nord

Title: Learning in Social Networks

Abstract

The study of Social Networks advanced significantly during last years, with the development of sophisticated techniques for Social Network Analysis and Mining, driven by a strong demand from Web 2.0 applications: social web sites, e-mails and IM systems. The applications includes classification systems (text classification, churn, ...), the detection of users communities and recommendation systems. Social Network Analysis faces difficult problems, like modeling the nature of social interactions, mining structured data (social graphs, text, heterogeneous data), or understanding the dynamic of the social networks. Moreover, the applications typically generate huge datasets, with networks counting several millions of nodes, and the mining algorithm have to deal with the data using limited computing resources. In this communication, we will present several problematics arising in Social Network Analysis, describe some recent advances and give some examples showing how social graph encoding can improve data mining tasks.

Teresa LUDERMIR, UFPE

Abstract

An Optimization Methodology for Neural Network Weights and Architectures This talk introduces a methodology for neural network global optimization. The aim is the simultaneous optimization of multilayer perceptron (MLP) network weights and architectures, in order to generate topologies with few connections and high classification performance for any data sets. The approach combines the advantages of simulated annealing, tabu search and the backpropagation training algorithm in order to generate an automatic process for producing networks with high classification performance and low complexity. Experimental results obtained with four classification problems and one prediction problem has shown to be better than those obtained by the most commonly used optimization techniques. Considering the data sets used in the work presented in this talk, the methodology was able to generate automatically MLP topologies with many fewer connections than the maximum number allowed. The results also generate interesting conclusions about the importance of each input feature in the classification and prediction task. The proposed methodology was originally not designed to deal with different number of hidden layers but it does work with different numbers of hidden layers. Some experiments were made with more than one hidden layer. In any case, a decision needs to be made about the size of the initial topology. So in the experiments made, the initial topologies have only one hidden layer with all possible feedforward connections.

Gauss CORDEIRO, UFRPE

Title: Transformed generalized linear models

Abstract

The estimation of data transformation is very useful to yield response variables satisfying closely a normal linear model. Generalized linear models enable the fitting of models to a wide range of data types.These models are based on exponential dispersion models.We propose a new class of transformed generalized linear models to extend the Box and Cox models and the generalized linear models. We use the generalized linear model framework to fit these models and discuss maximum likelihood estimation and inference. We give a simple formula to estimate the parameter that index the transformation of the response variable for a subclass of models. We also give a s imple formula to estimate the r-th moment of the original dependent variable. We explore the possibility of using these models to time series data to extend the generalized auto regressive moving average models discussed by Benjamin et al.[Generalized auto regressive moving average models .J.Amer.Statist. Assoc. 98, 214–223]. The usefulness of these models is illustrated in a simulation study and in applications to three real data sets.

Francisco de CARVALHO , UFPE

Title: Some partitioning clustering models for interval-valued data

Abstract

Cluster analysis have been widely used in numerous fields including pattern recognition, data mining and image processing. Their aim is to organize a set of items into clusters such that items within a given cluster have a high degree of similarity, whereas items belonging to different clusters have a high degree of dissimilarity.

In particular, partitioning clustering models aims to organize a set of items into a pre-defined number of clusters. Our reference clustering model is the partitioning dynamic cluster algorithms (Diday and Simon (1976)). They are iterative two steps relocation clustering algorithms involving at each iteration the construction of the clusters and the identification of a suitable representative or prototype (means, factorial axes, probability laws, etc.) of each cluster by locally optimizing an adequacy criterion between the clusters and their corresponding prototypes.

Often, objects to be clustered are represented as a vector of quantitative features. However, the recording of interval data has become a common practice in real world applications and nowadays this kind of data is often used to describe objects. Symbolic Data Analysis (SDA)is an area related to multivariate analysis, data mining and pattern recognition, which has provided suitable data analysis methods for managing objects described as a vector of intervals (Bock and Diday (2000)).

In this presentation, we review partitioning clustering models and algorithms for interval-valued data having as reference the dynamic clustering algorithm. For each clustering model, it is given the clustering criterion, the best ptototype of each cluster, the best distance associated to each cluster (if any) as well as the best partition in a fixed number of clusters. Moreover, various tools for the partition and cluster interpretation of interval-valued data furnished by these algorithms are also presented. Finally, in order to show the usefulness of these algorithms and the merit of the partition and cluster interpretation tools, experiments with real interval-valued data sets are given.

H.H. Bock and E. Diday, editors. Analysis of Symbolic Data, Exploratory methods for extracting statistical information from complex data.
Springer, Berlin Heidelberg, 2000.

E. Diday, and J. J. Simon, Clustering Analysis. In: K. S. Fu Eds., Digital Pattern Recognition. Springer, Heidelberg, 47-94, 1976.

Djamel ZIGHED, Université Lumière Lyon 2

Title: Topological approaches in machine learning

Abstract

Plenty of machine learning algorithms have been proposed so far either for supervised or unsupervised learning. Over all, these algorithms stress more on the metrical aspect than on the topological relations between the data set.
In this talk we will show why the topological relation is more informative than the metric. We introduce some techniques, based on computational geometry, that show up the topological structure of the data set. Then, we will present some methodological approaches that exploit the topological relationship for the tasks of classification or clustering.

Flavio FOGLIATTO, UFRS- J.L.D. Ribeiro

Title: SPC-based strategy for detecting frauds in power consumption time series

Abstract

Non-technical power losses related to fraud and theft are a serious problem in the management of electric power systems, varying in intensity across countries as a function of factors such as effective accountability, political stability, and corruption levels. In the US, where power systems are generally deemed efficient, Nesbit (2000) estimates that nontechnical losses represent from 0.5% to 3.5% of the annual gross revenues. In less efficient systems, such as in South Asian countries, loss levels can go up to 35% (RIZVI, 2000). In Brazilian power systems, Salsa (2009) reports losses due to non-technical reasons reaching up to 8.0% of the annual gross revenue. In view of their negative impact on the performance of distribution companies, and on customers in general who pay higher energy prices to compensate for the losses, companies have invested in new measuring technologies and inspection strategies to reduce non-technical losses. Electricity theft happens in four known ways (SMITH, 2004): stealing through illegal connections, fraud by meter tampering, false readings, and unpaid bills.

The first two modalities may be minimized (i) by investing in metering technology (GHAJAR, KHALIFE & RICHANI, 2000; GHAJAR & KHALIFE, 2003), (ii) through an efficient inspection program (CABRAL & GONTIJO, 2004; AHMAD & MOHAMAD, 2007), or (iii) by changing system ownership from public to private or by some other market strategy (PIERCE, Jr., 2003). To accomplish loss minimization through inspection, which is our main concern, one must first focus on the problem of selecting meters to be inspected from a population of consumers such that irregularity detection is maximized. Approaches vary in the literature, although usually dealing with predicting customers’ future consumption and analyzing abnormalities in their demand time series. In this paper, we propose an SPC (Statistical Process Control)-based strategy for detecting unusual behavior in customers’ demand time series. Although proposing a simplified and easily implementable forecasting model to predict demand, our method is essentially grounded on the analysis of historical demand behavior in search of potential fraudulent customers. For that purpose, we propose the combined use of robust statistics and SPC rules. Our proposal is illustrated in a case study using a large dataset provided by an electricity distributor located in southern Brazil.

References

AHMAd, A.R. & MOHAMAD, A.M. (2007). Intelligent system for detection of abnormalities and probable fraud by metered customers, 19th International Conference on Electricity Distribution, Vienna, 21-24 May 2007.

CABRAL, J.E. & GONTIJO, E.M. (2004). Fraud detection in electrical energy consumers using rough sets, 2004 IEEE International Conference on Systems, Man and Cybernetics, 10-13 Oct. 2004. On CD-ROM.

GHAJAR, R. & KHALIFE, J. (2003). Cost/benefit analysis of an AMR system to reduce electricity theft and maximize revenues for Electricite du Liban. Applied Energy, V. 76, 25–37.

GHAJAR, R.; KHALIFE, J. & RICHANI, B. (2000). Design and cost analysis of an automatic meter reading system for Electricite du Liban. Utilities Policy, V. 9, 193–205.

NESBIT, B. (2000). Thieves lurk—the sizeable problem of stolen electricity. Electrical World T&D, September/October.
PIERCE, Jr., R.J. (2003). Market manipulations and market flaws. The Electricity Journal, Jan/Feb, 39–46.

RIZVI, M. (2000). Pakistan gets switched on. VSO Orbit: Development Magazine on Global Issues, Number 71.
SANSA, C. (2009). Perdas na distribuição de energia elétrica no Brasil. Portal Ecodebate, March 02, 2009.
http://www.ecodebate.com.br/2009/03/02/perdas-na-distribuicao-de-energia-eletrica-no-brasil-artigo-de-carol-salsa .

SMITH, T.B. (2004). Electricity theft: a comparative analysis. Energy Policy, V. 32, 2067–2076.

Edwin Diday, CEREMADE PARIS-DAUPHINE University Paris (France)

Title: NEW ADVANCES IN SYMBOLIC DATA ANALYSIS and SPATIAL CLASSIFICATION

Abstract

The usual Data mining model is based on two parts: the first concerns the units (called here "individuals"), the second, contains their description by several standard variables including a class variable. The Symbolic Data Analysis model needs two more parts: the first concerns units called "concepts" and the second concerns their "description". The concepts are characterized by a set of properties called "intent" and by an "extent" defined by the set of individuals which satisfy these properties. These concepts are described by "symbolic data" which are standard categorical or numerical data and moreover interval, histograms, sequences of values, etc. These new kind of data allows keeping the internal variation of the extent of each concept. Then, new knowledge can be extracted from this model by new tools of Data Mining extended to concepts considered as new units. Among these tools, Spatial Classification allows a graphical visualisation of the given concepts on a grid and at different level of generalisation organised by a spatial hierarchy or pyramid (allowing overlapping clusters). The SYR software has been developed by SYROKKO company after the academic SODAS software developed by two EUROPEAN projects until 2003, The first aim of SYR is to extract, from a data file (.txt, .csv, ACCESS database) of several millions of units a reduced
number of units which are "concepts" summarizing the initial data. Then SYR can create handle (select, cut, move rows or columns.) and visualise a symbolic data file thanks to user-friendly graphical output. Finally SYR produces new knowledge by Symbolic Data Analysis tools.

References:

L. Billard, E. Diday (2006) "Symbolic Data Analysis: conceptual statistics and data Mining". Wiley. ISBN 0-470-09016-2. 351 pages.

E. Diday, M. Noirhomme (2008) " Symbolic Data Analysis and the SODAS software" 457 Pages. Wiley. ISBN 978-0-470-01883-5.

E. Diday (2008) Spatial classification. DAM (Discrete Applied Mathematics) Volume 156, Issue 8, Pages 1271

Yves LECHEVALLIER, INRIA-Paris-Rocquencourt(France)

Title: Some clustering methods on dissimilarity or similarity matrices: Uncovering clusters in WEB Content, Structure and Usage

Abstract

Clustering is one of the most popular techniques in knowledge acquisition and it is applied in various fields including data mining and statistical data analysis. Clustering involves organizing a set of individuals into clusters in such a way that individuals within a given cluster have a high degree of similarity, while individuals belonging to different clusters have a high degree of dissimilarity.

The definition of “homogeneous cluster” depends on a particular
algorithm: this is indeed a structure, which, in the absence of prior knowledge about the multidimensional shape of the data, may be a reasonable starting point towards the discovery of richer and more complex structures.

We propose an clustering method for partitioning a set of objects where the relation between two objects is described by a dissimilarity or similarity measures. The clustering criterion, based on the sum of weighted dissimilarities between the objects belonging to the same class, measures the homogeneity of the cluster. The mathematical properties of these weighted distances and to implement the corresponding algorithms which optimize the clustering criterion and an empirical framework to their evaluation will be studied The advantage of this approach is that the clustering algorithm recognizes different shapes and sizes of clusters.

Clustering is a valuable technique for analyzing the Web. We propose to study clustering approaches in Content and Structure Document Mining and Usage mining. The analysis of a web site based on its usage data is an important task as it provides insight into the organization of the site and its adequacy regarding user needs. We thus defined an approach for discovering the profiles of visitor groups.

Gilbert SAPORTA, CNAM(France)

Title: Models for Data or Models for Prediction?

Abstract

The classical view off statistical modelling consists in establishing a parsimonious representation of a random phenomenon, generally based upon the knowledge of an expert of the application field: the aim of a model is to provide a better understanding of data and of the underlying mechanism which have produced it. On the other hand in Data Mining and Statistical Learning predictive models are merely algorithms and the quality of a model is assessed by its performance for predicting new observations. In this communication, we develop some general considerations about both aspects of modelling.

Le programme

Le Workshop

Informations Pratiques

Participation