- Georges Hébrail , Professeur à Telecom-ParisTech
Title: An introduction to data stream querying and mining
Abstract
Human activity is nowadays massively supported by computerized systems. These systems handle data to achieve their operational goals and it is often of great interest to query and mine such data with a different goal: the supervision of the system. The supervision process is often difficult (or impossible) to run because the amount of data to analyze is too large to be stored in a database before being processed, due in particular to its historical dimension.
This problem has been studied intensively for several years, mainly by researchers from the database field. A new model of data management has been defined to handle “data streams” which are infinite sequences of structured records arriving continuously in real time. This model is supported by newly designed data processing systems called “Data Stream Management Systems”. These systems can connect to one or several stream sources and are able to process “continuous queries” applied both to streams and standard data tables. These queries are qualified as continuous because they stay active for a long time while streaming data are transient. The key feature of these systems is that data produced by streams are not stored permanently but processed ‘on the fly’. Note that this is the opposite of standard database systems where data are permanent and queries are transient. Such continuous queries are used typically either to produce alarms when some events occur or to build aggregated historical data from raw data produced by input streams.
As data stored in data bases and warehouses are processed by mining algorithms, it is interesting to mine data streams, i.e. to apply data mining algorithms directly to the streams instead of storing them beforehand in a database. This problem has also been studied a lot and new data mining algorithms have been developed to be applicable directly to streams. These new algorithms process data streams ‘on the fly’ but they can also provide results based on a portion of the stream instead of the whole stream already seen. Portions of streams are defined by fixed or sliding windows.
We will provide an introduction to the data stream management and mining field. First, the main applications which motivated these developments will be presented (telecommunications, computer networks, stock market, security, …) and the new concepts related to data streams will be introduced (structure of a stream, timestamps, time windows, …). A second part will present the main concepts and architectures related to Data Stream Management Systems. The third part will present the main results about the adaptation of data mining algorithms to the case of streams.
- Ludovic Lebart, Directeur de Recherches CNRS, Telecom-ParisTech
Title: Introduction to Text Mining
Abstract
Principal axes techniques and classification methods play a major role in the computerized exploration of textual corpora. They produce visualizations and/or groupings of elements (free responses in marketing and socioeconomic surveys, discourses, scientific abstracts, patents, broadcast news, financial and economic reports, literary texts, etc.); they highlight associations and patterns; they devise decision aids for attributing a text to an author or a period, for choosing a document within a database, for coding information expressed in natural language. They help also to achieve more technical objectives such as lexical disambiguation, parsing, selection of statistical units, description of semantic graphs, speech and optical character recognition. However, the basic concepts of statistical data analysis must be modified in text analysis. Variables, instead of being declared a priori, are derived from the text. Statistical units (or: observations, subjects, individuals, examples) can be documents (described by their titles or abstracts) in documentary databases, respondents (described by their responses to open questions) in surveys, or segments of texts (sentences, context units, paragraphs) in literary applications. Four additional characteristics increase the complexity of the basic data tables: These tables are large (thousands of documents, thousands of words), often sparse (a document may contain a relatively small number of words) and are provided with a huge amount of available meta-data (rules of grammar, semantic networks). Finally, textual data deal with sequences of occurrences (or: strings) of items, whose order could be of importance, another non standard feature in the multidimensional data analysis. We will focus our presentation on the assessments of visualizations, and the use of meta-data. The examples of application concern open-ended questions in an international survey.
Author’s reference
Lebart L., Salem A., Berry E. (1998). Exploring Textual Data.
Kluwer Academic Publisher, Dordrecht.