Next: 3. Applying Natural Semantics Up: Semantic Verification of Web Previous: 1. Introduction

Subsections

2. Semantics of Web Sites

Specifying semantics of Web sites or adding semantics to semi-structured documents is an effort that is widely recognised to be very useful to the community for facilitating information retrieval but more recently for the design of Web sites. However, what is semantics in such a context ?

Semantics appear in fact at different levels. First, a Web site is a way of presenting some information. To present this information, one will use a computer language like HTML or XML. This language has its own semantics. Second, one will want to have a model of knowledge (as ontologies), that can be represented by means of a computer language. Again this language has its own semantics. Third, one can annotate a Web document by a reference to a model of knowledge, and again will use a specific language and its semantics. The hole picture is darkened by the fact that we tend to use a universal language, namely XML, to represented very different things. More than an extensible language, XML can be viewed as a meta-language that allows to create new languages. If one has the possibility of defining these new languages using DTDs, the real problem is to define their semantics. This is not done in general and we want to address this specific problem.

Following the parallel with programming languages, one will traditionally speak of the static semantics of a programming language to express the constraints that a legal program must verify (this will very often correspond to compile-time verifications) in opposition with the dynamic semantics that express what happens at run-time. In this paper, we are concerned only by the static semantics of Web sites i.e. we want to be able to define the semantic constraints that a particular Web site must follow to be able to verify that some documents are not only well formed and valid but also correct in some sense. The dynamic semantics of a Web site could be defined as the way users navigate in the site, speaking of the behaviour of users [Trousse, Jaczynski, and Kanawati1999,Trousse2000], and is not in the scope of this paper.

We try now to give a survey of different languages and systems that are available and take into account the semantics for Web documents.

2.1 Representing Web Documents

In HTML, surely the most popular language on the Web, one manipulates some structured formated text but mixing in some way the structure of texts and their presentation. Semantics may appear, as we will see later, in meta-data attached to the documents, or be disseminated in the text by the use of <SPAN> tags, as explained in [van Harmelen and van der Meer1999,van Harmelen and Fensel1999].

XML [W3C-XML1998] not only gives the possibility to extend the formalisms of tags, but gives a way to clearly separate the syntactical structure of a document (its tree form) from the presentation by means of style-sheets. Trees can be attributed by text. By adding DTD¹ to XML, one add syntactical constraints concerning the structure of the document itself. So we can manipulate annotated typed trees.

2.2 Annotating Web Documents

RDF [W3C-RDF1999] and the RDF Schema language [W3C-RDF-Schema1999], as recently proposed by the World Wide Web Consortium (W3C), provide a powerful framework to formalise meta-data as acyclic directed labelled graphs where nodes represent resources and arcs represent named properties.

SHOE [Heflin, Hendler, and Luke1999] is an HTML(XML)-based knowledge representation language which adds the tags necessary to embed arbitrary semantic data into Web pages. SHOE tags are divided into two categories. First, there are tags for constructing ontologies i.e sets of rules which define what kinds of assertions SHOE documents can make and what these assertions mean. Second, there are tags for annotating Web documents.

On2broker [Fensel et al.1998,Fensel et al.1998b] uses formal ontologies to extract, reason and possibly generate meta-data in the format of the Resource Description Framework (RDF).

2.3 Manipulating Web Documents

We can distinguish at least two ways of creating coherent Web documents. A first one is to generate documents from a data base or another document (cf. XSL), the second one is to edit documents that are then checked against some declarations. We focus here more on the second possibility

XSL [W3C-XSL2000], initially designed to format document, is in fact more powerful than that as it can be used to translate documents from one XML syntax to an other one. It works on attributed typed trees, with access to the context (mainly information accessible from the root of the tree to the ``current'' point) and gives the possibility to construct new attributed typed trees.

In a few words, what is mainly possible today for manipulating Web documents is context-free, expressing semantics by use of syntactical constraints. For example, in XML a DTD validator verifies that an XML document respects the structure specified in a DTD, if this one is present.

However others works try to go a step further proposing context-dependent document manipulation for semantic verification of Web documents. If we pursue our comparison between a Web site and a program, we can specify the syntax of the language that is used (a DTD is a context-free grammar), we can define pretty printers and syntactical translators, but we cannot perform global computations. The need for global computation as been studied in WebMaster by Harmelen99a and also by PCR99 who extend a DTD to create attributes that are manipulated by an attributed grammar evaluator.

Webmaster [van Harmelen and van der Meer1999] addresses semantic verification of Web sites and proposes a constraint language for representing integrity constraints for HTML or XML documents (for example, a publication on a page of a member of the group must also be included in the publication list of the entire group).

SiRLi, developed in the Ontobroker² project, is a logic-based RDF interpreter, able to reason with meta-data in the XML serialisation of RDF. Ontobroker, as mentioned in [Fensel et al.1998b] could support weakly maintenance of structured text sources and detect incorrectness i.e. inconsistencies among documents and external sources. Such a support could be offered by integrating the inference of Webmaster or by using the existing inference engine in a different way (and the type system). Such a tool could suggest to add some new meta-data according to the ontology specification.

PCR99 propose the use of attributed grammars to make semantic computations on Web documents. Attributes of the grammar are saved in XML attributes, implying a modification of the syntax.

As Webmaster or the attributed grammar approach, we want to allow global computations overpassing some of their limitations as we want to be able to manipulate a context containing external or computed information.

Next: 3. Applying Natural Semantics Up: Semantic Verification of Web Previous: 1. Introduction

Thierry Despeyroux
Thu May 4 16:00:23 MEST 2000