Specifying semantics of Web sites or adding semantics to
semi-structured documents is an effort that is widely recognised to be
very useful to the community for facilitating information retrieval
but more recently for the design of Web sites.
However, what is
semantics in such a context ?
Semantics appear in fact at different levels. First, a Web site is a
way of presenting some information. To present this information, one
will use a computer language like HTML or XML. This language has its own
semantics. Second, one will want to have a model of knowledge (as
ontologies), that
can be represented by means of a computer language. Again this language
has its own semantics. Third, one can annotate a Web document by a
reference to a model of knowledge, and again will use a specific
language and its semantics. The hole picture is darkened by the
fact that we tend to use a universal language, namely XML, to
represented very different things. More than an extensible language,
XML can be viewed as a meta-language that allows to create new
languages. If one has the possibility of defining these new languages
using DTDs, the real problem is to define their semantics.
This is not done in general and we want to address this specific problem.
Following the parallel with programming languages,
one will traditionally speak of
the static semantics of a programming language to express the
constraints that a legal program must verify (this will very often
correspond to compile-time verifications) in opposition with the
dynamic semantics that express what happens at run-time.
In this paper, we are concerned only by the static semantics of Web
sites i.e. we want to be able to define the semantic constraints that
a particular Web site must follow to be able to verify that some
documents are not only well formed and valid but also correct in some sense.
The dynamic semantics of a Web site could be
defined as the way users navigate in the site, speaking of the
behaviour of users [Trousse, Jaczynski, and Kanawati1999,Trousse2000], and is not in the scope of this
paper.
We try now to give a survey of different languages and systems
that are available and take into account the semantics for Web
documents.
In HTML, surely the most popular language on the Web, one manipulates
some structured formated text but mixing in some way the structure of
texts and their presentation. Semantics may appear,
as we will see later, in meta-data attached to the documents, or be disseminated in the text by the use
of <SPAN> tags, as explained in [van Harmelen and van der Meer1999,van Harmelen and Fensel1999].
XML [W3C-XML1998] not only gives the possibility to extend the formalisms of tags,
but gives a way to clearly separate the syntactical structure of a
document (its tree form) from the presentation by means of
style-sheets. Trees can be attributed by text. By adding DTD1 to XML, one add syntactical constraints concerning the structure of the document itself. So we can manipulate annotated typed trees.
RDF [W3C-RDF1999] and the RDF Schema language [W3C-RDF-Schema1999], as recently proposed by the World Wide Web Consortium (W3C), provide
a powerful framework to formalise meta-data as acyclic directed labelled graphs
where nodes represent resources and arcs
represent named properties.
SHOE [Heflin, Hendler, and Luke1999] is an HTML(XML)-based knowledge representation
language which adds the tags necessary to embed arbitrary semantic
data into Web pages. SHOE tags are divided into two categories. First,
there are tags for constructing ontologies i.e sets of rules which
define what kinds of assertions SHOE documents can make and what these
assertions mean. Second, there are tags for annotating Web
documents.
On2broker [Fensel et al.1998,Fensel et al.1998b] uses formal ontologies to extract,
reason and possibly generate meta-data in the format of the Resource Description
Framework (RDF).
We can distinguish at least two ways of creating coherent Web documents.
A first one is to generate documents from a data base or another document (cf. XSL), the second one is to edit documents that are then checked against some declarations. We focus here more on the second possibility
XSL [W3C-XSL2000], initially designed to format document, is in fact
more powerful than that as it can be used
to translate documents from one XML syntax to an
other one. It works on attributed typed trees, with access to the
context (mainly information accessible from the root of the tree to
the ``current'' point) and gives the possibility to construct new
attributed typed trees.
In a few words, what is mainly possible today for manipulating Web
documents is context-free, expressing semantics by use of syntactical
constraints. For example, in XML a DTD validator verifies that an XML
document respects the structure specified in a DTD, if this
one is present.
However others works try to go a step further proposing context-dependent
document manipulation for semantic verification of Web documents.
If we pursue our comparison between a Web site and a program, we can specify the syntax of
the language that is used (a DTD is a context-free grammar), we can
define pretty printers and syntactical translators, but we cannot
perform global computations. The need for global computation as been
studied in WebMaster by Harmelen99a and also by PCR99 who extend a
DTD to create attributes that are manipulated by an attributed grammar
evaluator.
Webmaster [van Harmelen and van der Meer1999] addresses semantic verification of Web sites and
proposes a constraint language for
representing integrity constraints for HTML or XML documents (for
example, a publication on a page of a member of the group must also be
included in the publication list of the entire group).
SiRLi, developed in the
Ontobroker2
project, is a logic-based RDF interpreter, able to reason with
meta-data in the XML serialisation of RDF.
Ontobroker, as mentioned in [Fensel et al.1998b] could support weakly
maintenance of structured text sources and detect incorrectness
i.e. inconsistencies among documents and external sources. Such a
support could be offered by integrating the inference of Webmaster or
by using the existing inference engine in a different way
(and the type system). Such a tool could suggest to add some new meta-data
according to the ontology specification.
PCR99 propose the use of attributed grammars to make semantic
computations on Web documents. Attributes of the grammar are saved in
XML attributes, implying a modification of the syntax.
As Webmaster or the attributed grammar approach, we want to allow global computations
overpassing some of their limitations as we want to be able to manipulate a context containing external or computed information.