Next: 4. Comparison with Related Up: Semantic Verification of Web Previous: 2. Semantics of Web

Subsections

3. Applying Natural Semantics to Structured Documents

In this section, we first give a quick overview of Natural Semantics, before explaining how we can apply it to Web sites.

3.1 What is Natural Semantics?

Our goal in this section is not to give a complete description of Natural Semantics, but to give a sufficient number of elements to show how it can be applied to Web sites.

Natural Semantics [Kahn1987] has been developed to specify the semantics of programming languages to allow mechanical manipulations of programs and generation of compilers. It aims at two main goals: providing an executable semantic specification formalism [Despeyroux1987] and allowing proofs [Despeyroux1986].

Programs are structured documents that can be represented as trees. The set of syntactical constraints that legal programs must respect is called the abstract syntax of the programming language. If we thought of an XML document as a tree, we can identify the notion of abstract syntax with the notion of DTD, even if the expressiveness of declaring an element is less powerful that giving the signature of an abstract syntax operator (Abstract syntaxes traditionally define operators and types, while only operators, called elements, are defined in a DTD). In both cases these constraints are context free.

Natural Semantics has its origin in the structural operational semantics developed by Plo81. Only the purely natural deduction part of it has been retained, and the style of the definition is closed to the natural deduction style, using Gentzen's sequents [Szabo1969]. The natural semantics of a programming language or more generally of a computer formalism is given by one (or more) formal systems (sets of axioms and inference rules) which define sets of valid theorem (theories). Basically, these inference rules express a way to navigate in a program or in a document, checking that some properties (called consequents) are true. But, in opposition to syntactical constraints, these properties can most of the time only been checked in a given environment that contains more information that in what is generally called a context. In the case of the static semantics of a programming language, this environment contains the so called symbol table which contains the set of visible variables or objects together with their types, while the dynamic semantics contains a memory state that is a mapping from variable names to values. In this context, performing a computation (as type-checking a program) is proving that a particular predicate (this program is correctly typed) holds in the theory. Natural Semantics is declarative and does not define the tactic used to build proofs.

We try now to give a more precise idea of Natural Semantics, giving some examples of rules.

$\begin{displaymath}\emptyset \vdash Decls \rightarrow \rho \qquad \rho \vdash Stats \over \vdash \textbf{declare } Decls \textbf{ in } Stats \end{displaymath}$

This rule explains what could be the top rule of the static semantics of a programming language in which statements use objects defined in a preceding declarative part. The declarations Decls are verified in an empty environment and produce a new environment containing declared objects and their associated types that is used to verify the statements Stats.

Most of the rules express the fact that the verification process is recursive and how the environment is propagated. Modifications or accesses to the environment very often appear at the leaves of the tree as in the following axiom (i.e. an inference rule without premises).

$\begin{displaymath}\rho \vdash \textbf{var }X:T\textbf{;} \rightarrow \{X:T\}\cdot \rho \textbf{ (provided }X \textbf{ not in } \rho\textbf{)} \end{displaymath}$

This rule explains that a declaration must by added to the environment only when the declared object does not already appear in it. X and T are variables that will be unified with actual variables names and type expressions.

3.2 Current Implementation of Natural Semantics

Natural Semantics has been implemented by means of a specification formalism called Typol [Despeyroux1988]. It is part of a programming environment generator called Centaur [Borras et al.1988] that allows the definition of the concrete and abstract syntaxes of programming languages, pretty printing rules, and semantic rules. The environment offers support for creating a good man-machine interface. While most of the system is written in Lisp, semantic rules (and possibly parsers and un-parsers) are translated into Prolog code.

This implementation includes two features that are used to construct real tools: an error recovery system that allows to specify special rules that are used when an error occurs providing error messages, and the ability to keep track of occurrences in the source tree and to indicate precisely where errors occur.

3.3 Application to Markup Language Based Documents

Markup based languages are good candidates to specify their semantics using Natural Semantics. An XML validator checking that an XML page complies with a particular DTD is part of what can be generated with Centaur if the static semantics of XML is specified in Typol. However, this is not exactly our goal as we want to give a semantic to Web sites. In the following we are considering sites implemented in XML, but the method can be easily extended to HTML. We can consider that the real implementation language of a site is not XML but a language defined by a set of XML tags. The usage of these tags can be constrained by some syntactical rules: a DTD. To define completely the language, we have now to find a way of defining semantic rules.

In our running examples, DTDs have been manually translated to AS [Despeyroux1996], a Centaur formalism to define modular abstract syntaxes (this should be quite easily performed automatically). This is sufficient for Centaur to generate a syntactical editor for our sites. The next step is to define the static semantics in Typol and to get advantage of the Centaur interface to derive an executable version of this semantics. Executable in this context means that, once we have compiled the semantic rules, we get a tool that can check that a particular site complies with the semantic rules, providing error messages or warnings if necessary.

3.4 Maintaining Web Sites

In our vision, the role of the creator or the administrator of a Web site is quite more important that the traditional web-master one as it must take care of the semantics of the site.

We can here get the advantage of using an executable semantics: from a specification we can produce such tools as validators, compilers or debuggers.

We reach here the main distinction between checking a program, where all concepts (types, numbers etc.) are very precisely defined in a mathematical way, and checking a Web site where we manipulate natural languages and some representation of knowledge that cannot be manipulated by a theorem prover. Hopefully, the way Natural Semantics is conceived and implemented makes it possible to delegate some parts of the deductive process to external tools.

In the context of a Web site, two parameters may evolve: the contents, and the semantic constraints. In the last case, one have to regenerate the compiler itself (i.e. the site validator) before checking that the site is still correct. The Centaur environment generates a generic debugger than can be used to debug the semantics itself.

3.5 Examples

In this section we give two examples of Web site static specifications that have been done with the help of Natural Semantics.

Specifying a Thematic Directory A thematic directory is a way of classifying documents found on the Web and presenting them to users in an organised manner. They can be found as services provided by portails such as Yahoo³ or Voilà⁴. Topics are presented in a hierarchical manner as categories and sub-categories. Most of the time, documents indexed in such a directory are selected and classified manually. Such directories must evolve frequently, as new documents are very often available and because it is some time helpful to reorganise the hierarchy of categories. After each modification in such a directory, one would like to check the integrity of it.

The whole directory is a tree that contains two different types of node. The first type of nodes concerns categories that come with a list of sub-categories. For example the category Travel may contains the sub-categories Boating, Backpacking, etc. In this case, one want to check that sub-categories make sense in the context of the upper level category, by reference to a thesaurus or an ontology. The second type of nodes concerns leaf categories that come with a list of documents, or more exactly with a list of descriptions and links to documents. In the hypothesis in which a list of keywords is provided for each document, one would like to check that this list is relevant in the context in which the document is referred. The structure of the site (what we call its abstract syntax) can be described as a DTD while the site itself is an XML document.

To specify the coherence of such a site, one will affect a type to each node. A node has a type if and only if all subtrees are typed. Type expressions are made of a list of names that should appear in the thesaurus. Sub-categories must get a type that is "more precise" that the type of the upper category. For example, it is possible to say that the type Travel.Boating is more precise than the type Travel because it makes sense to speak of Boating in the context of Travel according to our favourite thesaurus.

The two following rules are equivalent

$\begin{displaymath}\textbf{makesense}(Name, \tau) \qquad \vdash List : \tau\cdot Name \over \vdash category(name \textbf{ }Name, List) : \tau \end{displaymath}$

$\begin{displaymath}\frac { \vdash List :\tau\cdot Name }{ \vdash category(name ... ... :\tau } \textbf{ (provided makesense}(Name, \tau)\textbf{)} \end{displaymath}$

The expression category(name Name, List) uses the Typol syntax to describe a schema in which Name and List are variables. It matches pieces of XML text (or tree) like the following one:

<category> <name>...</name> ... </category>

A category appears in a context in which some category nodes have already been traversed (except for the top ones for which the context is empty). The predicate makesense is an external predicate that may call a thesaurus (or a ontology) browser to verify that the relation between categories and sub-categories is respected. The order of premises in the upper part of the rules is not theoretically important. All premises must be proved to allow the inference. When a premise can be viewed as a test or a function call, the second form may be preferred. The second form makes clear that it is a conditional recursive rule.

A reference may be affected the type $\tau$ if it contains a sufficient number of keywords that make sense for this type. We can call this a confidence degree.

$\begin{displaymath}\frac{ \textbf{confidence}(Keywords, \tau, N) }{ \vdash refer... ...Keywords, Uri) : \tau } \textbf{ (provided } N > 50 \textbf{)} \end{displaymath}$

In this rule, we can use again an external predicate that calculates a rating number that must say what percentage of the list of keywords is relevant in the current context (category). It is the choice of the web-master to impose this ratio.

In the above rules, we have chosen to call an external predicate that is viewed as an axiom from the theory defined by the rules. It should be possible to define the thesaurus as a term, defining also access procedures to the thesaurus in Typol. In this case, the thesaurus should be considered as a parameter (an hypothesis) in the rules, and the sequents should have the form $\rho \vdash X : \tau$ , saying that X has type $\tau$ in the context of a particular thesaurus $\rho$ .

When the proof that, according to the specification, the Web site is correct cannot be constructed, some recovery rules are used to provide error messages or warning.

Specifying the Coherence of an Institutional Site Verifying an institutional site as a research institute one is closer to type-checking a program than the previous example. The hypothesises are the following: an authority, may be the staff, is in charge of the general Web site presentation that is the official view of the institution; then some subparts of the site are edited by other groups of persons, may be some researchers in charge of a particular project. We can see this form of site as containing a declarative part, and an other one than must follow these declarations, even when the first "declarative" part is updated. For example, each team may produce a list of other teams with which they cooperate.

In the following figure, the inner box must be considered as a page accessible by the link A.

$\fbox{ \parbox{.6\linewidth}{ \par Here is the list of teams; click on one of th... ...rate with teams B and M.}} \\ teams B & ...\\ teams C & ...\\ \end{tabular}}}$

As some teams may disappear, it is reasonable to check that a particular team does not pretend having some collaboration with a team that does not exist any more.

We are facing the usual problem of forward declarations in programming languages: each team-page must be checked in an environment containing the list of all teams. Three possibilities are available using natural semantics: the more operational one consists of a two-passes process (one to construct a list of teams, a second to check the team pages), a second possibility consists in constructing a fix point, a third one consists of considering the whole site as a context as shown in the following rules.

$\begin{displaymath}ListofTeams \vdash ListofTeams \over \vdash teamspresentation(Texte,ListofTeams) \end{displaymath}$

Each time a reference to a team exists, we must check that it is in the legal list:

$\begin{displaymath}\textbf{appears}(Name, ListofTeams) \over ListofTeams \vdash team(Name) \end{displaymath}$

This approach supposes that references to teams of the institute are correctly tagged in the document.

Next: 4. Comparison with Related Up: Semantic Verification of Web Previous: 2. Semantics of Web
Thierry Despeyroux
Thu May 4 16:00:23 MEST 2000