Aquarelle, WG-6 Some Issues about Metadata

Anne-Marie Vercoustre
INRIA
Résumé :

This paper reports on the discussion about metadata which arose during the WG6 meeting on 3 December 1996. It also shortly presents some of world-wide working groups on Metadata, their results and recommendations. Finally, it describes two possible scenarios for the creation of Metadata for the Aquarelle folders, and propose in Annex a DTD for describing Folder Profiles (ie. Folder metadata).


1 Introduction

Metadata is a description of objects, documents or services which may contain data about their format and content . It has been used for many years by librarians under the form of catalog records for print publications, abstract and keywords, or index databases. It is widely used in archives, museums and for document management in companies as well.

With the explosion of the WWW, the creation and publication of metadata upon digital information has been recognized of major importance to improve the precision of document retrieval in a distributed environment. Beyond this acknowledgment, there is a need for more precise agreement on which medata models and description language could be used. It is very unlikely that an unique set of metadata could ever been agreed upon, but hopefully some common understanding and usage (see section 3).

Metadata may be part of the resources themselves or kept separately from them. As far as digital documents are concerned, it could be argued that meta-information for several purposes could be directly extracted from the documents (especially regarding SGML documents). Even for individual document this approach would not be appropriate since it would require from the author to provide all the required information, which, supposing they could be forseen, could be unknown by the author (publisher, storage location of the document, etc.). Moreover some objects as collections, images, sounds will still require their metadata description outside the object itself.

In the following we will discribed the set of metadata that have been identified for describing the folders (hereinafter called the folder profile)and the discussion about how to describe and provide them.

Then we will give a short presentation of the aims, results and recommendations from two world-wide Metadata working groups, known as the Dublin Workshop and the Warwick Worshop, as well as a more semantic approach from Apple.

Finally we propose an SGML syntax for the folder profiles, as well as two possible scenarii for generating these metadata.

2 WG6 discussion on Folder Profile

The base of the discussion was the Folder Profile description provided by ICS-FORTH at the meeting. We remind here that folders are SGML documents which are managed by so-called Folder servers. Documents could be searched and retrieved using the folder description, or a more advanced query language involving the document structure (query using the DTD) and full text content.

We quickly agreeded on identifying three different types of metadata:

As Aquarelle is already strongly SGML-based it seems natural to describe the metadata using SGML as well, at least as a virtual interface between the Folder Server and the external world. It could be easily displayed as an SGML document and the extra information that have to be provided could be entered using Grif, the SGML editor which is already part of the system. [1]

More precise scenario for generating the metadata will be proposed and discussed in section 4.

Before addressing the issue of generating metadata we must be clear about which metadata and which format model (SGML being only the selected syntax), and to make sure that our choices will be compatible with ongoing internationals initiatives.

3 Metadata Workshops and Initiatives

Metadata has already attracted a lot of interest and work from the International community. Two Workshops have been organized to foster " ...a common understanding of the problems and potential solutions ... and promote a consensus on a core set of metadata elements to describe networked resources". The first Workshop was organized by OCLC/NCSA in March 1995, Dublin (Ohio). The result of the workshop was a simple resource description record, widely known as the Dublin Core set.[2]

The second Workshop [1]organized by UKOLN and OCLC in Warwick, April 1996, was intended to broaden the scope of the first meeting and to identify implementation strategies. It was attended by a mixed of computer science, text markup, and library professionals. The focus of the workshop very soon turned to the extensibility issue to support richer description and linkage to other description models.

Another initiative from Apple [3]concentrates more on the representation of content using a knowledge representation language in the spirit of Cycl or KIF, rather than a markup language. These languages are best designed for classification of documents. The intention is to extract the description from the document content rather than using any external description.

All these initiatives have to be carefully considered, especially if we expect Aquarelle to be accessed from the Web or through a Web-based Intranet. However it is obvious that Aquarelle has an immediate need for more specific metadata description than, let say, the Dublin Core set. Yet the Folder Profile should includes the Dublin Core set as a minimum, or make it possible to export such description if required.

Actually metadata should be regarded as an a posteriori external view upon documents and collection of documents that are provided when publishing in a specific context. It should be possible to describe different metadata sets, adapting to the content or format model as required. This approach will be developed with scenario 2. in the next section.

4 Metadata production

We agreed upon exporting Folder profile as a metadata description using SGML syntax. A DTD for that description is proposed in Annexe. We have added a couple of field that are part of the Dublin Core and favored a structure that makes possible compatibility with other descriptions (optional fields), further extensions (using lists rather than fixed ordered elements), and richer description (repeatable fields).

This section proposes two scenarios for the generation the Virtual metadata description: in scenario 1, the server outputs the metadata in the appropriate SGML format, from internal and external data for the document. In scenario 2 , an extended SGML description defines the way metadata have to be calculated or directly input.

4.1 Scenario 1

In this approach the Folder server will generate the Folder profile using its internal data structures and generating programs as shown in Figure 1. Data are of two types:

The generator will extract other metadata straight from the Folder SGML source (metadata extractor) or will calculate it from the document (metadata calculator), then it will generate a virtual SGML document according to the Aquarelle Folder_Profile DTD which includes all the required metadata.

Image scenario1.ps

Figure 1 - Generating Folder Profile: Scenario 1

Advantages:

Shortcomings:

4.2 Scenario 2

The second approach is to start with the SGML document to produce, in the DTD that is required. All the parts that have to be externally entered by an human can be edited straight into this document, using an SGML editor. The other metadata, eg. the ones that can be calculated by the server, will be specified within the SGML document using an SGML Processing Instruction (starting with <?), which will specify how to build the specific elements for a folder referred with the variable &Folder.

More precisely:

This Folder profile specification will be interpreted to provide the profile according to the same DTD as shown in figure 2.

Image scenario2.ps

Figure 2 - Specifying metadata: Scenario 2.

Then the Folder profile specification is a declarative and constructive prescription of the SGML metadata description. The specification is itself an extended SGML document that can be stored as a document.

A full description will has the following format:

<!DOCTYPE  Folder Profile "Fold-Prof.dtd"
<Fold-Prof>
<Fold-ID> 
   <?content = get_folder_id(&folder) >
</Fold-ID>
<Loc> 
    "http://WWW.inria.fr/Aquarelle-server/"
</Loc>
<Size>
   <?content = get_folder_size(&folder)  >
</Size>
<Title>
   <?content =
      &Title= Query("Aquarelle_server","Folder_profile",
               "folder.header.Folder_title")
   > [2]
</Title>
<Subject> This document reports on the Aquarelle WG6 meeting on 3
December 1996,
</Subject>
<Subject> The document propose two scenarios for the production of
Folder Profile (Document Metadata).
</Subject>
  etc.
</Fold-Prof>  
          

Advantages:

Shortcomings:

We think that the second approach emphasizes the idea of multiple metadata descriptions as external views upon the documents that are provided for publishing in various contexts.

5 Resume

Main recommendations:

Optional:

References

[1]
Juha Hakala, Ole Husby, Traugott Koch , ``Warwick framework and Dublin core set provide a comprehensive infrastructure for network resource description'', Report from the Metadata Workshop II, Warwick, UK,http://www.ub2.lu.se/tk/warwick.html , April 1996.
[2]
Stuart Weibel, Jean Godby, Eric Miller, Ron Daniel, ``OCLC/NCSA Metadata Workshop Report'', http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html, Dublin, Ohio, March 1995.
[3]
R.V. Guha, ``Meta-Content Format'', http://mcf.research.apple.com/hs/mcf.html, 1996..
[4]
V.Christophides, S. Abiteboul, S.Cluet, M.Scholl, ``From Structured Documents to novel query facilities'', SIGMOD'94, pp. 313-324, ACM, 1994.

Annexes


A Folder Profile DTD (draft)

We have commented mostly the elements and attributes that were not part of the ICS-FORTH proposal and which have be added for compatibility with the Dublin Core

<!-- DTD   Fold-Prof --> 
<!ENTITY % doctype "Folder_Profile"   -->
<!ELEMENT %doctype; - - (FOLD-ID, LOC, FIELD*,KEYW?,
CATG?,RIGHTS?,VERSION?,REVISION?,STATS?)      >
<!ELEMENT FOLD-ID - O  (#PCDATA)                   >
<!ELEMENT LOC       - O  (#PCDATA)   -- Location -->
<!ELEMENT FIELD     - -  (TITLE | AUTHOR | SIZE | EVENT | TYPE | SUBJECT |
CONTENT | COMMENT | RELATION)                                    -->
<!ELEMENT TITLE    - O  (#PCDATA)  -- repeatable -->
<!ELEMENT AUTHOR   - O  (#PCDATA)*  -- repeatable -->
<!ATTLIST AUTHOR  Role (main, alpha, publish) alpha --  main: the main
author
        alpha: by alphabetic order
        publish: the publisher
        Agent: others contributors               -->
<!ELEMENT SIZE  - O (#PCDATA) -- various metrics -->
<!ELEMENT TYPE     - O  (#PCDATA)  -- repeatable, can be more general that
just the DTD        -->
<!ATTLIST TYPE  Case (DTD,GENRE, LANG, FORM) DTD -- DTD:  the name of the
DTD must be given 
    GENRE: such as home page, novel, poem, report
    LANG: Language of the document
    FORM: such as text/html, ASCII, Postscript   -->
<!ELEMENT SUBJECT  - O  (#PCDATA)  -- repeatable -->

Footnotes

[1]
Unfortunately the task of writing the DTD and generate the Editor environment has not be planned and no one seems ready to provide the necessary resources.
[2]
We have introduced here a more complex syntax than in Ex.1 and Ex.2. The variable &Title contain the result of the query which can be reused later in the specification. Parameters to the query are the name of the Database, the name of the DTD, and the actual query. More complex queries could be used instead of a simple path.