The Hypertextual Electronic Library of Mathematics

Main Projectual Provisos

The next, compelling step for every tool aimed at the automation of formal reasoning and the mechanization of mathematics is its integration with (and exploitation through) the World Wide Web. This simple statement borrows already a few major methodological provisos:

Markup and semantic components

The kernel system is only meant to support the basic management of mathematical documents through their lifespan, but in order to support additional functionalities the intelligence encoded in the markup should cover all potential uses to which such documents will be put. To this aim, we must clearly identify all semantic components of a typical mathematical document and classifying them into logical categories (content, structure, presentation, notation, metadata, \dots). As a main methodological assumption, we try to keep a clear distinction between all these components, as far as possible. Similarly, we aim to reuse all existing technology and recommended standardization formats (such as MathML or RDF) for representing and modeling suitable parts of the information. We have already identified the following main components:

Figure 1:


For our development, we have adopted COQas a paradigmatic example of a typical logical framework. The first phase of HELM has been centered on the problem of exporting coq objects into XML format.

This has already an interest in its own, for at least three reasons:

What should NOT be exported from Coq?

Not all information useful to Coq should be exported. In particular we have choosed not to export:
Parsing and pretty-printing rules and information
Parsing rules should depend only on the proof engine. To be able to use other proof engines different from Coq we need not to rely on Coq's own rules. Similarly, pretty-printing rules should depend only on the browser.
Tactics-related information
This, too, are proof engine dependent.
Redundant information added by Coq to the terms of cic
Coq adds in several places a lot of information to cic terms in order to fasten the type-checking. This information is clearly redundant, is unuseful for browsing and could be unuseful also for other type-checkers.

What should be exported from Coq?

The following Coq objects are exported:
Constants (definitions/theorems/axioms)
Constant objects of Coq are used to represent definitions, theorems and also axioms. Definitions and theorems are syntactically identical and have a body and a type. The only difference is semantical: theorems are usually opaque (only their type is used in cic terms) because of proof-irrelevance, while definitions are transparent (their body is substituted in their occurrences in other terms during type checking). Axioms, instead, are constants with a type but without a body. We choose, during extraction, to discriminate only axioms from definitions and theorems, that become undistinguishable once extracted. We suppose that there will exist another level (a semantical one) responsible of ``marking'' non axiom constants as theorems, definitions, lemma, facts, ... Due to the sectioning machinery of Coq, constants defined inside a section are found in the environment many times: once in the uncooked form and once for each section in their section path. We choosed to export a constant in its uncooked form, but adding informations (the list of variables on which it depends) gotten from the most cooked form.
Variables have only a type and not a body. A variable behaves like an axiom inside the theory where it is defined and as a parameter when using that theory.
(Co)Inductive Definitions
In Coq, you can define also blocks of mutual (co)inductive definitions. Each definition inside such a block has a name, a type (also called arity in Coq) and a possibly empty list of constructors. Each constructor has a name and a type. A simple example is the inductive type of naturals whose name is nat, whose type is Set and whose constructors are O of type nat and S of type (nat -> nat). Blocks, as constants, could depend on variables; the list of variables (parameters) on which all the definitions in the block depend is also exported from Coq.
Proof in progress
We choosed also to export unterminated proofs. As terminated theorems, an unterminated proof has a name, a type and a body; moreover, it has also a list of conjectures on which the body depends. Each conjecture has a type but not a body: to end the proof you must provide a body for each conjecture.

Sections and cookings in Coq

In Coq it's possible to use sections. To make a section, you must open it, fill it and then close it. Filling a section means declaring in it other objects, i.e. variables, constants, local definitions, mutual (co)inductive types or subsections. So the structure describing sections and subsections in Coq is naturally a tree. The names given to objects in the current environment must ever be unique: for example, if in a supersection an axiom A is defined, you can't define another one in the current section. When you close a section, though, all the variables and the local definitions defined in it are discharged and all the constants and (co)inductive definitions in it are cooked, i.e. lambda-abstracted with respect to all the discharged variables (ingredients) that are used, directly or indirectly, inside their bodies. The body of a local definition, instead, is substituted to all the references to it inside each cic term of a not-discharged object. Example:
Section a.
 Variables V1,V2 : Prop.
 Axiom A1 : V1 -> V2.
 Axiom A2 : (P:Prop)P -> V1.
 Local Apply := [x:V1->V2][y:V1](x y).
 Theorem T : (P:Prop)P -> V2.
  Exact [P:Prop ; H:P](Apply A1 (A2 P H)).
End a.
After End a. the variables V1 and V2 and the local definition Apply are discharged. The other objects become equal to the ones defined, without using sections, in this way:
Axiom A1 : (V1,V2:Prop)V1 -> V2.
Axiom A2 : (V1,P:Prop)P -> V1.
Theorem T : (V1,V2,P:Prop)P -> V2.
 Exact [V1:Prop ; V2:Prop ; P:Prop ; H:P](A1 V1 V2 (A2 V1 P H)).
Note that A2 is abstracted only w.r.t. V1 because it doesn't depend on V2 and that T is abstracted also w.r.t. V1 on which it depends only implicitly through A1 and A2. In order to improve performance, Coq implementation of cooking is not uniform on all the objects:


Here is the current draft of a DTD for the low level structure of Coq term
<?xml encoding="ISO-8859-1"?>

<!-- DTD FOR CIC OBJECTS:                                            -->
<!--  First draft: September 1999, Claudio Sacerdoti Coen            -->
<!--  Revised: February 3 2000, Claudio Sacerdoti Coen, Irene Schena -->

<!-- CIC term declaration -->


<!-- CIC objects: -->

<!ELEMENT Definition (body, type)>
<!ATTLIST Definition
          name       CDATA      #REQUIRED
          params     CDATA      #REQUIRED
          param_mode (POSSIBLE) #IMPLIED>

<!ELEMENT Axiom (type)>
          name   CDATA #REQUIRED
          params CDATA #REQUIRED>

<!ELEMENT CurrentProof (Conjecture*,body,type)>
<!ATTLIST CurrentProof
          name CDATA #REQUIRED>

<!ELEMENT InductiveDefinition (InductiveType+)>
<!ATTLIST InductiveDefinition
          noParams NMTOKEN #REQUIRED
          params   CDATA #REQUIRED>

<!ELEMENT Variable (type)>
<!ATTLIST Variable
          name CDATA #REQUIRED>

<!-- Elements used in CIC objects, which are not terms: -->

<!ELEMENT InductiveType (arity,Constructor*)>
<!ATTLIST InductiveType
          name      CDATA        #REQUIRED
          inductive (true|false) #REQUIRED>

<!ELEMENT Conjecture %term;>
<!ATTLIST Conjecture
          no NMTOKEN #REQUIRED>

<!ELEMENT Constructor %term;>
<!ATTLIST Constructor
          id CDATA #REQUIRED>

<!-- CIC terms: -->

<!ELEMENT LAMBDA (source,target)>

<!ELEMENT PROD (source,target)>

<!ELEMENT CAST (term,type)>

          value NMTOKEN #REQUIRED>

          value CDATA #REQUIRED>

<!ELEMENT APPLY (%term;)+>

          relUri CDATA #REQUIRED>

          no NMTOKEN #REQUIRED>


          uri CDATA #REQUIRED>

          uri CDATA #REQUIRED>

          uri    CDATA   #REQUIRED
          noType NMTOKEN #REQUIRED>

          uri      CDATA   #REQUIRED
          noType   NMTOKEN #REQUIRED
          noConstr NMTOKEN #REQUIRED>

<!ELEMENT MUTCASE (patternsType,inductiveTerm,pattern*)>
          uriType CDATA   #REQUIRED
          noType  NMTOKEN #REQUIRED>

<!ELEMENT FIX (FixFunction)+>
          noFun NMTOKEN #REQUIRED>

<!ELEMENT COFIX (CofixFunction)+>
          noFun NMTOKEN #REQUIRED>

<!-- Elements used in CIC terms: -->

<!ELEMENT FixFunction (type,body)>
<!ATTLIST FixFunction
          name     CDATA   #REQUIRED
          recIndex NMTOKEN #REQUIRED>

<!ELEMENT CofixFunction (type,body)>
<!ATTLIST CofixFunction
          name     CDATA   #REQUIRED>

<!-- Sintactic sugar for CIC terms and for CIC objects: -->

<!ELEMENT source %term;>

<!ELEMENT target %term;>
<!ATTLIST target
          binder CDATA #IMPLIED>

<!ELEMENT term %term;>

<!ELEMENT type  %term;>

<!ELEMENT patternsType  %term;>

<!ELEMENT inductiveTerm  %term;>

<!ELEMENT pattern  %term;>

<!ELEMENT body  %term;>

2.1.9   The code

We have firmly choosed not to modify Coq at all to export XML files. So we have written a brand new ``tactic'' named Xml (to be stored in tactics/contrib/xml). The tactic is very dependent on the internal data structures of Coq, that are now changing version after version. So we hope that the tactic will be accepted in the set of standard contributions to Coq (so that it will be considered when modifing Coq's code). Once loaded via ``Require Xml.'', four new commands become availables to Coq users: ...SPIEGAZIONE DELLE TATTICHE To implements these new commands we have written some OCaml modules and a vernacular file Xml.v:

Figure 1:


Once exported in XML, we need to parse CIC objects and put them in memory in an internal, useful representation. While the XML representation should be kept small and unredundant as much as possible (redundancy means more space occupied and more consistency checks), the internal representation is thought to fasten the subsequent accesses to the information (e.g. duging typechecking), at the price of overloading the parsing phase. Our parser has been built using Markup. Markup is a generic validating XML parser written by ??? entirely and from scratch in OCaml. Being a validating XML parser, it is really fast (as fast as the nsgml parser written by J. Clarck for SGML and faster than the validating XML parser XML::Parser ??? written in Perl). Absolutely speaking, though, it is not too fast (it takes about six minutes to parse all the standard library of Coq): this is due to the huge dimensions of the XML files (the standard library of Coq is about 80Mb in uncompress form and reduces to 3-4Mb once compressed) and perhaps also to validation.

How to fasten parsing time

First of all, should we really need a validating XML parser? The first answer seems to be yes because it is really important for a system like ours to be able to detect errors. But in fact a lot of controls have subsequently been added during parsing and will be in subsequently phases, up to type-checking the entire proof. So, in next phases, when we'll have a clear view of how heavy is parsing on the entire computation, we'll turn back and consider again the use of a validating parser. Another point is how much syntactic sugar should we put in XML files? Up to now we have put a lot (50%) of syntactic sugar in XML files so that the files are easily readable to anyone knowing CIC. For debugging purpouses this is great. But, once finished, should XML files really be easily readable by humans? This is one of the aims (a lesser one, thoug) of XML, but we must remember that in CIC it doesn't exist a short proof: non trivial proofs are at least of some hundreds of lines of XML and really very difficult to understand without a proof-checker. It is also quite impossible to anyone to write a proof directly in CIC without the help of a proof engine. So, in next phases, we'll reconsider also if getting rid of the syntactic sugar.

The internal representation

Once parsed, the objects of CIC are stored internally using two concrete data-types (term for the term level and obj for the object level) whose constructors are chosen to match exactly the entities present in the CIC DTD. Uris are all translated during parsing in absolute uris, even if they are encoded in the XML file as relative ones. In subsequent phases we will need a fair elaborate data-structure for the uris to be able, for example, to determine fastly if two uris are equal of if one uri is a prefix of another. So we introduce an UriManager, i.e. an OCaml module declaring an abstract data type uri and the functions to use it.

The code

Markup's main function takes in input the name of the xml file to parse, and a bunch of options; then it parses and validates the file building the concrete syntax tree that is returned. There are two possible ways to use the tree:
Functional style:
the type of the concrete syntax tree is a concrete data-type. So it is simple to write a bundle of recursive functions to navigate the tree retrieving the information needed.
Object Oriented style:
it is possible to write a set of classes with the same interface, instantiate each class to a clonable object called a prototype and tell markup to link each node of the tree to a copy of a particular prototype. The map from XML-element names to prototypes is called a DOM in markup. Therefore, it is possible to access the tree in an object-oriented way simply calling the methods of the objects linked to each node of the tree.
None of the two solutions is better of the other: The functional style takes less memory (no prototypes and objects) than the object-oriented style, but it may be less efficient: the recursive functions used to navigate the tree for each node of the tree must try to match the XML-element name to decide what action should be performed. This means a lot of string comparisons that are notoriously slow. In the object-oriented style the string-matches are done only one, when choosing which prototype to clone, and an hashtable is used to fasten the search. Thereafter, for each method called, only a method (dynamic) dispatch is needed, which is not so heavy. From the point of view of software engineering, though, the object-oriented style seems preferable. What we need is a parser able to translate the generic XML concrete syntax tree to our internal representation of the same tree (defined in the OCaml module Cic). The dtd of CIC has two well-distinct levels: the level of the objects and that of the terms. We have choosed to use the functional style when accessing the object level (that has only few XML-element types) and the object-oriented style when accessing the term level. So we have splitted the implementation of the parser into three different modules:
The top-level parser. It's main functions, given the name of an XML file containing a CIC object, returns the internal representation of that object once parsed. This should be the only module accessed from other OCaml modules; next two modules should be used directly only by this one.
The object-level parser. In this module are defined the recursive functions to convert the object-level nodes of markup trees to the internal representation. It uses next module every time that one node has a subnode at the terms level.
The term-level parser. It defines the DOM for translating a term-level node of markup to the internal representation.

Figure 2:

aggiungere la spiegazione su come vengono risolte le uri relative cambiare get_term in CicParser2 to get_obj

Overall architecture

Figure 3:

Figure 3 shows the overall architecture of helm at this phase. At this early stage we consider only the term and object levels, i.e. the single theorem, axiom, and so on. Obviously we will need to build another level, the theory level, in which whole theories (=structured sequences of theorems, axioms, etc.) will be taken into account. As said, we want to isolate the modules depending on CIC in order to be able to substitute easily the logical framework with another one.

At the top level we will have many interfaces to browse and typecheck the single objects. Actually we plan at least to put together a textual interface and a gtk interface. We require also that adding a brand new top-level interface should be easy and should not require modifications to lower-level modules.

Then we have some modules that are strictly dependent on the logical framework: the pretty printer(s) (CicPp) will be able to translate the internal format to something that could be used by the top-level (mathml, for example). Probably many pretty printers will be developed. The type-checker (CicTypeChecker) proof-checks objects once put in the internal format. Every time one of these modules needs one object, it requires it to the cache (CicCache). If a cache miss occurs, the cache is responsible to ask the getter (Getter) to retrieve the XML file describing the object from the Internet; once retrieved, the XML file is passed to the parser that returns the internal representation of the object to be put into the cache. As just said, the parser, the pretty-printer, the cache and the parser depends heavily on the logical framework and on the internal representation of it's objects, that will be described by the Cic module. Instead, the Getter and the UriManager modules should be independent of the logical framework.

In order to test the parsing modules and to debug the tactic developed to export terms from Coq, during this phase we have implemented all the remaining modules (but the typechecker) in a trivial way: uris and filenames are the same thing so that the getter actually does nothing at all and the UriManager is easily coded. The cache is simply implemented as an hashtable (nothing is never flushed from the cache). Finally the pretty-printer translates the internal representation to the same textual description used by system Coq and the two interfaces simply display that description, either in a terminal (Experiment) or in a gtk text area (GtkInterface).