The Hypertextual Electronic Library of Mathematics

Content:

Objectives
Contribution to the European Research Programme
Innovation

Objectives

The purpose of the project is the development of a suitable technology for the creation and maintenance of a virtual, distributed, hypertextual library of structured mathematical knowledge, eventually passing through the eXtensible Markup Language. The final aim is to allow mathematical documents to be served, received, and processed on the Web, just as HTML has enabled this functionality for text.

In contrast with current digital libraries, where the real document is usually considered as a single ``text'' element, preventing any automatic elaboration of its content, as well as any form of internal navigation, our purpose is that of modeling the inner structure of mathematical documents, at different levels of detail.

For the deepest levels, where the structure of every formula and every logical step is made explicit, we shall largely rely on the experience of the current tools for the automation of formal reasoning and the mechanization of mathematics (proof assistants and logical frameworks). So, although the emphasis of the project is not on formal mathematics (and in particular we have no foundational ambitions), the proposed structure should be compatible (when restricted to suitable subsets) with most of the existing applications for the mechanization of mathematics, leading to a natural integration of these tools with the most recent technologies for the development of web applications, electronic publishing and metadata modeling. On the other side, any of these tools would immediately provide a simple authoring system for documents of the library.

As we already mentioned, the eXtensible Markup Language, which is rapidly imposing as the main technological tool for all networking applications involving structured information, will play a pivotal role in our project.

XML is a subset of the Standard Generalized Markup Language (SGML), an international Standard which has been widely used in high-end areas of information management and publishing. XML, whose development started in 1996 by a Working Group under the aegis of the World Wide Web Consortium, was actually conceived as a streamlined version of SGML designed to simplify transmission and manipulation of structured documents over the Web. The purpose of XML and SGML is to encode information according to their structure and content, via markup tags and hyperlinks. The markup makes possible manipulation, analysis, storage, linking, and transformation of all media, obviously comprising text. XML and SGML are not programming languages: they do nothing, per se. They are just a way for describing and structuring information. For example, a well known instance of SGML is the HyperText Markup Language (HTML). In this particular case, markup tags are used to structure information according to its expected rendering by a browser. But in XML (SGML) there is no intended semantics for tags, and no limitation on the information that can be contained in markup; what is conveyed varies from application to application, and can be only fixed by convention or agreement (as in the case of HTML).

Note also, by the way, the clear limitations of HTML, from the point of view of our project (which exactly coincide with the usual motivations for adopting XML). HTML has been merely conceived to enable the transmission and visualization of hypertext documents across a network, but is not suitable for representing the structure of the information contained in the document for purposes other than display. So HTML is good for little beyond its original purpose of specifying the overall layout of a document, and it is particularly bad at communicating information to computational agents. Of course, while waiting for a new generation of (real) XML compliant web browsers, i.e. fully supporting the eXtensible Style Language we shall not be able to completely get rid of HTML for displaying and browsing. However, the HTML counterpart can be generated by a suitable transformation of the XML document, and if the browser is at least supporting the style-sheet part of XSL, the transformation can be computed ``on the fly'', as a simple style-sheet application. The design of the style-sheet is obviously another objective of our project. The feasibility of describing mathematical structures using a markup language is already testified by the MathML project. The Mathematical Markup Language is an instance of XML for describing mathematical expressions capturing both its notation and content. The difference with respect to our proposal should also be clear: while the emphasis of MathML is on mathematical expressions, we are mainly concerned with mathematical developments, that not only means precisely structured sets of mathematical entities (that requires a suitable Document Type Description), but also metadata information, in order to support effective searching facilities.

The importance of metadata to support control and management of large developments and collections of documents (especially for improving the possibilities of document retrieval and the search precision) has been by now clearly evinced (see for instance the Dublin Core System.

Metadata is a description of objects, documents or services providing data about their form and content. It may be part of the resources themselves or kept separately from them.

The Resource Description Framework (RDF) provides a general model for representing metadata as well as a syntax for encoding and transporting this metadata in a manner that maximizes the interoperability of independently developed Web servers and clients.

The broad goal of RDF is to define a mechanism for describing resources that makes no assumptions about a particular application domain, nor defines (a priori) the semantics of any application domain. The definition of the mechanism is domain neutral, yet the mechanism is suitable for describing information about any domain, and in particular about the specific domain of mathematical knowledge. The definition of this model and its integration with the other component of our system is still another goal of the project.

It is finally interesting to observe how most of the different pieces of technology under development at W3C naturally fit together in our project. It is among our objectives to submit to the the World Wide Web consortium a proposal for the creation of a working group on this subject.

In summary, HELM aims at the definition of a layer of DTD's specification and a RDF model for structured mathematical knowledge, suited to the creation and maintenance of a distributed, hypertextual, electronic library of mathematics. We {\em shall not} develop the library itself (that would obviously require the contribution of hundreds of people), except for a small subset for validation and demonstration purposes.

On the technological side, we shall develop all the infrastructure required for the consultation of the library via a usual www-browser (including navigation and searching facilities).

On the other side we shall interface some current proof assistants with the XML layer, mainly for authoring purposes. In other words, these applications will be able to produce XML documents conforming to the DTD (and read them back, whenever defined according to a suitable subset).

Contribution to key action objectives of the European Research Programme

The Project contributes to build a user friendly information society, and in particular it meets the following general objectives of this program:

(multimedia content) confirming Europe as a leading force in this field, realizing the potential of its creativity and culture;
(essential technologies and infrastructure) enabling technologies which are the foundations of the information society, driving their development, enhancing their applicability, and accelerating their take up in Europe.

As a matter of fact, the project is based in an essential way on the use of most part of the recent recommendations of the World Wide Web Organization for web publishing and human-computer interaction (XML, XSL, XLL, name space, MathML, RDF, etc.). In particular, we aim to prove how all these specifications naturally fit together, when trying to build a full, integrated description (comprising content, notation, metadata, etc.) of a given field of knowledge. At our knowledge, the project is the first of the kind, and could become a paradigmatic example in the integrated use of these technologies.

The project also addresses most of the issues of the multimedia content key action, namely: electronic publishing, digital heritage and cultural content, education, information access, filtering and handling. Actually, all these aspects are and must be covered in our project, in order to reach our objectives. In particular, the educational potential of our system should not be neglected either: it could become an essential tool for a wider and more friendly dissemination of mathematical knowledge. For instance, if supported by a suitable technology, proving theorems in a proof assistant could be as amusing as playing a video game. We imagine bunches of young researchers contributing to the free development of the library for the mere gratification of seeing their name as actual editor (or, why not, original author) of a specific fragment.

Finally, the project is aimed to improve the access by students and professionals to the fast-growing mathematical knowledge base, allowing mathematical documents to be retrieved, served, and processed directly on the Web. Moreover, our system is meant to be compatible with most of the existing tools for the mechanization of mathematics and the automation of formal reasoning (proof assistants and logical frameworks). The possibility to build coherent sub-libraries of formal mathematical developments would provide an essential (and unique) added value to the library itself, making of Europe a leader in this area.

Maybe, having the possibility to process, analyze and elaborate mathematical structures as data, the time will come when we shall finally be able to start a completely new and exciting field of research on mathematics: namely a scientific, empirical study on the real structure of mathematical entities, and the `way of thinking' of mathematicians.

Innovation

The main technical novelty of the project is in its synergy of different scientific communities and research topics: digital libraries, web publishing, logical frameworks.

From the point of view of web publishing, our project is the first attempt to provide a comprehensive description, from content to metadata, of a given field of knowledge (in our case mathematics), in order to enhance its accessibility, exchange and elaboration via the world wide web. To this aim, we shall put to use most of the technologies recently introduced by the W3C: XML, DOM, XSL, XLL, name spaces, MathML, RDF, etc. From this respect, the project is first of all a complex test for all these technologies, and should hopefully become an example of `best practice' in their use. Here, the emphasis on mathematics is pretty marginal: the final architecture is likely to be extendible to other fields of structured information (such as computer software, for instance, or CORBA-like systems).

From the point of view of digital libraries, our work is aimed to exploit all the potential functionalities offered by the web, and in particular a more integrated use of its browsing and searching facilities. The library is not merely seen as a more or less structured collection of texts, but as a virtual structure inside which we can freely navigate, jumping for instance from and entity to its definition, or peeping inside some information at deeper and deeper levels of details (such as different levels of detail of a proof). This is similar to what we currently do with HTML texts, but in order to enhance the effectiveness of the consultation, we clearly need a good metadata model of the information. Moreover, in such an integrated view, it is hardly conceivable to just apply some `general purpose' metadata model (like the Dublin Core system, say): the metadata model must be eventually specialized to the actual structure of the information it is supposed to model (and more structure we have on the information, more relevant metadata we can usually infer on the document). For instance, metadata could contain the whole signature of a given module of mathematical knowledge. The usual motivation for keeping metadata simple and general is that it is usually difficult to add these informations by hand; but in our case a large part of metadata is supposed to be extracted automatically by the (structured) text itself, allowing for pretty complex metadata models.

Finally, a main aspect of our project is the integration with current tools for the automation of formal reasoning and mechanization of mathematics (proof assistant and logical frameworks). This integration has a mutual benefit. From the point of view of the mathematical library, the first and fundamental role of these systems is that of providing friendly authoring tools (for instance, our `core' library will be automatically extracted from existing libraries of these systems). The relevance of this point should not be underestimated: as a matter of fact, the main reason for the failure of complex markup modelings is usually the lack of suitable authoring tools (it is often painful to add the markup by hand). Of course, they can also provide other functionalities (like automatic proof checking) on fragments of the library (typically, the fragments generated by the tool itself, in its specific logical dialect).

On the other side, there is a compelling need of integration between the current tools for automation of formal reasoning and mechanization of mathematics and the most recent technologies for the development of web applications and electronic publishing. As a matter of fact, the next, compelling step for all these tools is their integration with (and exploitation through) the World Wide Web. From this respect, XML, which is rapidly imposing as a pivotal technology in the future development of all Internet applications, and the main tool for representation, manipulation, and exchange of structured information in the networked age, looks as a natural, almost mandatory, choice for modeling the information.

In this way, we just obey to the very primitive commandment of the web: make your information available. Currently, libraries in logical frameworks are usually saved in two formats: a textual one, in the specific tactical language of the proof assistant, and a compiled (proof checked) one in some internal, concrete representation language. Both representations are obviously unsatisfactory, since they are too oriented to the specific application. On one side, they restrict the access of the libraries to the users of the given application, and at the same time they are too sensible to the evolution and the maintenance of the application itself. A huge amount of formalization work has been done since the 70's using several proof assistant prototypes, but where is this information, now? Often, hopelessly lost in some internal format of a software application which is not maintained any more. If that information had been saved in some clearly defined, application independent format, at least a recovery attempt could have been done, by translating it into some other current logical dialect. If having a common representation layer is not the ultimate solution to all inter-operability problems between different applications, it is however the first and essential step in this direction.

Moreover, as soon as the information is put in a standard format on the web, any kind of research becomes virtually possible, and anybody could start developing his own spider for implementing his own searching requirements. This is clearly a major improvement w.r.t. the present situation. Currently, not only you must rely on the searching facilities offered by the specific applications, but even if you wished to implement your own searching algorithm, you would be prevented by the simple reason that the information is not accessible (in any reasonable sense of the word).

For all this reasons, the simple idea to switch from an application dependent internal format to an international standard hypertextual language as XML is likely to have the same impact, in the realm of Automated Mathematics, of that of passing, in Internet, from FTP to http, and from text (or some other ``internal'' format) to HTML (another ``small'' step that gave birth to the World Wide Web).

Finally, we strongly believe that the choice of an application-independent representation language is also going to improve the modularity of the general architecture of proof assistant and logical frameworks, and to simplify the definition of the applications devoted to specific tasks. Eventually, it should lead to a substantial simplification and re-organization of the current, "monolithic" architecture of these systems. All their different and often loosely connected functionalities (proof checking, proof editing, proof displaying, search and consulting, program extraction, and so on) could be split in more or less autonomous applications, maybe written in different languages and developed by different teams. This is one of the reasons why the `internal language' has an interest in its own, and for many respects should grow independently from the evolution of logical systems and the problems of proof checking.