The Power of Text

An impressive, ever growing, amount of information is available on line. The World-Wide-Web has made possible to everyone equipped with a computer not only to access information published by others, but also to become a publisher. Information published can be seen instantly practically from anywhere in the world. This reality was a few decades ago a dream that nobody could ever have imagined would become true. But the challenge of the period is to find the information that we are looking for, and even better locate pieces of information that are relevant, useful, up-to-date. This information has to be stable enough so that it can be referenced.

The same phenomenon has already occurred in history. After the invention of the printing press, and the availability of industrially-produced books, the issue of finding a given book and finding something inside a particular book emerged. Tools have been invented to make this happen. Books became enriched with title pages, tables of contents, and later indexes. Libraries started to develop catalogs to enable access to the documents by document type (book, article, etc.), by author, by title, and by subject. Classification systems have been invented and became standard ways to help readers find their way in the myriad of printed materials all over the world.

Today most documents available on line can be retrieved on the Internet by means of a system that emulate tables of contents: first find the document, then search inside it for the portion which is relevant. In the most elaborated cases, the documents are organized using a system similar to a library catalog, called "Metadata", which describes them. What this book is about is to make metadata more useful, including the possibility to connect knowledge using the equivalent of a printed index. What a printed index does is to enable direct access by topic of interest, which have been marked to appear on given pages. It is possible, using the page as a search unit, to locate quickly the portion of the book which is relevant to a subject of interest.

Classification schemas and discovery

Classification schemas facilitate discovery... but they can also prevent it from happening.

The invention of typography and grammar as a structure. New symbols (punctuation) to structure the discourse. The invention of mathematical formalism to fit the print.

Reliability.

A universe that expresses a world view.

Finding aids. Invention of tables : tables of contents, glossaries, indexes, library classifications, etc. Classifications, taxonomies, vocabularies. Organization of knowledge and world-views. Indexes are not what they look like. Multiple entry/subentries not hierarchical, etc.

Navigational aids

Add: What is a TOC? What is an index? What is a bibliographic reference? What is a cross-reference? (Difficult to maintain) Punctuation. Tree structure. Capitalization. Typographical rules.

The idea of structure is as old as text.

The language is structured. It has a grammar, which describes for example the notion of sentences. The written part of the language has its own structure. And the fact that written material gets printed, or more generally, published, results in adding more structure to it. The fact for example that a sentence starts with a capitalized letter, and ends with a period, is a manifestation of such a structure. It is so commonly used that it has become unnoticed.

Grammar and Typography

The art of typography is not limited to the creation of character fonts, but also of the creation and use of symbols aimed at improving the readability of text for humans. Punctuation is an example of such symbols. Periods, commas, semi-colons, colons, etc., are used to split sentences in smaller units which have their own definitions. The typographical rules add to the grammar rules to define the way the structure by which the written language is displayed. The notion of spacing between words, the separation of semantic units in paragraphs, use of typographical variants such as capital letters, bold face and italic, is another example of such a structure. The variation of size in the font also indicates the boundaries for nested sections whose importance varies with the size of the font used for its header. A thorough analysis of written text amounts to describe it as a hierarchical structure, which is organized as a tree seen in the reversed way: the root is high up, and the leaves are on the ground.

The art of typography has reached such important achievements that the same process has resulted in promoting other symbolisms than pure text, such as the mathematical or chemical notations. Sometimes, when such symbols existed, they have been altered in ordered to fit the linearization imposed by the constraints of print technology: the symbol used for square root is an example.

The emergence of printing¹, a technology that made possible the industrial reproduction of many copies of a single book has led to further structuring the written language. More precision was required when typesetting a book in preparation for printing than when copyists were using some individual creativity for he presentation of a book (no spaces, font size variable depending on the length of the text to make fit on a page, etc.) Printing is about structuring written language.

The idea of structure is at the core of computing.

Computers, when created, were not handling text processing. They were mainly used for handling data bases, which is a heavy structure imposed on data. Data is chunked into pieces, which are stored in a fixed, recognizable way, so that it can be extracted and manipulated as needed. These database records separated into fields have given rise to the possibility of managing enormous quantities of data that would have been unthinkable or impracticable without computers.

The realization that text itself, even if it looks unstructured when compared to databases, has an "invisible" structure, which is partially revealed by the typographical variations, has given rise to the creation of structured markup languages. This work has been done throughout the 1980s with the creation of the "Standard Generalized Markup Language"² which has evolved into the "Extensible Markup Language"?, which represents the lingua franca of Web-based applications today. The Hypertext Markup Language³, known as " HTML", is also expressed as a markup language and has become ubiquitous on the Worldwide Web, due to its simplicity of use and power in triggering links between documents interspersed on the world wide web available on the Internet.

The structure of text, as known by typographers and computerized first by phototypesetting system vendors, and the structure of data, as implemented by database system vendors, appears now as two facets of the same phenomenon. A text can be "exploded" into a set of elements forming a tree structure, therefore comprising a certain form of database, and a database can be expressed into text delimited by separators. Therefore, in the current state of the technology, the traditional distinction, in terms of computer science, between text and databases, tends to vanish.

Semantic Structuring

The schemas used to express structures for text as well as for data in general do not necessarily provide "meaning" that can be exploited when searching information. Text structure can be presentation-oriented. Or it can be partially presentation-oriented, partially semantically-oriented, without a clear distinction of what kinds of structure corresponds to which distinctions. For example, it is possible to use italic face in one context to highlight the important terms that are defined, and therefore would provide a good mechanism for search, whereas in another context, italics would just be used for purely aesthetic reasons, without any expectation that the phrases in italics should be used for searching. Sometimes, however, document structures are more distant from pure presentation and qualify the nature of the information that they are enclosing. In that sense, they are not really distinguishable from databases.

There is no clear boundary that would delimit presentation-oriented markup from semantic-oriented markup. Some markup can serve both purposes at the same time⁴.

Also, there are really two levels of structure that are of interest. One is the structure that defines the skeleton of the information set under study, the other one is the structure of the knowledge or the meaning that is expressed. These two structures are completely different and are in general unrelated. The fact that we sometimes tend to assimilate them together is another source of difficulty to understand why the computers don't get us to where we want.

This is also illustrated by the fact that information can be described, regardless of whether the information item is structured or not. We can very well end up with paradoxical situations where on the one hand, text can be heavily structured but the knowledge that can be extracted is not, whereas, on the other hand, text can be not structured at all, but the metadata that are used to qualify it are extremely structured.

Orthogonal Structuring

The two forms of structuring are orthogonal, i.e. they do not depend of each other. Hence, there is no direct relationship between the granularity of presentation structuring and the granularity of semantic structuring. However, because there are many cases where the two kinds of structuring are present, there is hope that starting from structured information gives more hooks on which to build the semantic structure. But again, when it happens, it is somewhat fortuitous.

When pure semantic structuring is used, as it is often the case in a database, then the knowledge value is greater and information retrieval is made easier. However, it is also constrained by the limitations of the schema under which it has been created at first. The simplicity of such schemas can hide the true complexity of the information, and it might not be practical to rely on it for complex tasks. The kinds of things that a user may want to look for might be completely different from the structure that has been used to design and create this information at the first place.

Structure and Discovery

Structure, especially semantic structure, looks like the ideal candidate for helping navigate within a corpus. This is true from the perspective of those who have created and designed that particular structure. In order to maximize its usability, it's better to be in the nearest possible conditions to the creators. If possible, be for example close in time. Things evolve, and what was meaningful at a particular time and place might lose meaning later. The signification of some words, keywords, may even be lost. Without a thorough documentation, it may become increasingly difficult to retrieve information, just because nobody don't remember any more what it meant when it was first created. Remember, we are speaking here over long-term preservation of meaning. Some classification or access schemas may not be useful outside of the particular field of expertise in which it was created, or without the appropriate training. It was a time, not too long ago, where browsing a library catalog was not something very easy for most users.

Metadata and Data

Trying to define the difference between metadata and data amounts to find where the information that is going to used for discovery is located. If the discovery information is located inside the information itself, then it's considered to be "data". If information is located in a specific descriptive set of fields, then it's considered "metadata". But the difference between data and metadata is not always relevant. Deciding what is inside and what is outside depend where one considers the boundary is. Someone's data might be someone else's metadata. A table of contents, for example, can be considered metadata. But a table of contents is nothing else than a filtered view on an information structure, where only the headers are displayed and the rest is hidden. Therefore, a table of contents is sometimes regarded as metadata, sometimes as data, and it doesn't even change its content. In other cases, information considered "external" to a document can be made "internal" if the boundaries delimiting the document are redefined. This is even true if the data describing a given document are located outside of the document itself. Because what can be considered internal and external can be applied to a set of documents rather than to an individual document, and then the boundaries can be similarly redefined between what is "internal" as opposed as what is "external".

In general, the distinction between data and metadata is artificial.

A superimposed overlay

Semantic description of information, whether it is internal or external, can be considered a superimposed overlay. This approach offers the flexibility needed to open the possibility of creating individual perspectives, or views, on the information. Considering that only the "internal" metadata can be used, i.e. the ones that have been created by the originators of the information, means that there is no other perspective allowed on this information, which is obviously abusive.

The original intent is expressed as published and will provide the gateways to be used to access the information. Any "opinion" on a given piece of information can in turn be published and give rise to another perspective. Or it can stay unpublished and be used, as a personalized view on information that uses concepts that serve a unique, individual purpose.

Planning for the unexpected

The other side effect of information heavily structured is that the structure may not integrate harmoniously when viewed at a higher level, than the one it was created for. Instead of being a help, a structure may become an obstacle. Here, the problem looks like an impossible one. How can we plan for the unexpected? This is where subject gravity comes into mind. Augmenting local subject gravity facilitates interconnections with outside, because it reduces dramatically the points where things connect. Up to a point that it is possible to get things under control and to design the merging principles the way which is considered the most appropriate.

The section about printing is only to be considered an example of structure. The important idea to develop in this chapter is the fact that there is a structure for finding aids (which is conceptually equivalent, but does not equate the structure of text–hierarchical essentially). ↩
An ISO standard, published in 1986: ISO 8879, edited by Charles F. Goldfarb. ↩
A World Wide Web Consortium standard published in 1998, edited by Tim Bray, Jon Bosak, Jean Paoli and Michael Sperberg-McQueen. ↩
Eve Maler and Jeanne El Andaloussi, Developing SGML DTDs, Prentice-Hall, 1996. ↩