Subject Integration
Finding aids for the digital era
Information is useful when we can find it, regardless of the technology used to support it. We have invented finding aids for books, adapted to the printing technology. Books are made of numbered pages, used by tables of contents, indexes and cross-references helping us to locate specific portions of the content which are relevant to the subjects mentioned. Taxonomies, which are hierarchical classifications of knowledge into domains and subdomains, are used in library catalogs to list the books relevant to each domain of knowledge. In general, each piece of content has an address and a subject. Using the finding aids, addresses can be retrieved by uttering a subject. In a library, knowing the call numbers enables librarians to find books on the shelves where they are stored. The mapping between subject and address constitutes a universal basis for information retrieval.
At the low level of the operating system of a computer, the classic library metaphor still applies. Each memory location on a computer, whether it is live (RAM) or on a disk, stores a bit (either 0 or 1) at a given address. The operating system is used to read and write bits in each address. Any higher level software combines these operations in processes designed for human consumption. Bits are combined into strings, and displayed as words. Data is stored on computers in files, for which the computer knows the starting and ending locations. Text, sound, images, video are stored exactly in the same way as digitized files. Only the rendering software differs for each format. But we can do better than relying just on files to retrieve information. One of the major advantages of computerized information is that we can divide files further using the structure of the data.
Structure
Information was already highly structured before it became digitized. Text is a structured, powerful and efficient way to store information. Text is made of characters, comprising an alphabet or a set of ideograms and added signs such as spaces, digits, and punctuation characters. Each alphabetic language has a slightly different alphabet, with a set of accented characters, specific diacritic characters, for example the ß character in German, or œ in French. Mathematics adds a number of signs with a specific meaning, and emojis are now commonly used to express feelings. Characters are grouped into words, and words separated by spaces constitute a sentence. Sentences are assembled into paragraphs, paragraphs are assembled into sections, chapters, etc. The advent of printing has tightened rules for writing, adding various layers of structuring. In addition to applying grammatical rules, spelling rules, and typographical conventions to text, finding aids were invented to make it easier to find text. Page numbers were added to books, enabling tables of contents to list the chapters' titles of books with the page numbers where they start. Footnotes and endnotes were added as a way to add secondary information, commonly the sources of information explaining where an information comes from. Back of the book indexes enable direct navigation by subject to the various pages of the book in which occurrences of the subject are found. Here too, the pair address/location is used as an added structure created to facilitate retrieval of information.
One of the first usages of computers has been to organize information into databases. A database is a storage system made of records. Each record is divided into fields, where data is stored. Therefore, it is possible to specify what the nature of every piece of data is depending on the field it belongs to. A contact database contains one record per person, and examples of fields include the first name, the last name, etc. Finding information in a database is easy, because it is possible to query a record by uttering a value of a field name. For example, it is possible to lookup a person by its last name and view all fields for its record. When several people share a last name, a list of several records is returned instead of just one. A database can be described as a table, where each row is a record and each cell a field. Frequently, the field names are the data contained in the first row.
In the early 1980s, computers which were expensive machines used only by big corporations became "personal" and started to invade offices. It didn't take long until they were used, not only to manage data in databases, but also to replace typewriters used in offices and homes. Desktop pulishing software gave us the ability to print text in a professional manner and personal computers progressively replaced expensive typesetting machines used in the printing industry.
When typesetting was first computerized, it used software which enabled the inclusion of specific markers indicating various font changes, the amount of space needed between lines and paragraphs, margins, etc. Each manufacturer had a specific set of codes that needed to be entered in order to get the machines understand what the intended purpose was.
The emergence of a mass market for text composition led to the need to streamline the production of documents by applying a common structure. In offices, people started to regroup around the most commonly used formats, and ending up selecting Microsoft Word as the preferred way to exchange documents. At the same time, the publishing industry agreed on generalizing and standardizing markup languages.The Standard Generalized Markup Language (SGML) was issued. The level of detail that it enabled offered the possibility to replace proprietary typesetting software with flexible ways to encode the information, by designing a structure that describes the way information was structured, and independently enabled to design stylesheets used to render each of the tags comprising the markup for viewing or printing. The main goal was to introduce a separation between the structure of the data and its presentation, and to enable multiple outputs from a single input.
SGML gave birth to two different offsprings. The first one to appear was HTML, the "HyperText Markup Language", using the same tagging principle as SGML, even if the Web browsers used to render it were more permissive than the strict validation required by SGML parsers. HTML was small and easy to use, and quickly propagated. Its markup is used both to indicate variations in the presentation of information (headers, bold, italic, tables, etc.), to insert various media types (images, sound, video, etc.), and to connect any web page to any other using hyperlinks. The success of HTML is what prompted the emergence of the World Wide Web. The other byproduct of SGML is the "eXtensible Markup Language", or XML, which was primarily designed as a simplification, keeping most core characteristics and removing features rarely used that were difficult to implement. XML is more general than HTML, because it allows its users to define a custom set of tags and rules for their own documents.1 In XML, the structure of the data is expressed inside the text, using markup qualifying the content. Markup can be used either to indicate format information (such as \
XML is used not only as a lingua franca to encode text documents, but also to export and import the contents of databases.2 XML provides a bridge between word processing and databases by considering the text as a set of structured elements, where tags play a role similar to fields, and content plays a role similar to data.
XML query languages have been designed to find information inside XML documents, in a fashion that resembles the way we can find information from inside a database. These efforts mark the beginning of a merger because the universes of documents and databases.
Metadata and data
Documents contain "metadata", which are information about the author, the title, the date of publication, the software used to create them, etc. Often, the metadata is used to populate a database, containing information helpful to retrieve the document, playing a similar role as a catalog card in a library. In other words, metadata looks like the structured part of the document, considered otherwise unstructured. But for a document which is entirely structured, for example, an XML or HTML document, there is no difference in nature between the tags that are considered metadata and the other tags. It is therefore possible to extract information from elements which are not explicitly considered metadata, and we can eventually consider that the difference between data and metadata is somewhat artificial, at least when a document is structured.
Content management systems are products used to store documents into a database, and assign them properties, to facilitate their retrieval. These properties are usually metadata, but could be any data extracted from inside the documents.
When XML is used primarily for semantic tagging, where elements describe the nature of the data that they contain, the documents are functionally equivalent to databases. On the opposite site, HTML and XHTML are used primarily to indicate to the browsers how to visually render information, and therefore are generally more focused on presentation than on semantic structure. The consequence is that the information remains, semantically speaking, unstructured. In other words, extracting the content paragraph doesn't tell us anything about the content, whereas we expect to get the name of a city when extracting information from a \
Finding information relevant to a given subject in an unstructured data set, such as Web pages, requires a different approach than querying a database. Powerful search technologies have been developed to find information based on word recognition. These technologies have become more sophisticated, and now include stem roots, synonyms, translations in various languages. The most popular search engine, developed by Google, utilizes proprietary search algorithms that return search hits in an order that depends on the frequency of use, and other criteria such as paid advertising. Since the algorithm is not public, there is no way to forecast exactly what the results of a given search query will be, even if a whole discipline emerged under the name "Search Engine Optimization", whose purpose is to ensure that hits are brought to the top first listed in the results. Data analytics follows a similar pattern: it consists in returning results from a huge set of data using some algorithm that is often proprietary. Search or data analysis, therefore, are useful to search an unknown set of data, but there is no guarantee that a particular item will be found even if it is related to the subject of a particular query.
Thus, the main differentiation when working with information is whether the information is structured or unstructured. Depending on which side we are on, the technologies used for finding information will differ. Generally speaking, we will use database queries for structured information and data analytics for unstructured information. Structured information need to be prepared in compliance with the schemas, classification schemes, controlled vocabularies, taxonomies, or ontologies that are in effect. Doing so imposes an agreement to follow certain rules. We are therefore in a highly regulated environment, where compliance is key. In big corporations, government agencies, international organizations, this is common practice. But it is a heavy, burdensome process that can not be generalized to other environments where information consumption is more fluid and dynamic, and not always predefined.
The difference between structured and unstructured information is more complex than it seems at first sight. Using structured information implies that the structure is universally understood by its users. Librarians, for example, need a specific training to understand the meaning of the fields in catalog records. In more general purpose structured environments, there is an untold consensus that field names should be self-explanatory and therefore not ambiguous. In general this is the case when the system is deployed, but over time increasing discrepancies might develop between the way the information is organized and what it actually means. For example, a scientific taxonomy doesn't obviously include findings that have not been discovered, and new categories which may emerge after the schema has been designed have to be added. For users who don't know which categories are available, finding information might be quite challenging.
Searching within unstructured information is done by looking up a word or an expression. The first generation search tools would miss a hit unless the lookup string was spelled exactly the same way. That kind of deficiency has been generally fixed in a new generation of more apt technologies, by adding spelling variants, ignoring accented characters or hyphens, using word roots instead of the exact word, and including synonyms. Advances in the domain of artificial intelligence and machine learning raises the hope that the results of search queries will be further improved. Data analytics tools, data mining technologies are using patterns that can be domain-specific, and raise the overall user satisfaction.
However, sometimes, information is very specific to a narrow context, and human work is irreplaceable. For example, the list of index entries compiled by an author of a book or a professional indexer results from an intellectual work of analyzing the content of the book, the concepts used and their interrelations. The index provides an added value by referring to pages of the book not necessarily because they contain a specific word listed in the index, but because the content is relevant to the subject that the index entry is about. An author writing an index aims at attracting the attention of the readers to the highlights of the book. Sometimes, researchers start by looking at an index to figure out what the book is about. This hidden characteristic of an index, which consists in a detailed intellectual summary of the major concepts developed in the content, provides value besides the simple ability to refer to pages. This aspect of an index is lost in any algorithm developed for a broader context. Browsing web pages requires less focus than reading a book, and therefore we are starting to neglect it and don't consider it as a big loss. This question deserves to be asked, nevertheless.
Relying blindly on algorithms that we don't know makes us more dependent on the technologies to apprehend knowledge. Only those who know the information sources will realize that some information cannot be found when users are asking for words different from those contained in the underlying database, and they will feel powerless, having no way to easily correct the situation. This may become a significant problem for example when words are transliterated from a foreign language. For example, looking for Peking will return zero occurrence of Beijing, if the authors of the algorithms didn't think of including it. And the readers will end up thinking that Peking has never been mentioned in this document.
It amounts to the level of control desired. An environment where there is a need to fully control the content and the quality of the information is different from an environment where information is shown as it is. This difference is similar to the one between a publisher and a syndicator. A publisher wants more control about the information it creates and makes available, whereas a syndicator shows information as it is. The government is a publisher: the information it makes available needs to be very precise, terms need to be well defined. Heavy regulatory procedures impose a workflow that can not easily be changed without disrupting deeply the day-to-day operations.
How subjects are represented
A computer can not contain a subject, but it can contain a representation of the subject. The representation of a subject is a "subject expression". Examples of subject expressions include entries in a printed index, headers in a card catalog, a list of authority keywords comprising a vocabulary, terms used in a thesaurus. Subject expressions together form a space where subjects occupy locations. Subject expressions can also be called "topics", because the Greek etymology of the word implies the idea of space, a topos. Since a subject is not represented by its name, or by a set of names, there needs to be a new mechanism to represent a subject. In some limited environments, names can be used to represent subjects, but it is understood that there is a universe of discourse in which this is valid, and no claim is being made as to the validity of such names in another universe of discourse.
The way subjects are defined, and information about them is a matter of design. It is possible to concentrate many references to one subject, or on the contrary to differentiate subjects at a level of extreme detail, in order to minimize the number of occurrences of that subject. The more concentrated the subjects become, in terms of their relevance, the heavier they are in the subject space, and the more their gravity augments. In a perspective of a better semantic integration, the gravity of subjects should be augmented, in order to diminish the possibilities to find the same subject elsewhere in the subject space.
In engineering and management terms, this attitude can be paralleled to the way the airline traffic has been reorganized through hubs. Navigation hubs concentrate flights so that more connections become available, rather then relying on a somewhat random point-to-point schema. Navigation by subject is similar. We are more likely to succeed getting to our final destination if we first connect to a "subject hub" providing multiple, well-documented connections to other subjects . Apart of the heavy traffic load guaranteed between major metropolitan areas, airline companies have no way to predict what is going to be the seat occupancy for a trip from a small city A to another small city B. Direct flights, when they existed, were often randomly filled, sometimes nearly empty. Therefore, flights have been rerouted to hubs. The situation for subject navigation is quite similar. How can a company offering web-based search know in advance how many people having a given background (for example, speaking a given language) will want to search for a particular given subject? Relying simply on strings of characters helps, but accidentally. There is no guarantee that information on a given subject will be available using exactly the same strings of characters than the ones used in the search string. This is why redirecting search through well-organized hubs offering a number of services (namely, links to related subjects ) can improve dramatically the current situation.
Topic maps
Topic mapping is a way to qualify what information is about, independently of where and how information is stored. It is based on a map of subjects using locations that point to the places where information resides. A topic map is a computer-readable graph or network of interconnected subjects. It enables any subject to be related to any other subject, and encompasses taxonomies, which are hierarchical descriptions of subjects. In a taxonomy, a subject must be contained into another one, so that we get to a finer level of detail. A taxonomy can be considered a particular example of a topic map, but a topic map can create connections between subjects at any level, using any kind of relationships, not only predefined ones, such as "broader term", or "narrower term", as in a taxonomy. A topic map is a knowledge base, it contains semantic information and can be considered the equivalent of the index of a book, together with a glossary and thesaurus, allowing us to find information using subjects that users can define.
Topic maps is an information architecture that has been designed to organize information by grouping in one location everything which is relevant to a given subject. It is a map because it points to the various places where relevant information is present, and can be created independently of the information sources. Topics are computer representations of objects uniquely describing subjects of conversation. Topics have properties such as names, types, occurrences in sources, and they can be related to other topics through a graph of relationships whose semantics can be entirely defined by the users. Consequently, an index appears as a list of topic names, alphabetically sorted, accompanied by a pointer to the occurrences in the source documents. A glossary is a list of topic names, followed by their occurrences playing the roles of definitions. A cross-reference is a link between two occurrences of the same topic. Technically speaking, the topic maps model ends up as a method for considering the traditional navigational aids as pre-resolved queries in a topic database.
Subject vs. names vs. identifiers.
It is essential to distinguish a subject from its name and its identifier. A subject is an abstract representation of something we talk about, it is the idea that we mentally construct so that we can give it a meaning. A name is a label that is attached to the subject. There can be several names used for a given subject. An identifier is just another name, usually used in the context of a computer software, but when it is unique, it serves as the principal mechanism to retrieve the information about the subject.
The difference between name and subject is at the core of topic mapping. A topic is a computer representation of a subject, a proxy for the subject. A topic map is based on the idea that there should be one topic per subject. A subject may have no name. For example, a paragraph in a book can describe a subject, without giving it a name. A cross-reference in a book that indicates (for more, see p. xxx) doesn't indicate what subject it is about, but it provides more information about it. A subject may have several names, for example synonyms, or names used in different contexts (surname versus first name, for example). Or it can have names expressed in various languages, in case of a multilingual information repository. A name is not required to uniquely identify a subject, contrarily to an identifier. For example, the name Washington can be used to designate the capital city of the United States, or a state on the West Coast. An email address or a phone number, together with the area code and the country code, is supposedly unique. A URL is also a unique identifier.
Name-based information retrieval
What a subject means is different from the names used to represent it. This difference is extremely important to understand the power and limits of name-based information retrieval. Research by subject is an operation that is being performed while searching in a library catalog or in a book index. When browsing an index or a card catalog, we need to have a predefined idea of how what we are looking for can be expressed, i.e., which words best describe it. In the context of a card catalog or in the index of a book, the information to be searched has been prepared by professional indexers or catalogers, that are applying a method that is (hopefully) consistent. Understanding how categories have been established, whether fully or partially, intuitively, is key to the success of the finding. Sometimes, names used to qualify subjects have synonyms and these are explicitly declared with the keyword "See". When employed, the "see" keyword means that the subject being searched under a given word or phrase is to be found under another one. Recognizing the boundaries of the universe of discourse in which names used to describe subjects is an important step forward. There is always some world view behind a choice of terms, and when it is explicitly stated, it makes it less prone to lead to wrong interpretations.
On the Internet, we often use search engines which return a list of documents containing the subject we are looking for. But is it the subject itself? Not really. These technologies are powerful because they act as robots which do not necessitate human intervention. But the price to pay for this is that the search is done not on the subject itself, but on the string of characters searched for. In most cases, there is no prompting for synonyms. However, the search engines are using a combination of automatic retrieval based on string identification with manual qualification of some of the major subjects, that constitute a taxonomy.
Graph databases
The interchangeability of topic maps across companies and organizations was one of the main assets provided by the standard. Most of the tools were built to provide this feature. The possibility of merging topic maps was a side effect of this ability. So far, we have not seen much use of this feature. It doesn't mean that it has not been used, but in our extensive experience with topic maps we have not seen it happening. One of the reasons may be because companies that do business with others don't necessary want to interchange the core of their knowledge with each other, but limit their information exchange to what is strictly necessary. If they have invested a lot into their knowledge assets to be competitive, they may simply not want to share them. Government agencies could be a good candidate for more openness in intimate information interchange. In some situations, for the sake of preserving individual rights, information exchange is strictly prohibited. For example, the US Census Bureau cannot divulge any information about people. The IRS cannot share tax returns with anyone, except in very special circumstances. Moreover, when information is sharable, there is a big gap between general declarations of intent about openness and transparency and the reality. Most of the times, information is very complex. If every government agency would be using topic maps, it is very unlikely that they could share them with other institutions. And even if they want to, the way information is organized would often be too specific to be easily exchanged.
In other words, the case for topic maps interchange still needs to be made, and does not look as desirable as previously thought. This reason explains why focusing on interchangeability of topic maps has probably been a factor in its low adoption ratings. The main value of the topic maps paradigm seems therefore not in be the interchangeability of topic maps, but rather in the independence between the sources and the knowledge layer.
Independence from the sources.
The most important promise of the topic maps design is to guarantee that the knowledge representation of information as topics be kept independent from the information sources. In other words, users should be able to point to any information sources, including and especially when they change, and create and manage semantic from outside. This offers the guarantee that when sources change, only the occurrences of the topics are redefined, but the other properties of the topic remain, including the relations to other topics, the various names used to designate the topics, including the other languages equivalents, and the types to which the topics belong. The flexibility of the topic map, its ability to survive modifications in the information sources repository, is accompanied by a wide open ability to use various tools to create topic maps. It is therefore possible to manage topic maps using XML systems, but also instead spreadsheets, databases --relational, object-oriented, XML-based, NoSQL, graph-based--, and content management systems including tagging capabilities, taxonomy management features, etc. Web-based frameworks are commonly equipped with features that emulate what can be done with topic maps, and should be usable as well.
Small Data vs. Big Data
In the information environment in which we live, with the scary amount of information available, why would anyone spend time and effort to proactively organize information around topics? The answer to that question depends on the environment. In many cases, the search technologies yield to transient topic maps (i.e., search hits) which are considered "good enough". They are not perfect, especially if the number of hits is astronomical. Besides the world of "Big Data", another world exists, which is less visible, but still very present, that we will call "Small Data". This is a world where the creators or publishers of information know what their content is, and their business is based on guaranteeing that their content has a high quality, high value, so that they can be relied upon. This is the traditional role of the publishing industry, but other industries or activities also depend on the reliability of the information they make public: media, government, international organizations, healthcare, intelligence, finance, manufacturing, research and development, are examples of such sectors. And they represent a non-negligible part of the economy. In these sectors, guaranteeing access to relevant pieces of information is of paramount importance. Sometimes, the purpose of proper information management is to hide information rather than showing it, but it is even more important to use a solid methodology to describe and qualify the information items.
There are situations where automated search capabilities do not return the information a user is looking for. To take a simplistic example, if a search engine is based on strict full text recognition, looking for "George Washington" would not return content that contains "General Washington". The problem with this situation is that the user may think that the information is not there, and the publisher may lose traction and even the trust of its customers if they are not fully confident that they can find the information that matters to them. In some cases, it may even a life-or-death issue. It would be unthinkable for an airplane pilot cabin equipment manufacturer to leave its users --pilots--, rely on unvetted search algorithms to find critical information in case of an emergency.
Other domains have similar requirements: information collected by intelligence agencies must be organized according to complex, not always repeatable, combinations of algorithms and hand-made editing, in order to be able to "connect the dots" and not take a chance to miss an important piece of information, only because it is not tagged exactly like another similar one. When information is published in multiple languages, the ability to synchronize is important. These sophisticated requirements are also playing an important role in finance, healthcare, science and in academic work. In other words, subject matter experts still have a role to play, whether it is to index books, or to perform similar activities on digital information.
The ambiguity in human communication doesn't apply to computers.
When we communicate, we expect the person to whom we are speaking to understand what we mean. We choose to express ourselves with words that convey a meaning. But it happens often that the other person understands something slightly different from what we are saying, and we can't always tell whether the idea we were talking about has gone through or not. Even when our friend tells us: I see what you mean, it doesn't necessarily mean that he actually has grasped it exactly in the way we expect. This is where things become tricky when we use computers to communicate and we assume that the computer software understands perfectly what we meant. A computer only uses strings to classify information, i.e. it puts it in a box where it can be retrieved by finding the identifier. A computer doesn't know anything, it doesn't understand anything. It can process information that it has stored provided this information has been retrieved. Ambiguity is not something that works well with computers. Fuzzy logic exists, artificial intelligence can be quite sophisticated, machine learning means that there is a potential for the computer to acquire information, but it always rely on algorithms that are at the core of the software
Taxonomies
Librarians use taxonomies to organize knowledge, according to a hierarchy of categories and subcategories. All relevant materials are related to a branch or a leaf of the tree using a common terminology. The "authority terms" comprising it are expected to be used outside the library catalog, as metadata in the sources, enabling links to the taxonomy. For example, every book traditionally gets assigned Library of Congress authority headings as part of its metadata. Taxonomies have further evolved into ontologies, containing rules that facilitate automatic processing for retrieving subjects based on computed properties. The Semantic Web community has developed an ontology language (OWL) that is used on top of RDF to help use artificial intelligence techniques to retrieve data based on various user queries.
The main challenge with taxonomies, and any knowledge organization scheme, is the cost for creating them and for maintaining them over time. Experts need to meet and agree upon a common way to describe knowledge, down to a very detailed level. This seems like a reasonable endeavor, but in reality, it turns out to be a very complicated task. Besides common ground, the devil is in the details. Experts may integrate explicitly different world views and find ways to account for multiple ways of modeling and qualifying terms. When no prevalent worldview is asserted, disagreements may result in misunderstandings, imperfect compromises, therefore jeopardizing the integrity of the knowledge description. Any ambiguity or lack of clear definition will result in the future to be further deepened by newcomers, who may not have a full understanding of the background context. Subsequent taxonomy editors may eventually mischaracterize some topics, and the overall quality of the taxonomical organization will decrease. Just the passing of time will also take its toll. New information sources may not be describable using the existing categories. Modifying the taxonomies may not be easy, especially if it served as foundations for customized tools providing user interfaces relying on the existing content. The procedures for submitting a request for changing, adding, or deleting taxonomy terms may involve many steps that users will be reluctant to take. They may prefer to slightly tweak descriptions to fit an existing term rather than getting into the trouble of adjusting the taxonomy. By doing so, they may underestimate the long term effects of semantic drift, and slowly the taxonomy will become out of sync with the content of the information, causing it to become progressively irrelevant, until the time when these effects will become impossible to ignore. The company may decide that it makes more sense to start from scratch and build a brand new taxonomy, that will eventually go to the same degradation process over time again.
Crowd-source tagging is sometimes considered an alternative. It leaves full freedom to each contributor to create their own terminology, but with the risk of creating semantic inconsistencies. Recently we have been working for the NYU Library on a project of integration of about one hundred book indexes made by different authors, at different times, published by different publishers. Although each index is internally consistent, mixing them together reveals how delicate semantic integration can be. Fixing variant spellings or presentation for similar terms is the easy part, and can be handled using a variety of policies that enforce consistency after the fact (for example, person names could be harmonized as "last name" followed by "first name"). The most difficult part is the level of semantic granularity that is needed. For example, the word "heart surgery" is a valid index entry on a book describing a range of medical techniques, but it is irrelevant in a book that is entirely devoted to the subject of heart surgeries.
Ontologies
Philosophically speaking, an ontology is a science or study of being: “specifically, a branch of metaphysics relating to the nature and relations of being; a particular system according to which problems of the nature of being are investigated; first philosophy”. 3
In computer science, "an ontology is the attempt to formulate an exhaustive and rigorous conceptual schema within a given domain, typically a hierarchical data structure containing all the relevant entities and their relationships and rules (theorems, regulations) within that domain."4
Using the philosophical term in the computer science context has a misleading effect: it conveys the idea that it is in the nature of things that they are the way we describe them. However, although things do exist, they cannot be described without a specific world view, be it implicit or explicit. Therefore there are various ways to describe the same thing, each of which is valid in its own right. Variations may be due (among other reasons) either to the fact that the descriptions have been created by different unrelated authors, or that they apply to different contexts, or that they serve a different purpose and are intended to be used by different categories of users...
As we have learned from experience, particularly when designing document structures, there is usually not a single way to describe things and sometimes there are no compelling reasons to decide whether one particular way should be considered better than another. Semantic applications only make it worse, since there may be several ways to speak about the same thing. It appears to be necessary to take into account multiple perspectives.
Bottom-Up versus Top-Down
Implementing an ontology, like a database management system, or an XML application, is usually perceived as a top-down approach. The architecture has to be defined before anything can be done. This approach works for information islands, where the complexity of the structure is simple enough to enable a small team of people to agree on a common ground. Semantic integration, on the other hand, allows for connecting information that was not at the start envisioned to be connected. Using a perspective-based approach enables highly customized integration and a bottom-up approach. This opens powerful new ways to put information together, provided we abandon the requirement to have every layer in the process under control.
Top-down ontologies vs. Bottom-up Integration
Implementing an ontology, like a database management system, or an XML application, is usually perceived as a top-down approach. The architecture has to be defined before anything can be done. This approach works for information islands, where the complexity of the structure is simple enough to enable a small team of people to agree on a common ground. Semantic integration, on the other hand, allows for connecting information that was not meant to be connected to begin with. Using a perspective-based approach enables highly customized integration and a bottom-up approach. This opens powerful new ways to put information together, provided we abandon the requirement to have every layer in the process under control.
There are many possible illustrations of examples where using multiple perspectives will offer a workable solution. The major interest of such solutions is that they don't require any change to the existing management of information. The diversity of perspectives is respected, and new perspectives are created, each corresponding to a given category of usages. Integration of information comes as a result of mapping, it doesn't require source information materials to be altered. This approach leverages existing information rather than requiring it to be restructured.
Typical use cases include situations where an organization or several interconnected organizations (such as government agencies) need to cooperate. If there is an overall schema aiming at connecting information from various sources, the size and the amount of existing information assets to be interconnected is very diverse and very large. Consequently, having everybody agree on a single ontology that would cover all needs at various levels is an unachievable task. Using multiple perspectives in such a context is mainly a matter of common sense: less ambitious, more specific, goals are easier to achieve. Still, they can be connected on demand to answer certain needs.
An example is the connection between topics involving Federal regulations and their counterparts at the state level in the United States (or European level/national levels, or Government/Provinces in Canada). Information returned may be identical, derived, slightly different, or radically different, and it needs to stay like this. A single ontology would never capture all nuances, and filterfiltered views are necessary.
The usual message coming from the standards community is: make data interoperable! merge your maps! This seems to be so common-sense that there is no point in discussing the rationale for doing it. The only interesting discussion is how to best achieve this objective. Actually, interoperability is far from being achievable in most cases, without first transforming the data. This is similar to performing experiments in the context of quantum mechanics, where there can be no observation without altering the environment and conditions that are being studied. Merging seems to be more tame, because it can be done by creating a supplementary layer and letting the previous states remain unchanged. But still the analysis of real-world situations makes us wonder whether things which in appearance should merge should actually merge. The notion of perspective, which imposes a user- view, is sometimes more important than the information content itself, and merging too extensively may be eventually counter-productive, because too much information is being aggregated and it prevents us from using the information in an efficient manner.
The examples presented here are subject maps rendered as graphical maps. These are multiple maps of the same territory, each expressing a different perspective. The perspective can be analyzed by looking at what the map actually displays. Some of the maps express perspectives which result from the merging of other maps. The important point that will be emphasized is the fact that merging is not necessarily the ultimate goal, and the way it is done depends on how it fits specific, well-defined, user needs.
Automation vs. Curation
So far, we have shown the limitations of various semantic approaches to find information. Using automated algorithms may result in missing crucial information, because the way the information item appears is outside of the range of what the algorithm can grasp. Using a strict organization of knowledge may result in a process which is so hard to maintain that over time it becomes progressively irrelevant. Leaving full freedom for tagging information results in the creation of inconsistencies. The difficulties involved seem out of reach for many confronted to these challenges, and, either looking for lowering costs or out of despair, some companies are outsourcing many aspects of their information technology assets, including the knowledge management itself. But when companies or organizations defer to third parties the management of their core knowledge assets, they take the risk to lose their raison d'être.
There should be a better way. There are two directions to look for: first, using the principle of independence between the sources and the knowledge management layer, and second, fine tuning the balance between automatic processing and manual curation.
The independence between the sources and the knowledge management layer is what is at the core of the Topic Maps paradigm. But it has been somewhat relegated to the back burner by methods insisting on privileging merging of topic maps, and imposing its users to author topic maps using the syntactic constructs of the standard. Instead, our experience has been to be as pragmatic as possible in terms of the organizations of topics. It should not matter whether they appear in a database, in a content management system, in XML elements, in HTML metadata, in RDF-Dublin Core metadata, MARC format, spreadsheets, full text, index entries, etc. There are ways to extract those topics after the fact, and organize them with powerful tools, providing a comfortable user interface. Once the topics are extracted, they live independently of the sources from where they come from. Therefore, any changes in the way sources are handled does not affect the entirety of the knowledge layer. For example, if a company decides to replace its content management system by a new one, the knowledge layer just needs to disconnect from the old system and connect to the new one. Everything else is preserved. In that sense, the topic maps paradigm offers a way to preserve the longevity of the work done on the knowledge layer. For example, the fact that Manhattan is a borough of New York City has nothing to do with the source formats in which the topic "Manhattan" is found. The problem arises if the knowledge management layer is handled inside a particular product. For example, if this information is only present in a content management system, and the content management system is replaced, the information gets lost and has to be recreated, potentially at high cost.
Given the amount of information available at our fingertips, it is as unrealistic to rely exclusively on manual qualification of findable information. The back-of-the-book indexes are extremely useful tools, because they have been crafted by hand, as an intellectual work, providing more value to the book. But this activity is not scalable, except in specific contexts. It is not advisable neither to exclusively rely on automated processes, because of numerous exceptions that would be missed. There is no magical answer to that question, but our experience is to empirically find the fragile equilibrium point between these two poles, knowing that this equilibrium point may change over time. Some automatic processes can be added, others need to be removed, and manual tweaking should be possible at various levels. Sometimes, it's more convenient to edit the results of an automatic processing than to do everything manually. Sometimes, it's easier to do everything manually and often more accurate; there is a limit to the accuracy automatic processes can add in decisions about semantic meaning. There is no absolute limit how to decide where the tipping point is.
The combination of both ways is what has proven to work best. Extracting knowledge into an independent layer, and enabling processing at that knowledge layer, with a feedback loop going back to the sources, doesn't seem like the most direct and efficient way to do this. This process is comparable to the publishing workflow where authors insist on using Word, but the publishers want XML. Round-tripping the conversion between the two formats is not efficient, but it's sometimes necessary. Furthermore, this level of indirection is precisely what provides us with the power and freedom to handle knowledge in a way that can be preserved over time, regardless what happens to the source information, and more specifically, to the systems used to handle. All the work which was done to describe information, type the topics, create relationships, manage multilingual equivalences, still works. Because it has been managed independently, upgrading a system simply mean disconnect from the old system and reconnect to the new one.
The lessons learned from working with Topic Maps for more than two decades are contrasted: because the rapid pace of technological advances, we have been overwhelmed by the success of information technologies. Looking for the immediate next big thing has obscured our capacity of thinking about the fundamental nature of what we are doing. The notions of trust, reliability, high quality content, are still central to the long-term success of our enterprises. We need to adjust to the changing nature of the ways information we are dealing with presents itself. It's just the beginning. When we created the Topic maps standard, we created something that turned out to be a solution without a problem: the possibility to merge knowledge networks across organizations. Despite numerous expectations and many efforts in that direction, this didn't prove to meet enough demands from users. But we also developed the concept of independence between information sources and the knowledge management layers. This may turn out to be what remains on the long term, even if the fact that this idea once went by the name of topic maps may fall into oblivion.
Subjects are interconnected
Semantic integration is the process by which plurality is reduced to uniqueness. It consists in grouping of information items organized around shared subjects. The definition of what delimits a subject is a matter of perspective. In a given perspective, several information items will be considered the same subject whereas in another one they may be distinguished because the level of granularity of the description of information is finer. Semantic integration aims at creating views in which subjects that have been expressed with different expressions appear to be the same.
Perspective" implies plurality, whereas "ontology" implies singularity.
The distinction between data, information and knowledge
Thomas Davenport and Sirkka Jarvenpaa write: "Our distinction between data/information and knowledge conveys that the source of value does not arise from possessing the information source, but from acting on it in a context of a specific meaning at a specific time".5 What do computers bring to the picture? Computers are the terminals connected to networks which enable information to be transmitted worldwide, at the speed of light. Computers exchange bits of data, sequences of 0 and 1s. When grouped, these bits form strings of characters. And strings of characters resemble to what we call words in our natural language natural languages. And words, or sequences of words, are what we use to name things. Names are how we designate things. Therefore, by uttering the names of the things we are talking about, we expect others to see what we mean. But human languages are inherently ambiguous. The first question is: how sure are we that when we intend to use a name, the computer knows what is the proper thing to do about it? The meaning of some object or concept expressed within a language is subject to interpretation, it depends on the perspective. Computers do not have this problem. In the general case, computers assimilate things that are expressed with the exact same words, because they are able to establish the identity of two strings of characters. This is how we think that computers know what we mean. But this doesn't take into account the difference that may exist between what we really mean and what the computer understands.[^what-do-computer-understand]
However, computers store data, not meaning. As far as computers are used as a publishing medium, they are not different than books and everything that applies to printing applies to computers. The only difference is that the text is displayed on a screen rather than be printed on paper. Then, what's so different about computers.
Trusted Published Sources
The reason we traditionally rely on what's printed is because the process of printing is complex, expensive and requires professional skills. In other words, it was a big deal. The publishing industry created the rules to define and preserve intellectual property. The definition of what an author is, what copyright is, etc., date from the early times of the publishing industry. The publishing industry and the press occupy a significant part. Even in the democratic countries where freedom of speech is granted, anything can not be published. And when it gets published, there are a number of rules and constraints to follow. Among those, the obligation to register a publication number so that the publication can be used for further reference. And the obligation to give a number of copies to national libraries, so that the published item can be found during a certain period of time.
Published sources on the Internet.
In the twenty-first century, the Internet is the medium of choice for publishing information. There is a huge advantage: any information sent to a server connected to the Internet is immediately available worldwide, and the cost of publishing has significantly decreased. Not only it is much cheaper to publish electronically than to print, but above all there are no more costs for shipping printing materials to the places where they are stored and the places where they are sold (not necessarily the same places). But the ease of use and affordable price come at a cost: It is as easy to publish than it is to "unpublish", i.e. to edit, remove or transform any piece of information. Contrarily to a published item which can be reliably used for future reference, there is no guarantee that what is available at a particular instant on the World Wide Web will be still be there even a moment later.[^internet-archive] On the Internet, there is a de facto registering operation due to the fact that every information item must have a "uniform resource locator" (or URL). The mere size of the computer network known as the World Wide Web, the amount of information available creates an illusion that once we have found information on the Web, that's all what there is. Some pieces of information are to be found inside others, and may not have an identity as such. There is no obligation to publish something on the Web and store it for a given period of time. It is therefore possible (and actually very frequent) that information items referred to by their URL don't exist any more. This is the source of the phenomenon of "broken links" that are spread all over the Web, and make it a publishing environment less trustable than print.
[^internet-archive]Some sites are preserved in the Internet Archive. The old saying: "It must be true, because it's printed" is not valid on the Web, unless it comes from an authoritative source, which we trust. The organizations that are trustworthy when they publish written materials can also be trusted when they publish on the Web. Trust has nothing to do with the medium of publication. Trust comes with the acknowledgment of the reliability of certain authors, experience and with the word of the mouth. Education is part of it. We learn from our teachers which are the sources of knowledge that can be trusted.
Does sharing knowledge mean we need to unify how we think?
There are several ways to organize knowledge (see "Knowledge Organizers"). But the essential question is the scope in which this organization is valid. The community of users of that particular knowledge of information has to be defined: it can be experts of a certain field, people sharing a common language, people sharing common interests. It's difficult to imagine that a certain configuration of knowledge is valid for everybody.
The natural way to organize information in a way that can be easily manipulated and retrieved by computer systems is to use databases. Databases consist of defining classes of objects, so that each object belonging to a given class can be found where it belongs. Database systems enable users to issue queries based on criteria corresponding to the ways information is organized, returning lists of relevant items.
The temptation exists to reduce information to what computers are able to handle. Since computers need data to be unambiguous to be useful, data are encoded following models which have been elaborated usually with the computer in mind, which may sometimes result in oversimplification and in imposing views that are reducing complex realities to a set of well-defined, but sometimes over-simplified schemas. This problem can prevent free expression of complex and messy information. It usually prevents also conflicting information to be entered into a computer system. This results in a situation where the information encoded may not reflect the reality, or worse, may prevent personnel to enter data according to perspectives which are different from those which were originally allowed by the designers of the system. On the other end, when users can create their own terms, the result may be more accurate, but still arbitrary and the potential for a messy organization. Sometimes the same concept is designated by different words, sometimes a concept is described by a category that is much broader than normally, etc.
The gravity of this problem can be minimized by saying: we have been through this already. For example when desktop publishing systems replaced typesetting systems, professionals in the typesetting industry were complaining about the limitations of the computer-based systems and the loss of functionalities that followed. But this was a temporary problem, due to the nascent industry. Loss of content is much more serious, because it can't be recovered. We need to do something about it. The second part of this book presents solutions to avoid having to reduce information to whatever current computer software applications are able to absorb.
E Pluribus Unum: Welcoming Diversity
There is a clash between the information that we have to deal with and the information that computers can manipulate. On the one hand, information is a diverse, constantly evolving, etc. We live in a global world where a variety of cultures, languages, religions, practices, etc., exist, and will continue to co-exist. On the other hand, computerized information works best when it is well-formed, well-organized, unequivocal, when it conforms to preexisting schemas. If the top-down approach implied by computerization of information would become generalized, the world would take another turn, and we would be ruled by a dominant world view, a dominant language, a dominant economic model, etc. This would be a world dominated by the constraints imposed by one technology. Tools are everywhere, and people who know how to manipulate them are in a position to manipulate others as well. If the system doesn't perform as desired, they can answer: "It's the computer system, it's not our fault. There is nothing we can do about it." Technology has evolved to accommodate various languages, alphabets, formats.
Technicians cannot sustain the claim any longer that technical constraints force to limit the expressiveness of what gets encoded. In the 1960s, and 1970s, computers couldn't accept any text unless it was in upper case, and with no accented characters, nor worse, non-latin alphabets; the first versions of popular operating systems could not handle file names longer than eight characters. Because memory and storage were so sparse, years were encoded with the last two digits only, and it is only in the very last years of the twentieth century that people started to realize that something needed to be done to handle the dates in the new century, and the so-called Y2K bug ended up costing over 300 billion dollars 6.
Now that Asia and other parts of the world who were lagging behind are taking the lead in information technologies, computers are now able to handle ideographic languages, and the latin alphabet is becoming merely one possible way of displaying data in computers.
We will not get rid of diversity. But we still tend to think that in Information Land, we need to define a world view, which will be valid for many people. It can take the form of one or several classification or database schemas, taxonomies, ontologies, etc. And it gives a feeling a comfort, when working within an organization, to comply with a template under which everything is supposed to take place. We usually feel more comfortable if we are told what to do than if we have to figure it all out. This is fine, as long as information does not overflow. When confronted with the unexpected, even in everyday's life small details–, we are not equipped, and we have to be "creative" about a way to cope with the situation. Sometimes, being creative means tweaking a little bit the way things should be organized so that the overall schema seems to still be working. But this attitude, once generalized, may backfire. There is a raising discrepancy between the intended usages of information when the schemas were created, and the way they are used in practice. And this discrepancy is not documented, because documenting it would be asking for trouble. It would end up showing others how smart we are, because we can't satisfy ourselves with what exists, and always need to put our own two cents to the process. This attitude is generally considered not wishful in the era of team playing. But, what we end up doing is to preserve harmony of information organization even when facts show that it's not up to the task any more. The bottomline is that the information emanating by such an organization is not reliable any more, because it doesn't reflect the inner complexities of what needs to be there. And this is not done with bad intent, people are convinced that what they are doing is the right thing to do, and this is what they are told they should be doing.
The problem is even made worse if the tools to capture information play an important role on how to enter this information. If all information would be entered using plain old text, using word processors and free writing, it would be simpler to author anything that one would have to say. But most organizations have more elaborated tools to capture information: forms serving as input to databases, spreadsheets with predefined templates, for example. These tools do a great job in keeping information organized, but they are sometimes too constraining to be able to handle specific information that doesn't fit the predefined schema. This phenomenon can be seen – or more precisely, heard – on the phone when predefined menus are presented. Very often, the options presented do not correspond to the specific request and an operator has to solve the problem. To continue on the parallel, very often, information systems lack the operator to capture the specific kind of information that is relevant. A significant (and non-measurable) amount of information is lost simply because it doesn't fit into the predefined categories.
We, humans, speak different languages. This is not going to change any time soon. Each of us have our own way to express what we think in a way that can be very original and personal. We are entitled to think differently from each other and democratic countries let us express what we think. Freedom of speech is considered an unalienable right and we are thriving to protect it when we benefit from it, or fight for it where it's missing. This is not going to change any time soon either. Being able to speak a different voice is therefore considered a good thing, even if from time to time it results in augmenting complexity, ambiguity, and makes communication more difficult than if we were all speaking in a single voice. Computers are not well equipped for that, and all the computer models used to represent information tend to go in another direction, i.e., unify diverse ways of representing subjects in a way that is unique and the same for everybody. Having said that, it is also clear that, within a limited group that shares common interests, some level of agreement is desirable on what is making the group what it is, i.e., what are the common subjects of agreement that identify the group by distinguishing it from any other group. Such groups are called communities of interest, or community of practice.
Instead of, or in addition to, this top-down approach, a bottom-up approach is possible. Instead of practicing wishful thinking, by insisting of viewing things as they should be, regardless of what they really are, we can start from what there is. Within this approach, we can now look at things with fresh eyes, trying to figure out what their meaning is. In reality, whether we want it or not, we are looking at what they mean to us, and ignoring what they could mean to others, simply because there is no way to know that. What we mean by "us" depends on what "we" are. As an individual, I can only speak for me. As a member of an organization, "us" mean a certain number of people depending how powerful I am in the organization. The more powerful I am, the more people will be impacted by my vision of what "us" should think and how they should behave.
Diversity is everywhere, not only in the subjects we deal with, but also in the various ways we represent them. It will prevail, and any attemps to get rid of it will eventually fail.
Use Cases for Multiple Perspectives
The quest for integrating diverse information networks can be reached by providing subjects which have enough weight to attract information guaranteed to be relevant to a given subject, including but not limited to, the various names used to designate it, the occurrences of the subject in diverse information sources, and the subjects which are related to it for a diversity of reasons. 7
Conclusions
Should there be one common perspective to describe Lower Manhattan?
From what has been shown here, it may be that the answer is: what for?
On the one hand, it would be nice to have all information relevant to a given territory connected together. It seems rational to do so. The layered model presented in the previous graphic seems to be an appropriate answer. This layered model could be merged and used as a management tool. It is also a discovery tool that enables queries to be performed on various aspects. Harmonization of these aspects could lead to the formulation of a common model that would contain common goals.
On the other hand, the information overload of such a merged approach would be daunting. Merging all these layers would necessarily lead to losing focus on what are the important aspects that need to be privileged. The absence of specific perspective would equate to losing relevance and precision in the nformation. The complexity would increase, because the level of precision of each of these maps differs, and we don't necessarily have enough information to determine where exactly things match together. The maintenance of such an information base would also be made much more difficult, given the fact that many aspects of these layers are interconnected. In addition, achieving such a level of interoperability between the layers would require agreements on each of these aspects, and agreement might be prevented because of possible conflicting requirements.
Integration vs. Interoperability Revisited, or Life beyond the Semantic Web
Semantic interoperability relies on having a well-defined perspective. There is no other way to achieve it. Computers need to understand each other, and that means that everything, down to the most minuscule details, needs to be expressed in such a way as to prevent bottlenecks from occurring during the transition process between various environments. Here we are speaking of a single perspective, which may be distributed over several systems, located in various places, but the overall design, the structure, and the semantic used should be common. Despite the fact that systems could be large, very far from each other, they should all rely on a common architecture, built on one perspective, to function properly. In this environment, computers deal with objects, and these objects may or may not represent semantics, but this is not important. They end up being reducible to unique identifiers, be it URIs (or rather now IRIs to accommodate various alphabets), XML identifiers, digital object identifiers, or Public Subject Identifiers. We are speaking here of a potentially large, but closed, universe.
Semantic Integration on the other hand corresponds to a situation that is at the opposite end of the spectrum. Here we are speaking of information that allegedly represents some meaning, not always totally explicit, and that can be expressed in an infinite number of ways. Diverse information can co-exist even in very close environments, in the room next door, or even on the same computer system. Of course inside an organization examples abound of information that has semantics that could be interconnected but for some reason is not. This is an environment where an infinity of variant perspectives exist, some of them may not even be defined explicitly. This is the "real" information world, as opposed to the previous world which is all pre-processed. And it is much more challenging to try to find ways to make meaning emerge out of this messy information magma, rather than to tailor everything in a way that would reject anything else. Here we are in an all-inclusive environment. Any information, even information not initially meant to be interconnected, should be interconnectable, one way or the other. This is what semantic integration is really about.
The objective of this paper has been to demonstrate that the first step to achieve this daunting goal is to recognize the multiplicity of perspectives. In other words, once some meaning has been uttered, there is a perspective behind it, whether we recognize it or not. It is by looking at the rules that have been applied while creating this piece of information, and by disclosing the rules that are at work – rules for establishing identity of meaning, rules for applying merging to identical subjects, and derived rules, if any – that we can provide hooks that can be used as binding points for other, similar, pieces of information. But sometimes we can't access the original perspective that was at work while creating the information. Never mind! What counts is the perspective through which we are viewing the information we are interested in. It is biased, naturally, but recognizing it makes things simple, and open. If for some reason this bias is not appealing, or not adapted, nothing prevents us from creating another one which is more appropriate. And the reality is that this is what we are doing constantly, whether we recognize it or not.
Using the word "perspective" in lieu of "ontology" helps us understand this issue, which may turn out to be critical. The meaning of information has not a lot to do with what computers can make out of it. Computers don't get the meaning, they just apply algorithms. It's time to recognize that artificial intelligence is not what we're looking for. We should be more and more aware of the illusionary aspects that computers present. Search engines are maximally useful, and we couldn't live without them any more. But they are not providing us with value that is outside of their own, closed, well-defined, usually proprietary domains. Some information users need more than that.
The problem with subjects is that they are eventually not really computable. What gets to computers is only their expressions. And the relation between subjects and expressions is not a simple one-to-one. The whole issue of perspective comes in the middle of that, along with the difference between top-down, closed, semantic-web oriented, machine-computable semantics, on the one hand, and bottom-up, open, ambiguous, diverse, semantics of information on the other hand. I believe this confrontation is going to be the major challenge for the years to come.
-
XHTML represents a synthesis between XML and HTML. It's an XML schema representing the structure of an XML document. XHTML is used by many mobile applications, ebook standards, because it's easier to process than regular HTML. It is a stricter, more stringent, variant of HTML. All modern browsers accept XHTML. ↩
-
Application developers tend to prefer the Javascript Object Notation (JSON) as an interchange format, because it is easier to parse and to implement, and therefore translates into faster and more efficient software. However, XML is better suited for documents containing a significant amount of markup. ↩
-
Dnyanesh Rajpathak, Knowledge Media Institute, The Open University, [email protected], http://kmi.open.ac.uk/people/dnyanesh/Publications/ontology-1.ppt ↩
-
This definition of "ontology" was found in Wikipedia. As Wikipedia is being edited, the definition has changed, and this quotation is not there any more. The side-effect of the web-based publication process is that there is no way to guarantee stability of information and especially of sources. However, in this case, this definition has been quoted by several authors who all refer to the Wikipedia origin that doesn't exist any more. ↩
-
In Alfred J. Beerli, Svenja Falk, and Daniel Diemers, eds., Knowledge Managemen and Networked Environments: Leveraging Intellectual Capital in Virtual Business Communities (New York: AMACOM Books, 2003). Quoted in Donald M. Norris, Jon Mason, Robby Robson, Paul Lefrere and Geoff Collier, A Revolution in Knowledge Sharing, Educause Review, September/October 2003. ↩
-
Source: Y2K: Overhyped and oversold?, BBC News. ↩
-
There are four different forms of integration: 1. Portals (or at-the-glass) integration is the shallowest form, bringing potentially disparate applications together in a (typically Web) single entry point. 2. Business-process integration orchestrates processes across application and possibly enterprise boundaries, such as those involved in a supply chain relationship. Web services and their derivatives are becoming important here. 3. Application integration, in which applications that do similar or compelentary things communicate with each other, is typically focused on data transformation and message queuing, increasingly in the XML (Extensible Markup Language) domain. 4. Information integration, wherein complementary data are either physically (through warehousing tools) or logically brought together, makes it possible for applications to be written to and make use of all the relevant data in the enterprise, even if the data are not directly under their control. A typical example of this would be a new customer relationship application that combines the relational call log with the speech-to-text translated call itself.
Source: Information integration: A research agenda by A.D. Jhingran N. Mattos H. Pirahesh: Information Integration: A new generation of information technology © 2002, International Business Machines. Information Integration, Order No. G321-0147, http://www.research.ibm.com/journal/sj41-4.html. Copyright doesn't apply to the preceding excerpt which can not be reproduced without the authorization of IBM. ↩