Information Made Accountable

Our information-based society needs to improve the trust in information exchanges. This chapter is about comparing what has been done since centuries with money exchanges and see whether it is applicable to information content. The "double-entry bookkeeping" method used in accounting can serve as a guide to what could happen in the world of information, to provide ways to make information content and systems accountable.

Money, information and trust

Trust is a condition for conducting business. Money exchanges, whether directly with cash or through intermediaries like banks or credit card companies, are based on a shared trust that what we exchange has value. We trust the other party, because we can verify that the money is actually there, in other words, because we can hold the involved parties accountable. We assume that their accounting is in good standing because we know that we can check it if we need to.

Information is, generally speaking, not accountable. It is hard to figure out where information is coming from, by whom it is being accessed and where it ends up going. This was not a problem in the infancy of the Internet and the World Wide Web, when online information existed to fill the needs of scientists and technologists who were voluntarily and enthusiastically sharing information with each other, in order to build an interconnected world. But now that online information exchanges have become the prevalent medium for all businesses and governments alike, we desperately need to establish whether the information we exchange has any value. The lack of accountability has become a major issue giving rise to new threats including cyberattacks, identity theft, spreading of fake news and propaganda, meddling with computerized election systems, dismantling of critical computer-powered infrastructure. The situation has worsened so badly that we are even contemplating the fact that information can be used as a weapon in an arsenal of cyber-warfare. If we want to overcome these hurdles, we need to find better ways to account for the information we deal with.

Accountability is well implemented for money exchanges. Accounting can be somewhat complicated, but it is based on a simple idea, which is that any amount of money comes from somewhere and goes somewhere else. It relies on recording transactions between money accounts.

The recent usage of digital currency, with BitCoin, is using accountability concepts for handling digitized monetary transactions. Its underlying architecture, called BlockChain, a distributed database recording transactions in shared, immutable ledgers, is aiming at going further than just monetary transactions. It is starting to be applied for example to healthcare, dealing with transactions having to do with delivery of services, such as drug prescriptions.1

Accountability applied to the world of information is therefore a very general endeavor. Since accountability is transactional, the objects to which it applies can be best described as a network made of various information items, connected through a graph.

Data Projection

The "Data Projection" model described here is a method to make information accountable by applying to information methods similar to what accounting does with monetary exchanges.

To make information accountable, the "Data Projection" model proposes to flatten data and metadata as well as their interrelations and processes in a graph of transactions. This graph connects data and metadata, via processes and semantic relations. Multiple relations to or from an object are decomposed into sets of binary relationships. The resulting graph is a flattened representation of multidimensional information that can be analyzed by creating perspectives isolating objects and transactions of interest.

The Data Projection model has come into existence as a natural extension of the experience gained with topic mapping2. Data projection is to Topic Maps what the physics of elementary particles is to chemistry. Data projection goes much deeper into their internal components, and provides accountability down to details which are invisible at a higher level.

The term "data projection" results from looking at a flattened representation of multidimensional data, in a similar way that projections describe how 3-dimensional objects are represented on a flat, two-dimensional surface using specific perspectives. It is noteworthy that Luca Pacioli, who has formalized the double entry bookkeeping practice of the Venetian merchants, has also worked, together with Leonardo da Vinci, on mathematical aspects of the theory of perspective.

Valid Information

When a new piece of information is created, the very record of its creation as a process guarantees that no information ever exists in the vacuum, i.e., unrelated to any another unit information. A process applied to an object is of a similar nature as a relation that connects to that object.

Saying: "2 + 3 = 5" expresses an equality relation between the operation "2 + 3" and the value "5". This expression describes a process called addition, with the left part playing the role of "before" and the right part playing the role of "after". If we omit the result of the operation, saying simply "2 + 3" expresses the fact that "2" and "3" are related by the operator "+". This arithmetic operation is made of two operands (2, 3) and one operator (+). In this latter example, we are simply asserting the fact the these two numbers are associated through a process of addition.

Numbers have the property of "additivity", i.e. the potential to be added. We can say: "2 is a number", "number is susceptible to use +". Therefore, "number" is in the ledger for 2, and + is in the ledger for number. Consequently, if we are in a situation where only numbers are allowed to be added, the ledger for + should only contain items that are one step away from "number", for example "2". We can use this chaining of properties to create a validation rule, that would be able to detect a non-number in the ledger for "+" and report it as an error.

Binary relationships

There can be a variety of relationships between information items. Organizing the relationships into logical blocks, and creating an architecture of relationships can be complex. Taxonomies are hierarchical relationships used to describe how concepts are related to others. Well-known examples include the taxonomy of living species, or library catalogs used to classify sciences into domains and subdomains. Taxonomies used in library science contain essentially two types of relationships: "broader term" and "narrower term". Other information systems are organized according to more complex schemas. For example, family trees are strictly hierarchical (parent - child relationship), but the relationships between spouses is not hierarchical. Each spouse is added to a family tree from another tree. And the evolution of society has created many situations in which the classical family tree representation based on well-established categories needs to be further nuanced with new kinds of relationships.

A graph, or networked representation of the relations between information items, is widely open, because it doesn't constrain the relationships between information items to be hierarchical. And an information item can be related to multiple others. For example, a woman can have many children. Saying that "A" "is the mother" of "B", "C", "D", is equivalent to say "A is the mother of B", and "A is the mother of C", and "A is the mother of D". The first expression is called a n-ary relation, whereas the second expression is a functionally equivalent set of binary relations. Mathematicians have established that there is a strict equivalence between them. Under the hood, it is possible to converge to a representation uniquely relying on binary relations, even if for convenience purposes relations are presented as grouped to the users.

As a result, we can assert that the whole world of information can be represented ultimately as a set of binary relations. The transformation that takes as input a set of n-ary relations and outputs them as binary relations can be described as a flattening operation.

Perspective in a flattened world

Perspective comes into play when representing a three dimensional scene on a two dimensional flat surface such as a painting, a photograph or a computer screen. A scene is seen from a viewer's point of view, whose eye is in a certain location. Objects are scaled depending on their distance. Remote objects will appear smaller than closer objects. Parallel lines are represented as converging in a point called a "vanishing point". The laws of perspective have been studied by mathematicians and artists.

In the world of information, the number of layers to uncover, while opening where a piece of information leads us. It helps us focus on some aspects of it, while explicitly ignoring others. Classification levels are examples of perspectives. Data considered classified will be excluded from a perspective designed for non-authorized viewers. Or, the details on what contributes to one's income totaled on a tax return may be requested by an auditor. Eventually, the level of information made visible is a matter of perspective. In information modeling terms, a perspective is a view made of filters. Making information accountable is done by defining the perspectives in which we want to look at it. Furthermore, several perspectives can be defined on the same information repository. There is no universal perspective that makes the information accountable, once for all.

Metadata is also data

Information technology traditionally distinguishes between data and structure. The structure of data, in a database, is defined by a schema. The schema is a framework containing types of information allowing us to identify the nature of the data we are dealing with. For example, a contact database schema would contain fields for the last name, first name, telephone number, email, etc. Another commonly used distinction is between data and metadata. For example, in a document, data is considered to be the content, while metadata contains fields such as the author name, the creation date, etc. Sometimes, metadata are added automatically, sometimes they can be created by the user.

In order to enable full accountability, the first step is to consider that all information is equivalent. Data, metadata, field name, an XML generic identifier, a character, a byte, etc., should all be treated as information units. They are all related to at least another one. Saying that "New York is a city" is not of a different nature, from a semantic point of view, than saying that "New York is in the United States". However, the notion of "city", when considered a type, is privileged over the notion of geographic containment ("in United States"). This vision of information aggregated according to types enables the grouping of information and is based on a distinction between a type and an instance. "City" is considered metadata whereas "New York" is considered data. In the other phrase, the relation is considered as a semantic relation between two instances (New York as a city, United States as a country). But it is possible to consider New York being part of "all things in United States", and retrieve all of them the same way we list all cities. Should United States therefore considered a type to which New York belong? Not necessarily. This example shows that the traditional distinction between data and metadata is somewhat artificial, and only applies in a context where a "schema" containing predefined types is rigidly defined. Many relations between information items can't be described using this simple relationship. The difference between data and metadata doesn't hold when trying to analyze what information is at a deeper level. Unless we consider it as just one perspective, leaving open the door for considering other perspectives.

There can be any kind of relations between two pieces of information. For example, the fact that the string "New York" starts with "the letter N" is a relation between two pieces of information. The list of strings starting with "N" is typically what gets collected in a dictionary. Therefore, "New York" is to be found in the ledger for "Strings starting with N". This relation is useful, although it's so "obvious" that it usually doesn't need to be expressed explicitly. It is taken for granted by those of us who use an alphabetic character system. It is generally not considered to be of high value, except when collecting all names starting with the same letter in a list, such as in a dictionary listing terms in alphabetic order. However, the situation experienced by data experts can quickly lead to more complex questions. Should names starting with Ñ be listed on the same page than names starting with N? What is the proper alphabetic order? Should search on words starting with N also return names starting with Ñ? Etc.

Naming as a process

Furthermore, it is interesting to introduce a distinction between the things and the names by which they are designated. That distinction corresponds to the difference defined in linguistics between "signified" and "signifier". New York may well be the name of a city, but it is also the name of a state and the name of a county. It is not the only name for the city, also referred to as "New York City", "Big Apple", etc., and it is not the only name for the state, also referred to as "New York State", "NY", or "Empire State". The New York county occupies the same space as the borough "Manhattan". This is why people from Queens going to Manhattan sometimes say they are going to New York. Generally speaking, a n information unit can not be reduced to its name, even if its unique.

An information is an unnamed object, which has a mental representation, to which names can be assigned. The name itself is an information unit, related to the information unit that it describes. And the relation is itself an information unit. For example, the fact that the most populous city on the East Coast of the United States is called "New York" has a historical background (1667). When that city acquired its name, New York, from New Amsterdam, it was still the same city.

Consider now the sentence: "Nueva York is the Spanish name for New York". This proposition is in fact misleading. It would be more accurate to say that Nueva York is one possible name for this thing that some call New York. This name happens to be in Spanish. But Spanish is the English way to designate the language designated by its own speakers as Español. In other words, Spanish is English, meaning that "Spanish" is an English word. Worse, even "English" is ambiguous, since it has multiple variants, with variant spellings. It is not necessary to say that "Spanish" is a "British English" word as well as an "American English" word, but if instead we are talking about "organise" or "organize", it suddenly makes sense to do so.

Most computer software, for example relational database systems, assign unique identifiers to each object. But a unique identifier is also the result of an assignment operation, which varies according to implementations. In some systems, if an object is deleted, and its unique identifier becomes available, the same identifier might be reused for a new object. Uniqueness is therefore only valid at a given time and it would not work reliably for forensic analysis of events that happened in the past. Therefore, the process by which an object gets uniquely identified is itself must also be taken into account, for a system to be fully accountable.

The Perspector Notation

The Data Projection Model relies on describing information as sets of transactions between two information items. The nature of the transaction is indicated by the operator that indicates the process or the relationship between them. We propose to call each combination of "operand-operator-operand" a perspector and to use the Dirac bra-ket notation used in quantum physics to represent it: < operand | operator | operand >. Every perspector is unique and can itself serve as an operand in another perspector. :. For example, the mathematical expression 2 + 3 can be noted as the perspector <2 | + | 3 >. Perspectors can be nested, i.e. one perspector can be used as an operand in another perspector. < <2|+|3> | = | 5 >. Other examples of perspectors include <New York| is a name for| _New_York> and < Nueva York | is a name for | _New_York> (where by using _New_York we have used underscores to distinguish the representation of the city itself from one of its names). From this flattened graph representation any number of queries can be issued to filter information which is meaningful in a specific auditing context.

When information items get decomposed into more elementary components, for example to get from an information unit to the string of characters represented it until it gets resolved into a sequence of bits (os and 1s), the amount of overhead created can be enormous, way beyond what current computer systems are able to handle. In most situations, it is unnecessary to account for all those steps, but if we are looking for accountability, we may need to open some of those layers.

As computer technology evolves, new possibilities are looming at the horizon with quantum computing, which would multiply the capabilities offered by current systems. The increasing need for improved accountability can also play as an incentive to accelerate the development of more powerful computers.

In the meantime, it is possible to filter the data so that we can see more of it, inside the limits of what current computer systems allow us. For example, we could be interested by looking at all the names used for the city of New York, still ignoring all the processes used to encode each of the names into given character sets.

Early implementation

Accountability of information is at an early stage, and will require a significant increase in computing power to be fully operational. But even now, we can start experimenting with it on small subsets, to become familiarized with the way the concepts work, and prepare for the future.

An early implementation of the Data Projection Model has been implemented as part of our ongoing project of publishing Taxmap for the IRS (https://taxmap.ntis.gov), a portion of the web site originally designed to help the call centers' assistors to quickly find information needed by taxpayers according to topics. Because the relations between topics resulted from applying automated rules, as well as a human-curated knowledge base, it was difficult to figure out where the relations came from. Our first implementation of the Data projection model enabled us to answer questions about the provenance of those relationships.

References

Michel Biezunski, "The Data Projection Model: Towards Auditable Information Systems through Unified Declarations of Operations on Data" (https://infoloom.com/media/documents/dataprojection.pdf)


  1. Peter B. Nichol, Blockchain applications for healthcare, CIO, March 17, 2016. 

  2. See What about?