One Size Fits All

Well Organized Data

Computers are powerful tools to retrieve information if the source data are properly organized. Databases are ways to store information into records, divided into fields. Queries into the fields yield results that would be unreachable by other means. For example, a customer database in organized into one record per customer, where fields contain the individual information divided into rubrics, such as first name, last name, street address, city, zip code, etc. The ability to organize information into databases was the first primary incentive to use computers, when they were mainframes at the corporate level. Later individual applications appear, such as word processing or spreadsheets, for professional use, and access to multiple media that are being broadcast over the Internet.

A customer database is relatively easy to organize, although some complexities may arise when customers are located in multiple countries where the way addresses are expressed differ. But this is a relatively minor issue, easily solvable. Family trees are more challenging to organize, because not all families follow the standard pattern of what a traditional western-civilization family is expected to look like. Library catalogs are databases built on agreements on the categories under which the divisions should be organized. But they are several competing standard ways to accomplish this, and they are not necessarily stable. Big companies or professional sectors create taxonomies that describe their knowledge units so that they can be used in multiple situations. This doesn't prevent any organization, or subset of organization, to design their own way of qualifying knowledge.

Information that gets processed by computers impose an extra layer or rules. In order to be processable, information is organized into ontologies, which are more restrictive, because data has to be present in a form on which the rules can be applied. Complex, automated systems, can be built from them, but they are not allowed any variation in the way data is stored, because that would break the rules that are applied on them.

Artificial intelligence, and machine learning are a way that is used to go beyond these limits. The systems are equipped with powerful algorithms, able to analyze data, even if it is not organized to start with, and that extract data capable of being processed. Machine learning is dynamic, in the sense that the more it is run, the more it learns about new patterns that can exist, and it can be further refined until a point where the processing becomes acceptable.

What these techniques have in common is that they use either schemas or rules that have been defined to process information. These rules are not always explicit, and are not always decipherable, sometimes by design. For example, the search engines used by companies such as Google are proprietary secrets, and can't be accessed in order to understand what kinds of results they provide, and potentially what kinds of results they exclude, or they simply miss.

Data Cleaning

In order to be accepted by an automated computer systems, data needs to be cleaned. The process of cleaning data can be straightforward, but it can also be tedious and extremely difficult. When it can't be done by algorithms, low-paid humans are hired to clean the data. These are done by an army of volunteers, or low-paid workers, who spent their time making sure that any piece of data they look at conforms to the schema or the pattern that is mandatory.

One Size Fits All

Well Organized Data

Data Cleaning

Graph databases