Tobi Lehman's blog

The Algebra of Data

The Algebra of Data

This book introduces an algebraic model of data. If you are familiar with relational algebra, then you can think of data algebra as a generalization. If you are not familiar with relational algebra, then you have a lot less to un-learn.

In the book, the authors regard the decision to force all data into tables as a historical mistake that we are still living with today. One symptom of forcing data into tables is the null value. As an example, imagine we have a table people that has columns name(string), age(int), and prisonReleaseDate(date). Then we populate it with the following data:

| name            | age | prisonReleaseDate |
|-----------------|-----|-------------------|
| OJ Simpson      |  70 |        2017-10-01 |
| Chelsea Manning |  29 |        2017-05-17 |
| Sandra Bullock  |  53 |              null |

Since Sandra Bullock has never been to prison, there is no value for prisonReleaseDate. If this data were stored in a document store like Riak or MongoDB or whatever the cool HN kids are using these days, it would look like this:

[
  { name: "OJ Simpson",      age: 70, prisonReleaseDate: "2017-10-01" },
  { name: "Chelsea Manning", age: 29, prisonReleaseDate: "2017-05-17" },
  { name: "Sandra Bullock",  age: 53 }
]

JavaScript and JSON have distinct null and undefined values, and the idea of undefined is more appropriate for Sandra Bullock’s prisonReleaseDate in this example. This approach is how many NoSQL datastores handle non-tabular data.

The idea that forcing all data into tables is a mistake is supported by the rise in popularity of the NoSQL data stores over the last decade:

However, modern database humor acknowleges the problems with these non-mathematical, NoSQL datastores, just observe this gem from @jrecursive:

fault-tolerance

Key-value stores don’t usually have a good way to query data, or at least anything close to the power of SQL with relational databases. That power comes from the relational algebra. One attempt at solving this for key-value stores is the MapReduce framework that Google released in 2004.

I suspect that the reason MapReduce had to be created was that there was no algebraic model of data that was general enough for the very large, non-tabular data Google has amassed.

Enough about the motivation, if you are with me at this point, you are probably ready to see what it is, and what it can do. Data algebra is defined in terms of finite sets. If you want to get rigorous, it works in Zermelo-Fraenkel set theory without the axiom of infinity, or the axiom of choice.

Introduction to data algebra

The universe of basic data “atoms” is called the “genesis set”, and denoted G. It can be any finite set, pick one.

For example, G = {0, 1, a, b}

The cartesian product of G with itself is the set of ordered pairs:

GxG = {(0, 0), (0, 1), (0, a), (0, b), 
       (1, 0), (1, 1), (1, a), (1, b), 
       (a, 0), (a, 1), (a, a), (a, b),
       (b, 0), (b, 1), (b, a), (b, b)}

The basic unit of data is a couplet, which is another name for an an ordered pair. In the book, the authors use the notation xy to represent the ordered pair (x,y)

Couplets

As I mentioned above, couplets are the basic unit of data. We can think of key-value stores as a set of couplets of the form valuekey. If our values have types, we can think of the typed value as valuetype, so that the typed key value is (valuetype)key

If we take the 4-element genesis set G, we can write out all the couplets as:

Couplets(G) = { 00, 01, 0a, 0b,
                10, 11, 1a, 1b,
                a0, a1, aa, ab,
                b0, b1, ba, bb }

Each couplet is a datum, and any set of couplets is data. The authors also establish a convention, the first element of a couplet is the “yin”, and the second is the “yang”. So that

yin(ab) = a
yang(ab) = b

Composition

There is a partial operation called composition ∘ which is defined like this:

ab ∘ bc = ac

The operation is only defined for pairs of couplets where the yang of the first is equal to the yin of the second. This is where we transition “up the ladder” into the realm of relations.

Relations

A relation is a set of couplets, so the set of all relations for the genesis set G is just the power set of GxG:

Relations(G) = P(GxG)

It just so happens that “relations” from relational algebra are also relations in data algebra, but not vice versa. For an example of a relation, let’s go back to our people table, we can write this as a relation in data algebra:

P = {{OJ Simpsonname, 70age,  2017-10-01prisonReleaseDate},
     {Chelsea Manningname, 29age, 2017-05-17prisonReleaseDate}
     {Sandra Bullockname, 53age }}

We can define a composition operation on Relations(G) which will be defined for every pair of relations:

A ∘ B = {a1a2 ∘ b1b2 : a1a2 ∈ A and b1b2 ∈ B}

The composition of two relations is the set of all compositions of the couplets inside. If no pair of couplets are defined for the couplet composition, the result is an empty set. This pattern of lifting a partial operation to the power set and getting a full operation occurs many times in the book.

Moving up the ladder again, we call any set of relations a “clan”.

Clans

A clan is a set of relations, so that the algebra Clans(G) = P(P(GxG)) (the power set of the power set of the set of couplets). We can then lift the composition operation up to the level of clans in an analogous manner as we did to level of relations.

Cool, we have a binary operator on the algebra Clans(G), what can we do with it?

Select queries

Assuming we wanted to query the age columns from the people table in SQL, we could write the following query:

SELECT age FROM people;

(results)
|  70 |
|  29 |
|  53 |

If we wanted to perform this query in data algebra, we could use the clan of relations P (defined above), and then compose it with the singleton clan {{ageage}}. The equivalent to the SELECT age statement above is:

P  ∘ {{ageage}} = 

{{OJ Simpsonname, 70age,  2017-10-01prisonReleaseDate} ∘ {ageage},
 {Chelsea Manningname, 29age, 2017-05-17prisonReleaseDate} ∘ {ageage}
 {Sandra Bullockname, 53age } ∘ {ageage}} = 
 

{{70age ∘ ageage},
 {29age ∘ ageage}
 {53age ∘ ageage}} = 

{{70age},
 {29age},
 {53age}}

The book goes into detail for how to express all the the types of JOIN statements that are possible in SQL, and also how to represent graphs in data algebra. The algebra is general enough to cover all kinds of data. I highly recommend reading this book, you can sign up for a free copy here

The promise of data algebra

Data algebra provides a single language to represent and operate on all data, whether its relational and fits in a table, or a graph, a nested document, or an unstructured string of bytes. This uniform language allows for optimizations such as cacheing intermediate values. For example, a complex query has subexpressions, and those subexpressions are more likely to occur as subexpressions of another query, so that a data alebra-aware engine can cache the results of those subexpressions.

For example, suppose the P clan expanded to include a new person, then P ∘ {{ageage}} can use the cached original value and just evaluate the composition on the new relation. It also allows sharing across vastly different data structures. If some other clan represents a graph, it might still be able to make use of intermediate values involved in some query of the P clan.

Another benefit is the certainty of mathematical proofs. Getting guarantees on software and data systems is rare. By using data algebra as the foundation of a data system, and framing all the operations on the data in terms of those data algebra operators, you benefit from all the theorems that mathematicians can prove about data algebra, and about sets in general.

Aside from the benefits of mathematical rigor and query optimization, data algebra can be used to design a universal data interchange format. Right now, we have XML, JSON, Protobufs, SGML, etc.

The Future

The authors hinted at working on a more in-depth book, getting more theorems about clans and the operations on them would help drive adoption of this branch of mathematics. Currently I don’t see anyone using it outside of the Algebraix Data Corporation (the company started by the authors).

Despite its relative obscurity, I suspect this will creep into the mainstream, and in 20 years we will be using this as a foundation for dealing with the unfathomably large ocean of data we will have produced by then.