Using joins
In the previous section, we introduced the mathematical notation for some of the basic operations of relational algebra: Union (∪), Project (π), Cross product (×), Rename (ρ), and Restrict(σ).
In this section, we will go through a few slightly more advanced uses of relational algebra to get a feel for how it works. Rather than introducing more mathematical notation, we will start using Bmg for our examples.
Preliminaries
In what follows, we will work out some queries for the following three relations:
Actor
name : String | birth_year : Integer |
---|---|
Jennifer Jason Leigh | 1962 |
Peter Sellers | 1925 |
Robin Williams | 1949 |
Shelley Duval | 1949 |
Role
actor_name : String | name : String | film_name : String |
---|---|---|
Jennifer Jason Leigh | Lois Kaiser | Short Cuts |
Jennifer Jason Leigh | Pauline Avery | In the Cut |
Peter Sellers | Dr. Strangelove | Dr. Strangelove |
Peter Sellers | Inspector Jacques Clouseau | The Pink Panther |
Robin Williams | Henry Sagan | The Fisher King |
Robin Williams | Popeye | Popeye |
Shelley Duval | Olive Oyl | Popeye |
Shelley Duval | Wendy Torrance | The Shining |
Film
name : String | production_year : Integer | director : String |
---|---|---|
Dr. Strangelove | 1964 | Stanley Kubrik |
Fast Times at Ridgemont High | 1982 | Amy Heckerling |
In the Cut | 2003 | Jane Campion |
Popeye | 1980 | Robert Altman |
Short Cuts | 1993 | Robert Altman |
The Fisher King | 1991 | Terry Gilliam |
The Pink Panther | 1963 | Blake Edwards |
The Shining | 1980 | Stanley Kubrik |
In the tables below, we won’t repeat the domain of the attributes, ie String
/Integer
.
In Bmg, we can create these relations directly:
We can also work with relations based on JSON, CSV, a Redis database, or a SQL database. But here we will stick to plain relations.
Birth years of actors who worked with Stanley Kubrik?
Let’s start with a simple (though contrived) task: find the birth years of actors who appeared in films directed by Stanley Kubrik. To do this, we first filter the Film
relation to obtain a relation with only those rows where director = "Stanley Kubrik"
:
name | production_year | director |
---|---|---|
Dr. Strangelove | 1964 | Stanley Kubrik |
The Shining | 1980 | Stanley Kubrik |
Join (⋈)
Next, we pick the roles associated with these films. To do that, we use another central operator: Join (⋈). If you have any experience at all working with SQL or relational databases, you will be familiar with joins, but it might be worth bracketing that experience, because learning about Join from the ground up could give you another perspective.
Join glues together tuples from two relations, just like Cross product, but also applies a condition on which tuples to include in the new relation, by comparing, for each row, attributes from the first relation with attributes from the second one.
In this case, we want rows where the name attribute from sk_films
equals the film_name attribute from roles
:
film_name | production_year | director | actor_name | name |
---|---|---|---|---|
Dr. Strangelove | 1964 | Stanley Kubrik | Peter Sellers | Dr. Strangelove |
The Shining | 1980 | Stanley Kubrik | Shelley Duval | Wendy Torrance |
The new heading contains all the attributes from both relations, except that name from films
and film_name from role
have been merged. That makes sense, because what we have done is precisely to pick out those tuples where one equals the other. To include both would be redundant.
The only attribute we need from this relation is actor_name
so let’s next perform a project
:
actor_name |
---|
Dr. Strangelove |
Wendy Torrance |
And finally, we perform another Join to get the birth years:
actor_name | birth_year |
---|---|
Peter Sellers | 1925 |
Shelley Duval | 1949 |
Join = Cross product + Restrict
Let’s get even more clear about exactly how Join works. As we have seen, Join combines tuples from two relations, let’s call them R and S, to form a new relation T, but only includes those pairs of tuples where the values of one or more attributes match up. In the simplest case (“natural join”), any tuples that have the same name are compared, and if their values are the same in both tuples, they are included in the output.
Consider these two relations:
X1 | Y1 | Z |
---|---|---|
1 | 2 | 3 |
5 | 6 | 7 |
X2 | Y2 | Z |
---|---|---|
11 | 12 | 7 |
15 | 16 | 13 |
A Natural join compares any attributes with the same name for equality. In this case, it produces:
X1 | Y1 | Z | X2 | Y2 |
---|---|---|---|---|
5 | 6 | 7 | 11 | 12 |
because exactly one pairing of tuples was found to have the same value in an attribute (Z
) with the same name.
To gain a deeper understanding of the underlying algebra, it helps to understand how joins can be built up from simpler operations (although that is not something you would normally ever do).
In essence, the idea is: Cross product + Restrict — first do a Cross product to get all the combinations of tuples, then a Restrict to pick out the combinations that have matching attributes. However, since Cross product expects unique attributes, we first have to rename overlapping attributes. And as a last step, we remove the extra attributes created by the rename. Let’s go through this step by step:
1. Rename attributes
We let the first relation be intact, and change Z
in the second one to Z2
, so it now has the heading:
X2 | Y2 | Z2 |
---|
2. Get the cross product
X1 | Y1 | Z | X2 | Y2 | Z2 |
---|---|---|---|---|---|
1 | 2 | 3 | 11 | 12 | 7 |
1 | 2 | 3 | 15 | 16 | 13 |
5 | 6 | 7 | 11 | 12 | 7 |
5 | 6 | 7 | 15 | 16 | 13 |
3. Restrict to keep only rows where Z = Z2
X1 | Y1 | Z | X2 | Y2 | Z2 |
---|---|---|---|---|---|
5 | 6 | 7 | 11 | 12 | 7 |
4. Project, to keep all attributes except Z2
X1 | Y1 | Z | X2 | Y2 |
---|---|---|---|---|
5 | 6 | 7 | 11 | 12 |
And this is the exact same result as we would get with a Join.
In the case of our films and roles, the situation is slightly different: the attibute they share, name
, is not the one we want to join on, because it means different things in the two relations. Instead, we want to match film_name
from role
with name
from films
. The solution follows the same steps. We rewrite:
as:
Again, the point here is to demystify the Join operation and get a clear sense of the algebraic way of combining operations. Just as Join can be defined by simpler operations, we can use relations algebra to create our own composable building blocks for data querying and transformation.
Actors who appeared in the same film
For our next exercise, let’s ask: Which pairs of actors appeared in the same film? Taking a quick look at the Role relation at the beginning of this section, we immediately see that there is only one film that appears twice (namely Popeye), so we expect only one pair of actors.
To get our answer, we need to join the Role relation with itself on the film_name
attribute. To do that, we first create a new relation that is exactly the same as Role but with actor_name
renamed to actor_name2
, because we need both actors’ names in our output relation. The last step is to project only actor_name
and actor_name2
.
actor_name | actor_name2 |
---|---|
Jennifer Jason Leigh | Jennifer Jason Leigh |
Peter Sellers | Peter Sellers |
Robin Williams | Robin Williams |
Shelley Duval | Robin Williams |
Robin Williams | Shelley Duval |
Shelley Duval | Shelley Duval |
Hmm, that’s not right. We are getting “pairs” where both actors are the same one. After all, every actor appeared in the same film as themselves (at least from the viewpoint of relational algebra). We need to further restrict this relation to only include tuples where actor_name
and actor_name2
are not the same.
actor_name | actor_name2 |
---|---|
Shelley Duval | Robin Williams |
Robin Williams | Shelley Duval |
Still not right. We’re getting the same pair twice, in opposite order. To make sure it’s only included once, we change the condition that they are different to the stronger condition that the first one is smaller than the second. This works since there is a strict ordering between strings.
actor_name | actor_name2 |
---|---|
Robin Williams | Shelley Duval |
Who knows who?
For our last example in this section, we will consider the following question: Which actors have worked with the same director (even if not in the same films)? The answer will take the form of a table like this:
actor1 | director | actor2 |
---|---|---|
Robin Williams | Robert Altman | Jennifer Jason Leigh |
Shelley Duval | Robert Altman | Jennifer Jason Leigh |
… | … | … |
To get started, we’ll create a relation containing only actors:
actor1 |
---|
Jennifer Jason Leigh |
Peter Sellers |
Robin Williams |
Shelley Duval |
(This might seem like an unnecessary and inefficient step, but it will make things more streamlined later on.)
Next, we want to pair each actor with every director they’ve worked with (according to our small database). To do that, we join the actors
relation with roles
, to get film names, and then do another join with the films
relation to get the directors’ names, then keep only the actor and director attributes:
director | actor1 |
---|---|
Robert Altman | Jennifer Jason Leigh |
Jane Campion | Jennifer Jason Leigh |
… | … |
Finally, we use this relation to find other actors who worked with the same director. Again, we must use the roles
and films
relations to link directors and actors, but this time we do the traversal in the other direction, starting with films
. Like with our previous example, we make a restriction that the two actors are not the same, and cannot occur together twice (in reverse order), by using the less than predicate:
actor2 | director | actor1 |
---|---|---|
Robin Williams | Robert Altman | Jennifer Jason Leigh |
Shelley Duval | Robert Altman | Jennifer Jason Leigh |
Shelley Duval | Stanley Kubrik | Peter Sellers |
Shelley Duval | Robert Altman | Robin Williams |