Using joins

In the previous section, we introduced the mathematical notation for some of the basic operations of relational algebra: Union (∪), Project (π), Cross product (×), Rename (ρ), and Restrict(σ).

In this section, we will go through a few slightly more advanced uses of relational algebra to get a feel for how it works. Rather than introducing more mathematical notation, we will start using Bmg for our examples.

Preliminaries

In what follows, we will work out some queries for the following three relations:

Actor

name : String	birth_year : Integer
Jennifer Jason Leigh	1962
Peter Sellers	1925
Robin Williams	1949
Shelley Duval	1949

Role

actor_name : String	name : String	film_name : String
Jennifer Jason Leigh	Lois Kaiser	Short Cuts
Jennifer Jason Leigh	Pauline Avery	In the Cut
Peter Sellers	Dr. Strangelove	Dr. Strangelove
Peter Sellers	Inspector Jacques Clouseau	The Pink Panther
Robin Williams	Henry Sagan	The Fisher King
Robin Williams	Popeye	Popeye
Shelley Duval	Olive Oyl	Popeye
Shelley Duval	Wendy Torrance	The Shining

Film

name : String	production_year : Integer	director : String
Dr. Strangelove	1964	Stanley Kubrik
Fast Times at Ridgemont High	1982	Amy Heckerling
In the Cut	2003	Jane Campion
Popeye	1980	Robert Altman
Short Cuts	1993	Robert Altman
The Fisher King	1991	Terry Gilliam
The Pink Panther	1963	Blake Edwards
The Shining	1980	Stanley Kubrik

In the tables below, we won’t repeat the domain of the attributes, ie String/Integer.

In Bmg, we can create these relations directly:

actors = Bmg::Relation.new([
  { name: "Jennifer Jason Leigh", birth_year: 1962 },
  { name: "Peter Sellers", birth_year: 1925 },
  # ...
])

We can also work with relations based on JSON, CSV, a Redis database, or a SQL database. But here we will stick to plain relations.

Birth years of actors who worked with Stanley Kubrik?

Let’s start with a simple (though contrived) task: find the birth years of actors who appeared in films directed by Stanley Kubrik. To do this, we first filter the Film relation to obtain a relation with only those rows where director = "Stanley Kubrik":

sk_films = films.restrict(director: "Stanley Kubrik")

name	production_year	director
Dr. Strangelove	1964	Stanley Kubrik
The Shining	1980	Stanley Kubrik

Join (⋈)

Next, we pick the roles associated with these films. To do that, we use another central operator: Join (⋈). If you have any experience at all working with SQL or relational databases, you will be familiar with joins, but it might be worth bracketing that experience, because learning about Join from the ground up could give you another perspective.

Join glues together tuples from two relations, just like Cross product, but also applies a condition on which tuples to include in the new relation, by comparing, for each row, attributes from the first relation with attributes from the second one.

In this case, we want rows where the name attribute from sk_films equals the film_name attribute from roles:

sk_roles = roles.join(sk_films, :film_name => :name)

film_name	production_year	director	actor_name	name
Dr. Strangelove	1964	Stanley Kubrik	Peter Sellers	Dr. Strangelove
The Shining	1980	Stanley Kubrik	Shelley Duval	Wendy Torrance

The new heading contains all the attributes from both relations, except that name from films and film_name from role have been merged. That makes sense, because what we have done is precisely to pick out those tuples where one equals the other. To include both would be redundant.

The only attribute we need from this relation is actor_name so let’s next perform a project:

sk_roles2 = sk_roles.project([:actor_name])

actor_name
Dr. Strangelove
Wendy Torrance

And finally, we perform another Join to get the birth years:

sk_roles2.join(actors, :actor_name => :name)

actor_name	birth_year
Peter Sellers	1925
Shelley Duval	1949

Join = Cross product + Restrict

Let’s get even more clear about exactly how Join works. As we have seen, Join combines tuples from two relations, let’s call them R and S, to form a new relation T, but only includes those pairs of tuples where the values of one or more attributes match up. In the simplest case (“natural join”), any tuples that have the same name are compared, and if their values are the same in both tuples, they are included in the output.

Consider these two relations:

X1	Y1	Z
1	2	3
5	6	7

X2	Y2	Z
11	12	7
15	16	13

A Natural join compares any attributes with the same name for equality. In this case, it produces:

X1	Y1	Z	X2	Y2
5	6	7	11	12

because exactly one pairing of tuples was found to have the same value in an attribute (Z) with the same name.

To gain a deeper understanding of the underlying algebra, it helps to understand how joins can be built up from simpler operations (although that is not something you would normally ever do).

In essence, the idea is: Cross product + Restrict — first do a Cross product to get all the combinations of tuples, then a Restrict to pick out the combinations that have matching attributes. However, since Cross product expects unique attributes, we first have to rename overlapping attributes. And as a last step, we remove the extra attributes created by the rename. Let’s go through this step by step:

1. Rename attributes

We let the first relation be intact, and change Z in the second one to Z2, so it now has the heading:

X2	Y2	Z2

2. Get the cross product

X1	Y1	Z	X2	Y2	Z2
1	2	3	11	12	7
1	2	3	15	16	13
5	6	7	11	12	7
5	6	7	15	16	13

3. Restrict to keep only rows where Z = Z2

X1	Y1	Z	X2	Y2	Z2
5	6	7	11	12	7

4. Project, to keep all attributes except Z2

X1	Y1	Z	X2	Y2
5	6	7	11	12

And this is the exact same result as we would get with a Join.

In the case of our films and roles, the situation is slightly different: the attibute they share, name, is not the one we want to join on, because it means different things in the two relations. Instead, we want to match film_name from role with name from films. The solution follows the same steps. We rewrite:

roles.join(sk_films, :film_name => :name)

as:

sk_films2 = sk_films.rename(:name => :film_name2))
roles.cross_product(sk_films2)
     .restrict(Predicate.eq(:film_name, :film_name2))
     .project([
       :film_name, :production_year,
       :director, :actor_name, :name
     ])

Again, the point here is to demystify the Join operation and get a clear sense of the algebraic way of combining operations. Just as Join can be defined by simpler operations, we can use relations algebra to create our own composable building blocks for data querying and transformation.

Actors who appeared in the same film

For our next exercise, let’s ask: Which pairs of actors appeared in the same film? Taking a quick look at the Role relation at the beginning of this section, we immediately see that there is only one film that appears twice (namely Popeye), so we expect only one pair of actors.

To get our answer, we need to join the Role relation with itself on the film_name attribute. To do that, we first create a new relation that is exactly the same as Role but with actor_name renamed to actor_name2, because we need both actors’ names in our output relation. The last step is to project only actor_name and actor_name2.

roles2 = roles.rename(:actor_name => :actor_name2)
roles2.join(roles, [:film_name])
      .project([:actor_name, :actor_name2])

actor_name	actor_name2
Jennifer Jason Leigh	Jennifer Jason Leigh
Peter Sellers	Peter Sellers
Robin Williams	Robin Williams
Shelley Duval	Robin Williams
Robin Williams	Shelley Duval
Shelley Duval	Shelley Duval

Hmm, that’s not right. We are getting “pairs” where both actors are the same one. After all, every actor appeared in the same film as themselves (at least from the viewpoint of relational algebra). We need to further restrict this relation to only include tuples where actor_name and actor_name2 are not the same.

roles2.join(roles, [:film_name])
      .project([:actor_name, :actor_name2])
      .restrict(Predicate.neq(:actor_name, :actor_name2))

actor_name	actor_name2
Shelley Duval	Robin Williams
Robin Williams	Shelley Duval

Still not right. We’re getting the same pair twice, in opposite order. To make sure it’s only included once, we change the condition that they are different to the stronger condition that the first one is smaller than the second. This works since there is a strict ordering between strings.

roles2.join(roles, [:film_name])
      .project([:actor_name, :actor_name2])
      .restrict(Predicate.lt(:actor_name, :actor_name2))

actor_name	actor_name2
Robin Williams	Shelley Duval

Who knows who?

For our last example in this section, we will consider the following question: Which actors have worked with the same director (even if not in the same films)? The answer will take the form of a table like this:

actor1	director	actor2
Robin Williams	Robert Altman	Jennifer Jason Leigh
Shelley Duval	Robert Altman	Jennifer Jason Leigh
…	…	…

To get started, we’ll create a relation containing only actors:

actors = roles.project([:actor_name])
              .rename(:actor_name => :actor1)

actor1
Jennifer Jason Leigh
Peter Sellers
Robin Williams
Shelley Duval

(This might seem like an unnecessary and inefficient step, but it will make things more streamlined later on.)

Next, we want to pair each actor with every director they’ve worked with (according to our small database). To do that, we join the actors relation with roles, to get film names, and then do another join with the films relation to get the directors’ names, then keep only the actor and director attributes:

friends = actors.join(roles, :actor1 => :actor_name)
                .join(films, :film_name => :name)
                .project([:actor1, :director])

director	actor1
Robert Altman	Jennifer Jason Leigh
Jane Campion	Jennifer Jason Leigh
…	…

Finally, we use this relation to find other actors who worked with the same director. Again, we must use the roles and films relations to link directors and actors, but this time we do the traversal in the other direction, starting with films. Like with our previous example, we make a restriction that the two actors are not the same, and cannot occur together twice (in reverse order), by using the less than predicate:

friends = friends.join(films, :director => :director)
                 .join(roles, :name => :film_name)
                 .project([:actor1, :director, :actor_name])
                 .rename(:actor_name => :actor2)
                 .restrict(Predicate.lt(:actor1, :actor2))

actor2	director	actor1
Robin Williams	Robert Altman	Jennifer Jason Leigh
Shelley Duval	Robert Altman	Jennifer Jason Leigh
Shelley Duval	Stanley Kubrik	Peter Sellers
Shelley Duval	Robert Altman	Robin Williams