Skip to content

Using joins

In the previous section, we introduced the mathematical notation for some of the basic operations of relational algebra: Union (), Project (π), Cross product (×), Rename (ρ), and Restrict(σ).

In this section, we will go through a few slightly more advanced uses of relational algebra to get a feel for how it works. Rather than introducing more mathematical notation, we will start using Bmg for our examples.

Preliminaries

In what follows, we will work out some queries for the following three relations:

Actor
name : Stringbirth_year : Integer
Jennifer Jason Leigh1962
Peter Sellers1925
Robin Williams1949
Shelley Duval1949
Role
actor_name : Stringname : Stringfilm_name : String
Jennifer Jason LeighLois KaiserShort Cuts
Jennifer Jason LeighPauline AveryIn the Cut
Peter SellersDr. StrangeloveDr. Strangelove
Peter SellersInspector Jacques ClouseauThe Pink Panther
Robin WilliamsHenry SaganThe Fisher King
Robin WilliamsPopeyePopeye
Shelley DuvalOlive OylPopeye
Shelley DuvalWendy TorranceThe Shining
Film
name : Stringproduction_year : Integerdirector : String
Dr. Strangelove1964Stanley Kubrik
Fast Times at Ridgemont High1982Amy Heckerling
In the Cut2003Jane Campion
Popeye1980Robert Altman
Short Cuts1993Robert Altman
The Fisher King1991Terry Gilliam
The Pink Panther1963Blake Edwards
The Shining1980Stanley Kubrik

In the tables below, we won’t repeat the domain of the attributes, ie String/Integer.

In Bmg, we can create these relations directly:

actors = Bmg::Relation.new([
{ name: "Jennifer Jason Leigh", birth_year: 1962 },
{ name: "Peter Sellers", birth_year: 1925 },
# ...
])

We can also work with relations based on JSON, CSV, a Redis database, or a SQL database. But here we will stick to plain relations.

Birth years of actors who worked with Stanley Kubrik?

Let’s start with a simple (though contrived) task: find the birth years of actors who appeared in films directed by Stanley Kubrik. To do this, we first filter the Film relation to obtain a relation with only those rows where director = "Stanley Kubrik":

sk_films = films.restrict(director: "Stanley Kubrik")
nameproduction_yeardirector
Dr. Strangelove1964Stanley Kubrik
The Shining1980Stanley Kubrik

Join (⋈)

Next, we pick the roles associated with these films. To do that, we use another central operator: Join (⋈). If you have any experience at all working with SQL or relational databases, you will be familiar with joins, but it might be worth bracketing that experience, because learning about Join from the ground up could give you another perspective.

Join glues together tuples from two relations, just like Cross product, but also applies a condition on which tuples to include in the new relation, by comparing, for each row, attributes from the first relation with attributes from the second one.

In this case, we want rows where the name attribute from sk_films equals the film_name attribute from roles:

sk_roles = roles.join(sk_films, :film_name => :name)
film_nameproduction_yeardirectoractor_namename
Dr. Strangelove1964Stanley KubrikPeter SellersDr. Strangelove
The Shining1980Stanley KubrikShelley DuvalWendy Torrance

The new heading contains all the attributes from both relations, except that name from films and film_name from role have been merged. That makes sense, because what we have done is precisely to pick out those tuples where one equals the other. To include both would be redundant.

The only attribute we need from this relation is actor_name so let’s next perform a project:

sk_roles2 = sk_roles.project([:actor_name])
actor_name
Dr. Strangelove
Wendy Torrance

And finally, we perform another Join to get the birth years:

sk_roles2.join(actors, :actor_name => :name)
actor_namebirth_year
Peter Sellers1925
Shelley Duval1949

Join = Cross product + Restrict

Let’s get even more clear about exactly how Join works. As we have seen, Join combines tuples from two relations, let’s call them R and S, to form a new relation T, but only includes those pairs of tuples where the values of one or more attributes match up. In the simplest case (“natural join”), any tuples that have the same name are compared, and if their values are the same in both tuples, they are included in the output.

Consider these two relations:

X1Y1Z
123
567
X2Y2Z
11127
151613

A Natural join compares any attributes with the same name for equality. In this case, it produces:

X1Y1ZX2Y2
5671112

because exactly one pairing of tuples was found to have the same value in an attribute (Z) with the same name.

To gain a deeper understanding of the underlying algebra, it helps to understand how joins can be built up from simpler operations (although that is not something you would normally ever do).

In essence, the idea is: Cross product + Restrict — first do a Cross product to get all the combinations of tuples, then a Restrict to pick out the combinations that have matching attributes. However, since Cross product expects unique attributes, we first have to rename overlapping attributes. And as a last step, we remove the extra attributes created by the rename. Let’s go through this step by step:

1. Rename attributes

We let the first relation be intact, and change Z in the second one to Z2, so it now has the heading:

X2Y2Z2

2. Get the cross product

X1Y1ZX2Y2Z2
12311127
123151613
56711127
567151613

3. Restrict to keep only rows where Z = Z2

X1Y1ZX2Y2Z2
56711127

4. Project, to keep all attributes except Z2

X1Y1ZX2Y2
5671112

And this is the exact same result as we would get with a Join.

In the case of our films and roles, the situation is slightly different: the attibute they share, name, is not the one we want to join on, because it means different things in the two relations. Instead, we want to match film_name from role with name from films. The solution follows the same steps. We rewrite:

roles.join(sk_films, :film_name => :name)

as:

sk_films2 = sk_films.rename(:name => :film_name2))
roles.cross_product(sk_films2)
.restrict(Predicate.eq(:film_name, :film_name2))
.project([
:film_name, :production_year,
:director, :actor_name, :name
])

Again, the point here is to demystify the Join operation and get a clear sense of the algebraic way of combining operations. Just as Join can be defined by simpler operations, we can use relations algebra to create our own composable building blocks for data querying and transformation.

Actors who appeared in the same film

For our next exercise, let’s ask: Which pairs of actors appeared in the same film? Taking a quick look at the Role relation at the beginning of this section, we immediately see that there is only one film that appears twice (namely Popeye), so we expect only one pair of actors.

To get our answer, we need to join the Role relation with itself on the film_name attribute. To do that, we first create a new relation that is exactly the same as Role but with actor_name renamed to actor_name2, because we need both actors’ names in our output relation. The last step is to project only actor_name and actor_name2.

roles2 = roles.rename(:actor_name => :actor_name2)
roles2.join(roles, [:film_name])
.project([:actor_name, :actor_name2])
actor_nameactor_name2
Jennifer Jason LeighJennifer Jason Leigh
Peter SellersPeter Sellers
Robin WilliamsRobin Williams
Shelley DuvalRobin Williams
Robin WilliamsShelley Duval
Shelley DuvalShelley Duval

Hmm, that’s not right. We are getting “pairs” where both actors are the same one. After all, every actor appeared in the same film as themselves (at least from the viewpoint of relational algebra). We need to further restrict this relation to only include tuples where actor_name and actor_name2 are not the same.

roles2.join(roles, [:film_name])
.project([:actor_name, :actor_name2])
.restrict(Predicate.neq(:actor_name, :actor_name2))
actor_nameactor_name2
Shelley DuvalRobin Williams
Robin WilliamsShelley Duval

Still not right. We’re getting the same pair twice, in opposite order. To make sure it’s only included once, we change the condition that they are different to the stronger condition that the first one is smaller than the second. This works since there is a strict ordering between strings.

roles2.join(roles, [:film_name])
.project([:actor_name, :actor_name2])
.restrict(Predicate.lt(:actor_name, :actor_name2))
actor_nameactor_name2
Robin WilliamsShelley Duval

Who knows who?

For our last example in this section, we will consider the following question: Which actors have worked with the same director (even if not in the same films)? The answer will take the form of a table like this:

actor1directoractor2
Robin WilliamsRobert AltmanJennifer Jason Leigh
Shelley DuvalRobert AltmanJennifer Jason Leigh

To get started, we’ll create a relation containing only actors:

actors = roles.project([:actor_name])
.rename(:actor_name => :actor1)
actor1
Jennifer Jason Leigh
Peter Sellers
Robin Williams
Shelley Duval

(This might seem like an unnecessary and inefficient step, but it will make things more streamlined later on.)

Next, we want to pair each actor with every director they’ve worked with (according to our small database). To do that, we join the actors relation with roles, to get film names, and then do another join with the films relation to get the directors’ names, then keep only the actor and director attributes:

friends = actors.join(roles, :actor1 => :actor_name)
.join(films, :film_name => :name)
.project([:actor1, :director])
directoractor1
Robert AltmanJennifer Jason Leigh
Jane CampionJennifer Jason Leigh

Finally, we use this relation to find other actors who worked with the same director. Again, we must use the roles and films relations to link directors and actors, but this time we do the traversal in the other direction, starting with films. Like with our previous example, we make a restriction that the two actors are not the same, and cannot occur together twice (in reverse order), by using the less than predicate:

friends = friends.join(films, :director => :director)
.join(roles, :name => :film_name)
.project([:actor1, :director, :actor_name])
.rename(:actor_name => :actor2)
.restrict(Predicate.lt(:actor1, :actor2))
actor2directoractor1
Robin WilliamsRobert AltmanJennifer Jason Leigh
Shelley DuvalRobert AltmanJennifer Jason Leigh
Shelley DuvalStanley KubrikPeter Sellers
Shelley DuvalRobert AltmanRobin Williams