Data Science for Movies

Wed 12 May 2021

Set Cover with actors and movies

Posted by Maxime Kan in posts   

How to select movies with all actors in as few movies as possible

It sometimes can feel overwhelming how many movies one has to watch to build a strong cinematographic culture. In 1999, the American Film Institute (AFI) defined a list of 50 Hollywood legends of the classical Hollywood era. While some of the actors in this list starred in multiple movies that would actually need to be seen, is it possible to define a strategy to cover all of these actors by watching as few movies as possible?

This problem is actually known as the Set Cover problem. This problem attempts to find the smallest collection of sets which would cover the whole universe. Reformulated for our purpose, it is finding the smallest list of movies which would cover all the actors listed by the AFI.

The Set Cover problem is one of the fundamental algorithmic problems and has been shown to be NP-complete, meaning there is no known way to find a solution for this problem in an efficient way. Therefore, we will be using two approximation approaches to solve this problem: the greedy algorithm and an integer programming approximation.

1. Loading the data

In [1]:
from scipy.optimize import linprog
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import itertools

Below is the list of legend actors and actresses as defined by the American Film Institute.

In [2]:
afi_female_legends = [
              "Katharine Hepburn", "Bette Davis", "Audrey Hepburn", "Ingrid Bergman",
              "Greta Garbo", "Marilyn Monroe", "Elizabeth Taylor", "Judy Garland",
              "Marlene Dietrich", "Joan Crawford", "Barbara Stanwyck", "Claudette Colbert",
              "Grace Kelly", "Ginger Rogers", "Mae West", "Vivien Leigh",
              "Lillian Gish", "Shirley Temple", "Rita Hayworth", "Lauren Bacall", 
              "Sophia Loren", "Jean Harlow", "Carole Lombard", "Mary Pickford",
              "Ava Gardner"
]
afi_male_legends = [
                    "Humphrey Bogart", "Cary Grant", "James Stewart", "Marlon Brando",
                    "Fred Astaire", "Henry Fonda", "Clark Gable", "James Cagney",
                    "Spencer Tracy", "Charles Chaplin", "Gary Cooper", "Gregory Peck",
                    "John Wayne", "Laurence Olivier", "Gene Kelly", "Orson Welles",
                    "Kirk Douglas", "James Dean", "Burt Lancaster", "Groucho Marx",
                    "Buster Keaton", "Sidney Poitier", "Robert Mitchum", "Edward G. Robinson",
                    "William Holden"
]

legend_actors = afi_female_legends + afi_male_legends

The data comes from IMDB and links each actor to the movies it starred in.

In [3]:
%%capture
! wget "https://datasets.imdbws.com/title.principals.tsv.gz"
! wget "https://datasets.imdbws.com/title.basics.tsv.gz"
! wget "https://datasets.imdbws.com/name.basics.tsv.gz"

! gunzip title.principals.tsv.gz
! gunzip title.basics.tsv.gz
! gunzip name.basics.tsv.gz
In [4]:
cast = pd.read_csv("title.principals.tsv", sep="\t", na_values="\\N", keep_default_na=False, 
                   usecols=["tconst", "nconst", "category", "job"],
                   dtype={"tconst": str, "nconst": str, "category": str, "job": str})
titles = pd.read_csv("title.basics.tsv", sep="\t", na_values="\\N", keep_default_na=False,
                     usecols=["tconst", "primaryTitle", "startYear", "titleType"],
                    dtype={"tconst": str, "primaryTitle": str, "startYear": "Int64","titleType":str})
names = pd.read_csv("name.basics.tsv", sep="\t", na_values="\\N", keep_default_na=False,
                    usecols=["nconst", "primaryName"],
                   dtype={"nconst": str, "primaryName": str})
In [5]:
movie_titles = titles[titles.titleType == "movie"]
In [6]:
set_movies = pd.merge(
    pd.merge(
        names[names.primaryName.isin(legend_actors)],
        cast[cast.category.isin(["actor", "actress"])],
        on="nconst"
        ),
    titles[(titles.titleType == "movie") & (titles.startYear.notna())],
     on="tconst"
     )
In [7]:
print("All these actors have played in %i movies over their careers." %len(set_movies[["primaryTitle", "startYear"]].drop_duplicates()))
All these actors have played in 2258 movies over their careers.

Note that this does not cover absolutely all the movies these actors starred in, since IMDB only gives the first 3 or 4 main names from the end credits. This means that smaller cameo roles are not included in this analysis.

2. Exploration

As a first step, we can then check how many different actors each member of our list starred with.

In [8]:
def plot_collaborations(aggregator, title, y_label, set_movies, legend_actors):
    count_collaborations = set_movies.copy()
    count_collaborations["gender"] = count_collaborations.primaryName.apply(
        lambda x: "male" if x in afi_male_legends else "female"
        )
    count_collaborations = pd.merge(count_collaborations, count_collaborations, on=["primaryTitle", "startYear"], suffixes=[None, "_partner"])
    count_collaborations = count_collaborations[count_collaborations.primaryName != count_collaborations.primaryName_partner]

    counts = (count_collaborations
            .groupby(["primaryName", "gender_partner"], as_index=False)
            .agg({"primaryName_partner": aggregator})
            .merge(pd.DataFrame({"primaryName": legend_actors}), how="right")
            .pivot(index="primaryName", columns="gender_partner", values="primaryName_partner")
            .fillna(0)
            .loc[legend_actors, ["male", "female"]]
    )

    fig, ax = plt.subplots(figsize=(15,5))
    counts.plot.bar(stacked=True, ax=ax, color={"male": "lightcoral", "female": "gold"})
    ax.grid(axis="y", alpha=0.3)
    ax.set(
        title=title,
        xlabel="Actors",
        ylabel=y_label,
        facecolor="whitesmoke"
        )
    plt.show()

plot_collaborations("nunique",
                    "Number of different Hollywood stars involved in collaborations", 
                    "Number of different co-starring actors/actresses",
                    set_movies,
                    legend_actors)

Based on this chart, we can see that the two actors who played with most other actors Cary Grant and Henry Fonda with 16 different co-starring partners of the list. One clear trend that is visible from this chart is that these actresses rarely played with other actresses, while it is more common for male actors to share a movie with another leading actor. Finally, this shows that two members of this list do not share any movie with anyone else: Mary Pickford and Groucho Marx. For the former, it is mostly because she was active in the very early days of Hollywood, for the latter because he specialised in a distinctive kind of comedies.

We can then explore things one level further by looking at exactly what actor played with whom. This can be visualised as a graph where actors are represented by nodes and common movies are represented by an edge connecting them. Below, we show these graphs within both groups of male and female Hollywood legends.

In [9]:
def plot_graph(actor_list, movies, title):
    movies = movies[movies.primaryName.isin(actor_list)]
    combinations = movies[movies.groupby(["primaryTitle", "startYear"]).primaryName.transform("count") > 1]
    G=nx.Graph()
    G.add_nodes_from(actor_list)
    actor_pairs = []
    for movie in combinations.primaryTitle.unique():
        actors = list(combinations[combinations.primaryTitle == movie].primaryName)
        actor_pairs += list(itertools.combinations(actors, 2))
    G.add_edges_from(actor_pairs)

    fig, ax = plt.subplots(figsize=(15,15))
    nx.draw(G, with_labels=True, pos=nx.shell_layout(G), ax=ax, node_color="powderblue", edge_color="grey") 
    ax.axis([-1.2, 1.2, -1.2, 1.2])
    ax.set_title(title)
In [10]:
plot_graph(afi_male_legends, set_movies, "Shared movies between male actors")
In [11]:
plot_graph(afi_female_legends, set_movies, "Shared movies between female actresses")

This again illustrates quite clearly that lead actresses shared way less movies together than lead actors. The actor who starred with most other male actors is Burt Lancaster (7), the actress who starred with most other female actresses is Ginger Rogers (5).

3. Data Preparation

As mentioned above, these actors and actresses played in 2257 movies in total. This can be simplified for our purpose quite dramatically by keeping only the movies where multiple actors appear in the credits. As we have shown above, only Mary Pickford and Groucho Marx would be left out as a result. For them, there is no other option than picking one of their films individually to complete the list.

This simplification will reduce computation time without changing anything to the solution of the problem.

In [12]:
combinations = set_movies[set_movies.groupby(["primaryTitle", "startYear"]).primaryName.transform("count") > 1]
one_each = set_movies[~set_movies.primaryName.isin(combinations.primaryName)].groupby("primaryName", as_index=False).first()
set_movies = pd.concat([combinations, one_each])
set_movies = set_movies.drop_duplicates(["primaryName", "category", "primaryTitle", "startYear"])
set_movies.head()
Out[12]:
nconst primaryName tconst category job titleType primaryTitle startYear
0 nm0000001 Fred Astaire tt0025164 actor NaN movie The Gay Divorcee 1934
1 nm0001677 Ginger Rogers tt0025164 actress NaN movie The Gay Divorcee 1934
2 nm0000001 Fred Astaire tt0026942 actor NaN movie Roberta 1935
3 nm0001677 Ginger Rogers tt0026942 actress NaN movie Roberta 1935
4 nm0000001 Fred Astaire tt0027125 actor NaN movie Top Hat 1935
In [13]:
print("We are considering %i movies for this problem by keeping only the movies shared by multiple actors." 
      %len(set_movies[["primaryTitle", "startYear"]].drop_duplicates()))
We are considering 261 movies for this problem by keeping only the movies shared by multiple actors.

4. Greedy Algorithm

The first algorithm we can use to approximate a solution to this set cover problem is the greedy algorithm. The way it works is very easy: it iteratively selects the movie that covers the largest number of actors left. The algorithm then continues until each actor has been selected. The solution may not be optimal because the greedy algorithm is "short-sighted", it may select a movie first that covers most actors at once, but that then discards more interesting combinations further on.

In [14]:
def greedy_cover(movies):
    selected_movies = []
    selected_years = []
    while not movies.empty:
        title, year = movies.groupby(["primaryTitle", "startYear"]).primaryName.count().idxmax()
        selected_movies.append(title)
        selected_years.append(year)
        actors_selected = movies[(movies.primaryTitle == title) & (movies.startYear == year)].primaryName
        movies = movies[~movies.primaryName.isin(actors_selected)]
    min_set = pd.DataFrame({"title": selected_movies, "year": selected_years})
    return min_set
In [15]:
greedy_min_set = greedy_cover(set_movies)
print("The greedy algorithm returned a min set of length %i." %len(greedy_min_set))
greedy_min_set
The greedy algorithm returned a min set of length 26.
Out[15]:
title year
0 How the West Was Won 1962
1 Boom Town 1940
2 Key Largo 1948
3 Monkey Business 1952
4 Now and Forever 1934
5 Seven Days in May 1964
6 21 Days Together 1940
7 A Countess from Hong Kong 1967
8 Cover Girl 1944
9 Easter Parade 1948
10 Executive Suite 1954
11 Genocide 1982
12 Grand Hotel 1932
13 Guess Who's Coming to Dinner 1967
14 Jimmy the Gent 1934
15 Limelight 1952
16 The Night of the Hunter 1955
17 Blonde Venus 1932
18 Casablanca 1942
19 Charade 1963
20 China Seas 1935
21 Giant 1956
22 High Noon 1952
23 I'm No Angel 1933
24 The Cocoanuts 1929
25 The Daddy's Dream 1911

The solution given by the greedy algorithm is given above. It returns a list of 26 movies which cover all 50 actors and actresses.

5. Integer Programming Approximation

The second approach we can use uses Integer Programming. Indeed, the problem we are trying to solve can be expressed as an Integer Program. Let us note:

  • $n$ the number of actors, $m$ the number of movies
  • $x$ the solution of the problem, a vector that encodes whether or not each movie was selected. $x \in \{0,1\}^m$
  • $M$ a sparse matrix in $\{0,1\}^{n \times m}$ so that $M_{i,j} = 1$ if actor $i$ plays in movie $j$ and else $M_{i,j} = 0$
  • $b$ be a constant vector of ones, $(1)_n$

Then the problem we are trying to solve can be written as:

$$min \sum_{j=1}^m x_i \text{ s.t. } Mx > b$$

This is the exact mathematical translation of the sentence "find the smallest combination of movies so that each actor gets selected at least once". It is called an Integer Program because the solution we are looking for has integer values (we do not want to select half-movies).

Obviously, this is nice, but does it help? Not really per se, because this Integer Program cannot be solved easily either. However, we can approximate this IP with a Linear Program by relaxing the integer constraint, meaning that instead of enforcing that $x \in \{0,1\}$, we can enforce that $ 0 \leq x \leq 1$. With this relaxation, we can easily find an optimal solution $x^*$ in polynomial time to the LP.

However, this optimal solution $x^*$ is most likely not going to be an integer solution anymore. The most straightforward approximation is then to round $x*$ to bring the solution back to our integer world. However, not only is this rounded solution not going to be optimal anymore, it can also violate the constraints of our IP.

Let us implement this approach. Below, we first compute the matrix $M$ and feed it to scipy's LP solving tools (they require slightly different specifications than the mathematical expressions written here, but it is equivalent at the end of the day).

In [16]:
movies_matrix_df = pd.concat([set_movies, pd.get_dummies(set_movies.primaryName)], axis=1)
movies_matrix_df = movies_matrix_df.groupby(["primaryTitle", "startYear"]).sum()
movies_matrix_df.head()
Out[16]:
Audrey Hepburn Ava Gardner Barbara Stanwyck Bette Davis Burt Lancaster Buster Keaton Carole Lombard Cary Grant Charles Chaplin Clark Gable Claudette Colbert Edward G. Robinson Elizabeth Taylor Fred Astaire Gary Cooper Gene Kelly Ginger Rogers Grace Kelly Gregory Peck Greta Garbo Groucho Marx Henry Fonda Humphrey Bogart Ingrid Bergman James Cagney James Dean James Stewart Jean Harlow Joan Crawford John Wayne Judy Garland Katharine Hepburn Kirk Douglas Lauren Bacall Laurence Olivier Lillian Gish Mae West Marilyn Monroe Marlene Dietrich Marlon Brando Mary Pickford Orson Welles Rita Hayworth Robert Mitchum Shirley Temple Sidney Poitier Sophia Loren Spencer Tracy Vivien Leigh William Holden
primaryTitle startYear
20,000 Years in Sing Sing 1932 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
21 Days Together 1940 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
A Child Is Waiting 1963 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A Countess from Hong Kong 1967 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
A Streetcar Named Desire 1951 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
In [17]:
movies_matrix = movies_matrix_df.to_numpy().T
n_actors, n_movies = movies_matrix.shape
c = np.ones(n_movies)
A_ub = -1 * movies_matrix
b_ub = -1 * np.ones(n_actors)
l = np.zeros(n_movies)
u = np.ones(n_movies)

res = linprog(c=c, A_ub=A_ub, b_ub=b_ub, bounds=(0, 1))
/usr/local/lib/python3.7/dist-packages/numpy/core/fromnumeric.py:87: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
In [18]:
rounded_solution = movies_matrix_df.iloc[res.x > 0.5]
actors_covered = rounded_solution.sum(axis=0).sum()

print("The approximation of the Integer Linear Program returned a cover of %i actors with %i movies." %(actors_covered, len(rounded_solution)))
The approximation of the Integer Linear Program returned a cover of 43 actors with 19 movies.
In [19]:
ip_min_set = rounded_solution.reset_index()[["primaryTitle", "startYear"]]
ip_min_set.rename(columns={"primaryTitle": "title", "startYear": "year"}, inplace=True)
ip_min_set
Out[19]:
title year
0 A Countess from Hong Kong 1967
1 Clash by Night 1952
2 Giant 1956
3 Grand Hotel 1932
4 Guess Who's Coming to Dinner 1967
5 How the West Was Won 1962
6 Limelight 1952
7 Manpower 1941
8 Mogambo 1953
9 Murder on the Orient Express 1974
10 Now and Forever 1934
11 Sabrina 1954
12 She Done Him Wrong 1933
13 The Cocoanuts 1929
14 The Daddy's Dream 1911
15 The List of Adrian Messenger 1963
16 The Public Enemy 1931
17 The Whales of August 1987
18 Tomorrow Is Forever 1946
In [20]:
missing_actors = rounded_solution.columns[rounded_solution.sum(axis=0) == 0]
print("The missing actors are %s." %", ".join(list(missing_actors)))
The missing actors are Fred Astaire, Gene Kelly, Ginger Rogers, Judy Garland, Laurence Olivier, Rita Hayworth, Vivien Leigh.

The approximation of the IP returned the list of movies above. However, because of the rounding of the solution described above, some of the constraints of the IP were violated, resulting in some actors missing from this cover.

This can be easily fixed by adding a movie for each of the missing actors to this set. This would result in a set of 26 movies for 50 actors, which would be the same as the solution of the greedy algorithm. However, there might be a way to pick less than 7 movies to cover the remaining 7 actors. For this, we can apply the greedy algorithm on the actors left and see if we can improve the result.

In [21]:
missing_actors_set = greedy_cover(set_movies[set_movies.primaryName.isin(missing_actors)])
missing_actors_set
Out[21]:
title year
0 21 Days Together 1940
1 Carefree 1938
2 Cover Girl 1944
3 A Child Is Waiting 1963
In [22]:
print("Using the greedy algorithm to finish picks %i movies for the remaining %i actors." %(len(missing_actors_set), len(missing_actors)))
Using the greedy algorithm to finish picks 4 movies for the remaining 7 actors.

This leads us to a cover of 19 + 4 = 23 movies, which is 3 movies less than in the first approach using the greedy algorithm alone. A significant reduction! The final list looks like this:

In [23]:
min_set = pd.concat([ip_min_set, missing_actors_set], ignore_index=True)
min_set
Out[23]:
title year
0 A Countess from Hong Kong 1967
1 Clash by Night 1952
2 Giant 1956
3 Grand Hotel 1932
4 Guess Who's Coming to Dinner 1967
5 How the West Was Won 1962
6 Limelight 1952
7 Manpower 1941
8 Mogambo 1953
9 Murder on the Orient Express 1974
10 Now and Forever 1934
11 Sabrina 1954
12 She Done Him Wrong 1933
13 The Cocoanuts 1929
14 The Daddy's Dream 1911
15 The List of Adrian Messenger 1963
16 The Public Enemy 1931
17 The Whales of August 1987
18 Tomorrow Is Forever 1946
19 21 Days Together 1940
20 Carefree 1938
21 Cover Girl 1944
22 A Child Is Waiting 1963

6. Conclusion

In this post, we tried to find the minimal number of movies required to watch at least one movie with each of AFI's top 50 Hollywood stars. This is an application of the Set Cover problem, and while it is a problem that has no straightforward solution, two approximative methodologies were used. The first, greedy algorithm, returned 26 movies. The second, an approximation of the integer program, completed by an iteration of greedy algorithm for missing actors, returned 23 movies.

This selection of 23 movies may or may not be the optimal solution, but this is already a good start: watch these 23 movies and you will have an exhaustive knowledge of American classic cinema at a (almost?) minimal cost!