Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

Inspecting the dataset

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)

## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Aveng…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", …
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorro…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 2…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, …
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08,…
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08,…
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 92…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, …
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 3…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2,…

skim(movies)

(#tab:load_movies)Data summary
Name	movies
Number of rows	2961
Number of columns	11
_______________________
Column type frequency:
character	3
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
title	1	1	83	2907
genre	1	5	11	17
director	1	3	32	1366

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2.00e+03	9.95e+00	1920.0	2.00e+03	2.00e+03	2.01e+03	2.02e+03	▁▁▁▂▇
duration	1	1.10e+02	2.22e+01	37.0	9.50e+01	1.06e+02	1.19e+02	3.30e+02	▃▇▁▁▁
gross	1	5.81e+07	7.25e+07	703.0	1.23e+07	3.47e+07	7.56e+07	7.61e+08	▇▁▁▁▁
budget	1	4.06e+07	4.37e+07	218.0	1.10e+07	2.60e+07	5.50e+07	3.00e+08	▇▂▁▁▁
cast_facebook_likes	1	1.24e+04	2.05e+04	0.0	2.24e+03	4.60e+03	1.69e+04	6.57e+05	▇▁▁▁▁
votes	1	1.09e+05	1.58e+05	5.0	1.99e+04	5.57e+04	1.33e+05	1.69e+06	▇▁▁▁▁
reviews	1	5.03e+02	4.94e+02	2.0	1.99e+02	3.64e+02	6.31e+02	5.31e+03	▇▁▁▁▁
rating	1	6.39e+00	1.05e+00	1.6	5.80e+00	6.50e+00	7.10e+00	9.30e+00	▁▁▆▇▁

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

gross : The gross earnings in the US box office, not adjusted for inflation
budget: The movie’s budget
cast_facebook_likes: the number of facebook likes cast memebrs received
votes: the number of people who voted for (or rated) the movie in IMDB
reviews: the number of reviews for that movie
rating: IMDB average rating

Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?

No missing but many duplicates across categories. There are 2961 rows and only 2907 unique movie titles, so it can be inferred that some movies show more than 1 times with different genres.

General Analysis

How many movies of each genre are in this dataset?

movies %>% 
  count(genre) %>% 
  arrange(desc(n))

## # A tibble: 17 x 2
##    genre           n
##    <chr>       <int>
##  1 Comedy        848
##  2 Action        738
##  3 Drama         498
##  4 Adventure     288
##  5 Crime         202
##  6 Biography     135
##  7 Horror        131
##  8 Animation      35
##  9 Fantasy        28
## 10 Documentary    25
## 11 Mystery        16
## 12 Sci-Fi          7
## 13 Family          3
## 14 Musical         2
## 15 Romance         2
## 16 Western         2
## 17 Thriller        1

Exploring the relationship between earnings and budget in each film genre

movies %>% 
  mutate(return = gross/budget) %>% 
  group_by (genre) %>% 
  summarise (average_earnings = mean(gross), average_budget = mean(budget), return_on_budget = mean(return)) %>% 
  arrange(desc(return_on_budget))

## # A tibble: 17 x 4
##    genre       average_earnings average_budget return_on_budget
##    <chr>                  <dbl>          <dbl>            <dbl>
##  1 Horror             37713738.      13504916.         88.3    
##  2 Biography          45201805.      28543696.         22.3    
##  3 Musical            92084000        3189500          18.8    
##  4 Family            149160478.      14833333.         14.1    
##  5 Documentary        17353973.       5887852.          8.70   
##  6 Western            20821884        3465000           7.06   
##  7 Fantasy            42408841.      17582143.          6.68   
##  8 Animation          98433792.      61701429.          5.01   
##  9 Comedy             42630552.      24446319.          3.71   
## 10 Mystery            67533021.      39218750           3.27   
## 11 Romance            31264848.      25107500           3.17   
## 12 Drama              37465371.      26242933.          2.95   
## 13 Adventure          95794257.      66290069.          2.41   
## 14 Crime              37502397.      26596169.          2.17   
## 15 Action             86583860.      71354888.          1.92   
## 16 Sci-Fi             29788371.      27607143.          1.58   
## 17 Thriller               2468         300000           0.00823

Table of summary statistics for the 15 top grossing directors

movies %>% 
  group_by(director) %>% 
  summarise(total_amount = sum(gross),mean_amount = mean(gross), median_amount = median(gross), SD_amount = sd(gross, na.rm= TRUE)) %>%  
  slice_max(n=15, total_amount)

## # A tibble: 15 x 5
##    director          total_amount mean_amount median_amount  SD_amount
##    <chr>                    <dbl>       <dbl>         <dbl>      <dbl>
##  1 Steven Spielberg    4014061704  174524422.    164435221  101421051.
##  2 Michael Bay         2231242537  171634041.    138396624  127161579.
##  3 Tim Burton          2071275480  129454718.     76519172  108726924.
##  4 Sam Raimi           2014600898  201460090.    234903076  162126632.
##  5 James Cameron       1909725910  318287652.    175562880. 309171337.
##  6 Christopher Nolan   1813227576  226653447     196667606. 187224133.
##  7 George Lucas        1741418480  348283696     380262555  146193880.
##  8 Robert Zemeckis     1619309108  124562239.    100853835   91300279.
##  9 Clint Eastwood      1378321100   72543216.     46700000   75487408.
## 10 Francis Lawrence    1358501971  271700394.    281666058  135437020.
## 11 Ron Howard          1335988092  111332341     101587923   81933761.
## 12 Gore Verbinski      1329600995  189942999.    123207194  154473822.
## 13 Andrew Adamson      1137446920  284361730     279680930. 120895765.
## 14 Shawn Levy          1129750988  102704635.     85463309   65484773.
## 15 Ridley Scott        1128857598   80632686.     47775715   68812285.

How do movies across different genres perform on IMDB ratings?

movies %>%
  group_by(genre) %>% 
  summarise(count=n(), mean_rating=mean(rating), min_rating=min(rating), max_rating=max(rating), median_rating=median(rating), SD_rating=sd(gross, na.rm= TRUE)) %>% arrange(desc(count))

## # A tibble: 17 x 7
##    genre       count mean_rating min_rating max_rating median_rating  SD_rating
##    <chr>       <int>       <dbl>      <dbl>      <dbl>         <dbl>      <dbl>
##  1 Comedy        848        6.11        1.9        8.8          6.2   48067667.
##  2 Action        738        6.23        2.1        9            6.3   94846029.
##  3 Drama         498        6.73        2.1        8.8          6.8   51331049.
##  4 Adventure     288        6.51        2.3        8.6          6.6   96677829.
##  5 Crime         202        6.92        4.8        9.3          6.9   38309059.
##  6 Biography     135        7.11        4.5        8.9          7.2   49220303.
##  7 Horror        131        5.83        3.6        8.5          5.9   31505660.
##  8 Animation      35        6.65        4.5        8            6.9   86201928.
##  9 Fantasy        28        6.15        4.3        7.9          6.45  27036114.
## 10 Documentary    25        6.66        1.6        8.5          7.4   29441650.
## 11 Mystery        16        6.86        4.6        8.5          6.9   54787287.
## 12 Sci-Fi          7        6.66        5          8.2          6.4   30917838.
## 13 Family          3        6.5         5.7        7.9          5.9  247535241.
## 14 Musical         2        6.75        6.3        7.2          6.75 126255330.
## 15 Romance         2        6.65        6.2        7.1          6.65  44107152.
## 16 Western         2        5.70        4.1        7.3          5.70  29101851.
## 17 Thriller        1        4.8         4.8        4.8          4.8         NA

ggplot(movies, aes(x=rating))+
  geom_density(aes(fill=genre), alpha = 0.3, show.legend = FALSE)+
  facet_wrap(~genre, nrow = 6)+
  labs(title="Ratings distributed by genre", 
       x ="ratings", 
       caption = "Source: IMDB dataset")+
  theme_bw()+
  NULL

Are Facebook likes a good predictor of box office revenue of a movies?

The number of facebook likes that the cast has received is not a good predictor of how much money a movie will make at the box office since the correlation coefficient is only ~0.2, too low to indicate a correlation between these two. Since the Gross Revenue vs Facebook Likes plot is hard to read, we decided to map Log of the number of FB likes to the X axis and Log of gross revenue to the Y axis.

cor(movies$cast_facebook_likes, movies$gross)

## [1] 0.213

  ggplot(movies, aes(x=cast_facebook_likes, y= gross))+
    geom_point()+
    geom_smooth(method = "lm")+
    theme_bw()+
    NULL

fblikes = movies %>%  
  select(gross, cast_facebook_likes) %>% 
  mutate(ln_gross=log(gross), ln_fblikes=log(cast_facebook_likes))

  ggplot(fblikes, aes(x=ln_fblikes, y= ln_gross))+
  geom_point()+
  geom_smooth(method = "lm")+
  theme_bw()+
    labs(
    x = "Log of the number of FB likes",
    y= "Log of gross revenue" ,
    title = "How do Facebook likes correlate to box office performance?",
    caption = "Source: IMDB Dataset"
    )+
    NULL

How does the budget size correlate with movie revenue performance?

The budget is likely to be a good predictor of how much money a movie will make at the box office since the correlation coefficient is about ~0.64, indicating a correlation between these two. Since the first Gross revenue vs budget graph is difficult to interpret we follow it up with another graph in which we decided to map the log of the budget size to the X axis and the log of gross revenue to the Y axis

cor(movies$budget, movies$gross)

## [1] 0.641

ggplot(movies, aes(x=budget, y= gross))+
  geom_point()+
  geom_smooth(method = "lm")+
  theme_bw()+
   NULL

gross_on_budget = movies %>%  
  select(gross, budget) %>% 
  mutate(ln_gross=log(gross), ln_budget=log(budget))

ggplot(gross_on_budget, aes(x=ln_budget, y= ln_gross))+
  geom_point()+
  geom_smooth(method = "lm", fill="lightgrey")+
  theme_bw()+
  labs(
    x = "Log of film budget",
    y= "Log of gross revenue" ,
    title = "Is a bigger budget a good predictor of box office success?",
    caption = "Source: IMDB Dataset") +
    NULL

How do IMDB ratings correlate with box office perfromance across different movie genres?

From these plots it seems that IMDB rating is not a good indicator of a movie’s box office revenue because for most movie genres most movies incline to the same revenue level regardless of their IMDB ratings. It is strange in the dataset that for some genres of movies not much data available.

ggplot(movies, aes(x=rating, y=gross))+
  geom_point()+
  facet_wrap(~genre, nrow = 6)+
  labs(title="IMDB Ratings as a predictor of box office performance", 
       x ="IMDB Ratings", 
       y ="Gross",
       caption = "Source: IMDB dataset")+
  theme_bw()+
  expand_limits(y=0, x = 0)+
  NULL

IMDB ratings: Differences between directors

Now we would like to explore whether the mean IMDB rating for Steven Spielberg and Tim Burton are the same or not. ## Produce the data we will use

Burton_Spielberg <- movies %>% 
  filter(director %in% c("Steven Spielberg", "Tim Burton"))

Below is the calculated the confidence intervals for the mean ratings of these two directors and as you can see they overlap. We will reproduce this graph and run a hpothesis test using both the t.test command and the infer package to simulate from a null distribution, where we assume zero difference between the two.

Confidence intervals for the mean ratings of Steven Spielberg and Tim Burton

g1 <- Burton_Spielberg %>% 
  group_by(director) %>% 
  summarize(n = n(), 
            mean_rating = mean(rating, na.rm = TRUE), 
            SD_directors = sd(rating, na.rm = TRUE),
            SE_directors = SD_directors/sqrt(n),
            t_critical = qt(0.975, n-1),
            lower95_ci = mean_rating - t_critical*SE_directors,
            upper95_ci = mean_rating + t_critical*SE_directors)

g1 %>% 
  ggplot(mapping = aes(x = mean_rating, y= reorder(director, mean_rating)))+
  geom_point(aes(color = as.factor(director)), show.legend = FALSE, size = 5) +
  geom_errorbar(aes(xmin = lower95_ci, xmax = upper95_ci, color = as.factor(director)), width = 0.1, show.legend = FALSE, size = 1.8) +
  geom_rect(aes(xmin = max(lower95_ci), xmax = min(upper95_ci), ymin = -Inf, ymax = Inf), alpha = 0.1, fill = "black") +
  theme_minimal()+
  labs(x = "Mean IMDB Rating",
       y = "",
       title = "Do Spielberg and Burton have the same mean IMDB rating?",
       subtitle = "95% confidence intervals overlap"
       )+
  geom_text(aes(label = round(mean_rating,2)), vjust=-1, size=6)+
  geom_text(aes(x=upper95_ci, label = round(upper95_ci,2)), vjust=-1, size=5)+
  geom_text(aes(x=lower95_ci, label = round(lower95_ci,2)), vjust=-1, size=5)+
  NULL

Hypothesis test with formula

Null hypotheses: The mean IMDB rating for Steven Spielberg and Tim Burton are the same
Alternative hypotheses: The mean IMDB rating for Steven Spielberg and Tim Burton are different
p-value: if p-value is lower than 5%, then reject H0 and think the mean IMDB rating for Steven Spielberg and Tim Burton are different

t.test(rating ~ director, data = Burton_Spielberg)

## 
##  Welch Two Sample t-test
## 
## data:  rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.16 1.13
## sample estimates:
## mean in group Steven Spielberg       mean in group Tim Burton 
##                           7.57                           6.93

Hypothesis test with `infer`

Null hypotheses: The mean IMDB rating for Steven Spielberg and Tim Burton are the same
Alternative hypotheses: The mean IMDB rating for Steven Spielberg and Tim Burton are different
p-value: if p-value is lower than 5%, then reject H0 and think the mean IMDB rating for Steven Spielberg and Tim Burton are different

# initialize the test, which we save as `directors_diff`
directors_diff <- Burton_Spielberg %>%
    specify(rating ~ director) %>%
  
    # the statistic we are searching for is the difference in means, with the order being "Steven Spielberg", "Tim Burton"
    calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))

# simulate the test on the null distribution, which we save as null
directors_null_dist <- Burton_Spielberg %>%
    specify(rating ~ director) %>%
    hypothesize(null = "independence") %>%
    generate(reps = 1000, type = "permute") %>%
    calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))

# Visualising to see how many of these null permutations have a difference
directors_null_dist %>% visualize() +
  shade_p_value(obs_stat = directors_diff, direction = "two-sided")

# Calculating the p-value for the hypothesis test
directors_null_dist %>%
  get_p_value(obs_stat = directors_diff, direction = "two_sided")

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.004

From the hypothesis test, we can see that the t-stat is 3 and the p-value is 1% which is smaller than 5%, so we reject the null hypothesis and conclude that the the difference in mean IMDB ratings for Steven Spielberg and Tim Burton is statistically significant, specifically the mean rating of Steven Spielberg is higher. This is just like what we expected, as we all prefer Steven Spielberg.