How has hyperdrive technology in Star Wars changed over time? Are there differences by trilogies? We analyze these data from the They Star Wars API (SWAPI).

Setup

Load the tidyverse which is a collection of R packages that share common philosophies and are designed to work together. Load rwars which accesses SWAPI. Load additional packages for visualizing the data. Create labels for the trilogies. Pull data from SWAPI.

library(tidyverse)
library(rwars)
library(forcats)
library(ggrepel)
library(ggthemes)

trilogies <- c(
  "Prequels: Episode I-III", 
  "Originals: Episode IV-VI", 
  "Sequels: Episode VII"
  )

films <- get_all_films()$results

Package: dplyr

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges: mutating, selecting, filtering, summarizing, and arranging your data. These all combine naturally which allows you to perform any operation “by group”. You can learn more about them in vignette(“dplyr”).

starwars %>% 
  filter(species == "Droid")
starwars %>% 
  select(name, ends_with("color"))
starwars %>% 
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)
starwars %>% 
  arrange(desc(mass))
starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(n > 1)

Package: tibble

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print method() which makes them easier to use with large datasets containing complex objects.

Notice how data.frame converts characters into factors, whereas tibble does not. No more stringsAsFactors = FALSE.

data.frame(
  id = 1:3,
  trilogies = trilogies
  )
tibble(
  id = 1:3,
  trilogies = trilogies
  )

Package: purrr

purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.

We will compare lapply to purrr::map and purrr::map_chr.

cat("lapply\n\n")
lapply
lapply(films, function(x)x$title)
[[1]]
[1] "A New Hope"

[[2]]
[1] "Attack of the Clones"

[[3]]
[1] "The Phantom Menace"

[[4]]
[1] "Revenge of the Sith"

[[5]]
[1] "Return of the Jedi"

[[6]]
[1] "The Empire Strikes Back"

[[7]]
[1] "The Force Awakens"
A New Hope
Attack of the Clones
The Phantom Menace
Revenge of the Sith
Return of the Jedi
The Empire Strikes Back
The Force Awakens
cat("purrr::map\n\n")
purrr::map
map(films, "title")
[[1]]
[1] "A New Hope"

[[2]]
[1] "Attack of the Clones"

[[3]]
[1] "The Phantom Menace"

[[4]]
[1] "Revenge of the Sith"

[[5]]
[1] "Return of the Jedi"

[[6]]
[1] "The Empire Strikes Back"

[[7]]
[1] "The Force Awakens"
A New Hope
Attack of the Clones
The Phantom Menace
Revenge of the Sith
Return of the Jedi
The Empire Strikes Back
The Force Awakens
cat("purrr::map_chr\n\n")
purrr::map_chr
map_chr(films, "title")
[1] "A New Hope"              "Attack of the Clones"   
[3] "The Phantom Menace"      "Revenge of the Sith"    
[5] "Return of the Jedi"      "The Empire Strikes Back"
[7] "The Force Awakens"      
A New Hope

Attack of the Clones

The Phantom Menace

Revenge of the Sith

Return of the Jedi

The Empire Strikes Back

The Force Awakens

Example: The rise of hyperdrive

What is the ratio of ships to vehicles in each movie? We will use the tidyverse to organize and visualize Star Wars data.

results <- tibble(
  title = map_chr(films, "title"),
  episode = map_dbl(films, "episode_id"),
  starships = map_dbl(films, ~length(.x$starships)),
  vehicles = map_dbl(films, ~length(.x$vehicles)),
  planets = map_dbl(films, ~length(.x$planets))
  ) %>%
  mutate(ships = vehicles + starships) %>%
  mutate(ratio = starships / ships * 100) %>% 
  mutate(Trilogy = trilogies[findInterval(episode, c(1,4,7))])
results

Notice that in the output format each column is a variable and each row is an observation. This is the tidy format that is useful for modeling and visualization.

ggplot(results, aes(reorder(title, episode), ratio)) + 
  geom_bar(aes(fill = Trilogy), stat = "identity", size = 1) +
  labs(
    title = "The Rise of Hyperdrive",
    subtitle = "Percentage of Ships with Hyperdrive Capability"
  ) +
  scale_y_continuous(labels = function(x){paste(x,"%")}) +
  theme_fivethirtyeight() +
  scale_colour_fivethirtyeight() +
  theme(
    axis.text.x = element_text(angle = 35, vjust = 0.9, hjust = 0.9)
  )

Insights

These data indicate an increased emphasis on hyperdrive from one trilogy to the next. However, it is important to note that the trilogies were made out of order. So there was actually a decrease in the percentage of hyperdrives from the second to the first trilogy.

Example: Predicting hyperdrive

We will visually examine vehicls with hyperdrive (starships) to the total number of vehicles (starships + vehicles) to determine if there are trends over time or by trilogy. There is a strong correlation between the number of ships with hyperdrive and the total number of ships. Notice that the number of ships increases within each trilogy. Expect more ships in Episode VIII: The Last Jedi.

results %>%
  ggplot(aes(ships, starships)) +
  geom_point(aes(color = Trilogy)) +
  theme_fivethirtyeight() +
  geom_smooth(method = "lm") +
  geom_text(aes(label = title), vjust = -1, size = 2.5) +
  labs(
    title = "Hyperdrive Correlations",
    subtitle = "The Number of Ships with Hyperdrive vs Total Ships"
  )

starship_model <- lm(starships ~ ships, data = results)
coef_ships <- coef(starship_model)['ships']
summary(starship_model)

Call:
lm(formula = starships ~ ships, data = results)

Residuals:
      1       2       3       4       5       6       7 
 1.2740 -1.3325 -1.7260 -0.5865  1.6675  0.9215 -0.2180 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  1.31637    1.30877   1.006  0.36068   
ships        0.45081    0.07858   5.737  0.00225 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.442 on 5 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8418 
F-statistic: 32.92 on 1 and 5 DF,  p-value: 0.002254

The data show a positive trend for the percentage of ships with hyperdrive capability. Notice that 100% of the ships in The Force Awakens had hyperdrive. What will be the percentage for The Last Jedi?. Based on our visual inspection, we fit a simple linear model that predicts the number of ships with hyperdrive. The model indicates that for every additional ship introduced there are 0.45 more ships with hyperdrive capability added. In other words, the number of ships with hyperdrive is half of all ships plus one.

Predictions

There is a strong correlation between total number of ships and the number of ships with hyperdrive. The model predicts the number of ships with hyperdrive is roughly half of all ships plus one.

We predict that Episode VIII will have more ships overall than Episode VII, and that it will have a very high percentage of ships with hyperdrive.

