Model Step 2 - Model Card

Published

October 1, 2022

Background

A model card provides brief, transparent, responsible reporting for a trained machine learning model. Expand the code chunk below to inspect the code that loads the model, training data, and evaluation data.

Code
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.0.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(vetiver)
library(pins)
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
## 
## Attaching package: 'yardstick'
## The following object is masked from 'package:readr':
## 
##     spec
library(glue)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows


board <- pins::board_rsconnect()
## Connecting to RSC 2022.09.0 at <https://colorado.rstudio.com/rsc>
v <- vetiver_pin_read(board, params$name, version = params$version)
v_meta <- pin_meta(board, params$name)

con <- odbc::dbConnect(odbc::odbc(), "Content DB", timeout = 10)
bike_model_data <- tbl(con, "bike_model_data")


train_start_date <- lubridate::as_date(v$metadata$user$train_dates[1])
train_end_date <- lubridate::as_date(v$metadata$user$train_dates[2])
test_start_date <- lubridate::as_date(v$metadata$user$test_dates[1])
test_end_date <- lubridate::as_date(v$metadata$user$test_dates[2])

data <- bike_model_data %>%
  filter(date >= train_start_date) %>%
  collect() %>%
  mutate(
    data_type = case_when(
      date <= train_end_date ~ "train",
      date <= test_end_date ~ "test",
      TRUE ~ "latest"
    )
  )

train_data <- data %>%
  filter(data_type == "train")

test_data <- data %>%
  filter(data_type == "test")

Model details

  • Developed by Sam Edwardes (RStudio - Solutions Engineer)
  • A ranger regression modeling workflow using 7 features
  • This model was developed to predict how many capital bikeshare bikes will be available for a given station based on the time of day, and day of the week.
  • The data used to train the model was obtained from the capital bikeshare API.
  • Version 63032 of this model was published at 2022-10-01 09:02:00
  • If you have questions about this model, please create an issue on our GitHub repo: https://github.com/sol-eng/bike_predict/issues.

Intended use

  • The primary intended use of this model is to demonstrate an end-to-end data science workflow using RStudio Connect.

Metrics

  • The metrics used to evaluate this model are:
    • Root Mean Squared Error (RMSE),
    • R Squared (RSQ), and
    • Mean Absolute Error (MAE).
  • We chose these metrics because because they are the most common metrics for assessing the performance of regression models, and that they are well understood by the data science community.

Training data & evaluation data

The training data set for this model was was obtained from the capital bikeshare API. The Model is retrained on a daily basis using refreshed data.

  • Training date range: 2022-09-19 to 2022-09-29.
  • Evaluation date range: 2022-09-30 to 2022-10-01.

The data set for this model has the “prototype” or signature:

Code
glimpse(v$ptype)
Rows: 1
Columns: 7
$ id    <chr> "1"
$ hour  <dbl> 0
$ date  <date> 2022-09-19
$ month <dbl> 9
$ dow   <chr> "Monday"
$ lat   <dbl> 38.85897
$ lon   <dbl> -77.05323

Below is a summary of the training data:

Code
train_data %>%
  select(-n_bikes) %>%
  skimr::skim()
Data summary
Name Piped data
Number of rows 91296
Number of columns 8
_______________________
Column type frequency:
character 3
Date 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 1 3 0 698 0
dow 0 1 6 9 0 7 0
data_type 0 1 5 5 0 1 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2022-09-19 2022-09-29 2022-09-24 11

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hour 0 1 10.95 6.90 0.00 4.00 10.00 16.00 22.00 ▇▅▅▅▇
month 0 1 9.00 0.00 9.00 9.00 9.00 9.00 9.00 ▁▁▇▁▁
lat 0 1 38.91 0.06 38.78 38.88 38.90 38.94 39.13 ▂▇▃▁▁
lon 0 1 -77.06 0.09 -77.37 -77.09 -77.04 -77.01 -76.83 ▁▁▆▇▁

Below is a summary of the evaluation data:

Code
test_data %>%
  select(-n_bikes) %>%
  skimr::skim()
Data summary
Name Piped data
Number of rows 11866
Number of columns 8
_______________________
Column type frequency:
character 3
Date 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 1 3 0 698 0
dow 0 1 6 8 0 2 0
data_type 0 1 4 4 0 1 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2022-09-30 2022-10-01 2022-09-30 2

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hour 0 1 8.94 6.80 0.00 4.00 8.00 14.00 22.00 ▇▅▂▂▃
month 0 1 9.29 0.46 9.00 9.00 9.00 10.00 10.00 ▇▁▁▁▃
lat 0 1 38.91 0.06 38.78 38.88 38.90 38.94 39.13 ▂▇▃▁▁
lon 0 1 -77.06 0.09 -77.37 -77.09 -77.04 -77.01 -76.83 ▁▁▆▇▁

Quantitative analyses

Overall model performance against the model evaluation data.

Code
# Explicitly define packages used for model metrics by vetiver so that Connect 
# is able to redeploy.
library(slider)

latest_metrics %>%
  kable() %>%
  kable_material()
.index .n .metric .estimator .estimate
2022-09-30 8376 rmse standard 4.0905045
2022-09-30 8376 rsq standard 0.3249633
2022-09-30 8376 mae standard 3.2270999
2022-10-01 3490 rmse standard 4.2319364
2022-10-01 3490 rsq standard 0.2025737
2022-10-01 3490 mae standard 3.2758238

A comparison of the model prediction vs. the ground truth for the model evaluation data.

Code
preds %>%
  ggplot(aes(n_bikes, .pred)) +
  geom_abline(slope = 1, lty = 2, color = "gray60", size = 1.2) +
  geom_jitter(alpha = 0.5, width = 0.4) +
  labs(
    title = "Model Performance by Prediction",
    subtitle = glue("Using the evaluation data from {test_start_date} to {test_end_date}"),
    x = "Acutal Number of Bikes",
    y = "Predicted Number of Bikes"
  )

A visual assessment of model performance by location.

Code
preds %>%
  mutate(.resid = n_bikes - .pred) %>%
  ggplot(aes(lon, lat, color = .resid)) +
  geom_point(alpha = 0.8) +
  scale_color_gradient2() +
  coord_fixed() + 
  labs(
    title = "Model Performance by Location",
    subtitle = glue("Using the evaluation data from {test_start_date} to {test_end_date}"),
    x = "Longitude",
    y = "Latitude",
    color = "Residual"
  )

Ethical considerations

  • We considered the ethical impact the use of the model may have. We assessed that the overall risk due to an incorrect model prediction is relatively low.

Caveats & recommendations