STA 235H - Review Session & Final Trivia

class: center, middle, inverse, title-slide

.title[
# STA 235H - Review Session & Final Trivia
]
.subtitle[
## Fall 2022
]
.author[
### McCombs School of Business, UT Austin
]

---

.small .remark-code { /*Change made here*/
  font-size: 80% !important;
}

.tiny .remark-code { /*Change made here*/
  font-size: 80% !important;
}
</style>

# Rules of Final Trivia

1) **.darkorange[Form groups]**: 2 or 3 students (no more, no less).

2) **.darkorange[Choose a name for your group]**: You can be funny or classic.

3) **.darkorange[You need to complete all the questions]**: It doesn't matter if you don't know the answer! Make your best guess.

4) **.darkorange[Ask questions]**: I will give you time for your team to complete each questions; after the time is up, you will submit your questions and we will check answers.

- If something isn't clear, **now is the time to ask**.

5) **.darkorange[There are prizes]**: At the end of the session, we will crown the three teams that perform the best. If there is a tie in scores, the team that submits their answers the fastest moves up.

.center[**.darkorange[Note:]** All slides and answers will be posted on Thursday at 2pm. Make sure to take notes!]

---
background-position: 50% 50%
class: left, bottom, inverse
.big[
Regressions
]
---
# Insurance costs

In this question, we are looking at insurance costs. We have data on insurance premiums in addition to individual data from the customers:

```r
insurance <- read.csv("https://raw.githubusercontent.com/maibennett/website_github/master/exampleSite/content/files/data/insurance.csv")

head(insurance)
```

```
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622
```

We want to find the relation between these covariates and the outcome (`charges`).

---
# Question 1

.small[

```
## 
## Call:
## lm(formula = charges ~ smoker * bmi, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19768.0  -4400.7   -869.5   2957.7  31055.9 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5879.42     976.87   6.019 2.27e-09 ***
## smokeryes     -19066.00    2092.03  -9.114  < 2e-16 ***
## bmi               83.35      31.27   2.666  0.00778 ** 
## smokeryes:bmi   1389.76      66.78  20.810  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6161 on 1334 degrees of freedom
## Multiple R-squared:  0.7418,	Adjusted R-squared:  0.7412 
## F-statistic:  1277 on 3 and 1334 DF,  p-value: < 2.2e-16
```
]

---
# Question 2

.small[

```r
lm1 <- lm(charges ~ smoker*bmi, data = insurance)

summary(lm1)
```

---
# Question 3

.small[

```r
lm1 <- lm(charges ~ smoker*bmi, data = insurance)

summary(lm1)
```

---
background-position: 50% 50%
class: left, bottom, inverse
.big[
Causal Inference
]
---
# Does Academic Probation Work?

Academic probation is a widely used tool by most universities to make sure students maintain minimum academic standards. In this section, we will analyze data from a large Canadian university regarding the effects of academic probation, originally used in Lindo, Sanders, and Oreopoulos’ (2010) paper, “Ability, Gender, and Performance Standards: Evidence from Academic Probation”

.tiny[

```r
probation <- read.csv("https://raw.githubusercontent.com/maibennett/website_github/master/exampleSite/content/files/data/probation.csv")
```
]
.pull-left[
.small[
- `creditsY`: Credits attempted in year Y = 1,2.
- `credits_earnedY`: Credits earned in year Y = 1,2.
- `GPA_yearY`: GPA at the end of year Y = 1,2.
- `CGPA_yearY`: Cumulative GPA at the end of year Y = 1,2.
- `sex`: Gender of the student (M: Male, F: Female).
- `age_at_entry`: Age of the student when they first enrolled.]]

.pull-right[
.small[
- `gradinY`: Student graduated in Y years, Y = 4, 5, or 6.
- `left_school`: Whether the student left school or not after the first assessment.
- `hsgrade_pct`: Percentile of graduation in their high school.
- `probation_year1`: Whether the student was in academic probation by the end of year 1.
- `suspended_year1`: Whether the student was suspended by the end of year 1.]]

---
# Question 4

.small[

```r
summary(lm_robust(left_school ~ probation_year1, data = probation))
```

```
## 
## Call:
## lm_robust(formula = left_school ~ probation_year1, data = probation)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                 Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper    DF
## (Intercept)      0.03755  0.0009849   38.13 5.691e-313  0.03562  0.03948 44360
## probation_year1  0.07165  0.0038290   18.71  7.761e-78  0.06415  0.07916 44360
## 
## Multiple R-squared:  0.01481 ,	Adjusted R-squared:  0.01479 
## F-statistic: 350.2 on 1 and 44360 DF,  p-value: < 2.2e-16
```
]

---
# Question 5

.tiny[

```r
probation <- probation %>% filter(left_school==0)

summary(lm(GPA_year2 ~ probation_year1 + credits1 + credits_earned1 + GPA_year1 + 
     factor(sex) + age_at_entry + hsgrade_pct, data = probation))
```

```
## 
## Call:
## lm(formula = GPA_year2 ~ probation_year1 + credits1 + credits_earned1 + 
##     GPA_year1 + factor(sex) + age_at_entry + hsgrade_pct, data = probation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3545 -0.3239  0.0646  0.3708  2.5300 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.0557184  0.0803853  13.133  < 2e-16 ***
## probation_year1  0.2827426  0.0132546  21.332  < 2e-16 ***
## credits1        -0.0069394  0.0120652  -0.575   0.5652    
## credits_earned1  0.0245169  0.0116370   2.107   0.0351 *  
## GPA_year1        0.6971113  0.0059157 117.842  < 2e-16 ***
## factor(sex)M    -0.0957468  0.0061847 -15.481  < 2e-16 ***
## age_at_entry    -0.0248064  0.0041120  -6.033 1.63e-09 ***
## hsgrade_pct      0.0032055  0.0001307  24.529  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5817 on 38322 degrees of freedom
##   (3857 observations deleted due to missingness)
## Multiple R-squared:  0.5139,	Adjusted R-squared:  0.5139 
## F-statistic:  5789 on 7 and 38322 DF,  p-value: < 2.2e-16
```
]

---
# Question 6

.tiny[

```r
probation <- probation %>% filter(left_school==0)

summary(lm(GPA_year2 ~ probation_year1 + credits1 + credits_earned1 + GPA_year1 + 
     factor(sex) + age_at_entry + hsgrade_pct, data = probation))
```

---
background-position: 50% 50%
class: left, bottom, inverse
.big[
Prediction
]
---
# Candy, candy, candy

In this section, we will be predicting win percentage for candy bars! We have the following dataset for this:

.tiny[

```r
candy <- read.csv("https://raw.githubusercontent.com/maibennett/website_github/master/exampleSite/content/files/data/candy.csv")
```
]

.pull-left[
.small[
- `competitorname`:	Name of the candy
- `chocolate`: Is it chocolate?
- `fruity`: Is it fruit flavored?
- `caramel`: Is there caramel in the candy?
- `peanutalmondy`: Does it contain peanuts, peanut butter or almonds?
- `nougat`:	Does it contain nougat?
- `crispedricewafer`:	Does it contain crisped rice, wafers, or a cookie component?
]]

.pull-right[
.small[
- `hard`: Is it a hard candy?
- `bar`:	Is it a bar?
- `pluribus`: Is it one of many candies in a bag/box? 
- `sugarpercent`:	The percentile of sugar it falls under within the data set.
- `pricepercent`:	The unit price percentile compared to the rest of the set.
- `winpercent`:	The overall win percentage according to 269,000 matchups.]]

---
# Question 7

---
# Question 8

Using the code provided [here](https://www.magdalenabennett.com/files/data/Trivia/f2022_sta235h_14_FinalTrivia.R), how does your previous model perform? Write down the number (use two decimal places).

---
# Question 9

Your turn. Fit a random forest to predict the outcome of interest. Tune only the number of randomly selected predictors, and use the code provided as a starting point.

- Provide your code, the optimal number of `mtry`, and the performance of your model.