class: center, middle, inverse, title-slide .title[ # STA 235H - Review Session & Final Trivia ] .subtitle[ ## Fall 2022 ] .author[ ### McCombs School of Business, UT Austin ] --- <!-- <script type="text/javascript"> --> <!-- MathJax.Hub.Config({ --> <!-- "HTML-CSS": { --> <!-- preferredFont: null, --> <!-- webFont: "Neo-Euler" --> <!-- } --> <!-- }); --> <!-- </script> --> <style type="text/css"> .small .remark-code { /*Change made here*/ font-size: 80% !important; } .tiny .remark-code { /*Change made here*/ font-size: 80% !important; } </style> # Rules of Final Trivia 1) **.darkorange[Form groups]**: 2 or 3 students (no more, no less). -- 2) **.darkorange[Choose a name for your group]**: You can be funny or classic. -- 3) **.darkorange[You need to complete all the questions]**: It doesn't matter if you don't know the answer! Make your best guess. -- 4) **.darkorange[Ask questions]**: I will give you time for your team to complete each questions; after the time is up, you will submit your questions and we will check answers. - If something isn't clear, **now is the time to ask**. -- 5) **.darkorange[There are prizes]**: At the end of the session, we will crown the three teams that perform the best. If there is a tie in scores, the team that submits their answers the fastest moves up. -- .center[**.darkorange[Note:]** All slides and answers will be posted on Thursday at 2pm. Make sure to take notes!] --- background-position: 50% 50% class: left, bottom, inverse .big[ Regressions ] --- # Insurance costs In this question, we are looking at insurance costs. We have data on insurance premiums in addition to individual data from the customers: ```r insurance <- read.csv("https://raw.githubusercontent.com/maibennett/website_github/master/exampleSite/content/files/data/insurance.csv") head(insurance) ``` ``` ## age sex bmi children smoker region charges ## 1 19 female 27.900 0 yes southwest 16884.924 ## 2 18 male 33.770 1 no southeast 1725.552 ## 3 28 male 33.000 3 no southeast 4449.462 ## 4 33 male 22.705 0 no northwest 21984.471 ## 5 32 male 28.880 0 no northwest 3866.855 ## 6 31 female 25.740 0 no southeast 3756.622 ``` We want to find the relation between these covariates and the outcome (`charges`). --- # Question 1 .small[ ``` ## ## Call: ## lm(formula = charges ~ smoker * bmi, data = insurance) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19768.0 -4400.7 -869.5 2957.7 31055.9 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5879.42 976.87 6.019 2.27e-09 *** ## smokeryes -19066.00 2092.03 -9.114 < 2e-16 *** ## bmi 83.35 31.27 2.666 0.00778 ** ## smokeryes:bmi 1389.76 66.78 20.810 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6161 on 1334 degrees of freedom ## Multiple R-squared: 0.7418, Adjusted R-squared: 0.7412 ## F-statistic: 1277 on 3 and 1334 DF, p-value: < 2.2e-16 ``` ] --- # Question 2 .small[ ```r lm1 <- lm(charges ~ smoker*bmi, data = insurance) summary(lm1) ``` ``` ## ## Call: ## lm(formula = charges ~ smoker * bmi, data = insurance) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19768.0 -4400.7 -869.5 2957.7 31055.9 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5879.42 976.87 6.019 2.27e-09 *** ## smokeryes -19066.00 2092.03 -9.114 < 2e-16 *** ## bmi 83.35 31.27 2.666 0.00778 ** ## smokeryes:bmi 1389.76 66.78 20.810 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6161 on 1334 degrees of freedom ## Multiple R-squared: 0.7418, Adjusted R-squared: 0.7412 ## F-statistic: 1277 on 3 and 1334 DF, p-value: < 2.2e-16 ``` ] --- # Question 3 .small[ ```r lm1 <- lm(charges ~ smoker*bmi, data = insurance) summary(lm1) ``` ``` ## ## Call: ## lm(formula = charges ~ smoker * bmi, data = insurance) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19768.0 -4400.7 -869.5 2957.7 31055.9 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5879.42 976.87 6.019 2.27e-09 *** ## smokeryes -19066.00 2092.03 -9.114 < 2e-16 *** ## bmi 83.35 31.27 2.666 0.00778 ** ## smokeryes:bmi 1389.76 66.78 20.810 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6161 on 1334 degrees of freedom ## Multiple R-squared: 0.7418, Adjusted R-squared: 0.7412 ## F-statistic: 1277 on 3 and 1334 DF, p-value: < 2.2e-16 ``` ] --- background-position: 50% 50% class: left, bottom, inverse .big[ Causal Inference ] --- # Does Academic Probation Work? Academic probation is a widely used tool by most universities to make sure students maintain minimum academic standards. In this section, we will analyze data from a large Canadian university regarding the effects of academic probation, originally used in Lindo, Sanders, and Oreopoulos’ (2010) paper, “Ability, Gender, and Performance Standards: Evidence from Academic Probation” .tiny[ ```r probation <- read.csv("https://raw.githubusercontent.com/maibennett/website_github/master/exampleSite/content/files/data/probation.csv") ``` ] .pull-left[ .small[ - `creditsY`: Credits attempted in year Y = 1,2. - `credits_earnedY`: Credits earned in year Y = 1,2. - `GPA_yearY`: GPA at the end of year Y = 1,2. - `CGPA_yearY`: Cumulative GPA at the end of year Y = 1,2. - `sex`: Gender of the student (M: Male, F: Female). - `age_at_entry`: Age of the student when they first enrolled.]] .pull-right[ .small[ - `gradinY`: Student graduated in Y years, Y = 4, 5, or 6. - `left_school`: Whether the student left school or not after the first assessment. - `hsgrade_pct`: Percentile of graduation in their high school. - `probation_year1`: Whether the student was in academic probation by the end of year 1. - `suspended_year1`: Whether the student was suspended by the end of year 1.]] --- # Question 4 .small[ ```r summary(lm_robust(left_school ~ probation_year1, data = probation)) ``` ``` ## ## Call: ## lm_robust(formula = left_school ~ probation_year1, data = probation) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 0.03755 0.0009849 38.13 5.691e-313 0.03562 0.03948 44360 ## probation_year1 0.07165 0.0038290 18.71 7.761e-78 0.06415 0.07916 44360 ## ## Multiple R-squared: 0.01481 , Adjusted R-squared: 0.01479 ## F-statistic: 350.2 on 1 and 44360 DF, p-value: < 2.2e-16 ``` ] --- # Question 5 .tiny[ ```r probation <- probation %>% filter(left_school==0) summary(lm(GPA_year2 ~ probation_year1 + credits1 + credits_earned1 + GPA_year1 + factor(sex) + age_at_entry + hsgrade_pct, data = probation)) ``` ``` ## ## Call: ## lm(formula = GPA_year2 ~ probation_year1 + credits1 + credits_earned1 + ## GPA_year1 + factor(sex) + age_at_entry + hsgrade_pct, data = probation) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.3545 -0.3239 0.0646 0.3708 2.5300 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0557184 0.0803853 13.133 < 2e-16 *** ## probation_year1 0.2827426 0.0132546 21.332 < 2e-16 *** ## credits1 -0.0069394 0.0120652 -0.575 0.5652 ## credits_earned1 0.0245169 0.0116370 2.107 0.0351 * ## GPA_year1 0.6971113 0.0059157 117.842 < 2e-16 *** ## factor(sex)M -0.0957468 0.0061847 -15.481 < 2e-16 *** ## age_at_entry -0.0248064 0.0041120 -6.033 1.63e-09 *** ## hsgrade_pct 0.0032055 0.0001307 24.529 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5817 on 38322 degrees of freedom ## (3857 observations deleted due to missingness) ## Multiple R-squared: 0.5139, Adjusted R-squared: 0.5139 ## F-statistic: 5789 on 7 and 38322 DF, p-value: < 2.2e-16 ``` ] --- # Question 6 .tiny[ ```r probation <- probation %>% filter(left_school==0) summary(lm(GPA_year2 ~ probation_year1 + credits1 + credits_earned1 + GPA_year1 + factor(sex) + age_at_entry + hsgrade_pct, data = probation)) ``` ``` ## ## Call: ## lm(formula = GPA_year2 ~ probation_year1 + credits1 + credits_earned1 + ## GPA_year1 + factor(sex) + age_at_entry + hsgrade_pct, data = probation) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.3545 -0.3239 0.0646 0.3708 2.5300 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0557184 0.0803853 13.133 < 2e-16 *** ## probation_year1 0.2827426 0.0132546 21.332 < 2e-16 *** ## credits1 -0.0069394 0.0120652 -0.575 0.5652 ## credits_earned1 0.0245169 0.0116370 2.107 0.0351 * ## GPA_year1 0.6971113 0.0059157 117.842 < 2e-16 *** ## factor(sex)M -0.0957468 0.0061847 -15.481 < 2e-16 *** ## age_at_entry -0.0248064 0.0041120 -6.033 1.63e-09 *** ## hsgrade_pct 0.0032055 0.0001307 24.529 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5817 on 38322 degrees of freedom ## (3857 observations deleted due to missingness) ## Multiple R-squared: 0.5139, Adjusted R-squared: 0.5139 ## F-statistic: 5789 on 7 and 38322 DF, p-value: < 2.2e-16 ``` ] --- background-position: 50% 50% class: left, bottom, inverse .big[ Prediction ] --- # Candy, candy, candy In this section, we will be predicting win percentage for candy bars! We have the following dataset for this: .tiny[ ```r candy <- read.csv("https://raw.githubusercontent.com/maibennett/website_github/master/exampleSite/content/files/data/candy.csv") ``` ] .pull-left[ .small[ - `competitorname`: Name of the candy - `chocolate`: Is it chocolate? - `fruity`: Is it fruit flavored? - `caramel`: Is there caramel in the candy? - `peanutalmondy`: Does it contain peanuts, peanut butter or almonds? - `nougat`: Does it contain nougat? - `crispedricewafer`: Does it contain crisped rice, wafers, or a cookie component? ]] .pull-right[ .small[ - `hard`: Is it a hard candy? - `bar`: Is it a bar? - `pluribus`: Is it one of many candies in a bag/box? - `sugarpercent`: The percentile of sugar it falls under within the data set. - `pricepercent`: The unit price percentile compared to the rest of the set. - `winpercent`: The overall win percentage according to 269,000 matchups.]] --- # Question 7 <img src="f2022_sta235h_17_FinalTrivia_files/figure-html/rd_linear-1.svg" style="display: block; margin: auto;" /> --- # Question 8 Using the code provided [here](https://www.magdalenabennett.com/files/data/Trivia/f2022_sta235h_14_FinalTrivia.R), how does your previous model perform? Write down the number (use two decimal places). --- # Question 9 Your turn. Fit a random forest to predict the outcome of interest. Tune only the number of randomly selected predictors, and use the code provided as a starting point. - Provide your code, the optimal number of `mtry`, and the performance of your model.