class: center, middle, inverse, title-slide .title[ # STA 235H - Multiple Regression: Binary Outcomes ] .subtitle[ ## Fall 2023 ] .author[ ### McCombs School of Business, UT Austin ] --- <!-- <script type="text/javascript"> --> <!-- MathJax.Hub.Config({ --> <!-- "HTML-CSS": { --> <!-- preferredFont: null, --> <!-- webFont: "Neo-Euler" --> <!-- } --> <!-- }); --> <!-- </script> --> <style type="text/css"> .small .remark-code { /*Change made here*/ font-size: 80% !important; } .tiny .remark-code { /*Change made here*/ font-size: 90% !important; } </style> # Binary Outcomes .pull-left[ - You have probably used **.darkorange[binary outcomes]** in regressions, but do you know the issues that they may bring to the table? .box-6[What can we do about them?] ] .pull-right[ ![](https://media.giphy.com/media/HUkOv6BNWc1HO/giphy.gif) ] --- # How to handle binary outcomes? .center2[ .box-2tL[Linear Probability Model] .box-4tL[Logistic Regression] ] --- # Linear Probability Models - A Linear Probability Model is just a **.darkorange[traditional regression with a binary outcome]** -- - Something interesting about a binary outcome is that the expected value of `\(Y\)` if `\(Y\)` is binary is actually a probability! `$$E[Y|X_1,..., X_P] = Pr(Y = 0|X_1,...,X_p)\cdot 0 + Pr(Y = 1|X_1,...,X_p)\cdot 1$$` `$$= Pr(Y = 1|X_1,...,X_p)$$` --- # How to interpret a LPM? - `\(\hat{\beta}\)`'s interpreted as **.darkorange[change in probability]** -- - Example: `$$GradeA = \beta_0 + \beta_1 \cdot Study + \varepsilon$$` - `\(\hat{\beta}_1\)` is the average change in probability of getting an A if I study one more hour. -- - *Studying one more hour is associated with an average increase in the probability of getting an A of `\(\hat{\beta}_1\times100\)` **.darkorange[percentage points]**.* -- `$$\widehat{GradeA} = 0.2 + 0.125 \cdot Study$$` - *Studying one more hour is associated with an average increase in the probability of getting an A of `\(12.5\)` **.darkorange[percentage points]**.* --- # Side note: Difference between percent change and change in percentage points - Imagine that if you **.darkorange[study 4hrs]** your probability of getting an A is, on average, **.darkorange[70%]** and if you **.darkorange[study for 5hrs]** that probability increases to **.darkorange[75%]**. -- - Then, we can say that your probability increased by **.darkorange[5 percentage points]**. -- - Why is this not the same as saying that your probability increased by 5%? -- - Remember percent change? `$$\frac{y_1 - y_0}{y_0} = \frac{75-70}{70} = 0.0714$$` -- - This means that, in this case, a **.darkorange[5 percentage point increase]** is equivalent to a **.darkorange[7% increase in probability]**. -- .box-7trans[Be aware of the difference in percentage points and percent!] --- # Let's look at an example - Home Mortgage Disclosure Act Data (HMDA) .small[ ```r hmda = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week3/2_OLS_Issues/data/hmda.csv", stringsAsFactors = TRUE) head(hmda) ``` ``` ## deny pirat hirat lvrat chist mhist phist unemp selfemp insurance condomin ## 1 no 0.221 0.221 0.8000000 5 2 no 3.9 no no no ## 2 no 0.265 0.265 0.9218750 2 2 no 3.2 no no no ## 3 no 0.372 0.248 0.9203980 1 2 no 3.2 no no no ## 4 no 0.320 0.250 0.8604651 1 2 no 4.3 no no no ## 5 no 0.360 0.350 0.6000000 1 1 no 3.2 no no no ## 6 no 0.240 0.170 0.5105263 1 1 no 3.9 no no no ## afam single hschool ## 1 no no yes ## 2 no yes yes ## 3 no no yes ## 4 no no yes ## 5 no no yes ## 6 no no yes ``` ] --- # Probability of someone getting a mortgage loan denied? - Getting mortgage denied (1) based on race, conditional on payments to income ratio (`pirat`) .pull-left-little_l[ .small[ ```r hmda = hmda %>% mutate(deny = as.numeric(deny) - 1) summary(lm(deny ~ pirat + afam, data = hmda)) ``` ``` ## ## Call: ## lm(formula = deny ~ pirat + afam, data = hmda) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.62526 -0.11772 -0.09293 -0.05488 1.06815 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.09051 0.02079 -4.354 1.39e-05 *** ## pirat 0.55919 0.05987 9.340 < 2e-16 *** ## afamyes 0.17743 0.01837 9.659 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3123 on 2377 degrees of freedom ## Multiple R-squared: 0.076, Adjusted R-squared: 0.07523 ## F-statistic: 97.76 on 2 and 2377 DF, p-value: < 2.2e-16 ``` ] ] -- .small[ .pull-right_little_l[ <br> <br> <br> - Holding payment-to-income ratio constant, an AA client has a probability of getting their loan denied that is **.darkorange[18 pp higher]**, <u>on average, than a non AA client</u>. - Being AA is associated to an <u>average</u> increase of **.darkorange[0.177 in the probability]** of getting a loan denied <u>compared to a non AA</u>, holding payment-to-income ratio constant. ] ] --- # How does this LPM look? <img src="f2023_sta235h_5_reg_files/figure-html/lpm1-1.svg" style="display: block; margin: auto;" /> --- # Issues with a LPM? - **.darkorange[Main problems]**: - Non-normality of the error term - Heteroskedasticity (i.e. variance of the error term is not constant) - Predictions can be outside [0,1] - LPM imposes linearity assumption --- # Issues with a LPM? - **.darkorange[Main problems]**: - Non-normality of the error term `\(\rightarrow\)` **.darkorange[Hypothesis testing]** - Heteroskedasticity `\(\rightarrow\)` **.darkorange[Validity of SE]** - Predictions can be outside [0,1] `\(\rightarrow\)` **.darkorange[Issues for prediction]** - LPM imposes linearity assumption `\(\rightarrow\)` **.darkorange[Too strict?]** --- # Are there solutions? .pull-left[ ![](https://media.giphy.com/media/xT5LMHMVvWbyDAtYQ0/giphy.gif) ] .pull-right[ Some solutions we will take into account: - **.darkorange[Don't use small samples]**: With the CLT, non-normality shouldn't matter much. - **.darkorange[Use robust standard errors]**: Package `estimatr` in R is great! ] --- # Run again with robust standard errors .small[ ```r library(estimatr) model1 <- lm(deny ~ pirat + afam, data = hmda) model2 <- lm_robust(deny ~ pirat + afam, data = hmda) ``` ] .small[ <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> (1) </th> <th style="text-align:center;"> (2) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> −0.091*** </td> <td style="text-align:center;"> −0.091** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.021) </td> <td style="text-align:center;"> (0.031) </td> </tr> <tr> <td style="text-align:left;"> pirat </td> <td style="text-align:center;"> 0.559*** </td> <td style="text-align:center;"> 0.559*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.060) </td> <td style="text-align:center;"> (0.095) </td> </tr> <tr> <td style="text-align:left;"> afamyes </td> <td style="text-align:center;"> 0.177*** </td> <td style="text-align:center;"> 0.177*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.018) </td> <td style="text-align:center;"> (0.025) </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> ] - Can you interpret these parameters? Do they make sense? --- .center2[ .box-4[Most issues are solvable, but...] .box-4[What about prediction?] ] --- # Logistic Regression - Typically used in the context of binary outcomes (*Probit is another popular one*) - **.darkorange[Nonlinear function]** to model the conditional probability function of a binary outcome. `$$Pr(Y = 1|X_1,...,X_p) = F(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p)$$` Where in a **.darkorange[logistic regression]**: `\(F(x) = \frac{1}{1+exp(-x)}\)` - *In the LPM, `\(F(x) = x\)`* -- - A logistic regression doesn't look pretty: `$$Pr(Y=1|X_1,...,X_p) = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1+...+\beta_pX_p)}}$$` -- .box-7trans[A regression with log(Y) is NOT a logistic regression!] --- # How does this look in a plot? <img src="f2023_sta235h_5_reg_files/figure-html/logit1-1.svg" style="display: block; margin: auto;" /> --- # When will we use logistic regression? - As you discovered in the readings, logit is great for prediction (**.darkorange[much better]** than LPM). - For explanation, however, **.darkorange[LPM simplifies interpretation]**. -- .box-6Trans[Use LPM for explanation and logit for prediction] -- .box-7Trans[(but remember robust SE!)] --- # Takeaway points .pull-left[ - Always make sure to **.darkorange[check your data]**: - What are analyzing? Does the data behave as I would expect? Should I exclude observations? - For LPM, **.darkorange[always include robust standard errors]**! ] .pull-right[ <br> <br> ![](https://media.giphy.com/media/xT5LMNLNu7ZMpRgVgc/giphy.gif) ] <!-- pagedown::chrome_print('C:/Users/mc72574/Dropbox/Hugo/Sites/sta235/exampleSite/content/Classes/Week2/4_OLS_probs/f2021_sta235h_4_reg.html') -->