class: center, middle, inverse, title-slide .title[ # STA 235H - Multiple Regression: Outliers ] .subtitle[ ## Fall 2023 ] .author[ ### McCombs School of Business, UT Austin ] --- <!-- <script type="text/javascript"> --> <!-- MathJax.Hub.Config({ --> <!-- "HTML-CSS": { --> <!-- preferredFont: null, --> <!-- webFont: "Neo-Euler" --> <!-- } --> <!-- }); --> <!-- </script> --> <style type="text/css"> .small .remark-code { /*Change made here*/ font-size: 80% !important; } .tiny .remark-code { /*Change made here*/ font-size: 90% !important; } </style> <br> <br> <br> <br> <br> <br> .box-5Trans[Why should we inspect our data before doing anything else?] --- # Identifying outliers - How do we **.darkorange[identify outliers]**? -- - Visual inspection (e.g. plots, tables) - Creating thresholds (e.g. z-scores, IQ) -- - There is **.darkorange[no definite way to identify outliers]** - Like the characterization of pornography, "I know it when I see it" (P. Stewart, 1964) --- # HMDA Data for Bastrop County - Data from the Home Mortgage Disclosure Act (HMDA) from 2017 in Bastrop County (near Austin) -- <img src="f2023_sta235h_4_outliers_files/figure-html/hist_hmda-1.svg" style="display: block; margin: auto;" /> --- # Association between loan amount and income <img src="f2023_sta235h_4_outliers_files/figure-html/loan_income-1.svg" style="display: block; margin: auto;" /> --- # Identifying outliers <img src="f2023_sta235h_4_outliers_files/figure-html/loan_income2-1.svg" style="display: block; margin: auto;" /> --- # Association with complete data <img src="f2023_sta235h_4_outliers_files/figure-html/loan_income3-1.svg" style="display: block; margin: auto;" /> --- # Association after removing outliers <img src="f2023_sta235h_4_outliers_files/figure-html/loan_income4-1.svg" style="display: block; margin: auto;" /> --- # Compare both coefficients: Complete data ```r summary(lm(loan_amount_000s ~ applicant_income_000s, data = hmda)) ``` ``` ## ## Call: ## lm(formula = loan_amount_000s ~ applicant_income_000s, data = hmda) ## ## Residuals: ## Min 1Q Median 3Q Max ## -458.93 -36.97 -8.77 35.47 365.27 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 141.05028 4.15313 33.96 <2e-16 *** ## applicant_income_000s 0.84000 0.03663 22.93 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 67.04 on 875 degrees of freedom ## (4 observations deleted due to missingness) ## Multiple R-squared: 0.3754, Adjusted R-squared: 0.3747 ## F-statistic: 525.8 on 1 and 875 DF, p-value: < 2.2e-16 ``` --- # Compare both coefficients: Data without outliers ```r summary(lm(loan_amount_000s ~ applicant_income_000s, data = hmda_without_outliers)) ``` ``` ## ## Call: ## lm(formula = loan_amount_000s ~ applicant_income_000s, data = hmda_without_outliers) ## ## Residuals: ## Min 1Q Median 3Q Max ## -272.22 -36.09 -6.82 34.12 360.06 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 133.52408 4.47317 29.85 <2e-16 *** ## applicant_income_000s 0.92376 0.04171 22.15 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 64.82 on 873 degrees of freedom ## Multiple R-squared: 0.3597, Adjusted R-squared: 0.359 ## F-statistic: 490.5 on 1 and 873 DF, p-value: < 2.2e-16 ``` --- # What to do with outliers? 1. **.darkorange[Check them!]** - Make sure there's no coding error; try to understand what's happening there. -- 2a. **.darkorange[If they are wrongly coded]**: - You can remove them, always adding a note of why you did so - Be aware of sample selection! -- 2b. **.darkorange[If they are correctly coded]**: - Run analysis both with and without outliers (don't just drop them!). - Robust results: Do not depend exclusively on a few observations. --- background-position: 50% 50% class: center, middle .box-3LA[Let's do some exercises!]