Univariate, Bivariate and Multivariate Statistics Using R: Quantitative Tools for Data Analysis and Data Science
By Daniel J. Denis
Contents:
Preface xiii
1 Introduction to Applied Statistics 1
1.1 The Nature of Statistics and Inference 2
1.2 A Motivating Example 3
1.3 What About “Big Data”? 4
1.4 Approach to Learning R 7
1.5 Statistical Modeling in a Nutshell 7
1.6 Statistical Significance Testing and Error Rates 10
1.7 Simple Example of Inference Using a Coin 11
1.8 Statistics Is for Messy Situations 13
1.9 Type I versus Type II Errors 14
1.10 Point Estimates and Confidence Intervals 15
1.11 So What Can We Conclude from One Confidence Interval? 18
1.12 Variable Types 19
1.13 Sample Size, Statistical Power, and Statistical Significance 22
1.14 How “p < 0.05” Happens 23
1.15 Effect Size 25
1.16 The Verdict on Significance Testing 26
1.17 Training versus Test Data 27
1.18 How to Get the Most Out of This Book 28
Exercises 29
2 Introduction to R and Computational Statistics 31
2.1 How to Install R on Your Computer 34
2.2 How to Do Basic Mathematics with R 35
2.2.1 Combinations and Permutations 38
2.2.2 Plotting Curves Using curve() 39
2.3 Vectors and Matrices in R 41
2.4 Matrices in R 44
2.4.1 The Inverse of a Matrix 47
2.4.2 Eigenvalues and Eigenvectors 49
2.5 How to Get Data into R 52
2.6 Merging Data Frames 55
2.7 How to Install a Package in R, and How to Use It 55
2.8 How to View the Top, Bottom, and “Some” of a Data File 58
2.9 How to Select Subsets from a Dataframe 60
2.10 How R Deals with Missing Data 62
2.11 Using ls( ) to See Objects in the Workspace 63
2.12 Writing Your Own Functions 65
2.13 Writing Scripts 65
2.14 How to Create Factors in R 66
2.15 Using the table() Function 67
2.16 Requesting a Demonstration Using the example() Function 68
2.17 Citing R in Publications 69
Exercises 69
3 Exploring Data with R: Essential Graphics and Visualization 71
3.1 Statistics, R, and Visualization 71
3.2 R’s plot() Function 73
3.3 Scatterplots and Depicting Data in Two or More Dimensions 77
3.4 Communicating Density in a Plot 79
3.5 Stem-and-Leaf Plots 85
3.6 Assessing Normality 87
3.7 Box-and-Whisker Plots 89
3.8 Violin Plots 95
3.9 Pie Graphs and Charts 97
3.10 Plotting Tables 98
Exercises 99
4 Means, Correlations, Counts: Drawing Inferences Using Easy-to-Implement
Statistical Tests 101
4.1 Computing z and Related Scores in R 101
4.2 Plotting Normal Distributions 105
4.3 Correlation Coefficients in R 106
4.4 Evaluating Pearson’s r for Statistical Significance 110
4.5 Spearman’s Rho: A Nonparametric Alternative to Pearson 111
4.6 Alternative Correlation Coefficients in R 113
4.7 Tests of Mean Differences 114
4.7.1 t-Tests for One Sample 114
4.7.2 Two-Sample t-Test 115
4.7.3 Was the Welch Test Necessary? 117
4.7.4 t-Test via Linear Model Set-up 118
4.7.5 Paired-Samples t-Test 118
4.8 Categorical Data 120
4.8.1 Binomial Test 120
4.8.2 Categorical Data Having More Than Two Possibilities 123
4.9 Radar Charts 126
4.10 Cohen’s Kappa 127
Exercises 129
5 Power Analysis and Sample Size Estimation Using R 131
5.1 What Is Statistical Power? 131
5.2 Does That Mean Power and Huge Sample Sizes Are
“Bad?” 133
5.3 Should I Be Estimating Power or Sample Size? 134
5.4 How Do I Know What the Effect Size Should Be? 135
5.4.1 Ways of Setting Effect Size in Power Analyses 135
5.5 Power for t-Tests 136
5.5.1 Example: Treatment versus Control Experiment 137
5.5.2 Extremely Small Effect Size 138
5.6 Estimating Power for a Given Sample Size 140
5.7 Power for Other Designs – The Principles Are the Same 140
5.7.1 Power for One-Way ANOVA 141
5.7.2 Converting R2 to f 143
5.8 Power for Correlations 143
5.9 Concluding Thoughts on Power 145
Exercises 146
6 Analysis of Variance: Fixed Effects, Random Effects, Mixed Models, and
Repeated Measures 147
6.1 Revisiting t-Tests 147
6.2 Introducing the Analysis of Variance (ANOVA) 149
6.2.1 Achievement as a Function of Teacher 149
6.3 Evaluating Assumptions 152
6.3.1 Inferential Tests for Normality 153
6.3.2 Evaluating Homogeneity of Variances 154
6.4 Performing the ANOVA Using aov() 156
6.4.1 The Analysis of Variance Summary Table 157
6.4.2 Obtaining Treatment Effects 158
6.4.3 Plotting Results of the ANOVA 159
6.4.4 Post Hoc Tests on the Teacher Factor 159
6.5 Alternative Way of Getting ANOVA Results via lm() 161
6.5.1 Contrasts in lm() versus Tukey’s HSD 163
6.6 Factorial Analysis of Variance 163
6.6.1 Why Not Do Two One-Way ANOVAs? 163
6.7 Example of Factorial ANOVA 166
6.7.1 Graphing Main Effects and Interaction in the Same Plot 171
6.8 Should Main Effects Be Interpreted in the Presence of Interaction? 172
6.9 Simple Main Effects 173
6.10 Random Effects ANOVA and Mixed Models 175
6.10.1 A Rationale for Random Factors 176
6.10.2 One-Way Random Effects ANOVA in R 177
6.11 Mixed Models 180
6.12 Repeated-Measures Models 181
Exercises 186
7 Simple and Multiple Linear Regression 189
7.1 Simple Linear Regression 190
7.2 Ordinary Least-Squares Regression 192
7.3 Adjusted R2 198
7.4 Multiple Regression Analysis 199
7.5 Verifying Model Assumptions 202
7.6 Collinearity Among Predictors and the Variance Inflation Factor 206
7.7 Model-Building and Selection Algorithms 209
7.7.1 Simultaneous Inference 209
7.7.2 Hierarchical Regression 210
7.7.2.1 Example of Hierarchical Regression 211
7.8 Statistical Mediation 214
7.9 Best Subset and Forward Regression 217
7.9.1 How Forward Regression Works 218
7.10 Stepwise Selection 219
7.11 The Controversy Surrounding Selection Methods 221
Exercises 223
8 Logistic Regression and the Generalized Linear Model 225
8.1 The “Why” Behind Logistic Regression 225
8.2 Example of Logistic Regression in R 229
8.3 Introducing the Logit: The Log of the Odds 232
8.4 The Natural Log of the Odds 233
8.5 From Logits Back to Odds 235
8.6 Full Example of Logistic Regression 236
8.6.1 Challenger O-ring Data 236
8.7 Logistic Regression on Challenger Data 240
8.8 Analysis of Deviance Table 241
8.9 Predicting Probabilities 242
8.10 Assumptions of Logistic Regression 243
8.11 Multiple Logistic Regression 244
8.12 Training Error Rate Versus Test Error Rate 247
Exercises 248
9 Multivariate Analysis of Variance (MANOVA) and Discriminant
Analysis 251
9.1 Why Conduct MANOVA? 252
9.2 Multivariate Tests of Significance 254
9.3 Example of MANOVA in R 257
9.4 Effect Size for MANOVA 259
9.5 Evaluating Assumptions in MANOVA 261
9.6 Outliers 262
9.7 Homogeneity of Covariance Matrices 263
9.7.1 What if the Box-M Test Had Suggested a Violation? 264
9.8 Linear Discriminant Function Analysis 265
9.9 Theory of Discriminant Analysis 266
9.10 Discriminant Analysis in R 267
9.11 Computing Discriminant Scores Manually 270
9.12 Predicting Group Membership 271
9.13 How Well Did the Discriminant Function Analysis Do? 272
9.14 Visualizing Separation 275
9.15 Quadratic Discriminant Analysis 276
9.16 Regularized Discriminant Analysis 278
Exercises 278
10 Principal Component Analysis 281
10.1 Principal Component Analysis Versus Factor Analysis 282
10.2 A Very Simple Example of PCA 283
10.2.1 Pearson’s 1901 Data 284
10.2.2 Assumptions of PCA 286
10.2.3 Running the PCA 288
10.2.4 Loadings in PCA 290
10.3 What Are the Loadings in PCA? 292
10.4 Properties of Principal Components 293
10.5 Component Scores 294
10.6 How Many Components to Keep? 295
10.6.1 The Scree Plot as an Aid to Component Retention 295
10.7 Principal Components of USA Arrests Data 297
10.8 Unstandardized Versus Standardized Solutions 301
Exercises 304
11 Exploratory Factor Analysis 307
11.1 Common Factor Analysis Model 308
11.2 A Technical and Philosophical Pitfall of EFA 310
11.3 Factor Analysis Versus Principal Component Analysis on the Same
Data 311
11.3.1 Demonstrating the Non-Uniqueness Issue 311
11.4 The Issue of Factor Retention 314
11.5 Initial Eigenvalues in Factor Analysis 315
11.6 Rotation in Exploratory Factor Analysis 316
11.7 Estimation in Factor Analysis 318
11.8 Example of Factor Analysis on the Holzinger and Swineford Data 318
11.8.1 Obtaining Initial Eigenvalues 323
11.8.2 Making Sense of the Factor Solution 324
Exercises 325
12 Cluster Analysis 327
12.1 A Simple Example of Cluster Analysis 329
12.2 The Concepts of Proximity and Distance in Cluster Analysis 332
12.3 k-Means Cluster Analysis 332
12.4 Minimizing Criteria 333
12.5 Example of k-Means Clustering in R 334
12.5.1 Plotting the Data 335
12.6 Hierarchical Cluster Analysis 339
12.7 Why Clustering Is Inherently Subjective 343
Exercises 344
13 Nonparametric Tests 347
13.1 Mann–Whitney U Test 348
13.2 Kruskal–Wallis Test 349
13.3 Nonparametric Test for Paired Comparisons and Repeated
Measures 351
13.3.1 Wilcoxon Signed-Rank Test and Friedman Test 351
13.4 Sign Test 354
Exercises 356
References 359
Index 363