Python Data Cleaning Cookbook, Second Edition
Michael Walker
Table of Contents
Preface xiii
Chapter 1: Anticipating Data Cleaning Issues
When Importing Tabular Data with pandas 1
Technical requirements ……………………………………………………………………………………………… 2
Importing CSV files …………………………………………………………………………………………………… 2
Importing Excel files …………………………………………………………………………………………………. 9
Importing data from SQL databases ……………………………………………………………………………. 18
Importing SPSS, Stata, and SAS data ……………………………………………………………………………. 25
Importing R data …………………………………………………………………………………………………….. 34
Persisting tabular data ……………………………………………………………………………………………… 39
Summary ………………………………………………………………………………………………………………. 43
Chapter 2: Anticipating Data Cleaning Issues
When Working with HTML, JSON, and Spark Data 45
Technical requirements ……………………………………………………………………………………………. 45
Importing simple JSON data ……………………………………………………………………………………… 46
Importing more complicated JSON data from an API ……………………………………………………… 53
Importing data from web pages …………………………………………………………………………………. 58
Working with Spark data …………………………………………………………………………………………… 63
Persisting JSON data ………………………………………………………………………………………………… 68
Versioning data ………………………………………………………………………………………………………. 72
Summary ………………………………………………………………………………………………………………. 74
Chapter 3: Taking the Measure of Your Data 77
Technical requirements ……………………………………………………………………………………………. 78
Getting a first look at your data ………………………………………………………………………………….. 78
Selecting and organizing columns ………………………………………………………………………………. 84
Selecting rows ………………………………………………………………………………………………………… 91
Generating frequencies for categorical variables …………………………………………………………… 98
Generating summary statistics for continuous variables ……………………………………………….. 102
Using generative AI to display descriptive statistics ………………………………………………………. 107
Summary …………………………………………………………………………………………………………….. 115
Chapter 4: Identifying Outliers in Subsets of Data 117
Technical requirements ………………………………………………………………………………………….. 118
Identifying outliers with one variable ………………………………………………………………………… 118
Identifying outliers and unexpected values in bivariate relationships ………………………………. 128
Using subsetting to examine logical inconsistencies in variable relationships ……………………. 136
Using linear regression to identify data points with significant influence ………………………….. 144
Using k-nearest neighbors to find outliers ………………………………………………………………….. 148
Using Isolation Forest to find anomalies …………………………………………………………………….. 152
Using PandasAI to identify outliers …………………………………………………………………………… 156
Summary …………………………………………………………………………………………………………….. 163
Chapter 5: Using Visualizations for the Identification of Unexpected Values 165
Technical requirements ………………………………………………………………………………………….. 166
Using histograms to examine the distribution of continuous variables …………………………….. 166
Using boxplots to identify outliers for continuous variables …………………………………………… 173
Using grouped boxplots to uncover unexpected values in a particular group ……………………… 179
Examining both distribution shape and outliers with violin plots ……………………………………. 186
Using scatter plots to view bivariate relationships ………………………………………………………… 191
Using line plots to examine trends in continuous variables ……………………………………………. 198
Generating a heat map based on a correlation matrix …………………………………………………… 204
Summary …………………………………………………………………………………………………………….. 209
Chapter 6: Cleaning and Exploring Data with Series Operations 211
Technical requirements ………………………………………………………………………………………….. 212
Getting values from a pandas Series ………………………………………………………………………….. 212
Showing summary statistics for a pandas Series ………………………………………………………….. 217
Changing Series values …………………………………………………………………………………………… 221
Changing Series values conditionally ………………………………………………………………………… 225
Evaluating and cleaning string Series data ………………………………………………………………….. 233
Working with dates ………………………………………………………………………………………………… 241
Using OpenAI for Series operations …………………………………………………………………………… 248
Summary …………………………………………………………………………………………………………….. 253
Chapter 7: Identifying and Fixing Missing Values 255
Technical requirements ………………………………………………………………………………………….. 256
Identifying missing values ………………………………………………………………………………………. 256
Cleaning missing values …………………………………………………………………………………………. 260
Imputing values with regression ………………………………………………………………………………. 267
Using k-nearest neighbors for imputation ………………………………………………………………….. 273
Using random forest for imputation ………………………………………………………………………….. 276
Using PandasAI for imputation ………………………………………………………………………………… 280
Summary …………………………………………………………………………………………………………….. 284
Chapter 8: Encoding, Transforming, and Scaling Features 285
Technical requirements ………………………………………………………………………………………….. 286
Creating training datasets and avoiding data leakage ……………………………………………………. 286
Removing redundant or unhelpful features ………………………………………………………………… 289
Encoding categorical features: one-hot encoding …………………………………………………………. 294
Encoding categorical features: ordinal encoding …………………………………………………………. 297
Encoding categorical features with medium or high cardinality ……………………………………… 300
Using mathematical transformations ………………………………………………………………………… 303
Feature binning: equal width and equal frequency ………………………………………………………. 308
k-means binning …………………………………………………………………………………………………… 311
Feature scaling ……………………………………………………………………………………………………… 313
Summary …………………………………………………………………………………………………………….. 316
Chapter 9: Fixing Messy Data When Aggregating 317
Technical requirements ………………………………………………………………………………………….. 318
Looping through data with itertuples (an anti-pattern) …………………………………………………. 318
Calculating summaries by group with NumPy arrays ……………………………………………………. 323
Using groupby to organize data by groups ………………………………………………………………….. 326
Using more complicated aggregation functions with groupby ………………………………………… 330
Using user-defined functions and apply with groupby …………………………………………………… 336
Using groupby to change the unit of analysis of a DataFrame …………………………………………. 341
Using pivot_table to change the unit of analysis of a DataFrame ……………………………………… 344
Summary …………………………………………………………………………………………………………….. 346
Chapter 10: Addressing Data Issues When Combining DataFrames 347
Technical requirements ………………………………………………………………………………………….. 348
Combining DataFrames vertically …………………………………………………………………………….. 348
Doing one-to-one merges ………………………………………………………………………………………… 355
Doing one-to-one merges by multiple columns ……………………………………………………………. 361
Doing one-to-many merges ……………………………………………………………………………………… 365
Doing many-to-many merges …………………………………………………………………………………… 370
Developing a merge routine …………………………………………………………………………………….. 376
Summary …………………………………………………………………………………………………………….. 379
Chapter 11: Tidying and Reshaping Data 381
Technical requirements ………………………………………………………………………………………….. 381
Removing duplicated rows ………………………………………………………………………………………. 382
Fixing many-to-many relationships …………………………………………………………………………… 386
Using stack and melt to reshape data from wide to long format ………………………………………. 393
Melting multiple groups of columns ………………………………………………………………………….. 398
Using unstack and pivot to reshape data from long to wide format ………………………………….. 401
Summary …………………………………………………………………………………………………………….. 404
Chapter 12: Automate Data Cleaning with User-Defined Functions, Classes, and
Pipelines 405
Technical requirements ………………………………………………………………………………………….. 406
Functions for getting a first look at our data ……………………………………………………………….. 406
Functions for displaying summary statistics and frequencies …………………………………………. 411
Functions for identifying outliers and unexpected values ………………………………………………. 417
Functions for aggregating or combining data ……………………………………………………………… 423
Classes that contain the logic for updating Series values ……………………………………………….. 429
Classes that handle non-tabular data structures …………………………………………………………… 435
Functions for checking overall data quality ………………………………………………………………… 440
Pre-processing data with pipelines: a simple example ………………………………………………….. 446
Pre-processing data with pipelines: a more complicated example …………………………………… 449
Summary …………………………………………………………………………………………………………….. 453
Other Books You May Enjoy 457
Index 461