Python Data Cleaning Cookbook, Second Edition by Michael Walker

By

Python Data Cleaning Cookbook, Second Edition

Michael Walker

Python Data Cleaning Cookbook

Table of Contents

Preface xiii

Chapter 1: Anticipating Data Cleaning Issues

When Importing Tabular Data with pandas 1

Technical requirements ……………………………………………………………………………………………… 2

Importing CSV files …………………………………………………………………………………………………… 2

Importing Excel files …………………………………………………………………………………………………. 9

Importing data from SQL databases ……………………………………………………………………………. 18

Importing SPSS, Stata, and SAS data ……………………………………………………………………………. 25

Importing R data …………………………………………………………………………………………………….. 34

Persisting tabular data ……………………………………………………………………………………………… 39

Summary ………………………………………………………………………………………………………………. 43

Chapter 2: Anticipating Data Cleaning Issues

When Working with HTML, JSON, and Spark Data 45

Technical requirements ……………………………………………………………………………………………. 45

Importing simple JSON data ……………………………………………………………………………………… 46

Importing more complicated JSON data from an API ……………………………………………………… 53

Importing data from web pages …………………………………………………………………………………. 58

Working with Spark data …………………………………………………………………………………………… 63

Persisting JSON data ………………………………………………………………………………………………… 68

Versioning data ………………………………………………………………………………………………………. 72

Summary ………………………………………………………………………………………………………………. 74

Chapter 3: Taking the Measure of Your Data 77

Technical requirements ……………………………………………………………………………………………. 78

Getting a first look at your data ………………………………………………………………………………….. 78

Selecting and organizing columns ………………………………………………………………………………. 84

Selecting rows ………………………………………………………………………………………………………… 91

Generating frequencies for categorical variables …………………………………………………………… 98

Generating summary statistics for continuous variables ……………………………………………….. 102

Using generative AI to display descriptive statistics ………………………………………………………. 107

Summary …………………………………………………………………………………………………………….. 115

Chapter 4: Identifying Outliers in Subsets of Data 117

Technical requirements ………………………………………………………………………………………….. 118

Identifying outliers with one variable ………………………………………………………………………… 118

Identifying outliers and unexpected values in bivariate relationships ………………………………. 128

Using subsetting to examine logical inconsistencies in variable relationships ……………………. 136

Using linear regression to identify data points with significant influence ………………………….. 144

Using k-nearest neighbors to find outliers ………………………………………………………………….. 148

Using Isolation Forest to find anomalies …………………………………………………………………….. 152

Using PandasAI to identify outliers …………………………………………………………………………… 156

Summary …………………………………………………………………………………………………………….. 163

Chapter 5: Using Visualizations for the Identification of Unexpected Values 165

Technical requirements ………………………………………………………………………………………….. 166

Using histograms to examine the distribution of continuous variables …………………………….. 166

Using boxplots to identify outliers for continuous variables …………………………………………… 173

Using grouped boxplots to uncover unexpected values in a particular group ……………………… 179

Examining both distribution shape and outliers with violin plots ……………………………………. 186

Using scatter plots to view bivariate relationships ………………………………………………………… 191

Using line plots to examine trends in continuous variables ……………………………………………. 198

Generating a heat map based on a correlation matrix …………………………………………………… 204

Summary …………………………………………………………………………………………………………….. 209

Chapter 6: Cleaning and Exploring Data with Series Operations 211

Technical requirements ………………………………………………………………………………………….. 212

Getting values from a pandas Series ………………………………………………………………………….. 212

Showing summary statistics for a pandas Series ………………………………………………………….. 217

Changing Series values …………………………………………………………………………………………… 221

Changing Series values conditionally ………………………………………………………………………… 225

Evaluating and cleaning string Series data ………………………………………………………………….. 233

Working with dates ………………………………………………………………………………………………… 241

Using OpenAI for Series operations …………………………………………………………………………… 248

Summary …………………………………………………………………………………………………………….. 253

Chapter 7: Identifying and Fixing Missing Values 255

Technical requirements ………………………………………………………………………………………….. 256

Identifying missing values ………………………………………………………………………………………. 256

Cleaning missing values …………………………………………………………………………………………. 260

Imputing values with regression ………………………………………………………………………………. 267

Using k-nearest neighbors for imputation ………………………………………………………………….. 273

Using random forest for imputation ………………………………………………………………………….. 276

Using PandasAI for imputation ………………………………………………………………………………… 280

Summary …………………………………………………………………………………………………………….. 284

Chapter 8: Encoding, Transforming, and Scaling Features 285

Technical requirements ………………………………………………………………………………………….. 286

Creating training datasets and avoiding data leakage ……………………………………………………. 286

Removing redundant or unhelpful features ………………………………………………………………… 289

Encoding categorical features: one-hot encoding …………………………………………………………. 294

Encoding categorical features: ordinal encoding …………………………………………………………. 297

Encoding categorical features with medium or high cardinality ……………………………………… 300

Using mathematical transformations ………………………………………………………………………… 303

Feature binning: equal width and equal frequency ………………………………………………………. 308

k-means binning …………………………………………………………………………………………………… 311

Feature scaling ……………………………………………………………………………………………………… 313

Summary …………………………………………………………………………………………………………….. 316

Chapter 9: Fixing Messy Data When Aggregating 317

Technical requirements ………………………………………………………………………………………….. 318

Looping through data with itertuples (an anti-pattern) …………………………………………………. 318

Calculating summaries by group with NumPy arrays ……………………………………………………. 323

Using groupby to organize data by groups ………………………………………………………………….. 326

Using more complicated aggregation functions with groupby ………………………………………… 330

Using user-defined functions and apply with groupby …………………………………………………… 336

Using groupby to change the unit of analysis of a DataFrame …………………………………………. 341

Using pivot_table to change the unit of analysis of a DataFrame ……………………………………… 344

Summary …………………………………………………………………………………………………………….. 346

Chapter 10: Addressing Data Issues When Combining DataFrames 347

Technical requirements ………………………………………………………………………………………….. 348

Combining DataFrames vertically …………………………………………………………………………….. 348

Doing one-to-one merges ………………………………………………………………………………………… 355

Doing one-to-one merges by multiple columns ……………………………………………………………. 361

Doing one-to-many merges ……………………………………………………………………………………… 365

Doing many-to-many merges …………………………………………………………………………………… 370

Developing a merge routine …………………………………………………………………………………….. 376

Summary …………………………………………………………………………………………………………….. 379

Chapter 11: Tidying and Reshaping Data 381

Technical requirements ………………………………………………………………………………………….. 381

Removing duplicated rows ………………………………………………………………………………………. 382

Fixing many-to-many relationships …………………………………………………………………………… 386

Using stack and melt to reshape data from wide to long format ………………………………………. 393

Melting multiple groups of columns ………………………………………………………………………….. 398

Using unstack and pivot to reshape data from long to wide format ………………………………….. 401

Summary …………………………………………………………………………………………………………….. 404

Chapter 12: Automate Data Cleaning with User-Defined Functions, Classes, and

Pipelines 405

Technical requirements ………………………………………………………………………………………….. 406

Functions for getting a first look at our data ……………………………………………………………….. 406

Functions for displaying summary statistics and frequencies …………………………………………. 411

Functions for identifying outliers and unexpected values ………………………………………………. 417

Functions for aggregating or combining data ……………………………………………………………… 423

Classes that contain the logic for updating Series values ……………………………………………….. 429

Classes that handle non-tabular data structures …………………………………………………………… 435

Functions for checking overall data quality ………………………………………………………………… 440

Pre-processing data with pipelines: a simple example ………………………………………………….. 446

Pre-processing data with pipelines: a more complicated example …………………………………… 449

Summary …………………………………………………………………………………………………………….. 453

Other Books You May Enjoy 457

Index 461

This book is US$10
To get free sample pages OR Buy this book


Share this Book!

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.