While brainstorming of new projects to explore, I came across a couple of interesting article about automating R scripts to save time spent on repetitive tasks. Inspiration strikes! With each new research project or dataset, my first step is always data exploration. Instead of manually repeating the same functions for every new dataset, what if I write a function that will automatically call the most common data exploration functions in R and create a summary report?
Unfortunately, after a quick search, I find that packages already exist to accomplish exactly what I have in mind. After digging into two options: DataExplorer and SmartEDA, I don’t see a need to execute the project I had envisioned.
Instead, I will use this post to explore the basics of exploratory data analysis with R—using base R functions, tidyverse functions, and the DataExplorer and SmartEDA packages.
Table of Contents:
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) should be the first step that any researcher or data scientist takes when presented with a new dataset. EDA is key to familiarizing yourself with the data at hand and identifying/correcting quality issues. EDA can be divided into subgoals, which differ slightly between sources and the application of the data, but they generally include:
- Summarizing data using descriptive statistics
- Identifying of missing values
- Visualizing individual variables and relationships between variables with plots
EDA using R
I would recommend new users first become familiar with base R functions, then add to this foundation with tidyverse functions, and, finally, automate the process with specialized packages. In this way, you can get your hands dirty and feel comfortable handling and exploring data with available functions before abstracting away these processes and using dedicated EDA packages such as DataExplorer and SmartEDA.
Code samples in this post will use the iris dataset (which is included with base R) to demonstrate the most commonly used EDA functions.
EDA with base R functions
head()andtail()- display the first or last n number of rows (default is 6).head(iris)# Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosatail(iris, 2)# Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 149 6.2 3.4 5.4 2.3 virginica # 150 5.9 3.0 5.1 1.8 virginica
names()- display the column names.names(iris)# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
dim()- count the number of rows and columns.dim(iris)# [1] 150 5
str()- explore the structure of a data frame, including the total number of observations (rows) and variables (columns), the variable names and classes, and the first few values of each variable in the dataset. The data is transposed, allowing users to see all variables on one screen, even if the dataset includes many variables.str(iris)# 'data.frame': 150 obs. of 5 variables: # $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... # $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... # $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... # $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... # $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary()- displays level counts for factors and summary statistics for numeric variables.summary(iris)# Sepal.Length Sepal.Width Petal.Length Petal.Width Species # Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 # 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 # Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 # Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 # 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 # Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500The numerical summary statistics can be called individually using the respective functions:
min(),max(),mean(), andmedian()- display the minimum, maximum, mean, or median values.quantile()- display the nth quantile value.# Display 75th quantile value (3rd quartile) of a numeric variable quantile(iris$Sepal.Length, 0.75)# 75% # 6.4
-
IQR(),mad(),sd(),var()- calculate the interquartile range, median absolute deviation, standard deviation, or variance of a numeric variable. -
For logical variables:
mean()andsum()- calculate the proportion or count the number of TRUEs. -
sum(is.na())- count the number of NAs in a dataframe or variable vector. -
cor()- calculate the correlation coefficient between each pair of numeric variables and create a correlation matrix. The default calculates Pearson’s correlation coefficients, though Kendall or Spearman can also be calculated.# Select only numeric variables iris_numeric <- iris[ , c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")] # Create correlation matrix cor(iris_numeric)# Sepal.Length Sepal.Width Petal.Length Petal.Width # Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411 # Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259 # Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654 # Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
- Plotting functions (e.g.,
hist(),boxplot(),plot()) can be used to plot variables individually (histograms, boxplots) or pairwise (scatterplots).
EDA with tidyverse (dplyr) functions
Note: Before using tidyverse packages such as dplyr, be sure you have tidy data! dplyr functions expect that 1) each variable is in its own column and 2) each observation (or case) is in its own row. In addition, tidy data do not use row names (because they store a variable outside of columns).
-
glimpse()- see a “glimpse” of your data, by showing row and column counts, variable names and classes, and the first values of each variable in the dataset. The result is similar tostr(), but it typically displays more data.library(dplyr) glimpse(iris)# Rows: 150 # Columns: 5 # $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.… # $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.… # $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.… # $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.… # $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setos…
-
na_if()- replace specific values with NA. -
summarize()(synonymous withsummarise()) - apply multiple summary functions to grouped data and display the results in a new table, with one column for each of the summary statistics that are specified. Unlike thesummary()function, users must specify which summary statistics will be calculated. Of the functions supported bysummarize(), some are included in base R, and some originate in thedplyrpackage (listed below).-
n()andn_distinct()- number of values (rows) and number of unique values. -
first(),last(), andnth()- first, last, and nth value.
library(dplyr) iris %>% group_by(Species) %>% summarize(obs = n(), mean_SL = mean(Sepal.Length), mean_SW = mean(Sepal.Width), min_SL = min(Sepal.Length), max_SL = max(Sepal.Length))# A tibble: 3 × 6 # Species obs mean_SL mean_SW min_SL max_SL # <fct> <int> <dbl> <dbl> <dbl> <dbl> # 1 setosa 50 5.01 3.43 4.3 5.8 # 2 versicolor 50 5.94 2.77 4.9 7 # 3 virginica 50 6.59 2.97 4.9 7.9 -
EDA with the DataExplorer and SmartEDA packages
DataExplorer and SmartEDA were developed with the same goal: to increase the efficiency of EDA and reduce time wasted on repetitive report generation. A comparison of the features of these two packages (and other R packages designed for data exploration) can be found in Figure 3 of Putatunda et al.’s 2019 article: SmartEDA: An R Package for Automated Exploratory Data Analysis. The key differences:
- Regarding functionality:
SmartEDAincludes functions which create a table of summary statistics.DataExplorerincludes functions for feature binarization/binning and to standardize/identify missing imputation/diagnose outliers.
- Regarding report generation:
DataExplorergenerates a report in the format of your choosing (default HTML) which does not include any code chunks or description of section contents (other than section headers).SmartEDAcan only generate HTML reports. The reports contain code snippets and a general description of each section of the report.
The general organization of the reports is also slightly different and even individual functions with the same goal include different arguments, so I would suggest users explore both packages to identify which one best suits their needs.
DataExplorer
A report can be generated with the create_report() function.
library(DataExplorer)
create_report(iris, report_title = "Exploratory Data Analysis - **Iris** dataset",
output_file = "DataExplorer_report.html")
Individual sections of the report can be created by calling the respective function.
# Calculate only the "Raw Counts" section of the report
introduce(iris)
# rows columns discrete_columns continuous_columns all_missing_columns total_missing_values complete_rows total_observations memory_usage
# 1 150 5 1 4 0 0 150 750 7976
SmartEDA
To generate a report, use the ExpReport() function. In this report, the calls for each function are listed ahead of the respective table/plot, which makes it easy to identify which functions were used to generate the output.
library(SmartEDA)
ExpReport(iris, op_file = "SmartEDA_report.html", op_dir = getwd())
Individual sections of the report can be created by calling the respective function.
# Calculate only the "Overview of the data" section of the report
ExpData(data = iris, type = 1)
# Descriptions Value
# 1 Sample size (nrow) 150
# 2 No. of variables (ncol) 5
# 3 No. of numeric/interger variables 4
# 4 No. of factor variables 1
# 5 No. of text variables 0
# 6 No. of logical variables 0
# 7 No. of identifier variables 0
# 8 No. of date variables 0
# 9 No. of zero variance variables (uniform) 0
# 10 %. of variables having complete cases 100% (5)
# 11 %. of variables having >0% and <50% missing cases 0% (0)
# 12 %. of variables having >=50% and <90% missing cases 0% (0)
# 13 %. of variables having >=90% missing cases 0% (0)
Summary
Exploratory Data Analysis is an essential first step of data analysis and can be executed using a set of R functions. Many functions are included in base R and can be expanded upon by some tidyverse functions included in the dplyr package. The DataExplorer and SmartEDA packages can automate EDA and rapidly generate a summary report for easy study to gain an overview of the data, identify quality issues with datasets, and make initial hypotheses.
Learning Round-Up
The joy of a language like R is that a lot of incredible packages exist and continue to be developed, but it is easy to overlook these tools if you have set habits when working in R. The first version of DataExplorer was released in 2016, before I even knew that R existed, and SmartEDA was released in 2019, right after I learned to use R for the first time. And yet I never knew these packages until I was writing this blog post. It was a great reminder that processes can always be improved, even in areas where you are experienced. I look forward to using these packages to increase my efficiency when handling future datasets, and to continuing a lifetime of learning with R!
Software/Languages Used
R version 4.3.2
RStudio version 2024.04.0+735
R Packages: dplyr (v 1.1.4); DataExplorer (v 0.8.3); SmartEDA (v 0.3.10)
Resources
- Overview of Exploratory Data Analysis (EDA)
- Overview of EDA with R
- Data Transformation chapter of R for Data Science. A crash course introduction to dplyr.
- dplyr Cheatsheet. A cheatsheet of dplyr functions, including the list of functions compatible with
summarize(). - Introduction to DataExplorer
- SmartEDA Help Page
- Putatunda, S., et al., (2019). SmartEDA: An R Package for Automated Exploratory Data Analysis. Journal of Open Source Software, 4(41), 1509. DOI: 10.21105/joss.01509