While brainstorming of new projects to explore, I came across a couple of interesting article about automating R scripts to save time spent on repetitive tasks. Inspiration strikes! With each new research project or dataset, my first step is always data exploration. Instead of manually repeating the same functions for every new dataset, what if I write a function that will automatically call the most common data exploration functions in R and create a summary report?

Unfortunately, after a quick search, I find that packages already exist to accomplish exactly what I have in mind. After digging into two options: DataExplorer and SmartEDA, I don’t see a need to execute the project I had envisioned.

Instead, I will use this post to explore the basics of exploratory data analysis with R—using base R functions, tidyverse functions, and the DataExplorer and SmartEDA packages.


Table of Contents:


Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) should be the first step that any researcher or data scientist takes when presented with a new dataset. EDA is key to familiarizing yourself with the data at hand and identifying/correcting quality issues. EDA can be divided into subgoals, which differ slightly between sources and the application of the data, but they generally include:

  • Summarizing data using descriptive statistics
  • Identifying of missing values
  • Visualizing individual variables and relationships between variables with plots


EDA using R

I would recommend new users first become familiar with base R functions, then add to this foundation with tidyverse functions, and, finally, automate the process with specialized packages. In this way, you can get your hands dirty and feel comfortable handling and exploring data with available functions before abstracting away these processes and using dedicated EDA packages such as DataExplorer and SmartEDA.

Code samples in this post will use the iris dataset (which is included with base R) to demonstrate the most commonly used EDA functions.


EDA with base R functions



  • head() and tail() - display the first or last n number of rows (default is 6).
    head(iris)
    
    #   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    # 1          5.1         3.5          1.4         0.2  setosa
    # 2          4.9         3.0          1.4         0.2  setosa
    # 3          4.7         3.2          1.3         0.2  setosa
    # 4          4.6         3.1          1.5         0.2  setosa
    # 5          5.0         3.6          1.4         0.2  setosa
    # 6          5.4         3.9          1.7         0.4  setosa
    
    tail(iris, 2)
    
    #     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    # 149          6.2         3.4          5.4         2.3 virginica
    # 150          5.9         3.0          5.1         1.8 virginica
    


  • names() - display the column names.
    names(iris)
    
    # [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
    


  • dim() - count the number of rows and columns.
    dim(iris)
    
    # [1] 150   5
    


  • str() - explore the structure of a data frame, including the total number of observations (rows) and variables (columns), the variable names and classes, and the first few values of each variable in the dataset. The data is transposed, allowing users to see all variables on one screen, even if the dataset includes many variables.
      str(iris)
    
      # 'data.frame':	150 obs. of  5 variables:
      # $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
      # $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
      # $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
      # $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
      # $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
    


  • summary() - displays level counts for factors and summary statistics for numeric variables.
    summary(iris)
    
     #  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
     # Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
     # 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
     # Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
     # Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
     # 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                     
     # Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
    

    The numerical summary statistics can be called individually using the respective functions:

    • min(), max(), mean(), and median() - display the minimum, maximum, mean, or median values.
    • quantile() - display the nth quantile value.
        # Display 75th quantile value (3rd quartile) of a numeric variable
        quantile(iris$Sepal.Length, 0.75)
      
        # 75% 
        # 6.4 
      


  • IQR(), mad(), sd(), var() - calculate the interquartile range, median absolute deviation, standard deviation, or variance of a numeric variable.

  • For logical variables: mean() and sum() - calculate the proportion or count the number of TRUEs.

  • sum(is.na()) - count the number of NAs in a dataframe or variable vector.

  • cor() - calculate the correlation coefficient between each pair of numeric variables and create a correlation matrix. The default calculates Pearson’s correlation coefficients, though Kendall or Spearman can also be calculated.

    # Select only numeric variables
    iris_numeric <- iris[ , c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
    
    # Create correlation matrix
    cor(iris_numeric)
    
    #              Sepal.Length Sepal.Width Petal.Length Petal.Width
    # Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
    # Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
    # Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
    # Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
    


  • Plotting functions (e.g., hist(), boxplot(), plot()) can be used to plot variables individually (histograms, boxplots) or pairwise (scatterplots).


EDA with tidyverse (dplyr) functions

Note: Before using tidyverse packages such as dplyr, be sure you have tidy data! dplyr functions expect that 1) each variable is in its own column and 2) each observation (or case) is in its own row. In addition, tidy data do not use row names (because they store a variable outside of columns).

  • glimpse() - see a “glimpse” of your data, by showing row and column counts, variable names and classes, and the first values of each variable in the dataset. The result is similar to str(), but it typically displays more data.

      library(dplyr)
    
      glimpse(iris)
    
      # Rows: 150
      # Columns: 5
      # $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.…
      # $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.…
      # $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.…
      # $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.…
      # $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setos…
    


  • na_if() - replace specific values with NA.

  • summarize() (synonymous with summarise()) - apply multiple summary functions to grouped data and display the results in a new table, with one column for each of the summary statistics that are specified. Unlike the summary() function, users must specify which summary statistics will be calculated. Of the functions supported by summarize(), some are included in base R, and some originate in the dplyr package (listed below).

    • n() and n_distinct() - number of values (rows) and number of unique values.

    • first(), last(), and nth() - first, last, and nth value.

      library(dplyr)
    
      iris %>%
      group_by(Species) %>%
      summarize(obs = n(), mean_SL = mean(Sepal.Length), mean_SW = mean(Sepal.Width),
                min_SL = min(Sepal.Length), max_SL = max(Sepal.Length))
    
      # A tibble: 3 × 6
      #   Species      obs mean_SL mean_SW min_SL max_SL
      #   <fct>      <int>   <dbl>   <dbl>  <dbl>  <dbl>
      # 1 setosa        50    5.01    3.43    4.3    5.8
      # 2 versicolor    50    5.94    2.77    4.9    7  
      # 3 virginica     50    6.59    2.97    4.9    7.9
    


EDA with the DataExplorer and SmartEDA packages

DataExplorer and SmartEDA were developed with the same goal: to increase the efficiency of EDA and reduce time wasted on repetitive report generation. A comparison of the features of these two packages (and other R packages designed for data exploration) can be found in Figure 3 of Putatunda et al.’s 2019 article: SmartEDA: An R Package for Automated Exploratory Data Analysis. The key differences:

  • Regarding functionality:
    • SmartEDA includes functions which create a table of summary statistics.
    • DataExplorer includes functions for feature binarization/binning and to standardize/identify missing imputation/diagnose outliers.
  • Regarding report generation:
    • DataExplorer generates a report in the format of your choosing (default HTML) which does not include any code chunks or description of section contents (other than section headers).
    • SmartEDA can only generate HTML reports. The reports contain code snippets and a general description of each section of the report.

The general organization of the reports is also slightly different and even individual functions with the same goal include different arguments, so I would suggest users explore both packages to identify which one best suits their needs.


DataExplorer

A report can be generated with the create_report() function.

library(DataExplorer)

create_report(iris, report_title = "Exploratory Data Analysis - **Iris** dataset", 
              output_file = "DataExplorer_report.html")


Individual sections of the report can be created by calling the respective function.

# Calculate only the "Raw Counts" section of the report
introduce(iris)
#   rows columns discrete_columns continuous_columns all_missing_columns total_missing_values complete_rows total_observations memory_usage
# 1  150       5                1                  4                   0                    0           150                750         7976


SmartEDA

To generate a report, use the ExpReport() function. In this report, the calls for each function are listed ahead of the respective table/plot, which makes it easy to identify which functions were used to generate the output.

library(SmartEDA)

ExpReport(iris, op_file = "SmartEDA_report.html", op_dir = getwd())


Individual sections of the report can be created by calling the respective function.

# Calculate only the "Overview of the data" section of the report
ExpData(data = iris, type = 1)
#                                           Descriptions    Value
# 1                                   Sample size (nrow)      150
# 2                              No. of variables (ncol)        5
# 3                    No. of numeric/interger variables        4
# 4                              No. of factor variables        1
# 5                                No. of text variables        0
# 6                             No. of logical variables        0
# 7                          No. of identifier variables        0
# 8                                No. of date variables        0
# 9             No. of zero variance variables (uniform)        0
# 10               %. of variables having complete cases 100% (5)
# 11   %. of variables having >0% and <50% missing cases   0% (0)
# 12 %. of variables having >=50% and <90% missing cases   0% (0)
# 13          %. of variables having >=90% missing cases   0% (0)


Summary

Exploratory Data Analysis is an essential first step of data analysis and can be executed using a set of R functions. Many functions are included in base R and can be expanded upon by some tidyverse functions included in the dplyr package. The DataExplorer and SmartEDA packages can automate EDA and rapidly generate a summary report for easy study to gain an overview of the data, identify quality issues with datasets, and make initial hypotheses.


Learning Round-Up

The joy of a language like R is that a lot of incredible packages exist and continue to be developed, but it is easy to overlook these tools if you have set habits when working in R. The first version of DataExplorer was released in 2016, before I even knew that R existed, and SmartEDA was released in 2019, right after I learned to use R for the first time. And yet I never knew these packages until I was writing this blog post. It was a great reminder that processes can always be improved, even in areas where you are experienced. I look forward to using these packages to increase my efficiency when handling future datasets, and to continuing a lifetime of learning with R!


Software/Languages Used

R version 4.3.2

RStudio version 2024.04.0+735

R Packages: dplyr (v 1.1.4); DataExplorer (v 0.8.3); SmartEDA (v 0.3.10)


Resources