Graphically representing estimation table

Represent point and interval estimates using ggplot2

By Pallav Routh in Estimation Visualization

August 13, 2022

I have never been a fan of showing clunky long estimation tables in research presentations. In this blog post, I show how to represent the coefficient, standard error, p values and interval estimates from a typical estimation on a graph or plot.

To generate the ggplot, I will continue using the “scientific” theme I had created in a previous post.

theme_scientific <-
  theme(axis.line = element_line(color = "black"),
        axis.text.x = element_text(color = "black",
                                   size = 12,
                                   margin = unit(c(0.2, 0.1, 0.1, 0.5), "cm")),
        axis.text.y = element_text(color = "black",
                                   size = 12,
                                   margin = unit(c(0.1, 0.1, 0.2, 0.1), "cm")),
        axis.title = element_text(colour = "black", size = 12),
        axis.ticks.length = unit(-0.30, "cm"),
        panel.background = element_rect(fill = "white", colour = "white"),
        panel.border = element_rect(colour = "black", fill = NA, size = 0.2),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_line(colour = "#ddd8d8", linetype = 1, size = 0.5),
        strip.placement = "outside")

Let’s begin by running an estimation. I will use the mtcars dataset to run a simple OLS regression.

lmfit <- lm(mpg ~ wt*cyl + am + gear, data = mtcars)

Here is the output from the regression model -

summary(lmfit)

## 
## Call:
## lm(formula = mpg ~ wt * cyl + am + gear, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8339 -1.3594 -0.4773  1.2972  5.3204 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  57.1698     7.5338   7.588 4.69e-08 ***
## wt           -9.3089     2.8998  -3.210  0.00351 ** 
## cyl          -3.9487     1.1331  -3.485  0.00176 ** 
## am           -0.6179     1.8671  -0.331  0.74333    
## gear         -0.1916     1.0636  -0.180  0.85845    
## wt:cyl        0.8600     0.3835   2.243  0.03367 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.435 on 26 degrees of freedom
## Multiple R-squared:  0.8631,	Adjusted R-squared:  0.8368 
## F-statistic: 32.79 on 5 and 26 DF,  p-value: 1.951e-10

Now let’s store the output (the main estimation table) above in a data frame. We can do this easily with the tidy function from the broom package.

lmfit_tidy <- tidy(lmfit)
lmfit_tidy

## # A tibble: 6 × 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)   57.2       7.53      7.59  0.0000000469
## 2 wt            -9.31      2.90     -3.21  0.00351     
## 3 cyl           -3.95      1.13     -3.48  0.00176     
## 4 am            -0.618     1.87     -0.331 0.743       
## 5 gear          -0.192     1.06     -0.180 0.858       
## 6 wt:cyl         0.860     0.383     2.24  0.0337

I am going to make some changes to the above table. First, I will remove the ‘intercept’ row because in day to day situations researchers rarely interpret the intercept. I will rename some of the columns. I will also initiate an index or row number (this will help me later when creating the plot).

lmfit_tidy <- 
  tidy(lmfit) %>% 
    rename("std_error" = "std.error","p_value" = "p.value") %>% 
    filter(term != "(Intercept)") %>% 
    mutate(index = 1:n()) %>% 
    relocate(index, .before = term)

lmfit_tidy

## # A tibble: 5 × 6
##   index term   estimate std_error statistic p_value
##   <int> <chr>     <dbl>     <dbl>     <dbl>   <dbl>
## 1     1 wt       -9.31      2.90     -3.21  0.00351
## 2     2 cyl      -3.95      1.13     -3.48  0.00176
## 3     3 am       -0.618     1.87     -0.331 0.743  
## 4     4 gear     -0.192     1.06     -0.180 0.858  
## 5     5 wt:cyl    0.860     0.383     2.24  0.0337

I will also store the variables from the estimation table in a vector as this will also be useful later on in the preparation of the plot -

terms <- unique(lmfit_tidy$term)

Now let’s create a plot. The main idea is to create a scatter plot where the points are the estimates for each variable which would be represented on one of the axis. The other important aspect is to represent the confidence of the point estimate using a error bar.

Here is the code to do that -

lmfit_tidy %>% 
  mutate(lcl = estimate - 1.94 * std_error,
         ucl = estimate + 1.94 * std_error) %>% 
  # using index rather than variable as y axis 
  # allows me to keep the y axis as continuous rather
  # than discrete which allows for more flexibility
  # such as a duplicate axis for y 
  ggplot(aes(x = estimate, y = index)) +
      geom_point(size = 3) +
      geom_errorbar(mapping = aes(xmin = lcl, xmax = ucl), 
                    width = 0.20, size = 0.8) +
      geom_vline(aes(xintercept = 0), col = "blue", linetype = 2) +
      # here I replace the index with the variable names
      # and specify a duplicate axis
      scale_y_continuous(breaks = 1:length(terms), 
                         labels = terms, 
                         sec.axis = dup_axis(name = " ", labels = NULL)) + 
      scale_x_continuous(breaks = seq(-15,4,2),
                         sec.axis = dup_axis(name = " ", labels = NULL)) +
      labs(title = "Estimates from OLS regression",
           subtitle = "Dependent variable : mpg",
           x = "estimate", 
           y = "variable") +
      theme_scientific

Awesome! This is so much better than reading a table.

The dots represent the point estimates for each variable with the error bar representing the confidence of the point estimates. The vertical line on zero will allow readers to understand whether each estimate is significantly different from 0 (which is usually the null hypothesis).

We can easily improve the above picture by adding colors to each error bar to indicate whether each coefficient is significant (at the 5% level).

Here is how I do that -

lmfit_tidy %>% 
  mutate(lcl = estimate - 1.94 * std_error,
         ucl = estimate + 1.94 * std_error,
         significant = as.factor(ifelse(p_value < 0.1,1,0))) %>%
  ggplot(aes(x = estimate, y = index)) +
      geom_errorbar(mapping = aes(xmin = lcl, xmax = ucl, color = significant), 
                    width = 0.20, size = 0.8) +
      # moved the geom point after the error bar
      geom_point(size = 3) +
      geom_vline(aes(xintercept = 0), col = "blue", linetype = 2) +
      scale_y_continuous(breaks = 1:length(terms), 
                         labels = terms, 
                         sec.axis = dup_axis(name = " ", labels = NULL)) + 
      scale_x_continuous(breaks = seq(-15,4,2),
                         sec.axis = dup_axis(name = " ", labels = NULL))+
      # red indicates insignificant while blue
      # is significant
      scale_color_manual(values = c("red","blue")) +
      labs(title = "Estimates from OLS regression",
           subtitle = "Dependent variable : mpg",
           x = "estimate", 
           y = "variable") +
      theme_scientific +
      # I choose to remove the legend and instead add 
      # a footnote for explaining the color codes
      theme(legend.position = "none") +
      labs(caption = "Red indicates insignificant")

Excellent! We can further improve the above plot in a few ways.

First, we can add the value of the coefficient to make it easier for readers to understand where the point lies on the x axis. Second we can add more information such as the standard error of the estimates and p value associated with each estimate.

I use TeX function from latex2exp package to annotate the textual information to the existing plot. I write a helpful function that lets me format the number appropriately.

Here is how I do that -

fmt_no <- function(number) {
  return(format(round(number,2), nsmall = 2))
}

lmfit_tidy %>% 
  mutate(lcl = estimate - 1.94 * std_error,
         ucl = estimate + 1.94 * std_error,
         significant = as.factor(ifelse(p_value < 0.1,1,0))) %>%
  mutate(lab = glue("$\\beta$ = {fmt_no(estimate)}, se = {fmt_no(std_error)}, p = {fmt_no(p_value)}")) %>% 
  ggplot(aes(x = estimate, y = index, label = TeX(lab, output = "character"))) +
      geom_text(parse = TRUE, nudge_y = -0.5) +
      geom_errorbar(mapping = aes(xmin = lcl, xmax = ucl, color = significant), 
                    width = 0.20, size = 0.8) +
      geom_point(size = 3) +
      geom_vline(aes(xintercept = 0), col = "blue", linetype = 2) +
      scale_y_continuous(breaks = 1:length(terms), 
                         labels = terms, 
                         sec.axis = dup_axis(name = " ", labels = NULL)) + 
      scale_x_continuous(breaks = seq(-15,4,2),
                         sec.axis = dup_axis(name = " ", labels = NULL)) +
      scale_color_manual(values = c("red","blue")) +
      labs(title = "Estimates from OLS regression",
           subtitle = "Dependent variable : mpg",
           x = "estimate", 
           y = "variable") +
      theme_scientific +
      theme(legend.position = "none") +
      labs(caption = "Red indicates insignificant")

So nice!

The final finishing touch would be to change the ordering of the variables - ideally the interaction between wt and cyl should be right after it. Additionally, it would be helpful to replace the : notation with the x notation for relaying interaction between two variables.

One way to move wt:cyl close to wt and cyl, is to group them. Then use arrange to sort wt:cyl with wt and cyl. Finally, initialize a new index to capture this arrangement.

lmfit_tidy_rearranged <-
  lmfit_tidy %>% 
    mutate(wt_cyl = ifelse(grepl("wt|cyl",term),1,0)) %>% 
    group_by(wt_cyl) %>% 
      mutate(group_idx = 1:n()) %>% 
    ungroup() %>% 
    arrange(desc(wt_cyl),group_idx) %>% 
    mutate(index_rearranged = 1:n())  

terms_rearraged <- lmfit_tidy_rearranged$term

I use this rearranged data to re create the plot -

lmfit_tidy_rearranged %>% 
  mutate(lcl = estimate - 1.94 * std_error,
         ucl = estimate + 1.94 * std_error,
         significant = as.factor(ifelse(p_value < 0.1,1,0))) %>%
  mutate(lab = glue("$\\beta$ = {fmt_no(estimate)}, se = {fmt_no(std_error)}, p = {fmt_no(p_value)}")) %>%
  ggplot(aes(x = estimate, y = index_rearranged, label = TeX(lab, output = "character"))) +
      geom_text(parse = TRUE, nudge_y = -0.5) +
      geom_errorbar(mapping = aes(xmin = lcl, xmax = ucl, color = significant), 
                    width = 0.20, size = 0.8) +
      geom_point(size = 3) +
      geom_vline(aes(xintercept = 0), col = "blue", linetype = 2) +
      # reverse y axis
      # use the new rearranged terms here
      scale_y_reverse(breaks = 1:length(terms_rearraged), 
                      labels = terms_rearraged, 
                      sec.axis = dup_axis(name = " ", labels = NULL))  +
      scale_x_continuous(breaks = seq(-15,4,2),
                         sec.axis = dup_axis(name = " ", labels = NULL)) +
      scale_color_manual(values = c("red","blue")) +
      labs(title = "Estimates from OLS regression",
           subtitle = "Dependent variable : mpg",
           x = "estimate", 
           y = "variable") +
      theme_scientific +
      theme(legend.position = "none") +
      labs(caption = "Red indicates insignificant")

In order to replace the : notation with the x notation, it would be helpful to write a function that detects the interaction and then formats the text so that it can used within a TeX function.

format_terms <- function(var_string){
  has_interaction <- grepl(":",var_string)
  if (has_interaction) {
    vars <- unlist(strsplit(var_string,":"))
    return(glue("{vars[1]} $\\times$ {vars[2]}"))
  }
  else {
    return(var_string)
  }
}

terms_rearraged_formatted <- map_chr(terms_rearraged,format_terms)


lmfit_tidy_rearranged %>% 
  mutate(lcl = estimate - 1.94 * std_error,
         ucl = estimate + 1.94 * std_error,
         significant = as.factor(ifelse(p_value < 0.1,1,0))) %>%
  mutate(lab = glue("$\\beta$ = {fmt_no(estimate)}, se = {fmt_no(std_error)}, p = {fmt_no(p_value)}")) %>%
  ggplot(aes(x = estimate, y = index_rearranged, label = TeX(lab, output = "character"))) +
      geom_text(parse = TRUE, nudge_y = -0.5) +
      geom_errorbar(mapping = aes(xmin = lcl, xmax = ucl, color = significant), 
                    width = 0.20, size = 0.8) +
      geom_point(size = 3) +
      geom_vline(aes(xintercept = 0), col = "blue", linetype = 2) +
      scale_y_reverse(breaks = 1:length(terms_rearraged), 
                      labels = TeX(terms_rearraged_formatted), 
                      sec.axis = dup_axis(name = " ", labels = NULL)) + 
      scale_x_continuous(breaks = seq(-15,4,2),
                         sec.axis = dup_axis(name = " ", labels = NULL)) +
      scale_color_manual(values = c("red","blue")) +
      labs(title = "Estimates from OLS regression",
           subtitle = "Dependent variable : mpg",
           x = "estimate", 
           y = "variable") +
      theme_scientific +
      theme(legend.position = "none") +
      labs(caption = "Red indicates insignificant")

You can use this framework in any estimation.