Joins and conditional matching with data.table

Feb 10, 2019 10 min read R, ecohydrology, Data Science

A basic operation

Joining tables by a common key is frequently required by anybody working with data (and especially relational databases) - and an approach that isn’t too difficult to master luckily! This post will showcase how simple and rolling joins work in data.table (and also documents my learning process with this wonderful package). I’ve fallen in love with the latter (rolls) for setting up model grids, which I will demonstrate.

The tidy `dplyr` way

Typically I’ve used the join family of functions from dplyr:

inner_join()
left_join()
right_join()
full_join()
semi_join()
nest_join()
anti_join()

These work fabulously in pipes and have a consistent, and tidyversesque (🤓) API. These functions are part of my basic, data wrangling, munging and handling tool set, and come in very handy when matching up e.g. different measurements for similar sites and/or dates!

Example

library(dplyr)

# create a dummy data frame with data
temperatures <- expand.grid(area = c("east", "west"), 
                     site = letters[1:3]) %>% 
      mutate(temp_c = runif(n = nrow(.), -5, 5))

# create a table related via common key 'site' 
vegetation <- data.frame(site = letters[1:3], 
                          landcover = c("forest",
                                        "steppe",
                                        "wetland"),
                         stringsAsFactors = TRUE)


# magic!
overview <- left_join(temperatures, vegetation, by = "site")

knitr::kable(overview,
             format = "html",
             digits = 1,
             caption = "Joined data set.")

Table 1: Joined data set.
area	site	temp_c	landcover
east	a	3.8	forest
west	a	-3.6	forest
east	b	4.4	steppe
west	b	2.1	steppe
east	c	3.3	wetland
west	c	4.9	wetland

Aggregation is just two lines away:

overview_summary <- overview %>% 
      group_by(landcover) %>% 
      summarise_if(is.numeric, mean)

Table 2: Simple aggregation using mean in dplyr.
landcover	temp_c
forest	0.1
steppe	3.2
wetland	4.1

Setting the `data.table`

In short, data.table is an incredibly powerful and resource-efficient package, that I just began using and loving. It’s syntax is initially a bit difficult to wrap your head around - but that holds true for pretty much any new API/package I started using. It has outstanding documentation (e.g. this intro vignette) and is well-maintained. And so much fun!

This here is taken straight from the introductory PDF, and says it all:

Example

I’ll use the same example data from above to show a few data.table tricks. data.tables require keys to operate on and need to be set manually, or through coercion using as.data.table().

library(data.table)

# create a dummy data frame with data
temperatures <- expand.grid(area = c("east", "west"), 
                     site = letters[1:3]) %>% 
      mutate(temp_c = runif(n = nrow(.), -5, 5)) %>% 
      as.data.table()

# create a table related via common key 'site' 
vegetation <- data.table(site = letters[1:3], 
                          landcover = c("forest",
                                        "steppe",
                                        "wetland"),
                         stringsAsFactors = TRUE) 
 

# magic!
# overview_dt <- DT

temperatures[vegetation, on = "site"]

##    area site    temp_c landcover
## 1: east    a -2.781943    forest
## 2: west    a  3.062782    forest
## 3: east    b -4.427340    steppe
## 4: west    b -4.528919    steppe
## 5: east    c  0.690448   wetland
## 6: west    c -4.127972   wetland

# or after setting keys specifically
setkey(temperatures, site)
setkey(vegetation, site)


overview_dt <- temperatures[vegetation]

Table 3: Joined data set.
area	site	temp_c	landcover
east	a	-2.8	forest
west	a	3.1	forest
east	b	-4.4	steppe
west	b	-4.5	steppe
east	c	0.7	wetland
west	c	-4.1	wetland

And as above, a simple aggregation, yet on a single line:

# same aggregation as before:
overview_dt_summary <- overview_dt[,j = .(temp_c_mean = mean(temp_c)), by = .(landcover)]

Table 4: Simple aggregation using mean in data.table.
landcover	temp_c_mean
forest	0.1
steppe	-4.5
wetland	-1.7

Note how we always come back to the simple syntax of DT[i, j, by] (where, what, grouping). (Side note: the .() syntax indicates that the input type is a list, important when you want to return a named column or use multiple columns to operate on).

All operations, for example, can be done on subsets of the data, too, and multiple calculations in one go are very much possible.

overview_dt[area == "east", j = .(count = .N, max_temp_c = max(temp_c))]

##    count max_temp_c
## 1:     3   0.690448

Getting fancy with a (simple!) rolling join

The problem

The main reason I stumbled over data.table were rolling joins. I was desperately looking for a convenient way of assigning parameters to a 1-D model grid. More precisely, I was simulating lateral flow through a soil column, and wanted to be able to set the stratification in a consistent manner that was independent of my choice of discretization (i.e. how thick each simulated slice of the soil column would be).

There are a few ways I thought this could be achieved by (and certainly more than I could come up with). My starting point was always a table containing general parameters, the soil profile, and a model grid that needs to be populated with the profile based on the general parameters. My ideas were:

looping over each row in the model grid and filling a “soil_type” column based on a conditional statement
using dplyr::mutate() with a nested ifelse()
using a dplyr::case_when() with conditional statements covering the extent of each soil layer
hard-coding a base R solution with [] subsetting and conditional statements

While all of these are pretty-straight forward, I was hoping for something that could be employed in a more programmatic fashion. Enter data.table

Conditional joining with `DT[ , roll = TRUE/-Inf]`

A rolling join matches one row of a table with that of another (based on a common key), as long as the matching condition has not changed. A quick demonstration!

times <- data.table(hour_of_day = 1:24)
mood <- data.table(hour_of_day = c(0, 8, 20, 24),
                   feeling = c("sleep", "coffee please", "ready for bed", "sleep"))

# regular join
mood[times, on = "hour_of_day"]

##     hour_of_day       feeling
##  1:           1          <NA>
##  2:           2          <NA>
##  3:           3          <NA>
##  4:           4          <NA>
##  5:           5          <NA>
##  6:           6          <NA>
##  7:           7          <NA>
##  8:           8 coffee please
##  9:           9          <NA>
## 10:          10          <NA>
## 11:          11          <NA>
## 12:          12          <NA>
## 13:          13          <NA>
## 14:          14          <NA>
## 15:          15          <NA>
## 16:          16          <NA>
## 17:          17          <NA>
## 18:          18          <NA>
## 19:          19          <NA>
## 20:          20 ready for bed
## 21:          21          <NA>
## 22:          22          <NA>
## 23:          23          <NA>
## 24:          24         sleep
##     hour_of_day       feeling

# forward join populates until a condition changes (i.e. a new match in the "on" column is found)
mood[times, on = "hour_of_day", roll = TRUE]

##     hour_of_day       feeling
##  1:           1         sleep
##  2:           2         sleep
##  3:           3         sleep
##  4:           4         sleep
##  5:           5         sleep
##  6:           6         sleep
##  7:           7         sleep
##  8:           8 coffee please
##  9:           9 coffee please
## 10:          10 coffee please
## 11:          11 coffee please
## 12:          12 coffee please
## 13:          13 coffee please
## 14:          14 coffee please
## 15:          15 coffee please
## 16:          16 coffee please
## 17:          17 coffee please
## 18:          18 coffee please
## 19:          19 coffee please
## 20:          20 ready for bed
## 21:          21 ready for bed
## 22:          22 ready for bed
## 23:          23 ready for bed
## 24:          24         sleep
##     hour_of_day       feeling

# backwards works, too!
mood[times, on = "hour_of_day", roll = -Inf]

##     hour_of_day       feeling
##  1:           1 coffee please
##  2:           2 coffee please
##  3:           3 coffee please
##  4:           4 coffee please
##  5:           5 coffee please
##  6:           6 coffee please
##  7:           7 coffee please
##  8:           8 coffee please
##  9:           9 ready for bed
## 10:          10 ready for bed
## 11:          11 ready for bed
## 12:          12 ready for bed
## 13:          13 ready for bed
## 14:          14 ready for bed
## 15:          15 ready for bed
## 16:          16 ready for bed
## 17:          17 ready for bed
## 18:          18 ready for bed
## 19:          19 ready for bed
## 20:          20 ready for bed
## 21:          21         sleep
## 22:          22         sleep
## 23:          23         sleep
## 24:          24         sleep
##     hour_of_day       feeling

Discretizing a soil profile for a simple model

As noted earlier, I stumbled over data.table while looking for the most convenient way of assigning soil properties to cells along a soil profile in a programmatic fashion. This was principally necessary for a simple sensitivity analyses checking the impact of my model discretization, i.e. thickness of each cell / slice along the soil profile from surface to the model boundary (e.g. bedrock).

I used a list to store all of the basic model parameters in model_params:

total depth of soil column (\(z\))
height of each cell (i.e. \(\Delta z\))

The number of cells is then derived as \(n = \frac{ z } {\Delta z}\).

Additionally, I pulled soil profile and texture data from a soil coring log of our field sites and stored them in soil_param as a data frame. The data shown here are a simplified ‘dummy’ set.

model_params <- list(total_depth = 0.8,
                     cell_height_m = 0.1)
# model_params$n_cells = model_params$total_depth / model_params$cell_height_m

# starts from bottom
soil_params <- data.table(z = c(0.4, 0.1, 0),
                         soil_type = c("silt", "organic_mineral_mix", "lfh"),
                         ksat = c(10^-8, 10^-5, 10^-4),
                         key = "z")

I then developed a simple function model_domain() that used a rolling join and returned a data.frame with an option to set total depth and cell height.

#' Creates a discretized model domain along a soil profile
#'
#' @param total.depth Depth of soil column in meters
#' @param cell.height Height of each model cell in meters
#' @param soil.params A data frame containing one column of soil depths (named 'z')and any number of additional parameters of interest. These are added to the model grid using a rolling join. These parameters must be supplied ordered from top to bottom, starting at the respective depths, with zero (0) denoting the soil surface. 
#'
#' @return Returns a data frame representing a soil column and has at least one column for soil depths, discretized based on the supplied cell.height
#' @export
#'
#' @examples
model_domain <- function(total.depth, cell.height, soil.params){
      require(data.table)
      
      soil.params <- data.table(soil.params, key = "z")

      domain <- data.table(z = seq(0, total.depth, by = cell.height), key = "z")
      
      
      joined <- soil.params[domain, on = "z", roll = TRUE]
      
      
      
      return(joined)
      
}

Giving it a spin:

profile_0.1 <- model_domain(total.depth = model_params$total_depth,
             cell.height = model_params$cell_height_m,
             soil.params = soil_params)

profile_0.1

##      z           soil_type  ksat
## 1: 0.0                 lfh 1e-04
## 2: 0.1 organic_mineral_mix 1e-05
## 3: 0.2 organic_mineral_mix 1e-05
## 4: 0.3 organic_mineral_mix 1e-05
## 5: 0.4                silt 1e-08
## 6: 0.5                silt 1e-08
## 7: 0.6                silt 1e-08
## 8: 0.7                silt 1e-08
## 9: 0.8                silt 1e-08

Now, the model_domain() function can be easily used to investigate the impact of varying parameters of the model, and corresponding soil horizons. Rather than a formal investigation, here I’ll only offer a visual representation of how different \(\Delta z\)’s affect the model domain.

library(dplyr)
library(ggplot2)

# make a series of cell heights to discretize the model with
model_params$cell_height_m <- seq(0.02, 0.1, by = 0.02)


# use map (apply-style function) to run model_domain with multiple cell heights
profile_multiple <- purrr::map(model_params$cell_height_m,
                               ~model_domain(total.depth = model_params$total_depth,
                                             cell.height = .x,
                                             soil.params = soil_params)) %>% 
      # supply names for subsequent binding and plotting
      purrr::set_names(model_params$cell_height_m) %>% 
      bind_rows(.id = "height")
      

      
# make plot      
profile_multiple %>%
      ggplot(aes(x = "Profile",
                 y = z,
                 fill = soil_type)) +
      
      # add points
      geom_point(size = 3,
                 shape = 21,
                 color = "gray50") + 
      
      # split for each Delta z
      facet_grid(~height,switch = "x") +
      
      # labelling
      labs(y = "Depth", 
           fill = "Horizon",
           title = "Model Domain with variable Discretization",
           subtitle = "Resolution impacts location of boundaries" ) +
      
      # scales
      scale_y_reverse() +
      scale_fill_viridis_d(begin = 0, end = 0.9,option = "D") +
      
      # theming
      theme_minimal() +
      theme(plot.background = element_rect(fill = "transparent"),
            panel.border = element_rect(fill = NA, color = "transparent"),
            text = element_text(color = "gray40", face = "italic"),
            axis.title.x = element_blank(),
            axis.text.x = element_blank())

`data.table` is a new favorite!

I hope that you, just like me, have found a new friend in the #rstats world with data.table. The use-case I showed here was fairly basic, yet it can be applied in a number of settings, and I’ve found it particularly useful for making model grids. Yet, I can see a number of other applications, such as matching pre-defined or computed events, such as droughts, storms or disturbances between time series of interest. Give it a go yourself, and let me know which applications you can think up.

R data.table modelling

Joins and conditional matching with data.table

A basic operation

The tidy dplyr way

Example

Setting the data.table

Example

Getting fancy with a (simple!) rolling join

The problem

Conditional joining with DT[ , roll = TRUE/-Inf]

Discretizing a soil profile for a simple model

data.table is a new favorite!

Related

The tidy `dplyr` way

Setting the `data.table`

Conditional joining with `DT[ , roll = TRUE/-Inf]`

`data.table` is a new favorite!