Working with matrix-columns in tibbles

What’s a matrix-column?

The tibble package in R allows for the construction of “tibbles”—a sort of “enhanced” data frame. Most of these enhancements are fairly mundane, such as better printing in the console and not modifying column names. One of the unique features of tibbles is the ability to have a column that is a list. List-columns have been written about fairly extensively as they are a very cool way of working with data in the tidyverse. A less commonly known feature is that matrix-columns are also possible in a tibble. A matrix-column is a column of a tibble that is itself a \(n \times m\) matrix. Because a matrix-column is simultaneously a single column (of a tibble) and \(m\) columns (of the matrix), there are some quirks to working with them.

Creating a matrix-column.

Data frames and tibbles handle matrix inputs differently. data.frame() adds an \(n \times m\) matrix as \(m\) columns of a dataframe while tibble() creates a matrix-column.

my_matrix <- matrix(rnorm(100), nrow = 10)

No matrix-column. Just regular columns named mat_col._:

df <- data.frame(x = letters[1:10], mat_col = my_matrix)
dim(df)
## [1] 10 11
colnames(df)
##  [1] "x"          "mat_col.1"  "mat_col.2"  "mat_col.3"  "mat_col.4" 
##  [6] "mat_col.5"  "mat_col.6"  "mat_col.7"  "mat_col.8"  "mat_col.9" 
## [11] "mat_col.10"

Creating a matrix-colum requires using tibble() instead of data.frame():

tbl <- tibble(x = letters[1:10], mat_col = my_matrix)
dim(tbl)
## [1] 10  2
colnames(tbl)
## [1] "x"       "mat_col"

You can also “group” columns of a data frame or tibble into a matrix-column using dplyr.

df_mat_col <-
  df %>% 
  mutate(matrix_column = as.matrix(select(., starts_with("mat_col.")))) %>% 
  #then remove the originals
  select(-starts_with("mat_col."))

This creates a matrix-column, and the column names of the matrix itself come from the original dataframe (i.e. df).

colnames(df_mat_col)
## [1] "x"             "matrix_column"
colnames(df_mat_col$matrix_column)
##  [1] "mat_col.1"  "mat_col.2"  "mat_col.3"  "mat_col.4"  "mat_col.5" 
##  [6] "mat_col.6"  "mat_col.7"  "mat_col.8"  "mat_col.9"  "mat_col.10"

When do you need a matrix-column?

Matrix-columns are sometimes useful in modeling, when a predictor or covariate is not just a single variable, but a vector for every observation. For example, in multivariate analyses, certain packages (e.g. ropls) require a matrix as an input. Functional models are another example, which fit continuous functions of some variable (e.g. over time) as a covariate (One specific example are distributed lag non-linear models, which I hope to start blogging about soon).

pca <- prcomp(~ mat_col, data = tbl)
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.8022 1.6779 1.5645 1.3203 1.02222 0.77201 0.51162
## Proportion of Variance 0.2647 0.2295 0.1995 0.1421 0.08517 0.04858 0.02134
## Cumulative Proportion  0.2647 0.4942 0.6937 0.8358 0.92096 0.96954 0.99087
##                            PC8     PC9      PC10
## Standard deviation     0.31635 0.10918 5.838e-18
## Proportion of Variance 0.00816 0.00097 0.000e+00
## Cumulative Proportion  0.99903 1.00000 1.000e+00

Viewing and using matrix-columns

Matrix-columns are… weird, and as such they have some quirks in how they are printed in RStudio. Some of these may be bugs, but as far as I know, there aren’t any issues related to matrix-columns at the time of writing this post. If you are using paged printing of data frames in R Markdown documents, a tibble with a matrix column will simply not appear in-line. Instead you get an empty viewer box like so.

Trying to print a tibble with a matrix-column shows nothing in RStudio with paged printing of data frames.

You can turn off paged printing for a single code chunk with the paged.print = FALSE chunk option, and you’ll see something more like this:

```{r paged.print=FALSE}
tbl <- tibble(x = letters[1:10], mat_col = my_matrix)
tbl
``` 
## # A tibble: 10 x 2
##    x     mat_col[,1]    [,2]    [,3]    [,4]   [,5]   [,6]   [,7]   [,8]   [,9]
##    <chr>       <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 a          0.464  -1.12   -1.01    1.73    0.531  2.10   1.44   0.836  0.369
##  2 b          1.82   -0.239   0.749   1.57   -0.256 -1.41  -0.951 -1.71  -1.77 
##  3 c          0.190  -0.785   1.27   -1.43   -1.82   0.715 -0.593  2.07  -0.228
##  4 d         -1.18    0.271   1.52    0.135  -0.169 -1.23   0.522 -0.410  1.23 
##  5 e         -0.509  -0.944   0.108  -1.03    0.407 -0.953 -0.415 -1.25  -0.621
##  6 f          1.67    0.185  -0.807   0.149   0.114  0.240 -0.791  0.418 -2.13 
##  7 g         -2.04   -2.38    0.786   0.660  -0.114 -0.935  0.519 -1.32  -0.627
##  8 h         -0.0686  0.166  -0.0905 -1.18    0.217 -0.695 -1.53  -0.554 -0.610
##  9 i         -1.65    0.0525 -0.501  -1.64   -0.599 -1.04   0.143 -1.83  -0.626
## 10 j         -0.623  -0.290  -0.430  -0.0352  0.937 -3.33   2.32   1.10  -0.503
## # … with 1 more variable: [,10] <dbl>

Also note that View() only renders the first column of a matrix column, with no indication that there is more to see.

View()ing a tibble with a matrix-column only shows the first column of the matrix

Despite the printing and viewing issues, matrix columns are surprisingly easy to use. The usual sort of indexing works as expected. You can select the matrix column by name with [ or dplyr::select(), and you can extract the matrix column using the $ operator, [[, or dplyr::pull().

#a tibble with only the matrix-column
tbl["mat_col"]
select(tbl, mat_col) 

#the matrix itself:
tbl$mat_col
tbl[["mat_col"]]
pull(tbl, "mat_col")

Indexing rows works with no problem too.

tbl[3, ]
## # A tibble: 1 x 2
##   x     mat_col[,1]   [,2]  [,3]  [,4]  [,5]  [,6]   [,7]  [,8]   [,9] [,10]
##   <chr>       <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
## 1 c           0.190 -0.785  1.27 -1.43 -1.82 0.715 -0.593  2.07 -0.228  2.15
#dplyr::filter works too
filter(tbl, x %in% c("a", "f", "i"))
## # A tibble: 3 x 2
##   x     mat_col[,1]    [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]
##   <chr>       <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 a           0.464 -1.12   -1.01   1.73   0.531  2.10   1.44   0.836  0.369
## 2 f           1.67   0.185  -0.807  0.149  0.114  0.240 -0.791  0.418 -2.13 
## 3 i          -1.65   0.0525 -0.501 -1.64  -0.599 -1.04   0.143 -1.83  -0.626
## # … with 1 more variable: [,10] <dbl>

And as we saw above, using matrix-columns in model formulas seems to work consistently as long as the input is expected or allowed to be a matrix.

Saving matrix-columns to disk

Ordinary data frames and tibbles (i.e. without list-columns or matrix-columns) can usually be reliably saved as .csv files.

A tibble with a list-column will throw an error if you try to write it to a .csv file

df_list_col <- tibble(x = 1:10, y = list(1:10))

write_csv(df_list_col, "test.csv")
## Error: Flat files can't store the list column `y`

Tibbles with matrix-columns don’t throw the same error, but unfortunately this is not because they work correctly.

write_csv(tbl, "test.csv")
read_csv("test.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   x = col_character(),
##   mat_col = col_double()
## )
## # A tibble: 10 x 2
##    x     mat_col
##    <chr>   <dbl>
##  1 a      0.464 
##  2 b      1.82  
##  3 c      0.190 
##  4 d     -1.18  
##  5 e     -0.509 
##  6 f      1.67  
##  7 g     -2.04  
##  8 h     -0.0686
##  9 i     -1.65  
## 10 j     -0.623

As you can see, only the first column of the matrix was saved to the csv file. If you want to use matrix-columns in your work, you should either create them in the same document as your analysis, or save them as .rds files.

That’s all for now, but please let me know in the comments if you’ve used matrix-columns in your work!

Avatar
Postdoctoral Researcher

I’m a postdoctoral researcher in Emilio Bruna’s lab at University of Florida working on the effects of drought and habitat fragmentation on a tropical plant. I’m interested in the mechanisms of plant responses to stress and their consequences for natural and agricultural ecosystems.

comments powered by Disqus

Related