Retrieve chemical retention indices from NIST with {webchem}!

My PhD has involved learning a lot more than I expected about analytical chemistry, and as I’ve been learning, I’ve been trying my best to make my life easier by writing R functions to help me out. Some of those functions have found a loving home in the webchem package, part of rOpenSci.

Papers that use gas chromatography to separate and measure chemicals often include a table of the compounds they found along with experimental retention indices and literature retention indices. A retention index is basically a corrected retention time—the time it took for the particular chemical to make it through the gas chromatograph, an instrument designed to separate chemicals, to the detector used to identify the compound (e.g. an FID or mass spectrometer). While the retention time for a particular compound might vary from run to run or beetween labs, the retention index should be comparable. Therefore, they are often used to help identify compounds and NIST maintains a database of retention indeces for researchers to refer to.

An example table including literature retention indices from [Kowalsick, et al. 2014](https://doi.org/10.1016/j.chroma.2014.10.058)

Figure 1: An example table including literature retention indices from Kowalsick, et al. 2014

Producing such a table of literature retention indices for potentially hundreds of metabolites by hand can be really tedious!

Enter nist_ri(), a handy function I wrote to scrape retention index tables from NIST. Below, I work through an example of how you might use it. First, you need to install the latest version of webchem. My function isn’t in the latest CRAN release at the time of writing this blog post, but you can install from github like so:

devtools::install_github("ropensci/webchem")

To look up a retention index, you need a CAS identifier number for the chemical (For now, at least. Other search methods may be implemented in the future). If you don’t already have CAS numbers, you can get them using other functions in webchem from chemical names or other identifier numbers.

CASs <- c("83-34-1", "119-36-8", "123-35-3", "19700-21-1")

Load the package and take a look at the help file. You’ll see that we need to choose what type of retention index to scrape, what polarity of column, and what kind of teperature program. If you browse one of the NIST sites for a compound, this will make more sense.

library(webchem)
?nist_ri

Let’s get Van Den Dool & Kratz (AKA “linear”) retention indeces for non-polar columns using a temperature ramp. This might take a while, depending on your internet connection and how many CAS numbers you request data for. If a certain type of retention index doesn’t exist for a compound, the function will return NA for all columns but the CAS number.

RIs <- nist_ri(CASs, "linear", "non-polar", "ramp")
head(RIs)
##       CAS      type  phase   RI length      gas substrate diameter
## 1 83-34-1 Capillary  SPB-5 1410     60     <NA>      <NA>     0.32
## 2 83-34-1 Capillary DB-5MS 1380     30   Helium      <NA>     0.25
## 3 83-34-1 Capillary  SE-54 1410     50   Helium      <NA>     0.32
## 4 83-34-1 Capillary   DB-5 1381     30 Hydrogen      <NA>     0.25
## 5 83-34-1 Capillary   DB-5 1389     30 Nitrogen      <NA>     0.25
## 6 83-34-1 Capillary DB-5MS 1399     30     <NA>      <NA>     0.25
##   thickness temp_start temp_end temp_rate hold_start hold_end
## 1      1.00         40      230         3          2       10
## 2      0.25         35      225        10          5       25
## 3        NA         40      240         8          2        5
## 4      0.25         35      270         5         NA       20
## 5      0.25         30      250         3          2        2
## 6      0.25         40      200        10          3       20
##                               reference comment
## 1                 Engel and Ratel, 2007 MSDC-RI
## 2   Lozano P.R., Drake M., et al., 2007 MSDC-RI
## 3    Schlutt B., Moran N., et al., 2007 MSDC-RI
## 4            Alves, Pinto, et al., 2005 MSDC-RI
## 5 Colahan-Sederstrom and Peterson, 2005 MSDC-RI
## 6  Whetstine, Cadwallader, et al., 2005 MSDC-RI

You can see there are multiple retention indices (RI) for each CAS number. Let’s filter this down some more using some functions from dplyr and stringr.

library(dplyr)
library(stringr)
RIs_filtered <- RIs %>%
  filter(gas == "Helium",
         between(length, 20, 30),
         str_detect(phase, "5"),
         diameter < 0.3,
         thickness == 0.25)
head(RIs_filtered)
##        CAS      type  phase     RI length    gas substrate diameter
## 1  83-34-1 Capillary DB-5MS 1380.0     30 Helium      <NA>     0.25
## 2  83-34-1 Capillary DB-5MS 1396.0     30 Helium      <NA>     0.25
## 3  83-34-1 Capillary   DB-5 1391.0     30 Helium      <NA>     0.26
## 4 119-36-8 Capillary   DB-5 1201.0     25 Helium      <NA>     0.25
## 5 119-36-8 Capillary HP-5MS 1200.7     30 Helium      <NA>     0.25
## 6 119-36-8 Capillary HP-5MS 1190.0     30 Helium      <NA>     0.25
##   thickness temp_start temp_end temp_rate hold_start hold_end
## 1      0.25         35      225        10          5       25
## 2      0.25         35      200        10          5       30
## 3      0.25         50      300         6          4       20
## 4      0.25         60      200         2         NA       60
## 5      0.25         80      300         4         NA       NA
## 6      0.25         60      280         4          5       NA
##                                 reference comment
## 1     Lozano P.R., Drake M., et al., 2007 MSDC-RI
## 2 Karagül-Yüceer, Vlahovich, et al., 2003 MSDC-RI
## 3                Rostad and Pereira, 1986 MSDC-RI
## 4                 Rout, Rao, et al., 2007 MSDC-RI
## 5                Zeng, Zhao, et al., 2007 MSDC-RI
## 6         Saroglou, Dorizas, et al., 2006 MSDC-RI

Now we could summarise to get an average of all the database entries…

RIs_filtered %>% 
  group_by(CAS) %>% 
  summarise(mean_RI = mean(RI))
## # A tibble: 4 x 2
##   CAS        mean_RI
##   <chr>        <dbl>
## 1 119-36-8     1193.
## 2 123-35-3      990.
## 3 19700-21-1   1430 
## 4 83-34-1      1389

Or if we wanted to pick a single entry for each CAS number with the median RI, we could do that as well.

best_RIs <- RIs_filtered %>%
  group_by(CAS) %>% 
  filter(RI == median(RI)) %>% 
  filter(row_number() == 1) %>% 
  select(CAS, RI, reference)
best_RIs
## # A tibble: 4 x 3
## # Groups:   CAS [4]
##   CAS           RI reference                             
##   <chr>      <dbl> <chr>                                 
## 1 83-34-1     1391 Rostad and Pereira, 1986              
## 2 119-36-8    1191 Aligiannis, Kalpoutzakis, et al., 2004
## 3 123-35-3     991 Maccioni, Baldini, et al., 2007       
## 4 19700-21-1  1430 Dickschat, Wenzel, et al., 2004

You could then easily take this table and *_join() it to your GC/MS data, if you have a column for CAS#, and select the RI and reference columns, for example.

fake.data <- data.frame(CAS = CASs,
                 #Name = cts_convert(CASs, from = "CAS", to = "Chemical Name", first = TRUE),
                 Name = c("skatole", "methyl salicylate", "beta-myrcene", "geosmin"),
                 group_1_conc = round(abs(rnorm(4)), 3),
                 group_2_conc = round(abs(rnorm(4)), 3))

left_join(fake.data, best_RIs) %>%
  select(CAS, Name, RI, everything()) %>% 
  arrange(RI) %>%
  knitr::kable()
CAS Name RI group_1_conc group_2_conc reference
123-35-3 beta-myrcene 991 0.148 0.045 Maccioni, Baldini, et al., 2007
119-36-8 methyl salicylate 1191 0.569 0.259 Aligiannis, Kalpoutzakis, et al., 2004
83-34-1 skatole 1391 1.139 0.445 Rostad and Pereira, 1986
19700-21-1 geosmin 1430 0.151 0.474 Dickschat, Wenzel, et al., 2004

Related

comments powered by Disqus