Importing data from a LI-COR photosynthesis meter into R

The LI-6400XT is a portable device used to measure photosynthesis in plant leaves. As you take measurements by pressing a button on the device, they are recorded into memory. In order to keep track of which measurments go with which plants (or experimental treatments), there is an “add remark” option where you can enter sample information before taking measurements.

When the data are exported, you get a series of .xls files and a plain text file. Both of these have some problems that you’ll have to deal with if you want to read the data into R and use it for statistical analysis or generating reports.

Excel nightmare or text nightmare? Pick your poison.

Figure 1: Excel nightmare or text nightmare? Pick your poison.

Both file types create some problems for easily getting the data into R:

  1. Header information interrupts the data table format. Fortunatly, it’s mostly just information about the instrument configuration that we don’t need.
  2. Untidy handling of remarks. Instead of remarks being in their own column, they appear in the HHMMSS column in the .xls files and in the Obs column in the .txt file! And to indicate that the row is a remark, instead of giving it an observation number in Obs, it just says “Remark =”.
  3. Column headers are spread over two rows. There is a (somewhat mysterious to me) row of “in”s and “out”s under the column headers in the .xls file.
  4. Another problem that you can’t see in Fig. 1 is that I’ve done my measurments in several bouts. This produced two .xls files and a text file with header text inbetween my two sets of measurements.

At this point I had to choose between reading in the .xls files with read_xls() from the readxl package and doing some wrangling from there, or to deal with the text file, which would surely include some regular expression headaches.

For some unknown reason, read_xls() didn’t work on these files, and I had to open them in Excel, then save them as .xlsx files and use read_xlsx() to get them into R. For the sake of full automatedness, I’m going to work through the text file example here.

Tidying up raw text

Figure 2: My approach to wrangling text files generated by the LI-6400XT

My approach is to read in the raw text, tidy it up, then use read_tsv() to get a list of data frames. After that, I planned to combine them into one big data frame and do some more tidying to extract the sample IDs from the remarks. I’ll be using functions from stringr to do the text tidying, and functions from various tidyverse packages to bring it all together into a coherent data frame.

library(tidyverse)
library(stringr)
text.raw <- read_file("licor.txt")

Scrolling through the text a little reveals that, conveniently, the line "$STARTOFDATA$" appears between the header information and the start of the actual data. The headers themselves always begin with "OPEN" followed by a date. I created regular expression patterns for these and used them to split the raw text file first into separate bouts of measurements, then into headers and data, discarding the headers.

header_pattern <- "\"OPEN \\d\\.\\d\\.\\d"
data_pattern <- "\\$STARTOFDATA\\$"

#splits into individual bouts
raw_split <- str_split(text.raw, header_pattern, simplify = TRUE)

#splits further to separate headers from actual data
raw_split2 <- str_split(raw_split, data_pattern, simplify = FALSE)

str(raw_split2)
## List of 3
##  $ : chr ""
##  $ : chr [1:2] "\"\n\"Fri Aug 25 2017 08:29:30\"\n<open><version>\"6.3.4\"</version></open>\n<open><configfile>\"/User/Configs/"| __truncated__ "\n\"Obs\"\t\"HHMMSS\"\t\"FTime\"\t\"EBal?\"\t\"Photo\"\t\"Cond\"\t\"Ci\"\t\"Trmmol\"\t\"VpdL\"\t\"CTleaf\"\t\"A"| __truncated__
##  $ : chr [1:2] "\"\n\"Fri Aug 25 2017 10:32:44\"\n<open><version>\"6.3.4\"</version></open>\n<open><configfile>\"/User/Configs/"| __truncated__ "\n\"Obs\"\t\"HHMMSS\"\t\"FTime\"\t\"EBal?\"\t\"Photo\"\t\"Cond\"\t\"Ci\"\t\"Trmmol\"\t\"VpdL\"\t\"CTleaf\"\t\"A"| __truncated__

It’s a little hard to see here, but now there is a list of 3 elements. The first element contains nothing (because there is nothing before the first header_pattern), the other elements contain two strings—one is the header, the other is the data. Let’s get rid of the headers and the empty list element.

#extract just the second element, the actual data
raw_split3 <- raw_split2 %>%
  map(`[`, 2) %>% #equivalent to doing raw_split2[[i]][2] for every element "i"
  flatten_chr() #converts to a vector

#remove empty elements
raw_split3 <- raw_split3[!is.na(raw_split3)]

Reading in our cleaned text file

Then we can finally read in our cleaned text as a tab-separated (.tsv) file. Here I make use of map() from the purrr package to apply read_tsv() to every string in our raw text vector. skip = 1 gets rid of that weird line of “in”s and “out”s.

input <- raw_split3 %>%
  map(read_tsv, skip = 1)

input.all <- bind_rows(input)
head(input.all, 10)
## # A tibble: 10 x 40
##    Obs        HHMMSS FTime `EBal?` Photo    Cond    Ci Trmmol  VpdL CTleaf
##    <chr>      <time> <dbl>   <int> <dbl>   <dbl> <dbl>  <dbl> <dbl>  <dbl>
##  1 08:30:38 …    NA    NA       NA NA    NA        NA  NA     NA      NA  
##  2 08:32:37 …    NA    NA       NA NA    NA        NA  NA     NA      NA  
##  3 08:40:20 …    NA    NA       NA NA    NA        NA  NA     NA      NA  
##  4 08:40:20 …    NA    NA       NA NA    NA        NA  NA     NA      NA  
##  5 08:40:20 …    NA    NA       NA NA    NA        NA  NA     NA      NA  
##  6 08:40:20 …    NA    NA       NA NA    NA        NA  NA     NA      NA  
##  7 08:42:11 …    NA    NA       NA NA    NA        NA  NA     NA      NA  
##  8 08:43:13 …    NA    NA       NA NA    NA        NA  NA     NA      NA  
##  9 1          08:43   932.       0  6.62  0.0589  202.  0.724  1.25   22.3
## 10 2          08:43   937.       0  6.99  0.0594  193.  0.731  1.25   22.3
## # ... with 30 more variables: Area <int>, BLC_1 <dbl>, StmRat <int>,
## #   BLCond <dbl>, Tair <dbl>, Tleaf <dbl>, TBlk <dbl>, CO2R <dbl>,
## #   CO2S <dbl>, H2OR <dbl>, H2OS <dbl>, RH_R <dbl>, RH_S <dbl>,
## #   Flow <dbl>, PARi <int>, PARo <int>, Press <dbl>, CsMch <dbl>,
## #   HsMch <dbl>, CsMchSD <dbl>, HsMchSD <dbl>, CrMchSD <dbl>,
## #   HrMchSD <dbl>, StableF <dbl>, BLCslope <dbl>, BLCoffst <dbl>,
## #   f_parin <dbl>, f_parout <dbl>, alphaK <dbl>, Status <int>

Extracting useful remarks

The first step is moving remarks into a remark column while keeping the observation numbers in the Obs column. I’m sure there is a more elegant way to do this, but I had recently learned about the safely() function from purrr which allows you to capture errors. I figured I could try converting elements of the Obs column to integers and if it failed, I could use that as a criteria for moving to a new column.

#create a "safe" version of as.integer() that returns a list of a result and error
safe_as.int <- safely(as.integer)
#returns error for text remarks, returns value for integer observation numbers

input.all <- input.all %>% 
  mutate(#create a comment column to indicate if an "Obs" is actually a remark
       comment = is.na(safe_as.int(Obs)$result), 
       #copy those remarks to the remark column
       remark = ifelse(comment == TRUE, Obs, NA),
       #remove remarks from Obs column
       Obs = ifelse(comment == FALSE, Obs, NA)) %>% 
#move the remark column the the begining
select(remark, everything()) %>% 
#remove the comment column.  We're done with it
select(-comment)
head(input.all, 10)
## # A tibble: 10 x 41
##    remark      Obs   HHMMSS FTime `EBal?` Photo    Cond    Ci Trmmol  VpdL
##    <chr>       <chr> <time> <dbl>   <int> <dbl>   <dbl> <dbl>  <dbl> <dbl>
##  1 08:30:38 L… <NA>     NA    NA       NA NA    NA        NA  NA     NA   
##  2 08:32:37 C… <NA>     NA    NA       NA NA    NA        NA  NA     NA   
##  3 08:40:20 L… <NA>     NA    NA       NA NA    NA        NA  NA     NA   
##  4 08:40:20 C… <NA>     NA    NA       NA NA    NA        NA  NA     NA   
##  5 08:40:20 C… <NA>     NA    NA       NA NA    NA        NA  NA     NA   
##  6 08:40:20 F… <NA>     NA    NA       NA NA    NA        NA  NA     NA   
##  7 08:42:11 F… <NA>     NA    NA       NA NA    NA        NA  NA     NA   
##  8 08:43:13 c… <NA>     NA    NA       NA NA    NA        NA  NA     NA   
##  9 <NA>        1     08:43   932.       0  6.62  0.0589  202.  0.724  1.25
## 10 <NA>        2     08:43   937.       0  6.99  0.0594  193.  0.731  1.25
## # ... with 31 more variables: CTleaf <dbl>, Area <int>, BLC_1 <dbl>,
## #   StmRat <int>, BLCond <dbl>, Tair <dbl>, Tleaf <dbl>, TBlk <dbl>,
## #   CO2R <dbl>, CO2S <dbl>, H2OR <dbl>, H2OS <dbl>, RH_R <dbl>,
## #   RH_S <dbl>, Flow <dbl>, PARi <int>, PARo <int>, Press <dbl>,
## #   CsMch <dbl>, HsMch <dbl>, CsMchSD <dbl>, HsMchSD <dbl>, CrMchSD <dbl>,
## #   HrMchSD <dbl>, StableF <dbl>, BLCslope <dbl>, BLCoffst <dbl>,
## #   f_parin <dbl>, f_parout <dbl>, alphaK <dbl>, Status <int>

In the data frame above, you’ll notice that some of the remarks are just me changing parameters of the device, while others are sample IDs (e.g. “08:43:13 c 4 a” is plot c, plant 4, leaf a). I got lucky in my sample naming convention in that the sample IDs are relatively easily distinguishable from other remarks using regular expressions.

#you must replace NA with the literal string "NA" so str_* functions from stringr can deal with it
input.all <- input.all %>% mutate(remark = str_replace_na(remark))

IDpattern <- "[:lower:][:blank:]\\d+[:blank:][:lower:]"
str_view(input.all$remark[1:10], IDpattern)

Now that I’ve figure out a pattern that matches the ID’s I can use str_extract() to move them to a new sampleID column.

input.all <- input.all %>%
  mutate(sampleID = str_extract(remark, IDpattern)) %>% 
  select(sampleID, everything())
head(input.all, 10)
## # A tibble: 10 x 42
##    sampleID remark   Obs   HHMMSS FTime `EBal?` Photo    Cond    Ci Trmmol
##    <chr>    <chr>    <chr> <time> <dbl>   <int> <dbl>   <dbl> <dbl>  <dbl>
##  1 <NA>     08:30:3… <NA>     NA    NA       NA NA    NA        NA  NA    
##  2 <NA>     08:32:3… <NA>     NA    NA       NA NA    NA        NA  NA    
##  3 <NA>     08:40:2… <NA>     NA    NA       NA NA    NA        NA  NA    
##  4 <NA>     08:40:2… <NA>     NA    NA       NA NA    NA        NA  NA    
##  5 <NA>     08:40:2… <NA>     NA    NA       NA NA    NA        NA  NA    
##  6 <NA>     08:40:2… <NA>     NA    NA       NA NA    NA        NA  NA    
##  7 <NA>     08:42:1… <NA>     NA    NA       NA NA    NA        NA  NA    
##  8 c 4 a    08:43:1… <NA>     NA    NA       NA NA    NA        NA  NA    
##  9 <NA>     NA       1     08:43   932.       0  6.62  0.0589  202.  0.724
## 10 <NA>     NA       2     08:43   937.       0  6.99  0.0594  193.  0.731
## # ... with 32 more variables: VpdL <dbl>, CTleaf <dbl>, Area <int>,
## #   BLC_1 <dbl>, StmRat <int>, BLCond <dbl>, Tair <dbl>, Tleaf <dbl>,
## #   TBlk <dbl>, CO2R <dbl>, CO2S <dbl>, H2OR <dbl>, H2OS <dbl>,
## #   RH_R <dbl>, RH_S <dbl>, Flow <dbl>, PARi <int>, PARo <int>,
## #   Press <dbl>, CsMch <dbl>, HsMch <dbl>, CsMchSD <dbl>, HsMchSD <dbl>,
## #   CrMchSD <dbl>, HrMchSD <dbl>, StableF <dbl>, BLCslope <dbl>,
## #   BLCoffst <dbl>, f_parin <dbl>, f_parout <dbl>, alphaK <dbl>,
## #   Status <int>

Fill down

Now, if this were Excel, you could highlight that “c 4 a” and drag the corner down to fill in all the NA’s. In R, you can do exactly this with the fill() function from tidyr.

#get rid of other remarks and fill down the sample ID column
output <- input.all %>% 
  filter(!xor(remark == "NA" , is.na(sampleID))) %>%
  fill(sampleID) %>% 
  #get rid of the rest of the remark rows
  filter(complete.cases(.)) %>% 
  #get rid of the remark column
  select(-remark)
head(output, 10)
## # A tibble: 10 x 41
##    sampleID Obs   HHMMSS FTime `EBal?` Photo   Cond    Ci Trmmol  VpdL
##    <chr>    <chr> <time> <dbl>   <int> <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1 c 4 a    1     08:43   932.       0  6.62 0.0589  202.  0.724  1.25
##  2 c 4 a    2     08:43   937.       0  6.99 0.0594  193.  0.731  1.25
##  3 c 4 a    3     08:44   942.       0  6.96 0.0596  195.  0.731  1.25
##  4 c 4 b    4     08:44   978.       0  7.08 0.0783  237.  0.941  1.23
##  5 c 4 b    5     08:44   985.       0  6.90 0.0791  242.  0.949  1.23
##  6 c 4 b    6     08:44   990.       0  6.78 0.0801  246.  0.959  1.22
##  7 c 4 c    7     08:45  1030.       0  6.34 0.0654  228.  0.805  1.25
##  8 c 4 c    8     08:45  1037.       0  6.48 0.0648  222.  0.798  1.26
##  9 c 4 c    9     08:45  1044.       0  6.51 0.0664  226.  0.819  1.26
## 10 c 4 c    10    08:45  1050.       0  6.68 0.0678  225.  0.836  1.26
## # ... with 31 more variables: CTleaf <dbl>, Area <int>, BLC_1 <dbl>,
## #   StmRat <int>, BLCond <dbl>, Tair <dbl>, Tleaf <dbl>, TBlk <dbl>,
## #   CO2R <dbl>, CO2S <dbl>, H2OR <dbl>, H2OS <dbl>, RH_R <dbl>,
## #   RH_S <dbl>, Flow <dbl>, PARi <int>, PARo <int>, Press <dbl>,
## #   CsMch <dbl>, HsMch <dbl>, CsMchSD <dbl>, HsMchSD <dbl>, CrMchSD <dbl>,
## #   HrMchSD <dbl>, StableF <dbl>, BLCslope <dbl>, BLCoffst <dbl>,
## #   f_parin <dbl>, f_parout <dbl>, alphaK <dbl>, Status <int>

And finally, we have a cleaned data frame ready for use in analyses! You could go on to separate plot ID, plant ID and leaf ID using separate() from tidyr, and then do any necessary calculations, visualizations, and modeling with the resulting data frame.

Related

comments powered by Disqus