Chemometric Modeling with Tidymodels: A Tutorial for Spectroscopic Data

R
Tidymodels
Chemometrics
Machine Learning
Spectroscopy
In this post, we demonstrate how to build robust chemometric models for spectroscopic data using the Tidymodels framework in R. This workflow is designed to cater to beginners and advanced practitioners alike, offering an end-to-end guide from data preprocessing to model evaluation and interpretation.
Author

Christian L. Goueguel

Published

April 17, 2022

Photo by Robert Lukeman.

For this tutorial, we use the beer dataset, publicly available and commonly used for spectroscopy-based regression problems. This dataset contains near-infrared spectroscopy (NIRS) data of beer samples alongside measurements of the original gravity (alcoholic beverage), which serves as the target variable. Original gravity (OG) is one of the primary metrics used by brewers to estimate the potential alcohol content of the beer, as it reflects the fermentable sugar content available for yeast metabolism. By analyzing OG alongside the NIRS spectra, we can explore how the spectral data correlates with this fundamental brewing property, offering insights into the chemical composition and quality of the beer samples.

Setup

Below, we use suppressPackageStartupMessages to suppress startup messages for clarity and load essential packages.

ssh = suppressPackageStartupMessages
  • tidyverse for data manipulation and visualization.
  • tidymodels for modeling workflows and machine learning.
  • tidymodels_prefer() ensures consistency across conflicting tidymodels functions.
ssh(library(tidyverse))
ssh(library(tidymodels))
tidymodels_prefer()

Additional libraries include:

  • kknn: Implements k-nearest neighbors (KNN).
  • glmnet: Used for elastic net regression.
  • ranger: Used for random forest.
  • plsmod: Supports partial least squares (PLS) regression.
  • magrittr: Provides pipe operators (%>%, %<>%).
  • patchwork: Simplifies combining ggplot2 plots.
library(kknn)
library(plsmod)
ssh(library(glmnet))
ssh(library(ranger))
ssh(library(magrittr))
library(patchwork)

We set a custom theme with a clean white background and adjusted sizes for all ggplot2 plots.

base_size = 15 
theme_bw(
  base_size = base_size,
  base_line_size = base_size / 22,
  base_rect_size = base_size / 15
  ) %>% 
  theme_set()

Dataset Overview

We begin by loading the beer dataset and identifying the spectral predictor columns, which correspond to the NIRS wavelength variables. Usually, I prefer storing spectral wavelengths as character strings in a variable named wavelength because it makes data manipulation easier. This approach enhances flexibility when selecting, filtering, or grouping columns, simplifies integration with tidyverse functions, and ensures compatibility with tidymodels preprocessing workflows.

beer_data <- read_csv("beer.csv", show_col_types = FALSE)
wavelength <- beer_data %>% select(starts_with("xtrain")) %>% names()

Previewing the first rows of the dataset helps us ensure its integrity and understand its structure.

beer_data %>% head(5) %>% DT::datatable()
Show the code
p <- beer_data %>% mutate(spectra_id = paste0("s", 1:80)) %>%
  pivot_longer(
  cols = -c(originalGravity, spectra_id),
  names_to = "wavelength",
  values_to = "intensity"
  ) %>%
  mutate(wavelength = rep(seq(1100, 2250, 2), times = 80)) %>%
  ggplot() +
  aes(x = wavelength, y = intensity, colour = originalGravity, group = spectra_id) +
  geom_line() +
  scale_color_viridis_c(option = "inferno", direction = 1) +
  labs(
    x = "Wavelength [nm]", 
    y = "Absorbance [arb. units]", 
    title = "NIRS Spectra of Beer Samples", 
    subtitle = "Contains 80 samples, measured from 1100 to 2250 nm", 
    color = "Original Gravity") +
  theme_minimal()

plotly::ggplotly(p)