ssh = suppressPackageStartupMessagesChemometric Modeling with Tidymodels: A Tutorial for Spectroscopic Data

For this tutorial, we use the beer dataset, publicly available and commonly used for spectroscopy-based regression problems. This dataset contains near-infrared spectroscopy (NIRS) data of beer samples alongside measurements of the original gravity (alcoholic beverage), which serves as the target variable. Original gravity (OG) is one of the primary metrics used by brewers to estimate the potential alcohol content of the beer, as it reflects the fermentable sugar content available for yeast metabolism. By analyzing OG alongside the NIRS spectra, we can explore how the spectral data correlates with this fundamental brewing property, offering insights into the chemical composition and quality of the beer samples.
Setup
Below, we use suppressPackageStartupMessages to suppress startup messages for clarity and load essential packages.
tidyversefor data manipulation and visualization.tidymodelsfor modeling workflows and machine learning.tidymodels_prefer()ensures consistency across conflicting tidymodels functions.
ssh(library(tidyverse))
ssh(library(tidymodels))
tidymodels_prefer()Additional libraries include:
kknn: Implements k-nearest neighbors (KNN).glmnet: Used for elastic net regression.ranger: Used for random forest.plsmod: Supports partial least squares (PLS) regression.magrittr: Provides pipe operators (%>%, %<>%).patchwork: Simplifies combiningggplot2plots.
library(kknn)
library(plsmod)
ssh(library(glmnet))
ssh(library(ranger))ssh(library(magrittr))
library(patchwork)We set a custom theme with a clean white background and adjusted sizes for all ggplot2 plots.
base_size = 15
theme_bw(
base_size = base_size,
base_line_size = base_size / 22,
base_rect_size = base_size / 15
) %>%
theme_set()Dataset Overview
We begin by loading the beer dataset and identifying the spectral predictor columns, which correspond to the NIRS wavelength variables. Usually, I prefer storing spectral wavelengths as character strings in a variable named wavelength because it makes data manipulation easier. This approach enhances flexibility when selecting, filtering, or grouping columns, simplifies integration with tidyverse functions, and ensures compatibility with tidymodels preprocessing workflows.
beer_data <- read_csv("beer.csv", show_col_types = FALSE)
wavelength <- beer_data %>% select(starts_with("xtrain")) %>% names()Previewing the first rows of the dataset helps us ensure its integrity and understand its structure.
beer_data %>% head(5) %>% DT::datatable()Show the code
p <- beer_data %>% mutate(spectra_id = paste0("s", 1:80)) %>%
pivot_longer(
cols = -c(originalGravity, spectra_id),
names_to = "wavelength",
values_to = "intensity"
) %>%
mutate(wavelength = rep(seq(1100, 2250, 2), times = 80)) %>%
ggplot() +
aes(x = wavelength, y = intensity, colour = originalGravity, group = spectra_id) +
geom_line() +
scale_color_viridis_c(option = "inferno", direction = 1) +
labs(
x = "Wavelength [nm]",
y = "Absorbance [arb. units]",
title = "NIRS Spectra of Beer Samples",
subtitle = "Contains 80 samples, measured from 1100 to 2250 nm",
color = "Original Gravity") +
theme_minimal()
plotly::ggplotly(p)