---
title: "Publish CalCOFI tables to ERDDAP (EDDTableFromParquetFiles)"
editor_options:
chunk_output_type: console
---
## Overview
Publish CalCOFI tabular data to the CalCOFI ERDDAP server
(`erddap/erddap:latest` = v2.30, which supports **EDDTableFromParquetFiles**
natively since v2.27) as a set of EDDTable datasets backed directly by Parquet.
For each curated table this workflow:
1. **Stages** an ERDDAP-friendly Parquet (`libs/erddap.R::stage_table_for_erddap`):
drops `geom`/provenance, exposes `latitude`/`longitude`/`depth` as doubles and
`time` as epoch-seconds (ERDDAP has no geometry type and wants numeric time).
2. **Generates** an `EDDTableFromParquetFiles` `<dataset>` block
(`erddap_dataset_xml`) with `<dataType>`/`units`/`ioos_category`/`standard_name`
per variable (units pulled from each table's `metadata.json`).
3. Writes the staged Parquet to `data/erddap/{datasetID}/` and assembles all
blocks into `data/erddap/datasets_calcofi.xml`.
**Deploy** (manual, documented at the end): rsync `data/erddap/*` to the server's
`/share/erddap/datasets/`, splice the XML into `CalCOFI/erddap/content/datasets.xml`,
and reload via flag files.
## Setup
```{r}
#| label: setup
#| message: false
librarian::shelf(DBI, duckdb, dplyr, fs, glue, jsonlite, purrr, readr,
stringr, tibble, here, knitr, quiet = T)
options(readr.show_col_types = F)
source(here("libs/erddap.R"))
dir_erddap <- here("data/erddap")
dir_create(dir_erddap)
con <- dbConnect(duckdb()); dbExecute(con, "INSTALL spatial; LOAD spatial")
```
## Curated Tables
Tables that carry (or derive) `time`/`latitude`/`longitude` directly. Measurement
tables keyed only by UUID+depth (e.g. `ctd_measurement`, `bottle_measurement`)
are published via their coordinate-bearing companions (`ctd_wide`, joined views)
in a later pass.
```{r}
#| label: config
cfg <- tribble(
~dataset_id, ~parquet, ~title, ~summary, ~cdm, ~depth_col,
"calcofi_ctd", "data/parquet/calcofi_ctd-cast/ctd_wide.parquet", "CalCOFI CTD Profiles (wide)", "Wide-format CTD profiles (one row per cast/depth) from CalCOFI cruises.", "TrajectoryProfile", "depth_m",
"calcofi_casts", "data/parquet/calcofi_bottle/casts.parquet", "CalCOFI Casts", "Cast-level metadata for CalCOFI CTD/bottle stations.", "Point", "",
"calcofi_dic", "data/parquet/calcofi_dic/dic_sample.parquet", "CalCOFI DIC Samples", "Dissolved inorganic carbon / alkalinity sample positions.", "Point", "depth_m",
"calcofi_euphausiids", "data/parquet/cce-lter_euphausiids/euphausiids_tow.parquet", "CalCOFI Euphausiid Tows", "Euphausiid (krill) abundance net tows, 1951-2019.", "Point", "",
"calcofi_zooplankton", "data/parquet/pic_zooplankton/zooplankton_tow.parquet","CalCOFI Zooplankton Tows", "SIO PIC zooplankton net-tow registry (CalCOFI region).", "Point", "",
"calcofi_phytoplankton","data/parquet/calcofi_phytoplankton/phyto_obs.parquet", "CalCOFI Phytoplankton (Venrick)","Phytoplankton abundance (cells/L) by taxon, 1996-2022 (Venrick, EDI knb-lter-cce.254). Region-pooled: latitude/longitude are provisional region centroids.", "Point", ""
) |>
mutate(parquet = here(parquet),
metadata = here(glue("{path_dir(parquet)}/metadata.json")))
cfg |> select(dataset_id, title, cdm) |> kable()
```
## Stage Parquet + Generate datasets.xml
```{r}
#| label: publish
#| results: asis
# pull a column -> units lookup from a table's metadata.json (keys are "table.column")
units_from_metadata <- function(meta_path) {
if (!file.exists(meta_path)) return(list())
m <- fromJSON(meta_path, simplifyVector = FALSE)
setNames(
lapply(m$columns, function(c) c$units %||% ""),
sub("^[^.]+\\.", "", names(m$columns)))
}
blocks <- character(0)
for (i in seq_len(nrow(cfg))) {
r <- cfg[i, ]
if (!file.exists(r$parquet)) { cat(glue("\n- SKIP {r$dataset_id}: {r$parquet} not found\n")); next }
tbl <- tools::file_path_sans_ext(basename(r$parquet))
staged <- stage_table_for_erddap(
con, r$parquet, tbl, file.path(dir_erddap, r$dataset_id),
depth_col = if (nzchar(r$depth_col)) r$depth_col else "depth_m")
units_lk <- units_from_metadata(r$metadata)
xml <- erddap_dataset_xml(
staged, dataset_id = r$dataset_id, title = r$title, summary = r$summary,
file_dir = glue("/datasets/{r$dataset_id}/"),
units_lookup = units_lk, cdm_data_type = r$cdm)
writeLines(xml, file.path(dir_erddap, r$dataset_id, glue("{r$dataset_id}.xml")))
blocks <- c(blocks, xml)
cat(glue("\n- {r$dataset_id}: staged {nrow(staged)} cols, ",
"{ifelse('time' %in% staged$column,'time OK','NO TIME')}, ",
"{ifelse('latitude' %in% staged$column,'lat/lon OK','NO COORDS')}\n"))
}
```
## Assemble datasets.xml
```{r}
#| label: assemble
xml_all <- paste0(
"<!-- CalCOFI EDDTableFromParquetFiles datasets (generated by ",
"publish_calcofi_to_erddap.qmd) -->\n", paste(blocks, collapse = "\n\n"))
writeLines(xml_all, file.path(dir_erddap, "datasets_calcofi.xml"))
cat("wrote", length(blocks), "dataset blocks to data/erddap/datasets_calcofi.xml\n")
```
## Deploy (manual)
```{r}
#| label: deploy-instructions
#| echo: false
#| results: asis
cat('
To deploy to the live ERDDAP server (CalCOFI/server docker-compose, erddap.calcofi.io):
1. **Sync staged parquet** to the server data dir (container `/datasets/`):
```
rsync -av data/erddap/*/ <server>:/share/erddap/datasets/
```
(each `data/erddap/{datasetID}/{table}.parquet` -> `/share/erddap/datasets/{datasetID}/`)
2. **Splice the dataset blocks** in `data/erddap/datasets_calcofi.xml` into
`CalCOFI/erddap/content/datasets.xml` at the `<!-- add dataset definitions below -->`
marker, and commit/push the `CalCOFI/erddap` repo (it is bind-mounted into the container).
3. **Reload ERDDAP** per dataset (flag), e.g.:
```
touch /share/erddap/data/flag/calcofi_ctd # or each datasetID
```
or hit https://erddap.calcofi.io/erddap/setDatasetFlag.html?datasetID=calcofi_ctd&flagKey=...
4. **Verify**: curl https://erddap.calcofi.io/erddap/tabledap/calcofi_ctd.das
and check /share/erddap/data/log.txt for any "WARNING: Bad line(s)" Parquet messages.
NOTE: confirm each table\'s time/lat/lon column types; hand-tune units/long_name in
datasets.xml where metadata.json was sparse (Parquet carries no CF metadata).
')
```
## Cleanup
```{r}
#| label: cleanup
dbDisconnect(con, shutdown = TRUE)
```