Publish CalCOFI tables to ERDDAP (EDDTableFromParquetFiles)

Published

2026-06-08

1 Overview

Publish CalCOFI tabular data to the CalCOFI ERDDAP server (erddap/erddap:latest = v2.30, which supports EDDTableFromParquetFiles natively since v2.27) as a set of EDDTable datasets backed directly by Parquet.

For each curated table this workflow:

  1. Stages an ERDDAP-friendly Parquet (libs/erddap.R::stage_table_for_erddap): drops geom/provenance, exposes latitude/longitude/depth as doubles and time as epoch-seconds (ERDDAP has no geometry type and wants numeric time).
  2. Generates an EDDTableFromParquetFiles <dataset> block (erddap_dataset_xml) with <dataType>/units/ioos_category/standard_name per variable (units pulled from each table’s metadata.json).
  3. Writes the staged Parquet to data/erddap/{datasetID}/ and assembles all blocks into data/erddap/datasets_calcofi.xml.

Deploy (manual, documented at the end): rsync data/erddap/* to the server’s /share/erddap/datasets/, splice the XML into CalCOFI/erddap/content/datasets.xml, and reload via flag files.

2 Setup

Code
librarian::shelf(DBI, duckdb, dplyr, fs, glue, jsonlite, purrr, readr,
                 stringr, tibble, here, knitr, quiet = T)
options(readr.show_col_types = F)
source(here("libs/erddap.R"))

dir_erddap <- here("data/erddap")
dir_create(dir_erddap)
con <- dbConnect(duckdb()); dbExecute(con, "INSTALL spatial; LOAD spatial")
[1] 0

3 Curated Tables

Tables that carry (or derive) time/latitude/longitude directly. Measurement tables keyed only by UUID+depth (e.g. ctd_measurement, bottle_measurement) are published via their coordinate-bearing companions (ctd_wide, joined views) in a later pass.

Code
cfg <- tribble(
  ~dataset_id,            ~parquet,                                              ~title,                          ~summary,                                                            ~cdm,                ~depth_col,
  "calcofi_ctd",          "data/parquet/calcofi_ctd-cast/ctd_wide.parquet",      "CalCOFI CTD Profiles (wide)",   "Wide-format CTD profiles (one row per cast/depth) from CalCOFI cruises.", "TrajectoryProfile", "depth_m",
  "calcofi_casts",        "data/parquet/calcofi_bottle/casts.parquet",           "CalCOFI Casts",                 "Cast-level metadata for CalCOFI CTD/bottle stations.",                "Point",             "",
  "calcofi_dic",          "data/parquet/calcofi_dic/dic_sample.parquet",         "CalCOFI DIC Samples",           "Dissolved inorganic carbon / alkalinity sample positions.",          "Point",             "depth_m",
  "calcofi_euphausiids",  "data/parquet/cce-lter_euphausiids/euphausiids_tow.parquet", "CalCOFI Euphausiid Tows", "Euphausiid (krill) abundance net tows, 1951-2019.",                  "Point",             "",
  "calcofi_zooplankton",  "data/parquet/pic_zooplankton/zooplankton_tow.parquet","CalCOFI Zooplankton Tows",      "SIO PIC zooplankton net-tow registry (CalCOFI region).",             "Point",             "",
  "calcofi_phytoplankton","data/parquet/calcofi_phytoplankton/phyto_obs.parquet", "CalCOFI Phytoplankton (Venrick)","Phytoplankton abundance (cells/L) by taxon, 1996-2022 (Venrick, EDI knb-lter-cce.254). Region-pooled: latitude/longitude are provisional region centroids.", "Point",       ""
) |>
  mutate(parquet = here(parquet),
         metadata = here(glue("{path_dir(parquet)}/metadata.json")))

cfg |> select(dataset_id, title, cdm) |> kable()
dataset_id title cdm
calcofi_ctd CalCOFI CTD Profiles (wide) TrajectoryProfile
calcofi_casts CalCOFI Casts Point
calcofi_dic CalCOFI DIC Samples Point
calcofi_euphausiids CalCOFI Euphausiid Tows Point
calcofi_zooplankton CalCOFI Zooplankton Tows Point
calcofi_phytoplankton CalCOFI Phytoplankton (Venrick) Point

4 Stage Parquet + Generate datasets.xml

Code
# pull a column -> units lookup from a table's metadata.json (keys are "table.column")
units_from_metadata <- function(meta_path) {
  if (!file.exists(meta_path)) return(list())
  m <- fromJSON(meta_path, simplifyVector = FALSE)
  setNames(
    lapply(m$columns, function(c) c$units %||% ""),
    sub("^[^.]+\\.", "", names(m$columns)))
}

blocks <- character(0)
for (i in seq_len(nrow(cfg))) {
  r <- cfg[i, ]
  if (!file.exists(r$parquet)) { cat(glue("\n- SKIP {r$dataset_id}: {r$parquet} not found\n")); next }
  tbl <- tools::file_path_sans_ext(basename(r$parquet))
  staged <- stage_table_for_erddap(
    con, r$parquet, tbl, file.path(dir_erddap, r$dataset_id),
    depth_col = if (nzchar(r$depth_col)) r$depth_col else "depth_m")
  units_lk <- units_from_metadata(r$metadata)
  xml <- erddap_dataset_xml(
    staged, dataset_id = r$dataset_id, title = r$title, summary = r$summary,
    file_dir = glue("/datasets/{r$dataset_id}/"),
    units_lookup = units_lk, cdm_data_type = r$cdm)
  writeLines(xml, file.path(dir_erddap, r$dataset_id, glue("{r$dataset_id}.xml")))
  blocks <- c(blocks, xml)
  cat(glue("\n- {r$dataset_id}: staged {nrow(staged)} cols, ",
           "{ifelse('time' %in% staged$column,'time OK','NO TIME')}, ",
           "{ifelse('latitude' %in% staged$column,'lat/lon OK','NO COORDS')}\n"))
}
  • calcofi_ctd: staged 79 cols, time OK, lat/lon OK- calcofi_casts: staged 32 cols, time OK, lat/lon OK- calcofi_dic: staged 9 cols, time OK, lat/lon OK- calcofi_euphausiids: staged 14 cols, time OK, lat/lon OK- calcofi_zooplankton: staged 25 cols, time OK, lat/lon OK- calcofi_phytoplankton: staged 14 cols, time OK, lat/lon OK

5 Assemble datasets.xml

Code
xml_all <- paste0(
  "<!-- CalCOFI EDDTableFromParquetFiles datasets (generated by ",
  "publish_calcofi_to_erddap.qmd) -->\n", paste(blocks, collapse = "\n\n"))
writeLines(xml_all, file.path(dir_erddap, "datasets_calcofi.xml"))
cat("wrote", length(blocks), "dataset blocks to data/erddap/datasets_calcofi.xml\n")
wrote 6 dataset blocks to data/erddap/datasets_calcofi.xml

6 Deploy (manual)

To deploy to the live ERDDAP server (CalCOFI/server docker-compose, erddap.calcofi.io):

  1. Sync staged parquet to the server data dir (container /datasets/):

    rsync -av data/erddap/*/   <server>:/share/erddap/datasets/

    (each data/erddap/{datasetID}/{table}.parquet -> /share/erddap/datasets/{datasetID}/)

  2. Splice the dataset blocks in data/erddap/datasets_calcofi.xml into CalCOFI/erddap/content/datasets.xml at the <!-- add dataset definitions below --> marker, and commit/push the CalCOFI/erddap repo (it is bind-mounted into the container).

  3. Reload ERDDAP per dataset (flag), e.g.:

    touch /share/erddap/data/flag/calcofi_ctd        # or each datasetID

    or hit https://erddap.calcofi.io/erddap/setDatasetFlag.html?datasetID=calcofi_ctd&flagKey=…

  4. Verify: curl https://erddap.calcofi.io/erddap/tabledap/calcofi_ctd.das and check /share/erddap/data/log.txt for any “WARNING: Bad line(s)” Parquet messages.

NOTE: confirm each table’s time/lat/lon column types; hand-tune units/long_name in datasets.xml where metadata.json was sparse (Parquet carries no CF metadata).

7 Cleanup

Code
dbDisconnect(con, shutdown = TRUE)