Ingest NOAA CalCOFI Database

Published

2026-02-24

1 Overview

Goal: Generate the database from source files with workflow scripts to make updating easier and provenance fully transparent. This allows us to:

Rename tables and column names, control data types and use Unicode encoding for a consistent database ingestion strategy, per Database naming conventions in Database – CalCOFI.io Docs.
Differentiate between raw and derived or updated tables and columns. For instance, the taxonomy for any given species can change over time, such as lumping or splitting of a given taxa, and by taxonomic authority (e.g., WoRMS, ITIS or GBIF). These taxonomic identifiers and the full taxonomic hierarchy should get regularly updated regardless of source observational data, and can either be updated in the table directly or joined one-to-one with a seperate table in a materialized view (so as not to slow down queries with a regular view).

This workflow processes NOAA CalCOFI database CSV files and outputs parquet files. The workflow:

Reads CSV files from source directory (with GCS archive sync)
Loads into local wrangling DuckDB with transformations
Restructures primary keys (natural keys + sequential IDs)
Creates lookup table and consolidates ichthyo tables
Validates data quality and flags issues
Exports to parquet files (for later integration into Working DuckLake)

Code

graph TD
    A["1\. Read CSV Files<br/>(with GCS archive sync)"] --> B["2\. Load into Wrangling DuckDB<br/>(column renames & type transforms)"]
    B --> C["3\. Restructure Primary Keys<br/>(natural keys + sequential IDs)"]
    C --> D["4\. Create Lookup Table &<br/>Consolidate Ichthyo Tables"]
    D --> E["5\. Validate Data Quality<br/>(corrections, integrity, drop deprecated)"]
    E --> F["6\. Export Parquet Files<br/>(for Working DuckLake)"]

    A:::source
    B:::wrangle
    C:::keys
    D:::consolidate
    E:::validate
    F:::export

    classDef source fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    classDef wrangle fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    classDef keys fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    classDef consolidate fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    classDef validate fill:#f0e1ff,stroke:#6600cc,stroke-width:2px
    classDef export fill:#ffe1e1,stroke:#cc0000,stroke-width:2px

graph TD
    A["1\. Read CSV Files<br/>(with GCS archive sync)"] --> B["2\. Load into Wrangling DuckDB<br/>(column renames & type transforms)"]
    B --> C["3\. Restructure Primary Keys<br/>(natural keys + sequential IDs)"]
    C --> D["4\. Create Lookup Table &<br/>Consolidate Ichthyo Tables"]
    D --> E["5\. Validate Data Quality<br/>(corrections, integrity, drop deprecated)"]
    E --> F["6\. Export Parquet Files<br/>(for Working DuckLake)"]

    A:::source
    B:::wrangle
    C:::keys
    D:::consolidate
    E:::validate
    F:::export

    classDef source fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    classDef wrangle fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    classDef keys fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    classDef consolidate fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    classDef validate fill:#f0e1ff,stroke:#6600cc,stroke-width:2px
    classDef export fill:#ffe1e1,stroke:#cc0000,stroke-width:2px

Overview diagram of CSV ingestion process into the database.

See also 5.3 Ingest datasets with documentation – Database – CalCOFI.io Docs for generic overview of ingestion process.

Code

# devtools::install_local(here::here("../calcofi4db"), force = T)
devtools::load_all(here::here("../calcofi4db"))
devtools::load_all(here::here("../calcofi4r"))
# options(error=NULL)
librarian::shelf(
  CalCOFI / calcofi4db,
  CalCOFI / calcofi4r,
  DBI,
  dm,
  dplyr,
  DT,
  fs,
  glue,
  gargle,
  googledrive,
  here,
  htmltools,
  janitor,
  jsonlite,
  knitr,
  listviewer,
  litedown,
  lubridate,
  purrr,
  readr,
  rlang,
  sf,
  stringr,
  tibble,
  tidyr,
  units,
  uuid,
  webshot2,
  quiet = T
)
options(readr.show_col_types = F)
options(DT.options = list(scrollX = TRUE))

# define paths
dataset <- "calcofi-db"
provider <- "swfsc.noaa.gov"
dataset_name <- "NOAA CalCOFI Database"
dir_data <- "~/My Drive/projects/calcofi/data-public"
overwrite <- TRUE
dir_parquet <- here(glue("data/parquet/{provider}_{dataset}"))
db_path <- here(glue("data/wrangling/{provider}_{dataset}.duckdb"))

if (overwrite) {
  if (file_exists(db_path)) {
    file_delete(db_path)
  }
  if (dir_exists(dir_parquet)) dir_delete(dir_parquet)
}

dir_create(dirname(db_path))
con <- get_duckdb_con(db_path)
load_duckdb_extension(con, "spatial")

# load data using calcofi4db package
# - reads from local Google Drive mount
# - syncs to GCS archive if files changed (creates new timestamped archive)
# - tracks GCS archive path for provenance
d <- read_csv_files(
  provider = provider,
  dataset = dataset,
  dir_data = dir_data,
  sync_archive = TRUE,
  metadata_dir = here("metadata")
) # workflows/metadata/{provider}/{dataset}/

# show source files summary
message(glue("Loaded {nrow(d$source_files)} tables from {d$paths$dir_csv}"))
message(glue("Total rows: {sum(d$source_files$nrow)}"))

2 Check for any mismatched tables and fields

Code

# check data integrity - detects mismatches and controls chunk evaluation
integrity <- check_data_integrity(
  d = d,
  dataset_name = dataset_name,
  halt_on_fail = TRUE
)

# render the pass/fail message
render_integrity_message(integrity)

2.1 ✅ Data Integrity Check Passed: NOAA CalCOFI Database

2.1.1 All Systems Go

No mismatches were found between the CSV files and redefinition metadata. The data structures are properly aligned and ready for database ingestion.

3 Show Source Files

Code

show_source_files(d)

4 Show CSV Tables and Fields to Ingest

Code

d$d_csv$tables |>
  datatable(caption = "Tables to ingest.")

Code

d$d_csv$fields |>
  datatable(caption = "Fields to ingest.")

5 Show tables and fields redefined

Code

show_tables_redefine(d)

Code

show_fields_redefine(d)

6 Load Tables into Database

Code

# use ingest_dataset() which handles:
# - transform_data() for applying redefinitions
# - provenance tracking via gcs_path from read_csv_files()
# - automatic uuid column detection
# - ingest_to_working() for each table
tbl_stats <- ingest_dataset(
  con = con,
  d = d,
  mode = if (overwrite) "replace" else "append",
  verbose = TRUE
)

tbl_stats |>
  datatable(rownames = FALSE, filter = "top")

7 Establish Primary Keys

UUID-first approach: Source tables (site, tow, net) retain their *_uuid columns as primary unique identifiers. These UUIDs are minted at sea and remain stable throughout the data lifecycle — even when rows are removed and re-included during QA/QC. Sequential integer IDs would lose this stability because re-sorting or row additions change the assignment. Only cruise uses a natural key (cruise_key) because it has few rows with easily identifiable attributes (ship + year-month). The ichthyo table uses a deterministic UUID v5 hashed from its composite natural key. Other derived tables without source UUIDs (lookup, segment) still use sequential integer IDs for convenience.

7.1 Create cruise_key (natural key)

The cruise_key is a natural key in format YYMMKK (2-digit year + 2-digit month + ship_key).

Code

# create cruise_key as natural primary key (YYMMKK format)
create_cruise_key(
  con,
  cruise_tbl = "cruise",
  ship_tbl = "ship",
  date_col = "date_ym"
)

# verify uniqueness
cruise_keys <- tbl(con, "cruise") |> pull(cruise_key)
stopifnot("cruise_key must be unique" = !any(duplicated(cruise_keys)))

# show sample cruise keys
tbl(con, "cruise") |>
  select(cruise_uuid, cruise_key, ship_key, date_ym) |>
  head(10) |>
  collect() |>
  datatable(caption = "Sample cruise_key values (YYMMKK format)")

7.2 Propagate cruise_key to child tables

Propagate the natural cruise_key to the site table for convenience in queries and sorting. The structural foreign key (site.cruise_uuid → cruise.cruise_uuid) comes from the source data.

Code

# propagate cruise_key from cruise to site (via cruise_uuid)
propagate_natural_key(
  con = con,
  child_tbl = "site",
  parent_tbl = "cruise",
  key_col = "cruise_key",
  join_col = "cruise_uuid"
)

# verify cruise_key is now in site
tbl(con, "site") |>
  select(site_uuid, cruise_uuid, cruise_key, orderocc) |>
  head(10) |>
  collect() |>
  datatable(caption = "Sample site rows with cruise_key")

8 Create Lookup Table

Create unified lookup table from vocabularies for egg stages, larva stages, and tow types.

Code

# egg stage vocabulary (Moser & Ahlstrom, 1985)
egg_stage_vocab <- tibble(
  stage_int = 1:11,
  stage_description = c(
    "egg, stage 1 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 2 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 3 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 4 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 5 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 6 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 7 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 8 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 9 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 10 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 11 of 11 (Moser & Ahlstrom, 1985)"
  )
)

# larva stage vocabulary
larva_stage_vocab <- tibble(
  stage_int = 1:5,
  stage_txt = c("YOLK", "PREF", "FLEX", "POST", "TRNS"),
  stage_description = c(
    "larva, yolk sac",
    "larva, preflexion",
    "larva, flexion",
    "larva, postflexion",
    "larva, transformation"
  )
)

# tow type vocabulary (from tow_type table)
tow_type_vocab <- tbl(con, "tow_type") |>
  collect() |>
  mutate(
    lookup_num = row_number(),
    lookup_chr = tow_type_key,
    description = description
  ) |>
  select(lookup_num, lookup_chr, description)

# create unified lookup table
create_lookup_table(
  con = con,
  egg_stage_vocab = egg_stage_vocab,
  larva_stage_vocab = larva_stage_vocab,
  tow_type_vocab = tow_type_vocab
)

# show lookup table
tbl(con, "lookup") |>
  collect() |>
  datatable(caption = "Lookup table with vocabularies")

9 Consolidate to Tidy Ichthyo Table

Consolidate 5 ichthyoplankton tables (egg, eggstage, larva, larvastage, larvasize) into a single tidy table.

Code

erDiagram
    egg {
        uuid net_uuid FK
        int  species_id FK
        int  tally
    }
    egg_stage {
        uuid net_uuid FK
        int  species_id FK
        int  stage
        int  tally
    }
    larva {
        uuid net_uuid FK
        int  species_id FK
        int  tally
    }
    larva_stage {
        uuid net_uuid FK
        int  species_id FK
        str  stage
        int  tally
    }
    larva_size {
        uuid net_uuid FK
        int  species_id FK
        dbl  length_mm
        int  tally
    }
    ichthyo {
        uuid ichthyo_uuid PK
        uuid net_uuid FK
        int species_id FK
        str life_stage
        str measurement_type
        dbl measurement_value
        int tally
    }
    lookup {
        int lookup_id PK
        str category
        int lookup_num
        str lookup_chr
        str description
    }
    egg ||--|{ ichthyo : "life_stage = egg"
    egg_stage ||--|{ ichthyo : "life_stage = egg, type = stage"
    larva ||--|{ ichthyo : "life_stage = larva"
    larva_stage ||--|{ ichthyo : "life_stage = larva, type = stage"
    larva_size ||--|{ ichthyo : "life_stage = larva, type = size"
    ichthyo }o--|| lookup : "measurement_value"

erDiagram
    egg {
        uuid net_uuid FK
        int  species_id FK
        int  tally
    }
    egg_stage {
        uuid net_uuid FK
        int  species_id FK
        int  stage
        int  tally
    }
    larva {
        uuid net_uuid FK
        int  species_id FK
        int  tally
    }
    larva_stage {
        uuid net_uuid FK
        int  species_id FK
        str  stage
        int  tally
    }
    larva_size {
        uuid net_uuid FK
        int  species_id FK
        dbl  length_mm
        int  tally
    }
    ichthyo {
        uuid ichthyo_uuid PK
        uuid net_uuid FK
        int species_id FK
        str life_stage
        str measurement_type
        dbl measurement_value
        int tally
    }
    lookup {
        int lookup_id PK
        str category
        int lookup_num
        str lookup_chr
        str description
    }
    egg ||--|{ ichthyo : "life_stage = egg"
    egg_stage ||--|{ ichthyo : "life_stage = egg, type = stage"
    larva ||--|{ ichthyo : "life_stage = larva"
    larva_stage ||--|{ ichthyo : "life_stage = larva, type = stage"
    larva_size ||--|{ ichthyo : "life_stage = larva, type = size"
    ichthyo }o--|| lookup : "measurement_value"

Consolidate 5 ichthyoplankton tables into a single tidy ichthyo table with lookup.

Code

message("Consolidating ichthyoplankton tables...")

# consolidate all ichthyo tables (keeps net_uuid as FK to net table)
consolidate_ichthyo_tables(
  con = con,
  output_tbl = "ichthyo",
  larva_stage_vocab = larva_stage_vocab
)

# assign ichthyo_uuid — deterministic UUID v5 from composite natural key
assign_deterministic_uuids(
  con        = con,
  table_name = "ichthyo",
  id_col     = "ichthyo_uuid",
  key_cols   = c(
    "net_uuid",
    "species_id",
    "life_stage",
    "measurement_type",
    "measurement_value")
)

# show sample rows
tbl(con, "ichthyo") |>
  head(20) |>
  collect() |>
  datatable(caption = "Sample ichthyo table rows (tidy format)")

Code

# summarize ichthyo table
ichthyo_summary <- tbl(con, "ichthyo") |>
  group_by(life_stage, measurement_type) |>
  summarize(
    n_rows = n(),
    n_species = n_distinct(species_id),
    total_tally = sum(tally, na.rm = TRUE),
    .groups = "drop"
  ) |>
  collect()

ichthyo_summary |>
  arrange(life_stage, measurement_type) |>
  datatable(
    caption = HTML(mark(
      "The `ichthyo` table summary by life_stage and measurement_type"
    ))
  ) |>
  formatCurrency(
    columns = c("n_rows", "n_species", "total_tally"),
    currency = "",
    digits = 0
  )

10 Data Quality Improvements

This section applies data corrections and validates referential integrity.

10.1 Data Corrections

Apply known data corrections identified by data managers.

Code

# apply data corrections
apply_data_corrections(con, verbose = TRUE)

10.2 Validate Referential Integrity

Run validation checks and flag invalid rows for review.

Code

# ensure flagged directory exists
dir_flagged <- here("data/flagged")
if (!dir.exists(dir_flagged)) {
  dir.create(dir_flagged, recursive = TRUE)
}

# validate egg stages (must be 1-11)
invalid_egg_stages <- validate_egg_stages(con, "egg_stage", "stage")
invalid_egg_stages_csv <- file.path(dir_flagged, "invalid_egg_stages.csv")
invalid_egg_stages_desc <- "Egg stage values NOT 1 to 11 (ie, not in Moser & Ahlstrom 1985 vocab)"
if (nrow(invalid_egg_stages) > 0) {
  flag_invalid_rows(
    invalid_rows = invalid_egg_stages,
    output_path = invalid_egg_stages_csv,
    description = invalid_egg_stages_desc
  )
}

[1] "/Users/bbest/Github/CalCOFI/workflows/data/flagged/invalid_egg_stages.csv"

Code

show_flagged_file(
  invalid_egg_stages,
  invalid_egg_stages_csv,
  invalid_egg_stages_desc
)

Egg stage values NOT 1 to 11 (ie, not in Moser & Ahlstrom 1985 vocab): 790 rows flagged → calcofi/workflows: data/flagged/invalid_egg_stages.csv

Code

# define validation checks
validations <- list(
  list(
    type = "fk",
    data_tbl = "ichthyo",
    col = "species_id",
    ref_tbl = "species",
    ref_col = "species_id",
    output_file = "orphan_species.csv",
    description = "Species IDs not found in species table"
  ),
  list(
    type = "fk",
    data_tbl = "ichthyo",
    col = "net_uuid",
    ref_tbl = "net",
    ref_col = "net_uuid",
    output_file = "orphan_nets.csv",
    description = "Net UUIDs not found in net table"
  )
)

# run validations
validation_results <- validate_dataset(
  con = con,
  validations = validations,
  output_dir = dir_flagged
)

# show validation summary with GitHub links
show_validation_results(validation_results)

Code

# optionally delete flagged rows (dry run first)
if (validation_results$total_flagged > 0) {
  message(glue("Found {validation_results$total_flagged} invalid rows"))

  # dry run to see what would be deleted
  delete_stats <- delete_flagged_rows(
    con = con,
    validation_results = validation_results,
    dry_run = TRUE
  )

  delete_stats |> datatable(caption = "Rows to be deleted (dry run)")

  # uncomment to actually delete:
  delete_flagged_rows(con, validation_results, dry_run = FALSE)
}

# A tibble: 1 × 4
  table   col        deleted would_delete
  <chr>   <chr>        <int>        <int>
1 ichthyo species_id   37036        37036

10.3 Drop Deprecated Tables

The source tables have been consolidated into ichthyo (tidy format) and lookup (vocabularies). Drop these before creating the schema diagram.

Note: *_uuid columns are retained as primary unique identifiers (see rationale in “Establish Primary Keys” above).

Code

# tables consolidated into ichthyo
deprecated_ichthyo <- c(
  "egg",
  "egg_stage",
  "larva",
  "larva_stage",
  "larva_size"
)

# tables consolidated into lookup
deprecated_lookup <- c("tow_type")

# all deprecated tables
deprecated_tables <- c(deprecated_ichthyo, deprecated_lookup)

# drop each deprecated table
for (tbl in deprecated_tables) {
  if (tbl %in% DBI::dbListTables(con)) {
    DBI::dbExecute(con, glue("DROP TABLE {tbl}"))
    message(glue("Dropped deprecated table: {tbl}"))
  }
}

message(glue(
  "\nRemaining tables: {paste(sort(DBI::dbListTables(con)), collapse = ', ')}"
))

11 Schema Documentation

Code

tbls <- dbListTables(con) |> sort()

# dm object for visualization (keys learned from data, not DB constraints)
dm_dev <- dm_from_con(con, table_names = tbls, learn_keys = FALSE)
dm_draw(dm_dev, view_type = "all")

11.1 Primary Key Strategy

UUID-first: Source tables use UUIDs minted at sea as primary identifiers. Only cruise uses a natural key because it has few, easily identifiable rows. Derived tables (lookup, segment) use sequential integer IDs since they have no source UUID. The ichthyo table uses a deterministic UUID v5 hashed from its composite natural key.

Table	Primary Key	Type
`cruise`	`cruise_key`	Natural key (YYMMKK); also retains `cruise_uuid` from source
`ship`	`ship_key`	Natural key (2-letter)
`site`	`site_uuid`	Source UUID (minted at sea)
`tow`	`tow_uuid`	Source UUID (minted at sea)
`net`	`net_uuid`	Source UUID (minted at sea)
`species`	`species_id`	Natural from source
`ichthyo`	`ichthyo_uuid`	Deterministic UUID v5 (from net_uuid, species_id, life_stage, measurement_type, measurement_value)
`lookup`	`lookup_id`	Sequential (derived table, sorted by lookup_type, lookup_num)

11.2 Foreign Key Relationships

DuckDB doesn’t support ALTER TABLE ADD FOREIGN KEY. We use the dm package for documentation and visualization only.

Foreign Key Relationships:

ship.ship_key (PK)
    ↓
cruise.cruise_key (PK)  ←──  cruise.ship_key (FK)
    ↓                        cruise.cruise_uuid (unique, source)
site.site_uuid (PK)     ←──  site.cruise_uuid (FK → cruise.cruise_uuid)
    ↓                        site.cruise_key (denormalized, for queries)
tow.tow_uuid (PK)       ←──  tow.site_uuid (FK → site.site_uuid)
    ↓                        tow.tow_type_key → lookup (lookup_type='tow_type')
net.net_uuid (PK)        ←──  net.tow_uuid (FK → tow.tow_uuid)
    ↓
ichthyo.ichthyo_uuid (PK) ←──  ichthyo.net_uuid (FK → net.net_uuid)
                               ichthyo.species_id (FK) → species.species_id

Code

# get tables (excluding old ichthyo source tables that are now consolidated)
tbls <- sort(DBI::dbListTables(con))

# refresh dm after index creation
dm_dev <- dm_from_con(con, table_names = tbls, learn_keys = FALSE)

# define relationships in dm (in-memory for visualization only)
# uses UUID PKs for source tables, sequential IDs for derived tables
add_cc_keys <- function(dm) {
  dm |>
    dm_add_pk(cruise, cruise_key) |>
    dm_add_pk(ship, ship_key) |>
    dm_add_pk(site, site_uuid) |>
    dm_add_pk(tow, tow_uuid) |>
    dm_add_pk(net, net_uuid) |>
    dm_add_pk(species, species_id) |>
    dm_add_pk(ichthyo, ichthyo_uuid) |>
    dm_add_pk(lookup, lookup_id) |>
    dm_add_fk(ichthyo, net_uuid, net) |>
    dm_add_fk(ichthyo, species_id, species) |>
    dm_add_fk(net, tow_uuid, tow) |>
    dm_add_fk(tow, site_uuid, site) |>
    dm_add_fk(site, cruise_key, cruise) |>
    dm_add_fk(cruise, ship_key, ship)
}

dm_dev_fk <- dm_dev |>
  add_cc_keys()

dm_draw(dm_dev_fk, view_type = "all")

12 Add Spatial

12.1 Add `site.geom`

Code

# add geometry column using DuckDB spatial
# note: DuckDB spatial doesn't track SRID metadata (unlike PostGIS)
# all geometries assumed WGS84 (EPSG:4326) by convention
add_point_geom(con, "site", lon_col = "longitude", lat_col = "latitude")

12.2 Fix calcofi4r `grid`

Problems with calcofi4r::cc_grid:

uses old station (line, position) vs newer site (line, station)
sta_lin, sta_pos: integer, so drops necessary decimal that is found in sta_key
sta_lin == 90, sta_pos == 120 repeats for:
- sta_pattern == ‘historical’ (sta_dpos == 20); and
- sta_pattern == ‘standard’ (sta_dpos == 10)

Code

librarian::shelf(
  calcofi4r,
  mapview,
  quiet = T
)

cc_grid_v2 <- calcofi4r::cc_grid |>
  select(
    sta_key,
    shore = sta_shore,
    pattern = sta_pattern,
    spacing = sta_dpos
  ) |>
  separate_wider_delim(
    sta_key,
    ",",
    names = c("line", "station"),
    cols_remove = F
  ) |>
  mutate(
    line = as.double(line),
    station = as.double(station),
    grid_key = ifelse(
      pattern == "historical",
      glue("st{station}-ln{line}_hist"),
      glue("st{station}-ln{line}")
    ),
    zone = glue("{shore}-{pattern}")
  ) |>
  relocate(grid_key, station) |>
  st_as_sf() |>
  mutate(
    area_km2 = st_area(geom) |>
      set_units(km^2) |>
      as.numeric()
  )

cc_grid_ctrs_v2 <- calcofi4r::cc_grid_ctrs |>
  select(sta_key, pattern = sta_pattern) |>
  left_join(
    cc_grid_v2 |>
      st_drop_geometry(),
    by = c("sta_key", "pattern")
  ) |>
  select(-sta_key) |>
  relocate(grid_key)

cc_grid_v2 <- cc_grid_v2 |>
  select(-sta_key)

cc_grid_v2 |>
  st_drop_geometry() |>
  datatable()

Code

mapview(cc_grid_v2, zcol = "zone") +
  mapview(cc_grid_ctrs_v2, cex = 1)

Code

grid <- cc_grid_v2 |>
  as.data.frame() |>
  left_join(
    cc_grid_ctrs_v2 |>
      as.data.frame() |>
      select(grid_key, geom_ctr = geom),
    by = "grid_key"
  ) |>
  st_as_sf(sf_column_name = "geom")

# convert sf geometry to WKB for DuckDB
grid_df <- grid |>
  mutate(
    geom_wkb = sf::st_as_binary(geom, hex = TRUE),
    geom_ctr_wkb = sf::st_as_binary(geom_ctr, hex = TRUE)
  ) |>
  sf::st_drop_geometry() |>
  select(-geom_ctr)

# write to DuckDB
dbWriteTable(con, "grid", grid_df, overwrite = TRUE)

# convert WKB back to geometry
dbExecute(con, "ALTER TABLE grid ADD COLUMN geom GEOMETRY")

[1] 0

Code

dbExecute(con, "UPDATE grid SET geom = ST_GeomFromHEXWKB(geom_wkb)")

[1] 218

Code

dbExecute(con, "ALTER TABLE grid DROP COLUMN geom_wkb")

[1] 0

Code

dbExecute(con, "ALTER TABLE grid ADD COLUMN geom_ctr GEOMETRY")

[1] 0

Code

dbExecute(con, "UPDATE grid SET geom_ctr = ST_GeomFromHEXWKB(geom_ctr_wkb)")

[1] 218

Code

dbExecute(con, "ALTER TABLE grid DROP COLUMN geom_ctr_wkb")

[1] 0

Code

message("Grid table created with geometry columns")

12.3 Update `site.grid_key`

Code

grid_stats <- assign_grid_key(con, "site")
grid_stats |> datatable()

12.4 Add `segment`: line segments between consecutive sites

Code

segment <- tbl(con, "site") |>
  select(cruise_key, orderocc, site_uuid, lon = longitude, lat = latitude) |>
  left_join(
    tbl(con, "tow") |>
      select(site_uuid, time_start),
    by = "site_uuid"
  ) |>
  group_by(
    cruise_key,
    orderocc,
    site_uuid,
    lon,
    lat
  ) |>
  summarize(
    time_beg = min(time_start, na.rm = T),
    time_end = max(time_start, na.rm = T),
    .groups = "drop"
  ) |>
  collect()

segment <- segment |>
  arrange(cruise_key, orderocc, time_beg) |>
  group_by(cruise_key) |>
  mutate(
    site_uuid_beg = lag(site_uuid),
    lon_beg = lag(lon),
    lat_beg = lag(lat),
    time_beg = lag(time_beg)
  ) |>
  ungroup() |>
  filter(!is.na(lon_beg), !is.na(lat_beg)) |>
  mutate(
    m = pmap(
      list(lon_beg, lat_beg, lon, lat),
      \(x1, y1, x2, y2) {
        matrix(c(x1, y1, x2, y2), nrow = 2, byrow = T)
      }
    ),
    geom = map(m, st_linestring)
  ) |>
  select(
    cruise_key,
    site_uuid_beg,
    site_uuid_end = site_uuid,
    lon_beg,
    lat_beg,
    lon_end = lon,
    lat_end = lat,
    time_beg,
    time_end,
    geom
  ) |>
  st_as_sf(
    sf_column_name = "geom",
    crs = 4326
  ) |>
  mutate(
    time_hr = as.numeric(difftime(time_end, time_beg, units = "hours")),
    length_km = st_length(geom) |>
      set_units(km) |>
      as.numeric(),
    km_per_hr = length_km / time_hr
  )

# convert to WKB and write to DuckDB
segment_df <- segment |>
  mutate(geom_wkb = sf::st_as_binary(geom, hex = TRUE)) |>
  sf::st_drop_geometry()

dbWriteTable(con, "segment", segment_df, overwrite = TRUE)

dbExecute(con, "ALTER TABLE segment ADD COLUMN geom GEOMETRY")

[1] 0

Code

dbExecute(con, "UPDATE segment SET geom = ST_GeomFromHEXWKB(geom_wkb)")

[1] 60413

Code

dbExecute(con, "ALTER TABLE segment DROP COLUMN geom_wkb")

[1] 0

Code

# assign segment_id sorted by time_beg
assign_sequential_ids(
  con = con,
  table_name = "segment",
  id_col = "segment_id",
  sort_cols = c("time_beg")
)

message("Segment table created")

Code

# slowish, so use cached figure
map_segment_png <- here(glue("figures/{provider}_{dataset}_segment_map.png"))

if (!file_exists(map_segment_png)) {
  segment_sf <- cc_read_sf(con, "segment") |>
    mutate(year = year(time_beg))

  m <- mapView(segment_sf, zcol = "year")
  mapshot2(m, file = map_segment_png)
}

htmltools::img(
  src = map_segment_png |> str_replace(here(), "."),
  width = "600px"
)

13 Report

Code

all_tbls <- DBI::dbListTables(con)

add_cc_spatial_keys <- function(dm) {
  dm |>
    dm_add_pk(grid, grid_key) |>
    dm_add_pk(segment, segment_id) |>
    dm_add_fk(site, grid_key, grid) |>
    dm_add_fk(segment, cruise_key, cruise)
}

dm_final <- dm_from_con(con, table_names = all_tbls, learn_keys = FALSE) |>
  add_cc_keys() |>
  add_cc_spatial_keys() |>
  dm_set_colors(lightgreen = c(segment, grid))
dm_draw(dm_final, view_type = "all")

Code

d_eff <- tbl(con, "segment") |>
  mutate(
    year = year(time_beg)
  ) |>
  group_by(year) |>
  summarize(
    time_hr = sum(time_hr, na.rm = T),
    length_km = sum(length_km, na.rm = T)
  ) |>
  collect()

total_hours <- sum(d_eff$time_hr, na.rm = T)
total_km <- sum(d_eff$length_km, na.rm = T)

fmt <- function(x, ...) format(x, big.mark = ",", ...)
message(glue(
  "Total effort: {fmt(round(total_hours))} hours ({fmt(round(total_hours/24))} days, {fmt(round(total_hours/24/365, 1))} years)"
))
message(glue("Total distance: {fmt(round(total_km))} km"))

14 Validate Local Database

Validate data quality in the local wrangling database before exporting to parquet. The parquet outputs from this workflow can later be used to update the Working DuckLake.

Code

# validate data quality
validation <- validate_for_release(con)

if (validation$passed) {
  message("Validation passed!")
  if (nrow(validation$checks) > 0) {
    validation$checks |>
      filter(status != "pass") |>
      datatable(caption = "Validation Warnings")
  }
} else {
  cat("Validation FAILED:\n")
  cat(paste("-", validation$errors, collapse = "\n"))
}

Validation FAILED:
- Table 'site' has 2084 NULL values in required column 'grid_key'

15 Enforce Column Types

Force integer/smallint types on columns that R’s numeric mapped to DOUBLE during dbWriteTable(). Uses flds_redefine.csv (type_new) as the source of truth for source-table columns, plus explicit overrides for derived-table columns.

Code

type_changes <- enforce_column_types(
  con = con,
  d_flds_rd = d$d_flds_rd,
  type_overrides = list(
    ichthyo.ichthyo_uuid = "UUID",
    ichthyo.species_id   = "SMALLINT",
    ichthyo.tally = "INTEGER",
    lookup.lookup_id = "INTEGER",
    lookup.lookup_num = "INTEGER",
    segment.segment_id = "INTEGER"
  ),
  tables = dbListTables(con),
  verbose = TRUE
)

if (nrow(type_changes) > 0) {
  type_changes |>
    datatable(caption = "Column type changes applied")
}

16 Data Preview

Preview first and last rows of each table before writing parquet outputs.

Code

preview_tables(
  con,
  c(
    "cruise",
    "ship",
    "site",
    "tow",
    "net",
    "species",
    "ichthyo",
    "lookup",
    "grid",
    "segment"
  )
)

16.1 `cruise` (691 rows)

16.2 `ship` (48 rows)

16.3 `site` (61,104 rows)

16.4 `tow` (75,506 rows)

16.5 `net` (76,512 rows)

16.6 `species` (1,144 rows)

16.7 `ichthyo` (830,873 rows)

16.8 `lookup` (26 rows)

16.9 `grid` (218 rows)

16.10 `segment` (60,413 rows)

17 Write Parquet Outputs

Export tables to parquet files for downstream use.

Code

# write parquet files with manifest
parquet_stats <- write_parquet_outputs(
  con = con,
  output_dir = dir_parquet,
  tables = dbListTables(con),
  strip_provenance = F
)

parquet_stats |>
  mutate(file = basename(path)) |>
  select(-path) |>
  datatable(caption = "Parquet export statistics")

18 Write Metadata

Build metadata.json sidecar file documenting all tables and columns in parquet outputs. DuckDB COMMENT ON does not propagate to parquet, so this provides the metadata externally.

Code

metadata_path <- build_metadata_json(
  con = con,
  d_tbls_rd = d$d_tbls_rd,
  d_flds_rd = d$d_flds_rd,
  metadata_derived_csv = here(
    "metadata/swfsc.noaa.gov/calcofi-db/metadata_derived.csv"
  ),
  output_dir = dir_parquet,
  tables = dbListTables(con),
  set_comments = TRUE,
  provider = provider,
  dataset = dataset,
  workflow_url = glue(
    "https://calcofi.io/workflows/ingest_swfsc.noaa.gov_calcofi-db.html"
  )
)

# show metadata summary
metadata <- jsonlite::fromJSON(metadata_path)
tibble(
  table = names(metadata$tables),
  n_cols = map_int(
    names(metadata$tables),
    ~ sum(grepl(glue("^{.x}\\."), names(metadata$columns)))
  ),
  name_long = map_chr(metadata$tables, ~ .x$name_long)
) |>
  datatable(caption = "Table metadata summary")

Code

listviewer::jsonedit(
  jsonlite::fromJSON(metadata_path, simplifyVector = FALSE),
  mode = "view"
)

19 Upload to GCS Archive

Upload parquet files, manifest, and metadata sidecar to gs://calcofi-db/ingest/{provider}_{dataset}/.

Code

gcs_ingest_prefix <- glue("ingest/{provider}_{dataset}")
gcs_bucket <- "calcofi-db"

# sync to GCS — only uploads new or changed files
sync_results <- sync_to_gcs(
  local_dir  = dir_parquet,
  gcs_prefix = gcs_ingest_prefix,
  bucket     = gcs_bucket)

20 Cleanup

Code

# close local wrangling database connection
close_duckdb(con)
message("Local wrangling database connection closed")

# note: parquet outputs are in data/parquet/swfsc.noaa.gov_calcofi-db/
# these can be used to update the Working DuckLake in a separate workflow
message(glue("Parquet outputs written to: {dir_parquet}"))
message(glue("GCS outputs at: gs://{gcs_bucket}/{gcs_ingest_prefix}/"))

21 TODO

Review documentation for more comment descriptions at Data > Data Formats | CalCOFI.org

Session Info

Code

devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.5.2 (2025-10-31)
 os       macOS Sequoia 15.7.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Rome
 date     2026-02-24
 pandoc   3.8.3 @ /opt/homebrew/bin/ (via rmarkdown)
 quarto   1.8.25 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 !  package            * version    date (UTC) lib source
    abind                1.4-8      2024-09-12 [1] CRAN (R 4.5.0)
    arrow                22.0.0.1   2025-12-23 [1] CRAN (R 4.5.2)
    assertthat           0.2.1      2019-03-21 [1] CRAN (R 4.5.0)
    backports            1.5.0      2024-05-23 [1] CRAN (R 4.5.0)
    base64enc            0.1-6      2026-02-02 [1] CRAN (R 4.5.2)
    bit                  4.6.0      2025-03-06 [1] CRAN (R 4.5.0)
    bit64                4.6.0-1    2025-01-16 [1] CRAN (R 4.5.0)
    blob                 1.3.0      2026-01-14 [1] CRAN (R 4.5.2)
    brew                 1.0-10     2023-12-16 [1] CRAN (R 4.5.0)
    broom                1.0.10     2025-09-13 [1] CRAN (R 4.5.0)
    bslib                0.10.0     2026-01-26 [1] CRAN (R 4.5.2)
    cachem               1.1.0      2024-05-16 [1] CRAN (R 4.5.0)
 VP calcofi4db         * 2.4.0      2026-02-04 [?] load_all() (on disk 2.1.0)
 P  calcofi4r          * 1.1.3      2026-02-09 [?] Github (calcofi/calcofi4r@304d271)
    chromote             0.5.1      2025-04-24 [1] CRAN (R 4.5.0)
    class                7.3-23     2025-01-01 [1] CRAN (R 4.5.2)
    classInt             0.4-11     2025-01-08 [1] CRAN (R 4.5.0)
    cli                  3.6.5      2025-04-23 [1] CRAN (R 4.5.0)
    codetools            0.2-20     2024-03-31 [1] CRAN (R 4.5.2)
    commonmark           2.0.0      2025-07-07 [1] CRAN (R 4.5.0)
    crayon               1.5.3      2024-06-20 [1] CRAN (R 4.5.0)
    crosstalk            1.2.2      2025-08-26 [1] CRAN (R 4.5.0)
    curl                 7.0.0      2025-08-19 [1] CRAN (R 4.5.0)
    data.table           1.18.2.1   2026-01-27 [1] CRAN (R 4.5.2)
    DBI                * 1.2.3      2024-06-02 [1] CRAN (R 4.5.0)
    dbplyr               2.5.1      2025-09-10 [1] CRAN (R 4.5.0)
    desc                 1.4.3      2023-12-10 [1] CRAN (R 4.5.0)
    devtools             2.4.6      2025-10-03 [1] CRAN (R 4.5.0)
    DiagrammeR           1.0.11     2024-02-02 [1] CRAN (R 4.5.0)
    DiagrammeRsvg        0.1        2016-02-04 [1] CRAN (R 4.5.0)
    digest               0.6.39     2025-11-19 [1] CRAN (R 4.5.2)
    dm                 * 1.0.12     2025-07-02 [1] CRAN (R 4.5.0)
    dplyr              * 1.2.0      2026-02-03 [1] CRAN (R 4.5.2)
    DT                 * 0.34.0     2025-09-02 [1] CRAN (R 4.5.0)
    duckdb               1.4.4      2026-01-28 [1] CRAN (R 4.5.2)
    dygraphs             1.1.1.6    2018-07-11 [1] CRAN (R 4.5.0)
    e1071                1.7-17     2025-12-18 [1] CRAN (R 4.5.2)
    ellipsis             0.3.2      2021-04-29 [1] CRAN (R 4.5.0)
    evaluate             1.0.5      2025-08-27 [1] CRAN (R 4.5.0)
    farver               2.1.2      2024-05-13 [1] CRAN (R 4.5.0)
    fastmap              1.2.0      2024-05-15 [1] CRAN (R 4.5.0)
    fs                 * 1.6.6      2025-04-12 [1] CRAN (R 4.5.0)
    fuzzyjoin            0.1.6.1    2025-07-10 [1] CRAN (R 4.5.0)
    gargle             * 1.6.1      2026-01-29 [1] CRAN (R 4.5.2)
    generics             0.1.4      2025-05-09 [1] CRAN (R 4.5.0)
    geojsonsf            2.0.5      2025-11-26 [1] CRAN (R 4.5.2)
    ggplot2              4.0.2      2026-02-03 [1] CRAN (R 4.5.2)
    glue               * 1.8.0      2024-09-30 [1] CRAN (R 4.5.0)
    googledrive        * 2.1.2      2025-09-10 [1] CRAN (R 4.5.0)
    gtable               0.3.6      2024-10-25 [1] CRAN (R 4.5.0)
    here               * 1.0.2      2025-09-15 [1] CRAN (R 4.5.0)
    highcharter          0.9.4      2022-01-03 [1] CRAN (R 4.5.0)
    hms                  1.1.4      2025-10-17 [1] CRAN (R 4.5.0)
    htmltools          * 0.5.9      2025-12-04 [1] CRAN (R 4.5.2)
    htmlwidgets          1.6.4      2023-12-06 [1] CRAN (R 4.5.0)
    httpuv               1.6.16     2025-04-16 [1] CRAN (R 4.5.0)
    httr                 1.4.7      2023-08-15 [1] CRAN (R 4.5.0)
    httr2                1.2.2      2025-12-08 [1] CRAN (R 4.5.2)
    igraph               2.2.1      2025-10-27 [1] CRAN (R 4.5.0)
    isoband              0.3.0      2025-12-07 [1] CRAN (R 4.5.2)
    janitor            * 2.2.1      2024-12-22 [1] CRAN (R 4.5.0)
    jquerylib            0.1.4      2021-04-26 [1] CRAN (R 4.5.0)
    jsonlite           * 2.0.0      2025-03-27 [1] CRAN (R 4.5.0)
    KernSmooth           2.23-26    2025-01-01 [1] CRAN (R 4.5.2)
    knitr              * 1.51       2025-12-20 [1] CRAN (R 4.5.2)
    later                1.4.5      2026-01-08 [1] CRAN (R 4.5.2)
    lattice              0.22-7     2025-04-02 [1] CRAN (R 4.5.2)
    lazyeval             0.2.2      2019-03-15 [1] CRAN (R 4.5.0)
    leafem               0.2.5      2025-08-28 [1] CRAN (R 4.5.0)
    leaflet              2.2.3      2025-09-04 [1] CRAN (R 4.5.0)
    leaflet.providers    2.0.0      2023-10-17 [1] CRAN (R 4.5.0)
    leafpop              0.1.0      2021-05-22 [1] CRAN (R 4.5.0)
    librarian            1.8.1      2021-07-12 [1] CRAN (R 4.5.0)
    lifecycle            1.0.5      2026-01-08 [1] CRAN (R 4.5.2)
    listviewer         * 4.0.0      2023-09-30 [1] CRAN (R 4.5.0)
    litedown           * 0.9        2025-12-18 [1] CRAN (R 4.5.2)
    lubridate          * 1.9.5      2026-02-04 [1] CRAN (R 4.5.2)
    magrittr             2.0.4      2025-09-12 [1] CRAN (R 4.5.0)
    mapgl                0.4.4      2026-01-12 [1] CRAN (R 4.5.2)
    mapview            * 2.11.4     2025-09-08 [1] CRAN (R 4.5.0)
    markdown             2.0        2025-03-23 [1] CRAN (R 4.5.0)
    Matrix               1.7-4      2025-08-28 [1] CRAN (R 4.5.2)
    memoise              2.0.1      2021-11-26 [1] CRAN (R 4.5.0)
    mgcv                 1.9-3      2025-04-04 [1] CRAN (R 4.5.2)
    mime                 0.13       2025-03-17 [1] CRAN (R 4.5.0)
    nlme                 3.1-168    2025-03-31 [1] CRAN (R 4.5.2)
    otel                 0.2.0      2025-08-29 [1] CRAN (R 4.5.0)
    pillar               1.11.1     2025-09-17 [1] CRAN (R 4.5.0)
    pkgbuild             1.4.8      2025-05-26 [1] CRAN (R 4.5.0)
    pkgconfig            2.0.3      2019-09-22 [1] CRAN (R 4.5.0)
    pkgload              1.4.1      2025-09-23 [1] CRAN (R 4.5.0)
    plotly               4.12.0     2026-01-24 [1] CRAN (R 4.5.2)
    png                  0.1-8      2022-11-29 [1] CRAN (R 4.5.0)
    processx             3.8.6      2025-02-21 [1] CRAN (R 4.5.0)
    promises             1.5.0      2025-11-01 [1] CRAN (R 4.5.0)
    proxy                0.4-29     2025-12-29 [1] CRAN (R 4.5.2)
    ps                   1.9.1      2025-04-12 [1] CRAN (R 4.5.0)
    purrr              * 1.2.1      2026-01-09 [1] CRAN (R 4.5.2)
    quantmod             0.4.28     2025-06-19 [1] CRAN (R 4.5.0)
    R6                   2.6.1      2025-02-15 [1] CRAN (R 4.5.0)
    rappdirs             0.3.4      2026-01-17 [1] CRAN (R 4.5.2)
    raster               3.6-32     2025-03-28 [1] CRAN (R 4.5.0)
    RColorBrewer         1.1-3      2022-04-03 [1] CRAN (R 4.5.0)
    Rcpp                 1.1.1      2026-01-10 [1] CRAN (R 4.5.2)
    readr              * 2.1.6      2025-11-14 [1] CRAN (R 4.5.2)
    remotes              2.5.0      2024-03-17 [1] CRAN (R 4.5.0)
    rlang              * 1.1.7      2026-01-09 [1] CRAN (R 4.5.2)
    rlist                0.4.6.2    2021-09-03 [1] CRAN (R 4.5.0)
    rmarkdown            2.30       2025-09-28 [1] CRAN (R 4.5.0)
    rnaturalearth        1.2.0      2026-01-19 [1] CRAN (R 4.5.2)
    rnaturalearthhires   1.0.0.9000 2025-10-02 [1] Github (ropensci/rnaturalearthhires@e4736f6)
    RPostgres            1.4.8      2025-02-25 [1] CRAN (R 4.5.0)
    rprojroot            2.1.1      2025-08-26 [1] CRAN (R 4.5.0)
    rstudioapi           0.18.0     2026-01-16 [1] CRAN (R 4.5.2)
    s2                   1.1.9      2025-05-23 [1] CRAN (R 4.5.0)
    S7                   0.2.1      2025-11-14 [1] CRAN (R 4.5.2)
    sass                 0.4.10     2025-04-11 [1] CRAN (R 4.5.0)
    satellite            1.0.6      2025-08-21 [1] CRAN (R 4.5.0)
    scales               1.4.0      2025-04-24 [1] CRAN (R 4.5.0)
    sessioninfo          1.2.3      2025-02-05 [1] CRAN (R 4.5.0)
    sf                 * 1.0-24     2026-01-13 [1] CRAN (R 4.5.2)
    shiny                1.11.1     2025-07-03 [1] CRAN (R 4.5.0)
    shinyWidgets         0.9.0      2025-02-21 [1] CRAN (R 4.5.0)
    snakecase            0.11.1     2023-08-27 [1] CRAN (R 4.5.0)
    sp                   2.2-0      2025-02-01 [1] CRAN (R 4.5.0)
    stars                0.7-0      2025-12-14 [1] CRAN (R 4.5.2)
    stringi              1.8.7      2025-03-27 [1] CRAN (R 4.5.0)
    stringr            * 1.6.0      2025-11-04 [1] CRAN (R 4.5.0)
    svglite              2.2.2      2025-10-21 [1] CRAN (R 4.5.0)
    systemfonts          1.3.1      2025-10-01 [1] CRAN (R 4.5.0)
    terra                1.8-93     2026-01-12 [1] CRAN (R 4.5.2)
    textshaping          1.0.4      2025-10-10 [1] CRAN (R 4.5.0)
    tibble             * 3.3.1      2026-01-11 [1] CRAN (R 4.5.2)
    tidyr              * 1.3.2      2025-12-19 [1] CRAN (R 4.5.2)
    tidyselect           1.2.1      2024-03-11 [1] CRAN (R 4.5.0)
    timechange           0.4.0      2026-01-29 [1] CRAN (R 4.5.2)
    TTR                  0.24.4     2023-11-28 [1] CRAN (R 4.5.0)
    tzdb                 0.5.0      2025-03-15 [1] CRAN (R 4.5.0)
    units              * 1.0-0      2025-10-09 [1] CRAN (R 4.5.0)
    usethis              3.2.1      2025-09-06 [1] CRAN (R 4.5.0)
    utf8                 1.2.6      2025-06-08 [1] CRAN (R 4.5.0)
    uuid               * 1.2-2      2026-01-23 [1] CRAN (R 4.5.2)
    V8                   8.0.1      2025-10-10 [1] CRAN (R 4.5.0)
    vctrs                0.7.1      2026-01-23 [1] CRAN (R 4.5.2)
    viridisLite          0.4.3      2026-02-04 [1] CRAN (R 4.5.2)
    visNetwork           2.1.4      2025-09-04 [1] CRAN (R 4.5.0)
    vroom                1.7.0      2026-01-27 [1] CRAN (R 4.5.2)
    webshot2           * 0.1.2      2025-04-23 [1] CRAN (R 4.5.0)
    websocket            1.4.4      2025-04-10 [1] CRAN (R 4.5.0)
    withr                3.0.2      2024-10-28 [1] CRAN (R 4.5.0)
    wk                   0.9.5      2025-12-18 [1] CRAN (R 4.5.2)
    xfun                 0.56       2026-01-18 [1] CRAN (R 4.5.2)
    xtable               1.8-4      2019-04-21 [1] CRAN (R 4.5.0)
    xts                  0.14.1     2024-10-15 [1] CRAN (R 4.5.0)
    yaml                 2.3.12     2025-12-10 [1] CRAN (R 4.5.2)
    zoo                  1.8-15     2025-12-15 [1] CRAN (R 4.5.2)

 [1] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library

 * ── Packages attached to the search path.
 V ── Loaded and on-disk version mismatch.
 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

1 Overview

2 Check for any mismatched tables and fields

2.1 ✅ Data Integrity Check Passed: NOAA CalCOFI Database

2.1.1 All Systems Go

3 Show Source Files

4 Show CSV Tables and Fields to Ingest

5 Show tables and fields redefined

6 Load Tables into Database

7 Establish Primary Keys

7.1 Create cruise_key (natural key)

7.2 Propagate cruise_key to child tables

8 Create Lookup Table

9 Consolidate to Tidy Ichthyo Table

10 Data Quality Improvements

10.1 Data Corrections

10.2 Validate Referential Integrity

10.3 Drop Deprecated Tables

11 Schema Documentation

11.1 Primary Key Strategy

11.2 Foreign Key Relationships

12 Add Spatial

12.1 Add site.geom

12.2 Fix calcofi4r grid

12.3 Update site.grid_key

12.4 Add segment: line segments between consecutive sites

13 Report

14 Validate Local Database

15 Enforce Column Types

16 Data Preview

16.1 cruise (691 rows)

16.2 ship (48 rows)

16.3 site (61,104 rows)

16.4 tow (75,506 rows)

16.5 net (76,512 rows)

16.6 species (1,144 rows)

16.7 ichthyo (830,873 rows)

16.8 lookup (26 rows)

16.9 grid (218 rows)

16.10 segment (60,413 rows)

17 Write Parquet Outputs

18 Write Metadata

19 Upload to GCS Archive

20 Cleanup

21 TODO

12.1 Add `site.geom`

12.2 Fix calcofi4r `grid`

12.3 Update `site.grid_key`

12.4 Add `segment`: line segments between consecutive sites

16.1 `cruise` (691 rows)

16.2 `ship` (48 rows)

16.3 `site` (61,104 rows)

16.4 `tow` (75,506 rows)

16.5 `net` (76,512 rows)

16.6 `species` (1,144 rows)

16.7 `ichthyo` (830,873 rows)

16.8 `lookup` (26 rows)

16.9 `grid` (218 rows)

16.10 `segment` (60,413 rows)