CalCOFI Data Workflow Plan

Published

2026-02-05

This document outlines a comprehensive strategy for managing and versioning CalCOFI data files, building an integrated database, and publishing datasets to external portals.

1 Overview for Data Managers & Scientists

This section provides a high-level view of the CalCOFI data system. You don’t need to understand all the technical details—just the four main components and how they connect.

1.1 The Big Picture

Code
flowchart TB
    subgraph drive["📁 THE DRIVE"]
        direction LR
        gd_pub["data-public/<br/>Bottle, Larvae, CTD..."]
        gd_priv["data-private/<br/>eDNA, Zooplankton..."]
    end

    subgraph archive["🗄️ THE ARCHIVE"]
        direction LR
        sync["_sync/<br/>(working copy)"]
        hist["archive/<br/>(snapshots)"]
    end

    subgraph workflows["⚙️ THE WORKFLOWS"]
        direction LR
        ingest["Ingest<br/>(CSV → DB)"]
        publish["Publish<br/>(DB → Portals)"]
    end

    subgraph database["🗃️ THE DATABASE"]
        direction LR
        working["Working<br/>(internal)"]
        frozen["Frozen<br/>(public releases)"]
    end

    subgraph outputs["🌐 OUTPUTS"]
        direction LR
        obis["OBIS"]
        apps["Shiny Apps"]
        api["API / R pkg"]
    end

    drive -->|"daily sync"| archive
    archive -->|"read files"| workflows
    workflows -->|"transform"| database
    database -->|"publish"| outputs

    style drive fill:#e8f4e8,stroke:#2e7d32
    style archive fill:#e3f2fd,stroke:#1565c0
    style workflows fill:#fff3e0,stroke:#ef6c00
    style database fill:#f3e5f5,stroke:#7b1fa2
    style outputs fill:#fce4ec,stroke:#c2185b
flowchart TB
    subgraph drive["📁 THE DRIVE"]
        direction LR
        gd_pub["data-public/<br/>Bottle, Larvae, CTD..."]
        gd_priv["data-private/<br/>eDNA, Zooplankton..."]
    end

    subgraph archive["🗄️ THE ARCHIVE"]
        direction LR
        sync["_sync/<br/>(working copy)"]
        hist["archive/<br/>(snapshots)"]
    end

    subgraph workflows["⚙️ THE WORKFLOWS"]
        direction LR
        ingest["Ingest<br/>(CSV → DB)"]
        publish["Publish<br/>(DB → Portals)"]
    end

    subgraph database["🗃️ THE DATABASE"]
        direction LR
        working["Working<br/>(internal)"]
        frozen["Frozen<br/>(public releases)"]
    end

    subgraph outputs["🌐 OUTPUTS"]
        direction LR
        obis["OBIS"]
        apps["Shiny Apps"]
        api["API / R pkg"]
    end

    drive -->|"daily sync"| archive
    archive -->|"read files"| workflows
    workflows -->|"transform"| database
    database -->|"publish"| outputs

    style drive fill:#e8f4e8,stroke:#2e7d32
    style archive fill:#e3f2fd,stroke:#1565c0
    style workflows fill:#fff3e0,stroke:#ef6c00
    style database fill:#f3e5f5,stroke:#7b1fa2
    style outputs fill:#fce4ec,stroke:#c2185b
Figure 1: CalCOFI Data System Overview

1.2 Four Key Components

Note📁 The Drive

What: The shared Google Drive folder where you upload your data files.

Location: My Drive/projects/calcofi/data-public/ (or data-private/)

Who uses it: Dataset managers (Rasmus, Ed, Nastassia, Linsey, etc.)

How to use it:

  1. Create a folder for your data: {provider}/{dataset}/ (e.g., swfsc.noaa.gov/calcofi-db/)
  2. Inside, organize files into raw/ (original data) and derived/ (processed data)
  3. Simply upload or overwrite files—versioning happens automatically
  4. No special naming required (no date prefixes needed)

Example structure:

data-public/
├── calcofi.org/
│   └── bottle-database/
│       ├── raw/
│       │   └── bottle_export_2026-01-15.csv
│       └── derived/
│           └── bottle_cleaned.csv
├── swfsc.noaa.gov/
│   └── calcofi-db/
│       ├── raw/
│       │   ├── larva.csv
│       │   └── cruise.csv
│       └── derived/
Note🗄️ The Archive

What: A versioned backup of all Drive files in the cloud1.

Location: Google Cloud Storage buckets (gs://calcofi-files-public/ and gs://calcofi-files-private/)

Who uses it: Automated systems (you don’t need to interact with it directly)

What it does:

  • Daily sync: Copies all files from the Drive to the Archive
  • Snapshots: Creates timestamped copies so we can always go back in time
  • Never loses data: Even if you delete a file from Drive, it’s preserved in the Archive

Why it matters: Scientists can reproduce analyses using the exact data from any past date.

Note⚙️ The Workflows

What: Automated scripts that process data from the Archive into the Database.

Location: /workflows/ folder (Quarto notebooks and R scripts)

Who uses it: Data engineers and developers

Two types of workflows:

  1. Ingest workflows (ingest_*.qmd): Read CSV files → Clean/transform → Load into Database
  2. Publish workflows (publish_*.qmd): Read from Database → Format for portals → Upload to OBIS, EDI, etc.

Key feature: Dependencies are managed automatically—if source data changes, all downstream steps re-run.

Note🗃️ The Database

What: The integrated CalCOFI database combining all datasets into related tables.

Format: DuckDB2 with Parquet files (modern, fast, serverless)

Two versions:

Version Purpose Access Features
Working Database Internal processing Workflows only Full provenance tracking, UUIDs, source file references, time travel
Frozen Database Public releases Apps, API, R package Clean tables, versioned releases (v2026.02, v2026.03…), optimized for queries

Why two versions? The Working Database tracks every detail for reproducibility. The Frozen Database provides clean, stable snapshots for users who just want the data.

1.3 How Data Flows

Code
flowchart TB
    subgraph you["👤 YOU"]
        upload["Upload CSV to Drive"]
    end

    subgraph auto["🤖 AUTOMATED"]
        sync["Sync to Archive"]
        snapshot["Create snapshot"]
        transform["Run workflows"]
    end

    subgraph result["📊 RESULT"]
        db["Updated Database"]
        portals["Published to Portals"]
    end

    upload --> sync --> snapshot --> transform --> db --> portals

    style you fill:#e8f4e8,stroke:#2e7d32
    style auto fill:#fff3e0,stroke:#ef6c00
    style result fill:#f3e5f5,stroke:#7b1fa2
flowchart TB
    subgraph you["👤 YOU"]
        upload["Upload CSV to Drive"]
    end

    subgraph auto["🤖 AUTOMATED"]
        sync["Sync to Archive"]
        snapshot["Create snapshot"]
        transform["Run workflows"]
    end

    subgraph result["📊 RESULT"]
        db["Updated Database"]
        portals["Published to Portals"]
    end

    upload --> sync --> snapshot --> transform --> db --> portals

    style you fill:#e8f4e8,stroke:#2e7d32
    style auto fill:#fff3e0,stroke:#ef6c00
    style result fill:#f3e5f5,stroke:#7b1fa2
Figure 2: Simplified Data Flow

Your role as a dataset manager: Upload files to the Drive. Everything else happens automatically.


2 Executive Summary

This plan modernizes the CalCOFI data workflow architecture to:

  1. Version raw data files in Google Cloud Storage (GCS)
  2. Use DuckDB + Parquet as the primary integrated database format
  3. Leverage targets for workflow dependency management
  4. Use R packages (calcofi4r, calcofi4db) for reusable functions
  5. Maintain reproducible pipelines from raw data to published datasets

3 Current State Analysis

3.1 Existing Infrastructure

Component Current State Location
Workflows Individual Quarto notebooks /workflows/
Database PostgreSQL with dev/prod schemas api.calcofi.io
R Packages calcofi4r (user), calcofi4db (admin) GitHub
Data Sources CSV files on Google Drive ~/My Drive/projects/calcofi/data-public
Documentation Quarto site /docs/

3.2 Key Workflow Files

Workflow Purpose Status
ingest_calcofi.org_bottle-database.qmd Ingest bottle/cast data Partial
ingest_swfsc.noaa.gov_calcofi-db.qmd Ingest NOAA ichthyoplankton data Active
publish_larvae_to_obis.qmd Publish larvae to OBIS Active (needs bottle data)

3.3 Identified Gaps

  1. No dependency management: Workflows run independently without declared dependencies
  2. No file versioning: Source files updated in place without version history
  3. PostgreSQL overhead: Requires server maintenance and API infrastructure
  4. Disconnected workflows: No systematic connection between ingest and publish steps
  5. Manual sequencing: Users must know to run bottle ingestion before larvae ingestion

4 File Versioning Strategy

4.1 The Two-World Problem

CalCOFI data originates from multiple sources managed by different individuals:

  • Rasmus: Bottle/cast data, CTD profiles
  • Ed: Ichthyoplankton (larvae, eggs) data
  • Nastassia: eDNA data
  • Linsey: Zooplankton samples

These dataset managers need a simple interface (Google Drive) to deposit and organize files, while the data engineering side needs version control, immutability, and reproducibility.

4.2 Google Drive: The Human Interface

This is “The Drive” from Section 1. Dataset managers upload files here; everything else is automatic.

4.2.1 Folder Structure

Recommended organization: {provider}/{dataset}/raw/ and {provider}/{dataset}/derived/

calcofi/data-public/                       # Shared Google Drive folder (public data)
├── calcofi.org/
│   └── bottle-database/
│       ├── raw/                           # Original exports from source system
│       │   └── bottle_export_2026-01.csv
│       └── derived/                       # Cleaned/processed versions
│           ├── bottle.csv
│           ├── cast.csv
│           └── field_descriptions.csv
├── swfsc.noaa.gov/
│   └── calcofi-db/
│       ├── raw/                           # Original NOAA database exports
│       │   └── calcofi_db_dump_2026-01.sql
│       └── derived/                       # Extracted CSV tables
│           ├── cruise.csv
│           ├── egg.csv
│           ├── larva.csv
│           ├── net.csv
│           ├── ship.csv
│           ├── site.csv
│           ├── species.csv
│           └── tow.csv
├── coastwatch.pfeg.noaa.gov/
│   └── erdCalCOFIlrvsiz/
│       └── derived/
│           └── larvae_size.csv
└── _archive/                              # Manual archive (optional)
    └── 2025-01-15_bottle.csv              # User-created backup

calcofi/data-private/                      # Private/sensitive data (restricted access)
├── edna/                                  # eDNA samples (Nastassia)
│   ├── raw/
│   └── derived/
├── zooplankton/                           # Zooplankton samples (Linsey)
│   ├── raw/
│   └── derived/
└── _archive/                              # Manual archive (optional)

4.2.2 Google Drive Naming Conventions

  • Top-level folders: {provider}/ - the organization providing the data (e.g., swfsc.noaa.gov)
  • Dataset folders: {dataset}/ - specific dataset name (e.g., calcofi-db)
  • Subfolders (recommended):
    • raw/ - Original data as received from source (database dumps, original CSVs)
    • derived/ - Processed/cleaned versions ready for ingestion
  • Files: Original source names, no date prefixes required
  • Updates: Simply overwrite existing files
  • No versioning in Drive: Google Drive does not maintain file history for CSVs

4.2.3 Dataset Manager Workflow

  1. Create folder structure: {provider}/{dataset}/raw/ and {provider}/{dataset}/derived/
  2. Upload original data to raw/ (preserve original format)
  3. Upload processed/cleaned data to derived/ (CSV format preferred)
  4. Simply overwrite files when updating—versioning happens automatically in the Archive
  5. Optionally notify data team of significant changes

See Section 1 for the big picture of how your data flows through the system.

4.3 Google Cloud Storage: The Versioned Data Lake

This is “The Archive” from Section 13. It automatically captures versioned snapshots of all files from the Drive.

4.3.1 Bucket Structure

gs://calcofi-files-public/                 # Versioned public source files
├── _sync/                                 # Working directory (rclone syncs here)
│   ├── calcofi.org/
│   │   └── bottle-database/
│   │       ├── bottle.csv
│   │       └── cast.csv
│   └── swfsc.noaa.gov/
│       └── calcofi-db/
│           ├── cruise.csv
│           ├── egg.csv
│           └── ...
├── archive/                               # Timestamped immutable snapshots
│   ├── 2026-02-02_121557/                # YYYY-MM-DD_HHMMSS
│   │   └── calcofi.org/
│   │       └── bottle-database/
│   │           └── bottle.csv            # Snapshot at that time
│   ├── 2026-02-15_120000/
│   │   └── swfsc.noaa.gov/
│   │       └── calcofi-db/
│   │           └── larva.csv
│   └── ...
└── manifests/                             # Version metadata
    ├── manifest_2026-02-02_121557.json
    ├── manifest_2026-02-15_120000.json
    └── manifest_latest.json              # Points to latest archive

gs://calcofi-files-private/                # Versioned private source files
├── _sync/                                 # Working directory (rclone syncs here)
│   ├── edna/
│   └── zooplankton/
├── archive/                               # Timestamped immutable snapshots
└── manifests/

gs://calcofi-db/                           # Integrated database
├── ingest/                                # Per-workflow outputs (intermediate)
│   ├── swfsc.noaa.gov_calcofi-db/         # Flattened {provider}_{dataset}
│   │   ├── cruise.parquet
│   │   ├── ship.parquet
│   │   ├── site.parquet
│   │   ├── ...
│   │   ├── manifest.json                  # Provenance for this workflow
│   │   └── metadata.json                  # Table & column metadata sidecar
│   └── calcofi.org_bottle-database/
│       ├── cast.parquet
│       ├── bottle.parquet
│       ├── manifest.json
│       └── metadata.json
│
└── ducklake/                              # DuckLake manages parquet + catalog
    ├── working/                           # Integrated database (combines ingests)
    │   ├── catalog/                       # DuckLake catalog metadata
    │   └── data/                          # Parquet files with provenance columns
    │       ├── cruise/
    │       ├── site/
    │       └── ...
    └── releases/                          # Frozen public releases
        ├── v2026.02/
        │   ├── catalog/
        │   └── data/                      # Provenance columns stripped
        └── latest -> v2026.02             # Symlink to current

4.4 Workflow Metadata

Redefinition files (column/table renaming rules) live in workflows/metadata/, NOT in the calcofi4db R package. This decouples workflow-specific metadata from the package code.

workflows/metadata/
├── swfsc.noaa.gov/
│   └── calcofi-db/
│       ├── tbls_redefine.csv              # Table renaming rules
│       ├── flds_redefine.csv              # Field renaming/typing rules
│       ├── tbls_raw.csv                   # Auto-generated raw table info
│       └── flds_raw.csv                   # Auto-generated raw field info
└── calcofi.org/
    └── bottle-database/
        ├── flds_rename.csv
        └── ship_renames.csv

Benefits of this structure:

  • Version-controlled alongside workflows (not buried in R package)
  • Easy to edit without rebuilding the package
  • Clear ownership per dataset
  • Allows different projects to use calcofi4db without conflicting metadata

4.4.1 GCS Naming Conventions

  • **_sync/**: Working directory, exact mirror of Google Drive structure (rclone syncs here)
  • archive/{YYYY-MM-DD_HHMMSS}/: Immutable snapshot of complete data at that timestamp
  • manifests/manifest_{YYYY-MM-DD_HHMMSS}.json: Snapshot metadata
  • manifests/manifest_latest.json: Points to latest archive snapshot

4.5 Rclone Sync: Capturing Daily Changes

4.5.1 Installation

macOS:

brew install rclone

Linux (Ubuntu/Debian):

curl https://rclone.org/install.sh | sudo bash

4.5.2 Configuration

Run rclone config and create two remotes:

Remote 1: Google Drive (gdrive)

n) New remote
name> gdrive
Storage> drive
client_id>                        # leave blank
client_secret>                    # leave blank
scope> 1                          # Full access (or 2 for read-only)
service_account_file>             # leave blank for OAuth, or path to JSON key
Edit advanced config> n
Use auto config> y                # opens browser (use 'n' on headless server)
Configure this as a Shared Drive> n

Remote 2: Google Cloud Storage (gcs)

n) New remote
name> gcs
Storage> google cloud storage
client_id>                        # leave blank
client_secret>                    # leave blank
project_number> ucsd-sio-calcofi
service_account_file>             # leave blank for OAuth, or path to JSON key
Edit advanced config> n
Use auto config> y                # opens browser (use 'n' on headless server)

4.5.3 Headless Server Setup

For Linux servers without a browser, use rclone authorize on a local machine:

# on local machine with browser
rclone authorize "drive"
# copy the token output

# on server, during rclone config
Use auto config> n
# paste the token when prompted

Similarly for GCS:

# on local machine
rclone authorize "google cloud storage"

4.5.4 Service Account (Optional)

For automated/scheduled syncs, use a GCP service account instead of OAuth:

  1. Create service account in GCP Console → IAM → Service Accounts
  2. Grant roles: Storage Object Admin for calcofi-files-public, calcofi-files-private, and calcofi-db buckets
  3. Download JSON key file to secure location (e.g., /etc/rclone/calcofi-sa.json)
  4. In rclone config, provide path to JSON key instead of using OAuth

4.5.5 Verify Configuration

# test Google Drive access
rclone lsd gdrive:projects/calcofi/data-public      # OR
rclone lsd gdrive-ecoquants:projects/calcofi/data-public

# test GCS access
rclone ls gcs:calcofi-files-public --max-depth 1    # OR
rclone ls gcs-calcofi:calcofi-files-public --max-depth 1

# test sync (dry-run first!)
rclone sync gdrive:calcofi/data-public gcs:calcofi-files-public/_sync --dry-run -v
rclone sync gdrive-ecoquants:projects/calcofi/data-public gcs-calcofi:calcofi-files-public/_sync --dry-run -v

4.5.6 How It Works

#!/bin/bash
# sync_gdrive_to_gcs.sh - Run daily via cron

TIMESTAMP=$(date +%Y-%m-%d_%H%M%S)
LOG_FILE="/var/log/rclone/sync_${TIMESTAMP}.log"

# --- PUBLIC DATA ---
# Step 1: Sync Google Drive to _sync/ working directory
rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync \
  --checksum \
  --drive-export-formats csv \
  --log-file ${LOG_FILE} \
  --log-level INFO

# Step 2: Create immutable archive snapshot
rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/${TIMESTAMP}

# Step 3: Generate and upload manifest
rclone lsjson gs://calcofi-files-public/archive/${TIMESTAMP} --recursive \
  > /tmp/manifest_${TIMESTAMP}.json
rclone copy /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/
rclone copyto /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/manifest_latest.json

# --- PRIVATE DATA ---
# Repeat for private bucket (separate log file)
rclone sync gdrive:calcofi/data-private gs://calcofi-files-private/_sync \
  --checksum \
  --drive-export-formats csv \
  --log-file ${LOG_FILE%.log}_private.log \
  --log-level INFO

rclone copy gs://calcofi-files-private/_sync gs://calcofi-files-private/archive/${TIMESTAMP}

4.5.7 Key Behaviors

Scenario Google Drive Action Rclone Behavior Result in GCS
New file User uploads new.csv rclone sync copies to _sync/ _sync/new.csv created, next archive includes it
Updated file User overwrites bottle.csv rclone sync updates _sync/ _sync/bottle.csv updated, archive snapshots preserve history
Deleted file User deletes old.csv rclone sync removes from _sync/ Removed from _sync/, preserved in prior archives
Renamed file User renames a.csvb.csv Treated as delete + create a.csv gone from _sync/, b.csv created
No changes Nothing --checksum skips identical files No new sync needed

4.5.8 The Two-Step Archive Strategy

The archiving approach uses two steps for reliability:

  1. Sync to _sync/: rclone syncs Google Drive to the _sync/ working directory
  2. Copy to archive/{timestamp}/: Create an immutable snapshot of the current state

Benefits:

  • Immutable archives: Each archive folder is a complete snapshot, not just deltas
  • Reproducibility: Any archive folder contains everything needed to recreate that state
  • No data loss: If sync fails, previous archives remain intact
  • Public access: calcofi-files-public bucket has public read access for easy data sharing

4.6 Manifest Files

Each manifest captures the complete state of the archive at a point in time:

{
  "generated_at": "2026-02-02T12:15:57Z",
  "archive_timestamp": "2026-02-02_121557",
  "bucket": "gs://calcofi-files-public",
  "archive_path": "archive/2026-02-02_121557",
  "files": [
    {
      "Path": "calcofi.org/bottle-database/bottle.csv",
      "Size": 45234567,
      "ModTime": "2026-02-02T10:30:00Z",
      "MimeType": "text/csv",
      "gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/calcofi.org/bottle-database/bottle.csv"
    },
    {
      "Path": "swfsc.noaa.gov/calcofi-db/larva.csv",
      "Size": 12345678,
      "ModTime": "2026-02-01T08:00:00Z",
      "MimeType": "text/csv",
      "gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/swfsc.noaa.gov/calcofi-db/larva.csv"
    }
  ],
  "total_files": 112457,
  "total_size_bytes": 35987654321
}

4.7 Reconstructing Historical State

To recreate the data lake as it existed on a specific date:

# R function to get file from a specific archive snapshot
get_archived_file <- function(path, timestamp = "latest", bucket = "calcofi-files-public") {
  # Get manifest for that timestamp
  if (timestamp == "latest") {
    manifest_url <- glue("gs://{bucket}/manifests/manifest_latest.json")
  } else {
    manifest_url <- glue("gs://{bucket}/manifests/manifest_{timestamp}.json")
  }

  manifest <- jsonlite::read_json(manifest_url)

  # Construct the archive URL
  archive_url <- glue("gs://{bucket}/archive/{manifest$archive_timestamp}/{path}")

  # Or use public HTTPS URL for direct access
  https_url <- glue("https://storage.googleapis.com/{bucket}/archive/{manifest$archive_timestamp}/{path}")

  return(https_url)
}

4.8 Automation Schedule

Code
%%---
%%displayMode: compact
%%---

%%{init: {'theme': 'default', 'themeVariables': { 'textColor': '##333'}}}%%

gantt
    title Daily Sync Schedule (UTC)
    dateFormat HH:mm
    axisFormat %H:%M

    section Sync
    rclone           :a1, 00:00, 10m
    archive          :a2, after a1, 5m
    manifest         :a3, after a2, 5m

    section Pipeline
    targets          :b1, after a3, 10m
    parquet          :b2, after b1, 5m
    duckdb           :b3, after b2, 5m

    section Notify
    notify           :c1, after b3, 5m
%%---
%%displayMode: compact
%%---

%%{init: {'theme': 'default', 'themeVariables': { 'textColor': '##333'}}}%%

gantt
    title Daily Sync Schedule (UTC)
    dateFormat HH:mm
    axisFormat %H:%M

    section Sync
    rclone           :a1, 00:00, 10m
    archive          :a2, after a1, 5m
    manifest         :a3, after a2, 5m

    section Pipeline
    targets          :b1, after a3, 10m
    parquet          :b2, after b1, 5m
    duckdb           :b3, after b2, 5m

    section Notify
    notify           :c1, after b3, 5m
Figure 3: Daily Sync Schedule (UTC)

The daily automation runs at midnight UTC and completes in under an hour:

  1. Sync phase: rclone copies new/changed files from Google Drive to the _sync/ folder in GCS, then creates an immutable archive/ snapshot and generates a manifest file documenting the state.

  2. Pipeline phase: The targets pipeline detects which source files changed and re-runs only the affected steps—transforming CSVs to Parquet format and updating the DuckDB database.

  3. Notify phase: Sends notifications (email/Slack) summarizing what changed and any errors encountered.

If no files changed in Google Drive, the pipeline completes quickly with minimal processing.


5 Proposed Architecture

5.1 Data Flow Diagram

Code
flowchart TB
    subgraph drive["📁 THE DRIVE"]
        direction LR
        gdrive_pub["data-public/"]
        gdrive_priv["data-private/"]
    end

    subgraph archive["🗄️ THE ARCHIVE"]
        direction LR
        bucket_pub["calcofi-files-public"]
        bucket_priv["calcofi-files-private"]
    end

    subgraph database["🗃️ THE DATABASE"]
        direction LR
        working["Working DuckLake"]
        frozen["Frozen Releases"]
    end

    subgraph outputs["🌐 OUTPUTS"]
        direction LR
        obis["OBIS"]
        edi["EDI"]
        erddap["ERDDAP"]
        apps["Apps/API"]
    end

    drive -->|"rclone sync"| archive
    archive -->|"targets pipeline"| database
    database -->|"publish workflows"| outputs

    style drive fill:#e8f4e8,stroke:#2e7d32
    style archive fill:#e3f2fd,stroke:#1565c0
    style database fill:#f3e5f5,stroke:#7b1fa2
    style outputs fill:#fce4ec,stroke:#c2185b
flowchart TB
    subgraph drive["📁 THE DRIVE"]
        direction LR
        gdrive_pub["data-public/"]
        gdrive_priv["data-private/"]
    end

    subgraph archive["🗄️ THE ARCHIVE"]
        direction LR
        bucket_pub["calcofi-files-public"]
        bucket_priv["calcofi-files-private"]
    end

    subgraph database["🗃️ THE DATABASE"]
        direction LR
        working["Working DuckLake"]
        frozen["Frozen Releases"]
    end

    subgraph outputs["🌐 OUTPUTS"]
        direction LR
        obis["OBIS"]
        edi["EDI"]
        erddap["ERDDAP"]
        apps["Apps/API"]
    end

    drive -->|"rclone sync"| archive
    archive -->|"targets pipeline"| database
    database -->|"publish workflows"| outputs

    style drive fill:#e8f4e8,stroke:#2e7d32
    style archive fill:#e3f2fd,stroke:#1565c0
    style database fill:#f3e5f5,stroke:#7b1fa2
    style outputs fill:#fce4ec,stroke:#c2185b
Figure 4: CalCOFI Data Flow Architecture

5.2 Technology Stack

Layer Technology Purpose Replaces
Storage
gs://calcofi-files-public Versioned public source CSV files Google Drive only
gs://calcofi-files-private Versioned private/sensitive files Google Drive only
gs://calcofi-db Parquet, DuckDB, PMTiles -
Sync rclone Drive → GCS with archive snapshots Manual copying
Format Apache Parquet Efficient columnar storage CSV
Compute DuckDB Serverless SQL queries PostgreSQL (gradual)
Database
Working DuckLake Internal DB with provenance, time travel PostgreSQL
Frozen DuckLake Public versioned releases (v2026.02, etc.) PostgreSQL snapshots
Orchestration targets Dependency management, caching Manual sequencing
Tiles PMTiles + tippecanoe Cloud-native vector tiles pg_tileserv
Mapping mapgl Modern WebGL maps in Shiny leaflet + pg_tileserv
Functions calcofi4r, calcofi4db Reusable R code Ad-hoc scripts

GCP Project: ucsd-sio-calcofi


6 Database Naming Conventions

6.1 Table Names

  • Table names are singular (e.g., cruise, site, species, lookup not cruises, sites, lookups)
  • Table names use all lower case with underscores for multi-word names (snake_case)

6.2 Column Names

  • Use snake_case (lower-case with underscores) for all column names
  • Unique identifiers are suffixed with:
    • *_id for integer keys (sequential or from source)
    • *_key for string/text keys (natural keys)
    • *_uuid for universally unique identifiers (provenance tracking only)
  • Suffix with units where applicable (e.g., length_mm, depth_m, temp_degc)

7 Primary Key Strategy

7.1 General Principle: Natural Keys Over Surrogate Keys

Prefer natural keys (meaningful identifiers from the domain) over surrogate keys (arbitrary integers) unless:

  • No natural key exists
  • Performance requires integer keys for large tables with frequent joins
  • The natural key is unstable or may change

Benefits of natural keys:

  • Self-documenting (cruise_key 2401NH tells you January 2024, New Horizon)
  • No lookup needed to understand relationships
  • Consistent across systems (source data uses same identifiers)
  • Easier debugging and data validation

7.2 Natural Keys (use as PRIMARY KEY)

Table Primary Key Format Example
cruise cruise_key TEXT YYMMKK '2401NH'
ship ship_key TEXT (2-letter) 'NH'
tow_type tow_type_key TEXT 'CB'
species species_id INTEGER 1234 (natural from source)

7.3 Sequential Integer Keys (with explicit sort order)

For tables without natural keys, use sequential integers with deterministic sort order for reproducibility when appending new data:

Table Primary Key Sort Order for ID Assignment
site site_id ORDER BY cruise_key, orderocc
tow tow_id ORDER BY site_id, time_start
net net_id ORDER BY tow_id, side
ichthyo ichthyo_id ORDER BY net_id, species_id, life_stage, measurement_type, measurement_value
lookup lookup_id ORDER BY lookup_type, lookup_num

Why sort order matters: When new cruises are added later, rows get appended. Without deterministic ordering, the same data could get different IDs on different runs, breaking reproducibility.

7.4 Foreign Key Relationships

ship.ship_key (PK)
    ↓
cruise.cruise_key (PK)  ←──  cruise.ship_key (FK)
    ↓
site.site_id (PK)       ←──  site.cruise_key (FK)
    ↓
tow.tow_id (PK)         ←──  tow.site_id (FK)
    ↓                        tow.tow_type_key (FK) → tow_type.tow_type_key
net.net_id (PK)         ←──  net.tow_id (FK)
    ↓
ichthyo.ichthyo_id (PK) ←──  ichthyo.net_id (FK)
                             ichthyo.species_id (FK) → species.species_id

--- Bottle chain (linked via cruise bridge) ---

casts.cast_id (PK)      ←──  casts.cruise_key (FK, nullable) → cruise.cruise_key
    ↓                        casts.grid_key (FK) → grid.grid_key
    ↓                        casts.ship_key (FK, nullable) → ship.ship_key
bottle.bottle_id (PK)   ←──  bottle.cast_id (FK)
    ↓
bottle_measurement       ←──  bottle_measurement.bottle_id (FK)
                             bottle_measurement.measurement_type_id → measurement_type

--- Taxonomy ---

species.species_id (PK)      species.worms_id → taxon (WoRMS authority)
                             species.itis_id → taxon (ITIS authority)
taxon.taxonID (PK per authority)
taxa_rank.taxonRank (PK)

--- Spatial ---

grid.grid_key (PK)      ←──  canonical from swfsc (not in bottle export)
segment (reference table)

Note: grid.parquet is only exported from the swfsc workflow (canonical source). The bottle workflow no longer exports grid.

7.5 UUID Columns

  • Remove cruise_uuid, site_uuid, tow_uuid, net_uuid from output tables
  • Retain _source_uuid in Working DuckLake for provenance (stripped in frozen release)

8 Database Architecture: Working vs Frozen DuckLake

The CalCOFI integrated database uses DuckLake, a lakehouse catalog that provides version control for data tables. We maintain two distinct databases to serve different needs:

8.1 The Two-Database Strategy

Code
flowchart TB
    subgraph archive["📁 Archive"]
        csv["CSV source files"]
    end

    subgraph working["🔧 WORKING DUCKLAKE"]
        direction LR
        w_tables["Provenance columns"]
        w_history["Time travel"]
        w_access["Internal only"]
    end

    subgraph frozen["❄️ FROZEN DUCKLAKE"]
        direction LR
        f_versions["v2026.02, v2026.03..."]
        f_clean["Clean tables"]
        f_access["Public access"]
    end

    subgraph outputs["🌐 Public Access"]
        direction LR
        apps["Shiny Apps"]
        api["REST API"]
        rpkg["calcofi4r"]
    end

    csv -->|"ingest workflows"| working
    working -->|"freeze release"| frozen
    frozen --> outputs

    style archive fill:#e3f2fd,stroke:#1565c0
    style working fill:#fff3e0,stroke:#ef6c00
    style frozen fill:#e8f4e8,stroke:#2e7d32
    style outputs fill:#fce4ec,stroke:#c2185b
flowchart TB
    subgraph archive["📁 Archive"]
        csv["CSV source files"]
    end

    subgraph working["🔧 WORKING DUCKLAKE"]
        direction LR
        w_tables["Provenance columns"]
        w_history["Time travel"]
        w_access["Internal only"]
    end

    subgraph frozen["❄️ FROZEN DUCKLAKE"]
        direction LR
        f_versions["v2026.02, v2026.03..."]
        f_clean["Clean tables"]
        f_access["Public access"]
    end

    subgraph outputs["🌐 Public Access"]
        direction LR
        apps["Shiny Apps"]
        api["REST API"]
        rpkg["calcofi4r"]
    end

    csv -->|"ingest workflows"| working
    working -->|"freeze release"| frozen
    frozen --> outputs

    style archive fill:#e3f2fd,stroke:#1565c0
    style working fill:#fff3e0,stroke:#ef6c00
    style frozen fill:#e8f4e8,stroke:#2e7d32
    style outputs fill:#fce4ec,stroke:#c2185b
Figure 5: Working vs Frozen DuckLake Architecture

8.2 Working DuckLake (Internal)

The Working DuckLake is the internal, mutable database used by workflows. It prioritizes:

  • Full provenance tracking: Every row knows where it came from
  • Time travel: Query the database as it existed at any past point
  • Frequent updates: Can be modified whenever new data arrives

Schema features:

Column Type Purpose
_source_file VARCHAR Path to original CSV file in Archive
_source_row INTEGER Row number in source file
_source_uuid UUID Original record ID (e.g., from NOAA database)
_ingested_at TIMESTAMP When this row was added/updated

Example query with provenance:

-- Find all larvae records from a specific source file
SELECT scientific_name, count, _source_file, _source_row
FROM larvae
WHERE _source_file LIKE '%swfsc.noaa.gov/calcofi-db/larva.csv'
  AND cruise_id = '202301CC';

Time travel query:

-- Query larvae table as it existed on January 15th
SELECT * FROM larvae
AT TIMESTAMP '2026-01-15 00:00:00';

8.3 Frozen DuckLake (Public Releases)

The Frozen DuckLake provides stable, versioned snapshots for external users. Based on the Frozen DuckLake pattern:

Key characteristics:

  • Immutable: Once released, a version never changes
  • Clean schema: No internal provenance columns (cleaner for users)
  • Semantic versioning: v{YYYY}.{MM} format (e.g., v2026.02)
  • Public access: Anyone can query via HTTP/Parquet

Release process:

Code
flowchart LR
    working["Working DuckLake"]
    validate["Validate data quality"]
    strip["Remove provenance columns"]
    version["Tag version (v2026.02)"]
    upload["Upload to releases/"]
    announce["Announce release"]

    working --> validate --> strip --> version --> upload --> announce

    style working fill:#fff3e0,stroke:#ef6c00
    style validate fill:#fff9c4,stroke:#f9a825
    style strip fill:#fff9c4,stroke:#f9a825
    style version fill:#fff9c4,stroke:#f9a825
    style upload fill:#e8f4e8,stroke:#2e7d32
    style announce fill:#e8f4e8,stroke:#2e7d32
flowchart LR
    working["Working DuckLake"]
    validate["Validate data quality"]
    strip["Remove provenance columns"]
    version["Tag version (v2026.02)"]
    upload["Upload to releases/"]
    announce["Announce release"]

    working --> validate --> strip --> version --> upload --> announce

    style working fill:#fff3e0,stroke:#ef6c00
    style validate fill:#fff9c4,stroke:#f9a825
    style strip fill:#fff9c4,stroke:#f9a825
    style version fill:#fff9c4,stroke:#f9a825
    style upload fill:#e8f4e8,stroke:#2e7d32
    style announce fill:#e8f4e8,stroke:#2e7d32
Figure 6: Freezing a Release

Frozen release structure:

gs://calcofi-db/ducklake/releases/
├── v2026.02/
│   ├── catalog.json          # DuckLake catalog
│   ├── parquet/
│   │   ├── bottle.parquet
│   │   ├── cast.parquet
│   │   ├── larvae.parquet
│   │   └── ...
│   └── RELEASE_NOTES.md
├── v2026.03/
│   └── ...
└── latest -> v2026.03        # Symlink to latest

8.4 Accessing Frozen Releases

From R (using calcofi4r):

library(calcofi4r)

# Get latest release (default)
con <- cc_get_db()

# Get specific version
con <- cc_get_db(version = "v2026.02")

# List available versions
cc_list_versions()
#> [1] "v2026.02" "v2026.03" "v2026.04"

Direct Parquet access (no R required):

import duckdb

# Query directly from GCS (no download needed)
con = duckdb.connect()
con.execute("""
    SELECT * FROM read_parquet(
        'https://storage.googleapis.com/calcofi-db/ducklake/releases/latest/parquet/larvae.parquet'
    )
    LIMIT 10
""")

From any HTTP client:

# Download parquet file directly
curl -O https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.02/parquet/larvae.parquet

8.5 Release Schedule

Frequency Trigger Example
Monthly Regular data updates v2026.02, v2026.03
Ad-hoc Major dataset additions v2026.02.1 (patch)
Annual Year-end comprehensive release v2026.12

Release checklist:


9 R Package Strategy

9.1 Package Roles

9.1.1 calcofi4db - Database Administration Tools

Target users: Data managers, workflow developers

calcofi4db/R/
├── cloud.R        # GCS operations (@concept cloud)
├── read.R         # CSV/Parquet reading (@concept read)
├── transform.R    # Data transformation (@concept transform)
├── ingest.R       # Database loading (@concept ingest)
├── workflow.R     # Workflow outputs to GCS (@concept workflow)
├── parquet.R      # Parquet operations (@concept parquet)
├── duckdb.R       # DuckDB operations (@concept duckdb)
├── ducklake.R     # DuckLake operations (@concept ducklake)
├── archive.R      # GCS archive sync (@concept archive)
├── version.R      # Schema versioning (@concept version)
├── check.R        # Data validation (@concept check)
├── validate.R     # FK/lookup validation functions (@concept validate)
├── wrangle.R      # Local DuckDB wrangling functions (@concept wrangle)
├── ship.R         # Ship fuzzy matching functions (@concept ship) [NEW]
├── taxonomy.R     # Taxonomy standardization functions (@concept taxonomy) [NEW]
├── spatial.R      # Spatial operations (@concept spatial)
├── freeze.R       # Frozen release operations (@concept freeze)
├── display.R      # Display helpers (@concept display)
└── viz.R          # Diagram generation (@concept viz)

9.1.2 calcofi4r - User-Facing Tools

Target users: Researchers, analysts, app developers

calcofi4r/R/
├── read.R         # Data retrieval (@concept read)
├── database.R     # Database connection (@concept database)
├── analyze.R      # Statistical analysis (@concept analyze)
├── visualize.R    # Plotting and mapping (@concept visualize)
└── functions.R    # Helper functions (@concept utils)

9.2 Function Reuse Across Contexts

Context Package Example Usage
Ingestion workflows calcofi4db read_csv_files(), ingest_csv_to_db()
Publishing workflows calcofi4db get_db_con(), create_db_manifest()
Shiny apps calcofi4r cc_get_db(), cc_plot_timeseries()
Analysis scripts calcofi4r cc_read_bottle(), cc_analyze_trend()
API endpoints calcofi4r cc_query_db(), cc_get_variables()

9.3 New Functions to Implement

9.3.1 In calcofi4db:

# R/cloud.R - GCS operations
get_gcs_file(bucket, path, local_path = NULL)
put_gcs_file(local_path, bucket, path)
sync_gdrive_to_gcs(gdrive_path, gcs_bucket, backup = TRUE)
list_gcs_versions(bucket, path)

# R/parquet.R - Parquet operations
csv_to_parquet(csv_path, schema_def = NULL)
read_parquet_table(path, con = NULL)
write_parquet_table(data, path, partitions = NULL)
add_parquet_metadata(path, metadata_list)

# R/duckdb.R - DuckDB operations
get_duckdb_con(path = ":memory:", read_only = FALSE)
create_duckdb_views(con, manifest)
attach_ducklake(con, catalog_path)
set_duckdb_comments(con, table, comments)

# R/workflow.R - Workflow output functions [NEW]
write_ingest_outputs(data_info, provider, dataset, gcs_bucket)
  # 1. Transform data via transform_data()
  # 2. Write parquet files to gs://calcofi-db/ingest/{provider}_{dataset}/
  # 3. Write manifest.json with provenance info
read_ingest_manifest(provider, dataset, gcs_bucket)
read_ingest_parquet(provider, dataset, table, gcs_bucket)
list_ingest_outputs(gcs_bucket)
integrate_to_working_ducklake(ingests, gcs_bucket, ducklake_path)
  # Combines ingest outputs into ducklake/working/

# R/ducklake.R - DuckLake catalog operations
get_working_ducklake()           # connect to internal Working DuckLake
ingest_dataset(con, d, mode)     # ✓ IMPLEMENTED - high-level dataset ingestion
  # - calls transform_data() internally
  # - handles uuid column detection
  # - tracks provenance via gcs_path
ingest_to_working(data, table, source_file, source_uuid_col = NULL)
add_provenance_columns(data, source_file, source_row_start = 1)
save_working_ducklake(con)       # upload Working DuckLake to GCS
list_working_tables(con)         # list tables with provenance stats

# R/freeze.R - Frozen release operations
freeze_release(version, release_notes = NULL)
validate_for_release(con)        # run all quality checks before freeze
list_frozen_releases()           # list available frozen versions
compare_releases(v1, v2)         # diff two releases

# R/ship.R - Ship code reconciliation [NEW]
match_ships(unmatched_ships, reference_ships, ship_renames_csv, fetch_ices)
  # Multi-source fuzzy matching: manual overrides → exact nodc → exact name
  # → CalCOFI UNOLS → NODC WOD → ICES → word-overlap fuzzy
fetch_ship_ices(ices_api, ices_ship_code_type)
  # Fetch ship codes from ICES Reference Codes API

# R/taxonomy.R - Taxonomy standardization [NEW]
standardize_species(con, species_tbl, include_gbif, batch_size)
  # Query WoRMS → ITIS → GBIF for each species
  # Updates species table with worms_id, itis_id, gbif_id
build_taxon_table(con, species_tbl, taxon_tbl, include_itis, batch_size)
  # Build taxon hierarchy table from WoRMS/ITIS classification
  # Also creates taxa_rank lookup table with rank ordering

9.3.2 In calcofi4r:

# R/database.R - Frozen DuckLake access (user-facing)
cc_get_db(version = "latest", local_cache = TRUE)
  # Connect to frozen release (never Working DuckLake)
cc_list_versions()               # list available frozen releases
cc_db_info(version = "latest")   # release date, row counts, notes
cc_release_notes(version)        # view release notes for a version

# R/database.R - Derived views [NEW]
cc_make_view(con, template, view_name, base_table, column_definitions)
  # Create derived VIEW from template or custom SQL expressions
  # Templates: "casts_extra" (year, month, quarter, julian_day, lat/lon parts, cruise, db_sta_key)
cc_list_view_templates()         # list available prebaked templates

# R/read.R - Convenience functions for common tables
cc_read_bottle(version = "latest", ...)
cc_read_larvae(version = "latest", ...)
cc_read_cast(version = "latest", ...)

# R/query.R - Query helpers
cc_query(sql, version = "latest")
cc_list_tables(version = "latest")
cc_describe_table(table, version = "latest")

10 Workflow Dependency Management with targets

10.1 Pipeline Definition

Create workflows/_targets.R:

library(targets)
library(tarchetypes)

tar_option_set(
  packages = c("calcofi4db", "duckdb", "dplyr", "arrow"),
  format = "qs"  # Fast serialization
)

# Config
DIR_DATA     <- "~/My Drive/projects/calcofi/data-public"
DIR_METADATA <- "metadata"  # workflows/metadata/
GCS_BUCKET   <- "calcofi-db"

# Define pipeline
list(
  # ─── Ingest SWFSC NOAA CalCOFI Database ─────────────────────────
  tar_target(
    ingest_swfsc,
    {
      d <- calcofi4db::read_csv_files(
        provider     = "swfsc.noaa.gov",
        dataset      = "calcofi-db",
        dir_data     = DIR_DATA,
        metadata_dir = DIR_METADATA)

      calcofi4db::write_ingest_outputs(
        data_info  = d,
        provider   = "swfsc.noaa.gov",
        dataset    = "calcofi-db",
        gcs_bucket = GCS_BUCKET)
    }),

  # ─── Ingest CalCOFI.org Bottle Database ─────────────────────────
  tar_target(
    ingest_bottle,
    {
      d <- calcofi4db::read_csv_files(
        provider     = "calcofi.org",
        dataset      = "bottle-database",
        dir_data     = DIR_DATA,
        metadata_dir = DIR_METADATA)

      calcofi4db::write_ingest_outputs(
        data_info  = d,
        provider   = "calcofi.org",
        dataset    = "bottle-database",
        gcs_bucket = GCS_BUCKET)
    }),

  # ─── Integrate Ingest Outputs into Working DuckLake ──────────────
  tar_target(
    working_ducklake,
    calcofi4db::integrate_to_working_ducklake(
      ingests       = list(ingest_swfsc, ingest_bottle),
      gcs_bucket    = GCS_BUCKET,
      ducklake_path = "ducklake/working")),

  # ─── Publish Datasets ───────────────────────────────────────────
  tar_target(
    obis_archive,
    create_obis_archive(database, include_bottle = TRUE))
)

The key changes in this pipeline:

  1. metadata_dir parameter: Points to workflows/metadata/ instead of the legacy calcofi4db/inst/ingest/ path
  2. write_ingest_outputs(): Writes parquet files + manifest.json to gs://calcofi-db/ingest/{provider}_{dataset}/
  3. integrate_to_working_ducklake(): Combines ingest parquet files into the Working DuckLake at gs://calcofi-db/ducklake/working/

Note: No separate database/ folder - the Working DuckLake IS the integrated database. The releases/ folder will contain frozen snapshots with provenance columns stripped.

10.2 Workflow Lifecycle

Code
flowchart TB
    subgraph dev["Development Cycle"]
        direction LR
        d1["1\. Explore"]
        d2["2\. Ingest"]
        d3["3\. Targets"]
        d4["4\. Run"]
        d5["5\. Publish"]
        d1 --> d2 --> d3 --> d4 --> d5
    end

    subgraph update["Update Cycle"]
        direction LR
        u1["1\. New data"]
        u2["2\. Sync"]
        u3["3\. Run"]
        u4["4\. Republish"]
        u1 --> u2 --> u3 --> u4
    end

    dev -.->|"formalized"| update

    style dev fill:#fff3e0,stroke:#ef6c00
    style update fill:#e8f4e8,stroke:#2e7d32
flowchart TB
    subgraph dev["Development Cycle"]
        direction LR
        d1["1\. Explore"]
        d2["2\. Ingest"]
        d3["3\. Targets"]
        d4["4\. Run"]
        d5["5\. Publish"]
        d1 --> d2 --> d3 --> d4 --> d5
    end

    subgraph update["Update Cycle"]
        direction LR
        u1["1\. New data"]
        u2["2\. Sync"]
        u3["3\. Run"]
        u4["4\. Republish"]
        u1 --> u2 --> u3 --> u4
    end

    dev -.->|"formalized"| update

    style dev fill:#fff3e0,stroke:#ef6c00
    style update fill:#e8f4e8,stroke:#2e7d32
Figure 7: Development and Update Cycles

10.3 Dependency Graph

Code
flowchart LR
    subgraph archive["📁 GCS Archive"]
        archive_swfsc["swfsc.noaa.gov/<br/>calcofi-db"]
        archive_bottle["calcofi.org/<br/>bottle-database"]
    end

    subgraph ingest["⚙️ Ingest Outputs"]
        ingest_swfsc["ingest_swfsc<br/>(parquet + manifest)"]
        ingest_bottle["ingest_bottle<br/>(parquet + manifest)"]
    end

    subgraph database["🗃️ Working DuckLake"]
        integrate["integrate_to_working_ducklake()"]
        ducklake["ducklake/working/"]
    end

    subgraph publish["🌐 Publish"]
        obis_archive
    end

    archive_swfsc --> ingest_swfsc
    archive_bottle --> ingest_bottle

    ingest_swfsc --> integrate
    ingest_bottle --> integrate

    integrate --> ducklake
    ducklake --> obis_archive

    style archive fill:#e3f2fd,stroke:#1565c0
    style ingest fill:#fff3e0,stroke:#ef6c00
    style database fill:#f3e5f5,stroke:#7b1fa2
    style publish fill:#fce4ec,stroke:#c2185b
flowchart LR
    subgraph archive["📁 GCS Archive"]
        archive_swfsc["swfsc.noaa.gov/<br/>calcofi-db"]
        archive_bottle["calcofi.org/<br/>bottle-database"]
    end

    subgraph ingest["⚙️ Ingest Outputs"]
        ingest_swfsc["ingest_swfsc<br/>(parquet + manifest)"]
        ingest_bottle["ingest_bottle<br/>(parquet + manifest)"]
    end

    subgraph database["🗃️ Working DuckLake"]
        integrate["integrate_to_working_ducklake()"]
        ducklake["ducklake/working/"]
    end

    subgraph publish["🌐 Publish"]
        obis_archive
    end

    archive_swfsc --> ingest_swfsc
    archive_bottle --> ingest_bottle

    ingest_swfsc --> integrate
    ingest_bottle --> integrate

    integrate --> ducklake
    ducklake --> obis_archive

    style archive fill:#e3f2fd,stroke:#1565c0
    style ingest fill:#fff3e0,stroke:#ef6c00
    style database fill:#f3e5f5,stroke:#7b1fa2
    style publish fill:#fce4ec,stroke:#c2185b
Figure 8: targets Pipeline Dependency Graph

The key insight is the separation between ingest outputs (per-workflow parquet + manifest) and the final database (integrated DuckDB). Each ingest workflow:

  1. Reads from GCS archive (immutable source files)
  2. Applies transformations via redefinition metadata
  3. Writes parquet files to gs://calcofi-db/ingest/{provider}_{dataset}/
  4. Writes manifest.json with provenance (archive timestamp, table stats)

The final database step combines all ingest outputs into a unified DuckDB.


11 Documentation Updates

11.2 Where Documentation Should Live

Content Location Rationale
Database schema docs/db.qmd User-facing documentation
Workflow development workflows/README.md Developer guide
Package functions calcofi4r/, calcofi4db/ pkgdown reference
Data dictionary calcofi4db/inst/ Versioned with code
Architecture decisions docs/ or workflows/_docs/ Long-term reference

12 Implementation Priority

12.1 Phase 1: Foundation ✅ COMPLETE (2026-02-02)

Goal: Establish cloud infrastructure and helper functions

  1. Set up GCS buckets
    • Created gs://calcofi-files-public/ with public read access
    • Created gs://calcofi-files-private/ with restricted access
    • Configured rclone access via OAuth
  2. Add cloud functions to calcofi4db
    • Created R/cloud.R with get_gcs_file(), put_gcs_file()
    • Created R/parquet.R with csv_to_parquet()
    • Created R/duckdb.R with get_duckdb_con(), create_duckdb_views()
  3. Initial sync
    • Synced Google Drive data-public/ to gs://calcofi-files-public/_sync/
    • Created initial archive snapshot: archive/2026-02-02_121557/
    • Generated manifest: manifests/manifest_2026-02-02_121557.json
    • Total: 112,457 files (33.5 GiB)

12.2 Phase 2: Pipeline ✅ COMPLETE (2026-02-05)

Goal: Implement targets pipeline with new architecture

  1. Migrate metadata to workflows/metadata/
    • Created workflows/metadata/{provider}/{dataset}/ directories
    • Copied redefinition files from calcofi4db/inst/ingest/
    • Updated workflows to use metadata_dir parameter
  2. Create calcofi4db/R/workflow.R
    • write_ingest_outputs(): Write parquet + manifest to GCS
    • write_parquet_outputs(): Export tables to parquet with manifest
    • read_ingest_manifest(): Read manifest from GCS
  3. Update workflows/_targets.R
    • Extended with ingest_swfsc_manifest + ingest_bottle_manifest file targets
    • Added merge_manifest target that renders merge_ichthyo_bottle.qmd
    • Pipeline: CSV→parquet→DuckDB + manifest-based merge
  4. Cross-dataset integration (new)
    • Ship matching: calcofi4db/R/ship.R with match_ships(), fetch_ship_ices()
    • Cruise bridge: Derive cruise_key (YYMMKK) from bottle casts via ship_key + datetime
    • Taxonomy standardization: calcofi4db/R/taxonomy.R with standardize_species(), build_taxon_table()
    • Grid canonical source: grid.parquet only from swfsc (removed from bottle export)
  5. Update publish workflow (pending)
    • Modify publish_larvae_to_obis.qmd to read from DuckDB
    • Add bottle data to EMoF extension

12.3 Phase 3: Working DuckLake ✅ COMPLETE (2026-02-05)

Goal: Implement Working DuckLake with provenance tracking

  1. Working DuckLake functions implemented
    • get_working_ducklake(): Connect to Working DuckLake (GCS-backed)
    • ingest_to_working(): Ingest with provenance columns
    • add_provenance_columns(): Helper for _source_file, _source_row, _source_uuid, _ingested_at
    • save_working_ducklake(): Upload to GCS
    • list_working_tables(): Table summary with provenance stats
  2. Merge workflow loads all tables
    • merge_ichthyo_bottle.qmd loads 15+ tables from upstream parquet
    • Modified tables (casts, species, taxon, taxa_rank) exported separately
    • Manifest references GCS URIs for unchanged upstream tables

12.4 Phase 4: Frozen DuckLake ✅ COMPLETE (2026-02-05)

Goal: Implement Frozen DuckLake for public releases

  1. Freeze functions implemented
    • validate_for_release(): Data quality checks
    • freeze_release() / write_parquet_outputs(strip_provenance=TRUE): Create immutable release
    • Strip provenance columns for clean public schema
    • Generate release notes with cross-dataset integration summary
  2. Release structure
    • gs://calcofi-db/ducklake/releases/{version}/
    • Parquet files + manifest + RELEASE_NOTES.md
    • latest pointer to current release
  3. calcofi4r frozen access
    • cc_get_db(version = "latest"): Default to frozen releases
    • cc_list_versions(): List available releases
    • cc_release_notes(): View release notes
    • cc_make_view(con, "casts_extra"): Create derived VIEWs with prebaked templates
    • cc_list_view_templates(): List available templates
  4. First public release: v2026.02 (pending GCS upload)

12.5 Phase 5: Documentation & Polish

Goal: Complete documentation and evaluate API

  1. Update docs/
    • Revise db.qmd for DuckDB/DuckLake
    • Document Working vs Frozen databases
    • Evaluate API necessity
    • Update portal documentation
  2. Clean up workflows/
    • Consolidate README.md with essentials
    • Archive deprecated exploration notebooks
    • Document workflow development process
    • Create “Getting Started” guide for data managers

13 Verification Plan

13.1 End-to-End Test

# 1. Sync test data from Google Drive to GCS
system("rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync")

# 2. Create archive snapshot
timestamp <- format(Sys.time(), "%Y-%m-%d_%H%M%S")
system(glue::glue("rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/{timestamp}"))

# 3. Run targets pipeline
targets::tar_make()

# 4. Verify database
con <- calcofi4db::get_duckdb_con("calcofi.duckdb")
DBI::dbListTables(con)
DBI::dbGetQuery(con, "SELECT COUNT(*) FROM larvae")

# 5. Verify OBIS archive
obistools::check_fields(read_csv("data/darwincore/larvae/occurrence.csv"))

13.2 Integration Tests


14 Migration Strategy: PostgreSQL → DuckDB

14.1 Current PostgreSQL Dependencies

Based on server/docker-compose.yml, the following services depend on PostgreSQL:

Service Purpose Migration Path
postgis Spatial database Keep for vector tiles (short-term)
pg_tileserv Vector tile server Replace with PMTiles
pg_rest REST API (PostgREST) Keep for web apps, add DuckDB option
plumber R API Update to use DuckDB
pgadmin Database admin Deprecate once migrated

14.2 Vector Tiles: pg_tileserv → PMTiles

The current architecture uses pg_tileserv to serve vector tiles from PostGIS for Shiny mapping apps. The migration path:

Code
flowchart LR
    subgraph current["Current Flow (PostgreSQL)"]
        direction LR
        pg["PostGIS"]
        ts["pg_tileserv"]
        lf["Leaflet"]
        pg --> ts --> lf
    end

    subgraph target["Target Flow (Cloud-native)"]
        direction LR
        geo["GeoParquet"]
        tc["tippecanoe"]
        pm["PMTiles"]
        mg["mapgl"]
        geo --> tc --> pm --> mg
    end

    current -.->|"migrate"| target

    style current fill:#ffebee,stroke:#c62828
    style target fill:#e8f4e8,stroke:#2e7d32
flowchart LR
    subgraph current["Current Flow (PostgreSQL)"]
        direction LR
        pg["PostGIS"]
        ts["pg_tileserv"]
        lf["Leaflet"]
        pg --> ts --> lf
    end

    subgraph target["Target Flow (Cloud-native)"]
        direction LR
        geo["GeoParquet"]
        tc["tippecanoe"]
        pm["PMTiles"]
        mg["mapgl"]
        geo --> tc --> pm --> mg
    end

    current -.->|"migrate"| target

    style current fill:#ffebee,stroke:#c62828
    style target fill:#e8f4e8,stroke:#2e7d32
Figure 9: Vector Tiles Migration Path

14.2.1 Implementation Steps

  1. Generate PMTiles from PostGIS

    # Export geometries from PostGIS
    ogr2ogr -f GeoJSON stations.geojson \
      PG:"host=postgis user=admin dbname=gis" \
      -sql "SELECT * FROM stations"
    
    # Convert to PMTiles using tippecanoe
    tippecanoe -o stations.pmtiles \
      --minimum-zoom=0 --maximum-zoom=14 \
      --layer=stations \
      stations.geojson
    
    # Upload to GCS
    gsutil cp stations.pmtiles gs://calcofi-db/tiles/
  2. Update Shiny Apps

    # Old: pg_tileserv
    leaflet() |>
      addTiles() |>
      addMapboxGL(
        style = list(
          sources = list(
            stations = list(
              type = "vector",
              tiles = list("https://tile.calcofi.io/public.stations/{z}/{x}/{y}.pbf")
            )
          )
        )
      )
    
    # New: PMTiles with mapgl
    library(mapgl)
    mapboxgl() |>
      add_pmtiles_source(
        id = "stations",
        url = "https://storage.googleapis.com/calcofi-db/tiles/stations.pmtiles"
      ) |>
      add_layer(
        id = "stations-layer",
        type = "circle",
        source = "stations",
        source_layer = "stations"
      )
  3. Add PMTiles generation to targets pipeline

    # In _targets.R
    tar_target(
      pmtiles_stations,
      create_pmtiles(
        geom_source = duckdb_file,
        query = "SELECT * FROM stations",
        output = "tiles/stations.pmtiles"
      )
    )

14.3 Migration Roadmap

Phase Timeline Actions
Phase 1 Now Add DuckDB as primary data access; keep PostgreSQL
Phase 2 +2 months Generate PMTiles for all spatial layers
Phase 3 +4 months Update Shiny apps to use mapgl + PMTiles
Phase 4 +6 months Deprecate pg_tileserv, reduce PostgreSQL to backup
Phase 5 +12 months Fully deprecate PostgreSQL (if no dependencies remain)

15 Resolved Questions

Based on clarifications received:

  1. GCS Project: Use existing project ucsd-sio-calcofi

    • Existing bucket calcofi-db → DuckLake/database files
    • New bucket calcofi-files-public → Versioned public source files (public read access)
    • New bucket calcofi-files-private → Versioned private/sensitive files (restricted access)
  2. Google Drive Structure (implemented 2026-02-02):

    • calcofi/data-public/ → Public datasets (bottle, larvae, etc.)
    • calcofi/data-private/ → Private/sensitive data (eDNA, zooplankton samples)
  3. GCS Folder Structure (implemented 2026-02-02):

    • _sync/ → Working directory (rclone syncs here, mutable)
    • archive/{timestamp}/ → Immutable snapshots (complete copies, not deltas)
    • manifests/ → JSON metadata files with manifest_latest.json pointer
  4. API Strategy: Keep PostgreSQL + API for now with migration roadmap

    • pg_tileserv needed for current Shiny apps
    • Migrate to PMTiles over time using tippecanoe + mapgl
  5. Priority Data: Larvae + Bottle datasets for proof-of-concept

  6. Bottle in OBIS: Include as EMoF extension in larvae dataset

16 Remaining Questions

  1. DuckLake vs MotherDuck: Self-hosted DuckLake catalog or use MotherDuck service?
  2. Automation: GitHub Actions vs server cron for scheduled rclone syncs?
  3. PMTiles hosting: Serve from GCS directly or via CDN?

17 Appendix: Key References

17.1 Documentation

17.2 Technologies

Database & Analytics:

Workflow & Orchestration:

Cloud Storage:

Vector Tiles:

17.3 CalCOFI Resources

Footnotes

  1. Technical term: This is a “versioned data lake” built on Google Cloud Storage with rclone synchronization.↩︎

  2. DuckDB is a fast, in-process database that can query Parquet files directly from cloud storage without downloading them.↩︎

  3. Technical terms: This is a “versioned data lake” implemented with Google Cloud Storage buckets and rclone synchronization with timestamped archive snapshots.↩︎