CalCOFI Data Workflow Plan

Published

2026-02-05

This document outlines a comprehensive strategy for managing and versioning CalCOFI data files, building an integrated database, and publishing datasets to external portals.

1 Overview for Data Managers & Scientists

This section provides a high-level view of the CalCOFI data system. You don’t need to understand all the technical details—just the four main components and how they connect.

1.1 The Big Picture

Code

flowchart TB
    subgraph drive["📁 THE DRIVE"]
        direction LR
        gd_pub["data-public/<br/>Bottle, Larvae, CTD..."]
        gd_priv["data-private/<br/>eDNA, Zooplankton..."]
    end

    subgraph archive["🗄️ THE ARCHIVE"]
        direction LR
        sync["_sync/<br/>(working copy)"]
        hist["archive/<br/>(snapshots)"]
    end

    subgraph workflows["⚙️ THE WORKFLOWS"]
        direction LR
        ingest["Ingest<br/>(CSV → DB)"]
        publish["Publish<br/>(DB → Portals)"]
    end

    subgraph database["🗃️ THE DATABASE"]
        direction LR
        working["Working<br/>(internal)"]
        frozen["Frozen<br/>(public releases)"]
    end

    subgraph outputs["🌐 OUTPUTS"]
        direction LR
        obis["OBIS"]
        apps["Shiny Apps"]
        api["API / R pkg"]
    end

    drive -->|"daily sync"| archive
    archive -->|"read files"| workflows
    workflows -->|"transform"| database
    database -->|"publish"| outputs

    style drive fill:#e8f4e8,stroke:#2e7d32
    style archive fill:#e3f2fd,stroke:#1565c0
    style workflows fill:#fff3e0,stroke:#ef6c00
    style database fill:#f3e5f5,stroke:#7b1fa2
    style outputs fill:#fce4ec,stroke:#c2185b

flowchart TB
    subgraph drive["📁 THE DRIVE"]
        direction LR
        gd_pub["data-public/<br/>Bottle, Larvae, CTD..."]
        gd_priv["data-private/<br/>eDNA, Zooplankton..."]
    end

    subgraph archive["🗄️ THE ARCHIVE"]
        direction LR
        sync["_sync/<br/>(working copy)"]
        hist["archive/<br/>(snapshots)"]
    end

    subgraph workflows["⚙️ THE WORKFLOWS"]
        direction LR
        ingest["Ingest<br/>(CSV → DB)"]
        publish["Publish<br/>(DB → Portals)"]
    end

    subgraph database["🗃️ THE DATABASE"]
        direction LR
        working["Working<br/>(internal)"]
        frozen["Frozen<br/>(public releases)"]
    end

    subgraph outputs["🌐 OUTPUTS"]
        direction LR
        obis["OBIS"]
        apps["Shiny Apps"]
        api["API / R pkg"]
    end

    drive -->|"daily sync"| archive
    archive -->|"read files"| workflows
    workflows -->|"transform"| database
    database -->|"publish"| outputs

    style drive fill:#e8f4e8,stroke:#2e7d32
    style archive fill:#e3f2fd,stroke:#1565c0
    style workflows fill:#fff3e0,stroke:#ef6c00
    style database fill:#f3e5f5,stroke:#7b1fa2
    style outputs fill:#fce4ec,stroke:#c2185b

Figure 1: CalCOFI Data System Overview

1.2 Four Key Components

📁 The Drive

What: The shared Google Drive folder where you upload your data files.

Location: My Drive/projects/calcofi/data-public/ (or data-private/)

Who uses it: Dataset managers (Rasmus, Ed, Nastassia, Linsey, etc.)

How to use it:

Create a folder for your data: {provider}/{dataset}/ (e.g., swfsc.noaa.gov/calcofi-db/)
Inside, organize files into raw/ (original data) and derived/ (processed data)
Simply upload or overwrite files—versioning happens automatically
No special naming required (no date prefixes needed)

Example structure:

data-public/
├── calcofi.org/
│   └── bottle-database/
│       ├── raw/
│       │   └── bottle_export_2026-01-15.csv
│       └── derived/
│           └── bottle_cleaned.csv
├── swfsc.noaa.gov/
│   └── calcofi-db/
│       ├── raw/
│       │   ├── larva.csv
│       │   └── cruise.csv
│       └── derived/

🗄️ The Archive

What: A versioned backup of all Drive files in the cloud¹.

Location: Google Cloud Storage buckets (gs://calcofi-files-public/ and gs://calcofi-files-private/)

Who uses it: Automated systems (you don’t need to interact with it directly)

What it does:

Daily sync: Copies all files from the Drive to the Archive
Snapshots: Creates timestamped copies so we can always go back in time
Never loses data: Even if you delete a file from Drive, it’s preserved in the Archive

Why it matters: Scientists can reproduce analyses using the exact data from any past date.

⚙️ The Workflows

What: Automated scripts that process data from the Archive into the Database.

Location: /workflows/ folder (Quarto notebooks and R scripts)

Who uses it: Data engineers and developers

Two types of workflows:

Ingest workflows (ingest_*.qmd): Read CSV files → Clean/transform → Load into Database
Publish workflows (publish_*.qmd): Read from Database → Format for portals → Upload to OBIS, EDI, etc.

Key feature: Dependencies are managed automatically—if source data changes, all downstream steps re-run.

🗃️ The Database

What: The integrated CalCOFI database combining all datasets into related tables.

Format: DuckDB² with Parquet files (modern, fast, serverless)

Two versions:

Version	Purpose	Access	Features
Working Database	Internal processing	Workflows only	Full provenance tracking, UUIDs, source file references, time travel
Frozen Database	Public releases	Apps, API, R package	Clean tables, versioned releases (v2026.02, v2026.03…), optimized for queries

Why two versions? The Working Database tracks every detail for reproducibility. The Frozen Database provides clean, stable snapshots for users who just want the data.

1.3 How Data Flows

Code

flowchart TB
    subgraph you["👤 YOU"]
        upload["Upload CSV to Drive"]
    end

    subgraph auto["🤖 AUTOMATED"]
        sync["Sync to Archive"]
        snapshot["Create snapshot"]
        transform["Run workflows"]
    end

    subgraph result["📊 RESULT"]
        db["Updated Database"]
        portals["Published to Portals"]
    end

    upload --> sync --> snapshot --> transform --> db --> portals

    style you fill:#e8f4e8,stroke:#2e7d32
    style auto fill:#fff3e0,stroke:#ef6c00
    style result fill:#f3e5f5,stroke:#7b1fa2

flowchart TB
    subgraph you["👤 YOU"]
        upload["Upload CSV to Drive"]
    end

    subgraph auto["🤖 AUTOMATED"]
        sync["Sync to Archive"]
        snapshot["Create snapshot"]
        transform["Run workflows"]
    end

    subgraph result["📊 RESULT"]
        db["Updated Database"]
        portals["Published to Portals"]
    end

    upload --> sync --> snapshot --> transform --> db --> portals

    style you fill:#e8f4e8,stroke:#2e7d32
    style auto fill:#fff3e0,stroke:#ef6c00
    style result fill:#f3e5f5,stroke:#7b1fa2

Figure 2: Simplified Data Flow

Your role as a dataset manager: Upload files to the Drive. Everything else happens automatically.

2 Executive Summary

This plan modernizes the CalCOFI data workflow architecture to:

Version raw data files in Google Cloud Storage (GCS)
Use DuckDB + Parquet as the primary integrated database format
Leverage targets for workflow dependency management
Use R packages (calcofi4r, calcofi4db) for reusable functions
Maintain reproducible pipelines from raw data to published datasets

3 Current State Analysis

3.1 Existing Infrastructure

Component	Current State	Location
Workflows	Individual Quarto notebooks	`/workflows/`
Database	PostgreSQL with dev/prod schemas	api.calcofi.io
R Packages	`calcofi4r` (user), `calcofi4db` (admin)	GitHub
Data Sources	CSV files on Google Drive	`~/My Drive/projects/calcofi/data-public`
Documentation	Quarto site	`/docs/`

3.2 Key Workflow Files

Workflow	Purpose	Status
`ingest_calcofi.org_bottle-database.qmd`	Ingest bottle/cast data	Partial
`ingest_swfsc.noaa.gov_calcofi-db.qmd`	Ingest NOAA ichthyoplankton data	Active
`publish_larvae_to_obis.qmd`	Publish larvae to OBIS	Active (needs bottle data)

3.3 Identified Gaps

No dependency management: Workflows run independently without declared dependencies
No file versioning: Source files updated in place without version history
PostgreSQL overhead: Requires server maintenance and API infrastructure
Disconnected workflows: No systematic connection between ingest and publish steps
Manual sequencing: Users must know to run bottle ingestion before larvae ingestion

4 File Versioning Strategy

4.1 The Two-World Problem

CalCOFI data originates from multiple sources managed by different individuals:

Rasmus: Bottle/cast data, CTD profiles
Ed: Ichthyoplankton (larvae, eggs) data
Nastassia: eDNA data
Linsey: Zooplankton samples

These dataset managers need a simple interface (Google Drive) to deposit and organize files, while the data engineering side needs version control, immutability, and reproducibility.

4.2 Google Drive: The Human Interface

This is “The Drive” from Section 1. Dataset managers upload files here; everything else is automatic.

4.2.1 Folder Structure

Recommended organization: {provider}/{dataset}/raw/ and {provider}/{dataset}/derived/

calcofi/data-public/                       # Shared Google Drive folder (public data)
├── calcofi.org/
│   └── bottle-database/
│       ├── raw/                           # Original exports from source system
│       │   └── bottle_export_2026-01.csv
│       └── derived/                       # Cleaned/processed versions
│           ├── bottle.csv
│           ├── cast.csv
│           └── field_descriptions.csv
├── swfsc.noaa.gov/
│   └── calcofi-db/
│       ├── raw/                           # Original NOAA database exports
│       │   └── calcofi_db_dump_2026-01.sql
│       └── derived/                       # Extracted CSV tables
│           ├── cruise.csv
│           ├── egg.csv
│           ├── larva.csv
│           ├── net.csv
│           ├── ship.csv
│           ├── site.csv
│           ├── species.csv
│           └── tow.csv
├── coastwatch.pfeg.noaa.gov/
│   └── erdCalCOFIlrvsiz/
│       └── derived/
│           └── larvae_size.csv
└── _archive/                              # Manual archive (optional)
    └── 2025-01-15_bottle.csv              # User-created backup

calcofi/data-private/                      # Private/sensitive data (restricted access)
├── edna/                                  # eDNA samples (Nastassia)
│   ├── raw/
│   └── derived/
├── zooplankton/                           # Zooplankton samples (Linsey)
│   ├── raw/
│   └── derived/
└── _archive/                              # Manual archive (optional)

4.2.2 Google Drive Naming Conventions

Top-level folders: {provider}/ - the organization providing the data (e.g., swfsc.noaa.gov)
Dataset folders: {dataset}/ - specific dataset name (e.g., calcofi-db)
Subfolders (recommended):
- raw/ - Original data as received from source (database dumps, original CSVs)
- derived/ - Processed/cleaned versions ready for ingestion
Files: Original source names, no date prefixes required
Updates: Simply overwrite existing files
No versioning in Drive: Google Drive does not maintain file history for CSVs

4.2.3 Dataset Manager Workflow

Create folder structure: {provider}/{dataset}/raw/ and {provider}/{dataset}/derived/
Upload original data to raw/ (preserve original format)
Upload processed/cleaned data to derived/ (CSV format preferred)
Simply overwrite files when updating—versioning happens automatically in the Archive
Optionally notify data team of significant changes

See Section 1 for the big picture of how your data flows through the system.

4.3 Google Cloud Storage: The Versioned Data Lake

This is “The Archive” from Section 1 ³. It automatically captures versioned snapshots of all files from the Drive.

4.3.1 Bucket Structure

gs://calcofi-files-public/                 # Versioned public source files
├── _sync/                                 # Working directory (rclone syncs here)
│   ├── calcofi.org/
│   │   └── bottle-database/
│   │       ├── bottle.csv
│   │       └── cast.csv
│   └── swfsc.noaa.gov/
│       └── calcofi-db/
│           ├── cruise.csv
│           ├── egg.csv
│           └── ...
├── archive/                               # Timestamped immutable snapshots
│   ├── 2026-02-02_121557/                # YYYY-MM-DD_HHMMSS
│   │   └── calcofi.org/
│   │       └── bottle-database/
│   │           └── bottle.csv            # Snapshot at that time
│   ├── 2026-02-15_120000/
│   │   └── swfsc.noaa.gov/
│   │       └── calcofi-db/
│   │           └── larva.csv
│   └── ...
└── manifests/                             # Version metadata
    ├── manifest_2026-02-02_121557.json
    ├── manifest_2026-02-15_120000.json
    └── manifest_latest.json              # Points to latest archive

gs://calcofi-files-private/                # Versioned private source files
├── _sync/                                 # Working directory (rclone syncs here)
│   ├── edna/
│   └── zooplankton/
├── archive/                               # Timestamped immutable snapshots
└── manifests/

gs://calcofi-db/                           # Integrated database
├── ingest/                                # Per-workflow outputs (intermediate)
│   ├── swfsc.noaa.gov_calcofi-db/         # Flattened {provider}_{dataset}
│   │   ├── cruise.parquet
│   │   ├── ship.parquet
│   │   ├── site.parquet
│   │   ├── ...
│   │   ├── manifest.json                  # Provenance for this workflow
│   │   └── metadata.json                  # Table & column metadata sidecar
│   └── calcofi.org_bottle-database/
│       ├── cast.parquet
│       ├── bottle.parquet
│       ├── manifest.json
│       └── metadata.json
│
└── ducklake/                              # DuckLake manages parquet + catalog
    ├── working/                           # Integrated database (combines ingests)
    │   ├── catalog/                       # DuckLake catalog metadata
    │   └── data/                          # Parquet files with provenance columns
    │       ├── cruise/
    │       ├── site/
    │       └── ...
    └── releases/                          # Frozen public releases
        ├── v2026.02/
        │   ├── catalog/
        │   └── data/                      # Provenance columns stripped
        └── latest -> v2026.02             # Symlink to current

4.4 Workflow Metadata

Redefinition files (column/table renaming rules) live in workflows/metadata/, NOT in the calcofi4db R package. This decouples workflow-specific metadata from the package code.

workflows/metadata/
├── swfsc.noaa.gov/
│   └── calcofi-db/
│       ├── tbls_redefine.csv              # Table renaming rules
│       ├── flds_redefine.csv              # Field renaming/typing rules
│       ├── tbls_raw.csv                   # Auto-generated raw table info
│       └── flds_raw.csv                   # Auto-generated raw field info
└── calcofi.org/
    └── bottle-database/
        ├── flds_rename.csv
        └── ship_renames.csv

Benefits of this structure:

Version-controlled alongside workflows (not buried in R package)
Easy to edit without rebuilding the package
Clear ownership per dataset
Allows different projects to use calcofi4db without conflicting metadata

4.4.1 GCS Naming Conventions

**_sync/**: Working directory, exact mirror of Google Drive structure (rclone syncs here)
archive/{YYYY-MM-DD_HHMMSS}/: Immutable snapshot of complete data at that timestamp
manifests/manifest_{YYYY-MM-DD_HHMMSS}.json: Snapshot metadata
manifests/manifest_latest.json: Points to latest archive snapshot

4.5 Rclone Sync: Capturing Daily Changes

4.5.1 Installation

macOS:

brew install rclone

Linux (Ubuntu/Debian):

curl https://rclone.org/install.sh | sudo bash

4.5.2 Configuration

Run rclone config and create two remotes:

Remote 1: Google Drive (gdrive)

n) New remote
name> gdrive
Storage> drive
client_id>                        # leave blank
client_secret>                    # leave blank
scope> 1                          # Full access (or 2 for read-only)
service_account_file>             # leave blank for OAuth, or path to JSON key
Edit advanced config> n
Use auto config> y                # opens browser (use 'n' on headless server)
Configure this as a Shared Drive> n

Remote 2: Google Cloud Storage (gcs)

n) New remote
name> gcs
Storage> google cloud storage
client_id>                        # leave blank
client_secret>                    # leave blank
project_number> ucsd-sio-calcofi
service_account_file>             # leave blank for OAuth, or path to JSON key
Edit advanced config> n
Use auto config> y                # opens browser (use 'n' on headless server)

4.5.3 Headless Server Setup

For Linux servers without a browser, use rclone authorize on a local machine:

# on local machine with browser
rclone authorize "drive"
# copy the token output

# on server, during rclone config
Use auto config> n
# paste the token when prompted

Similarly for GCS:

# on local machine
rclone authorize "google cloud storage"

4.5.4 Service Account (Optional)

For automated/scheduled syncs, use a GCP service account instead of OAuth:

Create service account in GCP Console → IAM → Service Accounts
Grant roles: Storage Object Admin for calcofi-files-public, calcofi-files-private, and calcofi-db buckets
Download JSON key file to secure location (e.g., /etc/rclone/calcofi-sa.json)
In rclone config, provide path to JSON key instead of using OAuth

4.5.5 Verify Configuration

# test Google Drive access
rclone lsd gdrive:projects/calcofi/data-public      # OR
rclone lsd gdrive-ecoquants:projects/calcofi/data-public

# test GCS access
rclone ls gcs:calcofi-files-public --max-depth 1    # OR
rclone ls gcs-calcofi:calcofi-files-public --max-depth 1

# test sync (dry-run first!)
rclone sync gdrive:calcofi/data-public gcs:calcofi-files-public/_sync --dry-run -v
rclone sync gdrive-ecoquants:projects/calcofi/data-public gcs-calcofi:calcofi-files-public/_sync --dry-run -v

4.5.6 How It Works

#!/bin/bash
# sync_gdrive_to_gcs.sh - Run daily via cron

TIMESTAMP=$(date +%Y-%m-%d_%H%M%S)
LOG_FILE="/var/log/rclone/sync_${TIMESTAMP}.log"

# --- PUBLIC DATA ---
# Step 1: Sync Google Drive to _sync/ working directory
rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync \
  --checksum \
  --drive-export-formats csv \
  --log-file ${LOG_FILE} \
  --log-level INFO

# Step 2: Create immutable archive snapshot
rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/${TIMESTAMP}

# Step 3: Generate and upload manifest
rclone lsjson gs://calcofi-files-public/archive/${TIMESTAMP} --recursive \
  > /tmp/manifest_${TIMESTAMP}.json
rclone copy /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/
rclone copyto /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/manifest_latest.json

# --- PRIVATE DATA ---
# Repeat for private bucket (separate log file)
rclone sync gdrive:calcofi/data-private gs://calcofi-files-private/_sync \
  --checksum \
  --drive-export-formats csv \
  --log-file ${LOG_FILE%.log}_private.log \
  --log-level INFO

rclone copy gs://calcofi-files-private/_sync gs://calcofi-files-private/archive/${TIMESTAMP}

4.5.7 Key Behaviors

Scenario	Google Drive Action	Rclone Behavior	Result in GCS
New file	User uploads `new.csv`	`rclone sync` copies to `_sync/`	`_sync/new.csv` created, next archive includes it
Updated file	User overwrites `bottle.csv`	`rclone sync` updates `_sync/`	`_sync/bottle.csv` updated, archive snapshots preserve history
Deleted file	User deletes `old.csv`	`rclone sync` removes from `_sync/`	Removed from `_sync/`, preserved in prior archives
Renamed file	User renames `a.csv` → `b.csv`	Treated as delete + create	`a.csv` gone from `_sync/`, `b.csv` created
No changes	Nothing	`--checksum` skips identical files	No new sync needed

4.5.8 The Two-Step Archive Strategy

The archiving approach uses two steps for reliability:

Sync to _sync/: rclone syncs Google Drive to the _sync/ working directory
Copy to archive/{timestamp}/: Create an immutable snapshot of the current state

Benefits:

Immutable archives: Each archive folder is a complete snapshot, not just deltas
Reproducibility: Any archive folder contains everything needed to recreate that state
No data loss: If sync fails, previous archives remain intact
Public access: calcofi-files-public bucket has public read access for easy data sharing

4.6 Manifest Files

Each manifest captures the complete state of the archive at a point in time:

{
  "generated_at": "2026-02-02T12:15:57Z",
  "archive_timestamp": "2026-02-02_121557",
  "bucket": "gs://calcofi-files-public",
  "archive_path": "archive/2026-02-02_121557",
  "files": [
    {
      "Path": "calcofi.org/bottle-database/bottle.csv",
      "Size": 45234567,
      "ModTime": "2026-02-02T10:30:00Z",
      "MimeType": "text/csv",
      "gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/calcofi.org/bottle-database/bottle.csv"
    },
    {
      "Path": "swfsc.noaa.gov/calcofi-db/larva.csv",
      "Size": 12345678,
      "ModTime": "2026-02-01T08:00:00Z",
      "MimeType": "text/csv",
      "gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/swfsc.noaa.gov/calcofi-db/larva.csv"
    }
  ],
  "total_files": 112457,
  "total_size_bytes": 35987654321
}

4.7 Reconstructing Historical State

To recreate the data lake as it existed on a specific date:

# R function to get file from a specific archive snapshot
get_archived_file <- function(path, timestamp = "latest", bucket = "calcofi-files-public") {
  # Get manifest for that timestamp
  if (timestamp == "latest") {
    manifest_url <- glue("gs://{bucket}/manifests/manifest_latest.json")
  } else {
    manifest_url <- glue("gs://{bucket}/manifests/manifest_{timestamp}.json")
  }

  manifest <- jsonlite::read_json(manifest_url)

  # Construct the archive URL
  archive_url <- glue("gs://{bucket}/archive/{manifest$archive_timestamp}/{path}")

  # Or use public HTTPS URL for direct access
  https_url <- glue("https://storage.googleapis.com/{bucket}/archive/{manifest$archive_timestamp}/{path}")

  return(https_url)
}

4.8 Automation Schedule

Code

%%---
%%displayMode: compact
%%---

%%{init: {'theme': 'default', 'themeVariables': { 'textColor': '##333'}}}%%

gantt
    title Daily Sync Schedule (UTC)
    dateFormat HH:mm
    axisFormat %H:%M

    section Sync
    rclone           :a1, 00:00, 10m
    archive          :a2, after a1, 5m
    manifest         :a3, after a2, 5m

    section Pipeline
    targets          :b1, after a3, 10m
    parquet          :b2, after b1, 5m
    duckdb           :b3, after b2, 5m

    section Notify
    notify           :c1, after b3, 5m

%%---
%%displayMode: compact
%%---

%%{init: {'theme': 'default', 'themeVariables': { 'textColor': '##333'}}}%%

gantt
    title Daily Sync Schedule (UTC)
    dateFormat HH:mm
    axisFormat %H:%M

    section Sync
    rclone           :a1, 00:00, 10m
    archive          :a2, after a1, 5m
    manifest         :a3, after a2, 5m

    section Pipeline
    targets          :b1, after a3, 10m
    parquet          :b2, after b1, 5m
    duckdb           :b3, after b2, 5m

    section Notify
    notify           :c1, after b3, 5m

Figure 3: Daily Sync Schedule (UTC)

The daily automation runs at midnight UTC and completes in under an hour:

Sync phase: rclone copies new/changed files from Google Drive to the _sync/ folder in GCS, then creates an immutable archive/ snapshot and generates a manifest file documenting the state.
Pipeline phase: The targets pipeline detects which source files changed and re-runs only the affected steps—transforming CSVs to Parquet format and updating the DuckDB database.
Notify phase: Sends notifications (email/Slack) summarizing what changed and any errors encountered.

If no files changed in Google Drive, the pipeline completes quickly with minimal processing.

5 Proposed Architecture

5.1 Data Flow Diagram

Code

flowchart TB
    subgraph drive["📁 THE DRIVE"]
        direction LR
        gdrive_pub["data-public/"]
        gdrive_priv["data-private/"]
    end

    subgraph archive["🗄️ THE ARCHIVE"]
        direction LR
        bucket_pub["calcofi-files-public"]
        bucket_priv["calcofi-files-private"]
    end

    subgraph database["🗃️ THE DATABASE"]
        direction LR
        working["Working DuckLake"]
        frozen["Frozen Releases"]
    end

    subgraph outputs["🌐 OUTPUTS"]
        direction LR
        obis["OBIS"]
        edi["EDI"]
        erddap["ERDDAP"]
        apps["Apps/API"]
    end

    drive -->|"rclone sync"| archive
    archive -->|"targets pipeline"| database
    database -->|"publish workflows"| outputs

    style drive fill:#e8f4e8,stroke:#2e7d32
    style archive fill:#e3f2fd,stroke:#1565c0
    style database fill:#f3e5f5,stroke:#7b1fa2
    style outputs fill:#fce4ec,stroke:#c2185b

flowchart TB
    subgraph drive["📁 THE DRIVE"]
        direction LR
        gdrive_pub["data-public/"]
        gdrive_priv["data-private/"]
    end

    subgraph archive["🗄️ THE ARCHIVE"]
        direction LR
        bucket_pub["calcofi-files-public"]
        bucket_priv["calcofi-files-private"]
    end

    subgraph database["🗃️ THE DATABASE"]
        direction LR
        working["Working DuckLake"]
        frozen["Frozen Releases"]
    end

    subgraph outputs["🌐 OUTPUTS"]
        direction LR
        obis["OBIS"]
        edi["EDI"]
        erddap["ERDDAP"]
        apps["Apps/API"]
    end

    drive -->|"rclone sync"| archive
    archive -->|"targets pipeline"| database
    database -->|"publish workflows"| outputs

    style drive fill:#e8f4e8,stroke:#2e7d32
    style archive fill:#e3f2fd,stroke:#1565c0
    style database fill:#f3e5f5,stroke:#7b1fa2
    style outputs fill:#fce4ec,stroke:#c2185b

Figure 4: CalCOFI Data Flow Architecture

5.2 Technology Stack

Layer	Technology	Purpose	Replaces
Storage
	`gs://calcofi-files-public`	Versioned public source CSV files	Google Drive only
	`gs://calcofi-files-private`	Versioned private/sensitive files	Google Drive only
	`gs://calcofi-db`	Parquet, DuckDB, PMTiles	-
Sync	rclone	Drive → GCS with archive snapshots	Manual copying
Format	Apache Parquet	Efficient columnar storage	CSV
Compute	DuckDB	Serverless SQL queries	PostgreSQL (gradual)
Database
	Working DuckLake	Internal DB with provenance, time travel	PostgreSQL
	Frozen DuckLake	Public versioned releases (v2026.02, etc.)	PostgreSQL snapshots
Orchestration	`targets`	Dependency management, caching	Manual sequencing
Tiles	PMTiles + tippecanoe	Cloud-native vector tiles	pg_tileserv
Mapping	mapgl	Modern WebGL maps in Shiny	leaflet + pg_tileserv
Functions	`calcofi4r`, `calcofi4db`	Reusable R code	Ad-hoc scripts

GCP Project: ucsd-sio-calcofi

6 Database Naming Conventions

6.1 Table Names

Table names are singular (e.g., cruise, site, species, lookup not cruises, sites, lookups)
Table names use all lower case with underscores for multi-word names (snake_case)

6.2 Column Names

Use snake_case (lower-case with underscores) for all column names
Unique identifiers are suffixed with:
- *_id for integer keys (sequential or from source)
- *_key for string/text keys (natural keys)
- *_uuid for universally unique identifiers (provenance tracking only)
Suffix with units where applicable (e.g., length_mm, depth_m, temp_degc)

7 Primary Key Strategy

7.1 General Principle: Natural Keys Over Surrogate Keys

Prefer natural keys (meaningful identifiers from the domain) over surrogate keys (arbitrary integers) unless:

No natural key exists
Performance requires integer keys for large tables with frequent joins
The natural key is unstable or may change

Benefits of natural keys:

Self-documenting (cruise_key 2401NH tells you January 2024, New Horizon)
No lookup needed to understand relationships
Consistent across systems (source data uses same identifiers)
Easier debugging and data validation

7.2 Natural Keys (use as PRIMARY KEY)

Table	Primary Key	Format	Example
`cruise`	`cruise_key`	TEXT `YYMMKK`	`'2401NH'`
`ship`	`ship_key`	TEXT (2-letter)	`'NH'`
`tow_type`	`tow_type_key`	TEXT	`'CB'`
`species`	`species_id`	INTEGER	`1234` (natural from source)

7.3 Sequential Integer Keys (with explicit sort order)

For tables without natural keys, use sequential integers with deterministic sort order for reproducibility when appending new data:

Table	Primary Key	Sort Order for ID Assignment
`site`	`site_id`	`ORDER BY cruise_key, orderocc`
`tow`	`tow_id`	`ORDER BY site_id, time_start`
`net`	`net_id`	`ORDER BY tow_id, side`
`ichthyo`	`ichthyo_id`	`ORDER BY net_id, species_id, life_stage, measurement_type, measurement_value`
`lookup`	`lookup_id`	`ORDER BY lookup_type, lookup_num`

Why sort order matters: When new cruises are added later, rows get appended. Without deterministic ordering, the same data could get different IDs on different runs, breaking reproducibility.

7.4 Foreign Key Relationships

ship.ship_key (PK)
    ↓
cruise.cruise_key (PK)  ←──  cruise.ship_key (FK)
    ↓
site.site_id (PK)       ←──  site.cruise_key (FK)
    ↓
tow.tow_id (PK)         ←──  tow.site_id (FK)
    ↓                        tow.tow_type_key (FK) → tow_type.tow_type_key
net.net_id (PK)         ←──  net.tow_id (FK)
    ↓
ichthyo.ichthyo_id (PK) ←──  ichthyo.net_id (FK)
                             ichthyo.species_id (FK) → species.species_id

--- Bottle chain (linked via cruise bridge) ---

casts.cast_id (PK)      ←──  casts.cruise_key (FK, nullable) → cruise.cruise_key
    ↓                        casts.grid_key (FK) → grid.grid_key
    ↓                        casts.ship_key (FK, nullable) → ship.ship_key
bottle.bottle_id (PK)   ←──  bottle.cast_id (FK)
    ↓
bottle_measurement       ←──  bottle_measurement.bottle_id (FK)
                             bottle_measurement.measurement_type_id → measurement_type

--- Taxonomy ---

species.species_id (PK)      species.worms_id → taxon (WoRMS authority)
                             species.itis_id → taxon (ITIS authority)
taxon.taxonID (PK per authority)
taxa_rank.taxonRank (PK)

--- Spatial ---

grid.grid_key (PK)      ←──  canonical from swfsc (not in bottle export)
segment (reference table)

Note: grid.parquet is only exported from the swfsc workflow (canonical source). The bottle workflow no longer exports grid.

7.5 UUID Columns

Remove cruise_uuid, site_uuid, tow_uuid, net_uuid from output tables
Retain _source_uuid in Working DuckLake for provenance (stripped in frozen release)

8 Database Architecture: Working vs Frozen DuckLake

The CalCOFI integrated database uses DuckLake, a lakehouse catalog that provides version control for data tables. We maintain two distinct databases to serve different needs:

8.1 The Two-Database Strategy

Code

flowchart TB
    subgraph archive["📁 Archive"]
        csv["CSV source files"]
    end

    subgraph working["🔧 WORKING DUCKLAKE"]
        direction LR
        w_tables["Provenance columns"]
        w_history["Time travel"]
        w_access["Internal only"]
    end

    subgraph frozen["❄️ FROZEN DUCKLAKE"]
        direction LR
        f_versions["v2026.02, v2026.03..."]
        f_clean["Clean tables"]
        f_access["Public access"]
    end

    subgraph outputs["🌐 Public Access"]
        direction LR
        apps["Shiny Apps"]
        api["REST API"]
        rpkg["calcofi4r"]
    end

    csv -->|"ingest workflows"| working
    working -->|"freeze release"| frozen
    frozen --> outputs

    style archive fill:#e3f2fd,stroke:#1565c0
    style working fill:#fff3e0,stroke:#ef6c00
    style frozen fill:#e8f4e8,stroke:#2e7d32
    style outputs fill:#fce4ec,stroke:#c2185b

flowchart TB
    subgraph archive["📁 Archive"]
        csv["CSV source files"]
    end

    subgraph working["🔧 WORKING DUCKLAKE"]
        direction LR
        w_tables["Provenance columns"]
        w_history["Time travel"]
        w_access["Internal only"]
    end

    subgraph frozen["❄️ FROZEN DUCKLAKE"]
        direction LR
        f_versions["v2026.02, v2026.03..."]
        f_clean["Clean tables"]
        f_access["Public access"]
    end

    subgraph outputs["🌐 Public Access"]
        direction LR
        apps["Shiny Apps"]
        api["REST API"]
        rpkg["calcofi4r"]
    end

    csv -->|"ingest workflows"| working
    working -->|"freeze release"| frozen
    frozen --> outputs

    style archive fill:#e3f2fd,stroke:#1565c0
    style working fill:#fff3e0,stroke:#ef6c00
    style frozen fill:#e8f4e8,stroke:#2e7d32
    style outputs fill:#fce4ec,stroke:#c2185b

Figure 5: Working vs Frozen DuckLake Architecture

8.2 Working DuckLake (Internal)

The Working DuckLake is the internal, mutable database used by workflows. It prioritizes:

Full provenance tracking: Every row knows where it came from
Time travel: Query the database as it existed at any past point
Frequent updates: Can be modified whenever new data arrives

Schema features:

Column	Type	Purpose
`_source_file`	VARCHAR	Path to original CSV file in Archive
`_source_row`	INTEGER	Row number in source file
`_source_uuid`	UUID	Original record ID (e.g., from NOAA database)
`_ingested_at`	TIMESTAMP	When this row was added/updated

Example query with provenance:

-- Find all larvae records from a specific source file
SELECT scientific_name, count, _source_file, _source_row
FROM larvae
WHERE _source_file LIKE '%swfsc.noaa.gov/calcofi-db/larva.csv'
  AND cruise_id = '202301CC';

Time travel query:

-- Query larvae table as it existed on January 15th
SELECT * FROM larvae
AT TIMESTAMP '2026-01-15 00:00:00';

8.3 Frozen DuckLake (Public Releases)

The Frozen DuckLake provides stable, versioned snapshots for external users. Based on the Frozen DuckLake pattern:

Key characteristics:

Immutable: Once released, a version never changes
Clean schema: No internal provenance columns (cleaner for users)
Semantic versioning: v{YYYY}.{MM} format (e.g., v2026.02)
Public access: Anyone can query via HTTP/Parquet

Release process:

Code

flowchart LR
    working["Working DuckLake"]
    validate["Validate data quality"]
    strip["Remove provenance columns"]
    version["Tag version (v2026.02)"]
    upload["Upload to releases/"]
    announce["Announce release"]

    working --> validate --> strip --> version --> upload --> announce

    style working fill:#fff3e0,stroke:#ef6c00
    style validate fill:#fff9c4,stroke:#f9a825
    style strip fill:#fff9c4,stroke:#f9a825
    style version fill:#fff9c4,stroke:#f9a825
    style upload fill:#e8f4e8,stroke:#2e7d32
    style announce fill:#e8f4e8,stroke:#2e7d32

flowchart LR
    working["Working DuckLake"]
    validate["Validate data quality"]
    strip["Remove provenance columns"]
    version["Tag version (v2026.02)"]
    upload["Upload to releases/"]
    announce["Announce release"]

    working --> validate --> strip --> version --> upload --> announce

    style working fill:#fff3e0,stroke:#ef6c00
    style validate fill:#fff9c4,stroke:#f9a825
    style strip fill:#fff9c4,stroke:#f9a825
    style version fill:#fff9c4,stroke:#f9a825
    style upload fill:#e8f4e8,stroke:#2e7d32
    style announce fill:#e8f4e8,stroke:#2e7d32

Figure 6: Freezing a Release

Frozen release structure:

gs://calcofi-db/ducklake/releases/
├── v2026.02/
│   ├── catalog.json          # DuckLake catalog
│   ├── parquet/
│   │   ├── bottle.parquet
│   │   ├── cast.parquet
│   │   ├── larvae.parquet
│   │   └── ...
│   └── RELEASE_NOTES.md
├── v2026.03/
│   └── ...
└── latest -> v2026.03        # Symlink to latest

8.4 Accessing Frozen Releases

From R (using calcofi4r):

library(calcofi4r)

# Get latest release (default)
con <- cc_get_db()

# Get specific version
con <- cc_get_db(version = "v2026.02")

# List available versions
cc_list_versions()
#> [1] "v2026.02" "v2026.03" "v2026.04"

Direct Parquet access (no R required):

import duckdb

# Query directly from GCS (no download needed)
con = duckdb.connect()
con.execute("""
    SELECT * FROM read_parquet(
        'https://storage.googleapis.com/calcofi-db/ducklake/releases/latest/parquet/larvae.parquet'
    )
    LIMIT 10
""")

From any HTTP client:

# Download parquet file directly
curl -O https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.02/parquet/larvae.parquet

8.5 Release Schedule

Frequency	Trigger	Example
Monthly	Regular data updates	v2026.02, v2026.03
Ad-hoc	Major dataset additions	v2026.02.1 (patch)
Annual	Year-end comprehensive release	v2026.12

Release checklist:

All ingest workflows pass
Data quality checks pass (no nulls in required fields, valid ranges)
Foreign key relationships valid
Row counts within expected range
Release notes document changes
Announce to users

9 R Package Strategy

9.1 Package Roles

9.1.1 `calcofi4db` - Database Administration Tools

Target users: Data managers, workflow developers

calcofi4db/R/
├── cloud.R        # GCS operations (@concept cloud)
├── read.R         # CSV/Parquet reading (@concept read)
├── transform.R    # Data transformation (@concept transform)
├── ingest.R       # Database loading (@concept ingest)
├── workflow.R     # Workflow outputs to GCS (@concept workflow)
├── parquet.R      # Parquet operations (@concept parquet)
├── duckdb.R       # DuckDB operations (@concept duckdb)
├── ducklake.R     # DuckLake operations (@concept ducklake)
├── archive.R      # GCS archive sync (@concept archive)
├── version.R      # Schema versioning (@concept version)
├── check.R        # Data validation (@concept check)
├── validate.R     # FK/lookup validation functions (@concept validate)
├── wrangle.R      # Local DuckDB wrangling functions (@concept wrangle)
├── ship.R         # Ship fuzzy matching functions (@concept ship) [NEW]
├── taxonomy.R     # Taxonomy standardization functions (@concept taxonomy) [NEW]
├── spatial.R      # Spatial operations (@concept spatial)
├── freeze.R       # Frozen release operations (@concept freeze)
├── display.R      # Display helpers (@concept display)
└── viz.R          # Diagram generation (@concept viz)

9.1.2 `calcofi4r` - User-Facing Tools

Target users: Researchers, analysts, app developers

calcofi4r/R/
├── read.R         # Data retrieval (@concept read)
├── database.R     # Database connection (@concept database)
├── analyze.R      # Statistical analysis (@concept analyze)
├── visualize.R    # Plotting and mapping (@concept visualize)
└── functions.R    # Helper functions (@concept utils)

9.2 Function Reuse Across Contexts

Context	Package	Example Usage
Ingestion workflows	`calcofi4db`	`read_csv_files()`, `ingest_csv_to_db()`
Publishing workflows	`calcofi4db`	`get_db_con()`, `create_db_manifest()`
Shiny apps	`calcofi4r`	`cc_get_db()`, `cc_plot_timeseries()`
Analysis scripts	`calcofi4r`	`cc_read_bottle()`, `cc_analyze_trend()`
API endpoints	`calcofi4r`	`cc_query_db()`, `cc_get_variables()`

9.3 New Functions to Implement

9.3.1 In `calcofi4db`:

# R/cloud.R - GCS operations
get_gcs_file(bucket, path, local_path = NULL)
put_gcs_file(local_path, bucket, path)
sync_gdrive_to_gcs(gdrive_path, gcs_bucket, backup = TRUE)
list_gcs_versions(bucket, path)

# R/parquet.R - Parquet operations
csv_to_parquet(csv_path, schema_def = NULL)
read_parquet_table(path, con = NULL)
write_parquet_table(data, path, partitions = NULL)
add_parquet_metadata(path, metadata_list)

# R/duckdb.R - DuckDB operations
get_duckdb_con(path = ":memory:", read_only = FALSE)
create_duckdb_views(con, manifest)
attach_ducklake(con, catalog_path)
set_duckdb_comments(con, table, comments)

# R/workflow.R - Workflow output functions [NEW]
write_ingest_outputs(data_info, provider, dataset, gcs_bucket)
  # 1. Transform data via transform_data()
  # 2. Write parquet files to gs://calcofi-db/ingest/{provider}_{dataset}/
  # 3. Write manifest.json with provenance info
read_ingest_manifest(provider, dataset, gcs_bucket)
read_ingest_parquet(provider, dataset, table, gcs_bucket)
list_ingest_outputs(gcs_bucket)
integrate_to_working_ducklake(ingests, gcs_bucket, ducklake_path)
  # Combines ingest outputs into ducklake/working/

# R/ducklake.R - DuckLake catalog operations
get_working_ducklake()           # connect to internal Working DuckLake
ingest_dataset(con, d, mode)     # ✓ IMPLEMENTED - high-level dataset ingestion
  # - calls transform_data() internally
  # - handles uuid column detection
  # - tracks provenance via gcs_path
ingest_to_working(data, table, source_file, source_uuid_col = NULL)
add_provenance_columns(data, source_file, source_row_start = 1)
save_working_ducklake(con)       # upload Working DuckLake to GCS
list_working_tables(con)         # list tables with provenance stats

# R/freeze.R - Frozen release operations
freeze_release(version, release_notes = NULL)
validate_for_release(con)        # run all quality checks before freeze
list_frozen_releases()           # list available frozen versions
compare_releases(v1, v2)         # diff two releases

# R/ship.R - Ship code reconciliation [NEW]
match_ships(unmatched_ships, reference_ships, ship_renames_csv, fetch_ices)
  # Multi-source fuzzy matching: manual overrides → exact nodc → exact name
  # → CalCOFI UNOLS → NODC WOD → ICES → word-overlap fuzzy
fetch_ship_ices(ices_api, ices_ship_code_type)
  # Fetch ship codes from ICES Reference Codes API

# R/taxonomy.R - Taxonomy standardization [NEW]
standardize_species(con, species_tbl, include_gbif, batch_size)
  # Query WoRMS → ITIS → GBIF for each species
  # Updates species table with worms_id, itis_id, gbif_id
build_taxon_table(con, species_tbl, taxon_tbl, include_itis, batch_size)
  # Build taxon hierarchy table from WoRMS/ITIS classification
  # Also creates taxa_rank lookup table with rank ordering

9.3.2 In `calcofi4r`:

# R/database.R - Frozen DuckLake access (user-facing)
cc_get_db(version = "latest", local_cache = TRUE)
  # Connect to frozen release (never Working DuckLake)
cc_list_versions()               # list available frozen releases
cc_db_info(version = "latest")   # release date, row counts, notes
cc_release_notes(version)        # view release notes for a version

# R/database.R - Derived views [NEW]
cc_make_view(con, template, view_name, base_table, column_definitions)
  # Create derived VIEW from template or custom SQL expressions
  # Templates: "casts_extra" (year, month, quarter, julian_day, lat/lon parts, cruise, db_sta_key)
cc_list_view_templates()         # list available prebaked templates

# R/read.R - Convenience functions for common tables
cc_read_bottle(version = "latest", ...)
cc_read_larvae(version = "latest", ...)
cc_read_cast(version = "latest", ...)

# R/query.R - Query helpers
cc_query(sql, version = "latest")
cc_list_tables(version = "latest")
cc_describe_table(table, version = "latest")

10 Workflow Dependency Management with `targets`

10.1 Pipeline Definition

Create workflows/_targets.R:

library(targets)
library(tarchetypes)

tar_option_set(
  packages = c("calcofi4db", "duckdb", "dplyr", "arrow"),
  format = "qs"  # Fast serialization
)

# Config
DIR_DATA     <- "~/My Drive/projects/calcofi/data-public"
DIR_METADATA <- "metadata"  # workflows/metadata/
GCS_BUCKET   <- "calcofi-db"

# Define pipeline
list(
  # ─── Ingest SWFSC NOAA CalCOFI Database ─────────────────────────
  tar_target(
    ingest_swfsc,
    {
      d <- calcofi4db::read_csv_files(
        provider     = "swfsc.noaa.gov",
        dataset      = "calcofi-db",
        dir_data     = DIR_DATA,
        metadata_dir = DIR_METADATA)

      calcofi4db::write_ingest_outputs(
        data_info  = d,
        provider   = "swfsc.noaa.gov",
        dataset    = "calcofi-db",
        gcs_bucket = GCS_BUCKET)
    }),

  # ─── Ingest CalCOFI.org Bottle Database ─────────────────────────
  tar_target(
    ingest_bottle,
    {
      d <- calcofi4db::read_csv_files(
        provider     = "calcofi.org",
        dataset      = "bottle-database",
        dir_data     = DIR_DATA,
        metadata_dir = DIR_METADATA)

      calcofi4db::write_ingest_outputs(
        data_info  = d,
        provider   = "calcofi.org",
        dataset    = "bottle-database",
        gcs_bucket = GCS_BUCKET)
    }),

  # ─── Integrate Ingest Outputs into Working DuckLake ──────────────
  tar_target(
    working_ducklake,
    calcofi4db::integrate_to_working_ducklake(
      ingests       = list(ingest_swfsc, ingest_bottle),
      gcs_bucket    = GCS_BUCKET,
      ducklake_path = "ducklake/working")),

  # ─── Publish Datasets ───────────────────────────────────────────
  tar_target(
    obis_archive,
    create_obis_archive(database, include_bottle = TRUE))
)

The key changes in this pipeline:

metadata_dir parameter: Points to workflows/metadata/ instead of the legacy calcofi4db/inst/ingest/ path
write_ingest_outputs(): Writes parquet files + manifest.json to gs://calcofi-db/ingest/{provider}_{dataset}/
integrate_to_working_ducklake(): Combines ingest parquet files into the Working DuckLake at gs://calcofi-db/ducklake/working/

Note: No separate database/ folder - the Working DuckLake IS the integrated database. The releases/ folder will contain frozen snapshots with provenance columns stripped.

10.2 Workflow Lifecycle

Code

flowchart TB
    subgraph dev["Development Cycle"]
        direction LR
        d1["1\. Explore"]
        d2["2\. Ingest"]
        d3["3\. Targets"]
        d4["4\. Run"]
        d5["5\. Publish"]
        d1 --> d2 --> d3 --> d4 --> d5
    end

    subgraph update["Update Cycle"]
        direction LR
        u1["1\. New data"]
        u2["2\. Sync"]
        u3["3\. Run"]
        u4["4\. Republish"]
        u1 --> u2 --> u3 --> u4
    end

    dev -.->|"formalized"| update

    style dev fill:#fff3e0,stroke:#ef6c00
    style update fill:#e8f4e8,stroke:#2e7d32

flowchart TB
    subgraph dev["Development Cycle"]
        direction LR
        d1["1\. Explore"]
        d2["2\. Ingest"]
        d3["3\. Targets"]
        d4["4\. Run"]
        d5["5\. Publish"]
        d1 --> d2 --> d3 --> d4 --> d5
    end

    subgraph update["Update Cycle"]
        direction LR
        u1["1\. New data"]
        u2["2\. Sync"]
        u3["3\. Run"]
        u4["4\. Republish"]
        u1 --> u2 --> u3 --> u4
    end

    dev -.->|"formalized"| update

    style dev fill:#fff3e0,stroke:#ef6c00
    style update fill:#e8f4e8,stroke:#2e7d32

Figure 7: Development and Update Cycles

10.3 Dependency Graph

Code

flowchart LR
    subgraph archive["📁 GCS Archive"]
        archive_swfsc["swfsc.noaa.gov/<br/>calcofi-db"]
        archive_bottle["calcofi.org/<br/>bottle-database"]
    end

    subgraph ingest["⚙️ Ingest Outputs"]
        ingest_swfsc["ingest_swfsc<br/>(parquet + manifest)"]
        ingest_bottle["ingest_bottle<br/>(parquet + manifest)"]
    end

    subgraph database["🗃️ Working DuckLake"]
        integrate["integrate_to_working_ducklake()"]
        ducklake["ducklake/working/"]
    end

    subgraph publish["🌐 Publish"]
        obis_archive
    end

    archive_swfsc --> ingest_swfsc
    archive_bottle --> ingest_bottle

    ingest_swfsc --> integrate
    ingest_bottle --> integrate

    integrate --> ducklake
    ducklake --> obis_archive

    style archive fill:#e3f2fd,stroke:#1565c0
    style ingest fill:#fff3e0,stroke:#ef6c00
    style database fill:#f3e5f5,stroke:#7b1fa2
    style publish fill:#fce4ec,stroke:#c2185b

flowchart LR
    subgraph archive["📁 GCS Archive"]
        archive_swfsc["swfsc.noaa.gov/<br/>calcofi-db"]
        archive_bottle["calcofi.org/<br/>bottle-database"]
    end

    subgraph ingest["⚙️ Ingest Outputs"]
        ingest_swfsc["ingest_swfsc<br/>(parquet + manifest)"]
        ingest_bottle["ingest_bottle<br/>(parquet + manifest)"]
    end

    subgraph database["🗃️ Working DuckLake"]
        integrate["integrate_to_working_ducklake()"]
        ducklake["ducklake/working/"]
    end

    subgraph publish["🌐 Publish"]
        obis_archive
    end

    archive_swfsc --> ingest_swfsc
    archive_bottle --> ingest_bottle

    ingest_swfsc --> integrate
    ingest_bottle --> integrate

    integrate --> ducklake
    ducklake --> obis_archive

    style archive fill:#e3f2fd,stroke:#1565c0
    style ingest fill:#fff3e0,stroke:#ef6c00
    style database fill:#f3e5f5,stroke:#7b1fa2
    style publish fill:#fce4ec,stroke:#c2185b

Figure 8: targets Pipeline Dependency Graph

The key insight is the separation between ingest outputs (per-workflow parquet + manifest) and the final database (integrated DuckDB). Each ingest workflow:

Reads from GCS archive (immutable source files)
Applies transformations via redefinition metadata
Writes parquet files to gs://calcofi-db/ingest/{provider}_{dataset}/
Writes manifest.json with provenance (archive timestamp, table stats)

The final database step combines all ingest outputs into a unified DuckDB.

11 Documentation Updates

11.1 Recommended Changes to `docs/`

11.1.1 `docs/db.qmd` - Database Documentation

Current: PostgreSQL-focused with naming conventions and ingestion strategy

Proposed changes:

Add section on DuckDB as primary database
Document Parquet file organization
Explain DuckLake versioning and time travel
Keep naming conventions (snake_case, *_id, *_uuid suffixes)
Update “Integrated database ingestion strategy” for targets pipeline

11.1.2 `docs/api.qmd` - API Documentation

Current: REST API endpoints for PostgreSQL queries

Proposed changes:

Note that REST API may be deprecated for most use cases
Document direct DuckDB access via calcofi4r::cc_get_db()
Keep API for web apps that cannot use DuckDB directly
Consider lightweight API using DuckDB backend if needed

11.1.3 `docs/portals.qmd` - Portal Documentation

Current: Good overview of EDI, NCEI, OBIS, ERDDAP

Proposed changes:

Update data flow diagram to show DuckDB → Portals
Add section on automated publishing via targets pipeline
Document versioning strategy for portal submissions

11.2 Where Documentation Should Live

Content	Location	Rationale
Database schema	`docs/db.qmd`	User-facing documentation
Workflow development	`workflows/README.md`	Developer guide
Package functions	`calcofi4r/`, `calcofi4db/`	pkgdown reference
Data dictionary	`calcofi4db/inst/`	Versioned with code
Architecture decisions	`docs/` or `workflows/_docs/`	Long-term reference

12 Implementation Priority

12.1 Phase 1: Foundation ✅ COMPLETE (2026-02-02)

Goal: Establish cloud infrastructure and helper functions

Set up GCS buckets ✅
- Created gs://calcofi-files-public/ with public read access
- Created gs://calcofi-files-private/ with restricted access
- Configured rclone access via OAuth
Add cloud functions to calcofi4db ✅
- Created R/cloud.R with get_gcs_file(), put_gcs_file()
- Created R/parquet.R with csv_to_parquet()
- Created R/duckdb.R with get_duckdb_con(), create_duckdb_views()
Initial sync ✅
- Synced Google Drive data-public/ to gs://calcofi-files-public/_sync/
- Created initial archive snapshot: archive/2026-02-02_121557/
- Generated manifest: manifests/manifest_2026-02-02_121557.json
- Total: 112,457 files (33.5 GiB)

12.2 Phase 2: Pipeline ✅ COMPLETE (2026-02-05)

Goal: Implement targets pipeline with new architecture

Migrate metadata to workflows/metadata/ ✅
- Created workflows/metadata/{provider}/{dataset}/ directories
- Copied redefinition files from calcofi4db/inst/ingest/
- Updated workflows to use metadata_dir parameter
Create calcofi4db/R/workflow.R ✅
- write_ingest_outputs(): Write parquet + manifest to GCS
- write_parquet_outputs(): Export tables to parquet with manifest
- read_ingest_manifest(): Read manifest from GCS
Update workflows/_targets.R ✅
- Extended with ingest_swfsc_manifest + ingest_bottle_manifest file targets
- Added merge_manifest target that renders merge_ichthyo_bottle.qmd
- Pipeline: CSV→parquet→DuckDB + manifest-based merge
Cross-dataset integration (new) ✅
- Ship matching: calcofi4db/R/ship.R with match_ships(), fetch_ship_ices()
- Cruise bridge: Derive cruise_key (YYMMKK) from bottle casts via ship_key + datetime
- Taxonomy standardization: calcofi4db/R/taxonomy.R with standardize_species(), build_taxon_table()
- Grid canonical source: grid.parquet only from swfsc (removed from bottle export)
Update publish workflow (pending)
- Modify publish_larvae_to_obis.qmd to read from DuckDB
- Add bottle data to EMoF extension

12.3 Phase 3: Working DuckLake ✅ COMPLETE (2026-02-05)

Goal: Implement Working DuckLake with provenance tracking

Working DuckLake functions implemented ✅
- get_working_ducklake(): Connect to Working DuckLake (GCS-backed)
- ingest_to_working(): Ingest with provenance columns
- add_provenance_columns(): Helper for _source_file, _source_row, _source_uuid, _ingested_at
- save_working_ducklake(): Upload to GCS
- list_working_tables(): Table summary with provenance stats
Merge workflow loads all tables ✅
- merge_ichthyo_bottle.qmd loads 15+ tables from upstream parquet
- Modified tables (casts, species, taxon, taxa_rank) exported separately
- Manifest references GCS URIs for unchanged upstream tables

12.4 Phase 4: Frozen DuckLake ✅ COMPLETE (2026-02-05)

Goal: Implement Frozen DuckLake for public releases

Freeze functions implemented ✅
- validate_for_release(): Data quality checks
- freeze_release() / write_parquet_outputs(strip_provenance=TRUE): Create immutable release
- Strip provenance columns for clean public schema
- Generate release notes with cross-dataset integration summary
Release structure ✅
- gs://calcofi-db/ducklake/releases/{version}/
- Parquet files + manifest + RELEASE_NOTES.md
- latest pointer to current release
calcofi4r frozen access ✅
- cc_get_db(version = "latest"): Default to frozen releases
- cc_list_versions(): List available releases
- cc_release_notes(): View release notes
- cc_make_view(con, "casts_extra"): Create derived VIEWs with prebaked templates
- cc_list_view_templates(): List available templates
First public release: v2026.02 (pending GCS upload)

12.5 Phase 5: Documentation & Polish

Goal: Complete documentation and evaluate API

Update docs/
- Revise db.qmd for DuckDB/DuckLake
- Document Working vs Frozen databases
- Evaluate API necessity
- Update portal documentation
Clean up workflows/
- Consolidate README.md with essentials
- Archive deprecated exploration notebooks
- Document workflow development process
- Create “Getting Started” guide for data managers

13 Verification Plan

13.1 End-to-End Test

# 1. Sync test data from Google Drive to GCS
system("rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync")

# 2. Create archive snapshot
timestamp <- format(Sys.time(), "%Y-%m-%d_%H%M%S")
system(glue::glue("rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/{timestamp}"))

# 3. Run targets pipeline
targets::tar_make()

# 4. Verify database
con <- calcofi4db::get_duckdb_con("calcofi.duckdb")
DBI::dbListTables(con)
DBI::dbGetQuery(con, "SELECT COUNT(*) FROM larvae")

# 5. Verify OBIS archive
obistools::check_fields(read_csv("data/darwincore/larvae/occurrence.csv"))

13.2 Integration Tests

14 Migration Strategy: PostgreSQL → DuckDB

14.1 Current PostgreSQL Dependencies

Based on server/docker-compose.yml, the following services depend on PostgreSQL:

Service	Purpose	Migration Path
`postgis`	Spatial database	Keep for vector tiles (short-term)
`pg_tileserv`	Vector tile server	Replace with PMTiles
`pg_rest`	REST API (PostgREST)	Keep for web apps, add DuckDB option
`plumber`	R API	Update to use DuckDB
`pgadmin`	Database admin	Deprecate once migrated

14.2 Vector Tiles: pg_tileserv → PMTiles

The current architecture uses pg_tileserv to serve vector tiles from PostGIS for Shiny mapping apps. The migration path:

Code

flowchart LR
    subgraph current["Current Flow (PostgreSQL)"]
        direction LR
        pg["PostGIS"]
        ts["pg_tileserv"]
        lf["Leaflet"]
        pg --> ts --> lf
    end

    subgraph target["Target Flow (Cloud-native)"]
        direction LR
        geo["GeoParquet"]
        tc["tippecanoe"]
        pm["PMTiles"]
        mg["mapgl"]
        geo --> tc --> pm --> mg
    end

    current -.->|"migrate"| target

    style current fill:#ffebee,stroke:#c62828
    style target fill:#e8f4e8,stroke:#2e7d32

flowchart LR
    subgraph current["Current Flow (PostgreSQL)"]
        direction LR
        pg["PostGIS"]
        ts["pg_tileserv"]
        lf["Leaflet"]
        pg --> ts --> lf
    end

    subgraph target["Target Flow (Cloud-native)"]
        direction LR
        geo["GeoParquet"]
        tc["tippecanoe"]
        pm["PMTiles"]
        mg["mapgl"]
        geo --> tc --> pm --> mg
    end

    current -.->|"migrate"| target

    style current fill:#ffebee,stroke:#c62828
    style target fill:#e8f4e8,stroke:#2e7d32

Figure 9: Vector Tiles Migration Path

14.2.1 Implementation Steps

Generate PMTiles from PostGIS

# Export geometries from PostGIS
ogr2ogr -f GeoJSON stations.geojson \
  PG:"host=postgis user=admin dbname=gis" \
  -sql "SELECT * FROM stations"

# Convert to PMTiles using tippecanoe
tippecanoe -o stations.pmtiles \
  --minimum-zoom=0 --maximum-zoom=14 \
  --layer=stations \
  stations.geojson

# Upload to GCS
gsutil cp stations.pmtiles gs://calcofi-db/tiles/

Update Shiny Apps

# Old: pg_tileserv
leaflet() |>
  addTiles() |>
  addMapboxGL(
    style = list(
      sources = list(
        stations = list(
          type = "vector",
          tiles = list("https://tile.calcofi.io/public.stations/{z}/{x}/{y}.pbf")
        )
      )
    )
  )

# New: PMTiles with mapgl
library(mapgl)
mapboxgl() |>
  add_pmtiles_source(
    id = "stations",
    url = "https://storage.googleapis.com/calcofi-db/tiles/stations.pmtiles"
  ) |>
  add_layer(
    id = "stations-layer",
    type = "circle",
    source = "stations",
    source_layer = "stations"
  )

Add PMTiles generation to targets pipeline

# In _targets.R
tar_target(
  pmtiles_stations,
  create_pmtiles(
    geom_source = duckdb_file,
    query = "SELECT * FROM stations",
    output = "tiles/stations.pmtiles"
  )
)

14.3 Migration Roadmap

Phase	Timeline	Actions
Phase 1	Now	Add DuckDB as primary data access; keep PostgreSQL
Phase 2	+2 months	Generate PMTiles for all spatial layers
Phase 3	+4 months	Update Shiny apps to use mapgl + PMTiles
Phase 4	+6 months	Deprecate pg_tileserv, reduce PostgreSQL to backup
Phase 5	+12 months	Fully deprecate PostgreSQL (if no dependencies remain)

15 Resolved Questions

Based on clarifications received:

GCS Project: Use existing project ucsd-sio-calcofi
- Existing bucket calcofi-db → DuckLake/database files
- New bucket calcofi-files-public → Versioned public source files (public read access)
- New bucket calcofi-files-private → Versioned private/sensitive files (restricted access)
Google Drive Structure (implemented 2026-02-02):
- calcofi/data-public/ → Public datasets (bottle, larvae, etc.)
- calcofi/data-private/ → Private/sensitive data (eDNA, zooplankton samples)
GCS Folder Structure (implemented 2026-02-02):
- _sync/ → Working directory (rclone syncs here, mutable)
- archive/{timestamp}/ → Immutable snapshots (complete copies, not deltas)
- manifests/ → JSON metadata files with manifest_latest.json pointer
API Strategy: Keep PostgreSQL + API for now with migration roadmap
- pg_tileserv needed for current Shiny apps
- Migrate to PMTiles over time using tippecanoe + mapgl
Priority Data: Larvae + Bottle datasets for proof-of-concept
Bottle in OBIS: Include as EMoF extension in larvae dataset

16 Remaining Questions

DuckLake vs MotherDuck: Self-hosted DuckLake catalog or use MotherDuck service?
Automation: GitHub Actions vs server cron for scheduled rclone syncs?
PMTiles hosting: Serve from GCS directly or via CDN?

17 Appendix: Key References

17.1 Documentation

17.2 Technologies

Database & Analytics:

DuckDB - Fast in-process analytical database
DuckLake - Lakehouse catalog for DuckDB
- Frozen DuckLake - Immutable versioned releases
Apache Parquet - Columnar storage format

Workflow & Orchestration:

targets R package - Pipeline dependency management

Cloud Storage:

rclone - Cloud storage sync tool
googleCloudStorageR - R interface to GCS
Data lakes and big data analytics - Google Cloud Storage

Vector Tiles:

PMTiles - Cloud-native vector tile format
tippecanoe - Build vector tilesets
mapgl R package - Modern WebGL maps in R/Shiny

17.3 CalCOFI Resources

calcofi.io/docs - CalCOFI documentation site
calcofi.io/calcofi4r - User-facing R package
calcofi.io/calcofi4db - Database admin R package

Footnotes

Technical term: This is a “versioned data lake” built on Google Cloud Storage with rclone synchronization.↩︎
DuckDB is a fast, in-process database that can query Parquet files directly from cloud storage without downloading them.↩︎
Technical terms: This is a “versioned data lake” implemented with Google Cloud Storage buckets and rclone synchronization with timestamped archive snapshots.↩︎

--- title: "CalCOFI Data Workflow Plan" format: html: toc: true toc-depth: 3 number-sections: true code-fold: true embed-resources: true --- <style> /* flowchart boxes */ .mermaid-js .flowchart-link { stroke: #333!important; } /* gantt chart: make task text visible */ .mermaid-js .taskText, .mermaid-js .taskTextOutsideRight, .mermaid-js .taskTextOutsideLeft, .mermaid-js [class^="taskText"] { fill: #333!important; } </style>  > This document outlines a comprehensive strategy for managing and versioning CalCOFI data files, building an integrated database, and publishing datasets to external portals. ## Overview for Data Managers & Scientists {#sec-overview} This section provides a high-level view of the CalCOFI data system. **You don't need to understand all the technical details**—just the four main components and how they connect. ### The Big Picture ```{mermaid} %%| label: fig-overview %%| fig-cap: "CalCOFI Data System Overview" flowchart TB subgraph drive["📁 THE DRIVE"] direction LR gd_pub["data-public/ Bottle, Larvae, CTD..."] gd_priv["data-private/ eDNA, Zooplankton..."] end subgraph archive["🗄️ THE ARCHIVE"] direction LR sync["_sync/ (working copy)"] hist["archive/ (snapshots)"] end subgraph workflows["⚙️ THE WORKFLOWS"] direction LR ingest["Ingest (CSV → DB)"] publish["Publish (DB → Portals)"] end subgraph database["🗃️ THE DATABASE"] direction LR working["Working (internal)"] frozen["Frozen (public releases)"] end subgraph outputs["🌐 OUTPUTS"] direction LR obis["OBIS"] apps["Shiny Apps"] api["API / R pkg"] end drive -->|"daily sync"| archive archive -->|"read files"| workflows workflows -->|"transform"| database database -->|"publish"| outputs style drive fill:#e8f4e8,stroke:#2e7d32 style archive fill:#e3f2fd,stroke:#1565c0 style workflows fill:#fff3e0,stroke:#ef6c00 style database fill:#f3e5f5,stroke:#7b1fa2 style outputs fill:#fce4ec,stroke:#c2185b ``` ### Four Key Components ::: {.callout-note collapse="false"} ## 📁 The Drive **What**: The shared Google Drive folder where you upload your data files. **Location**: `My Drive/projects/calcofi/data-public/` (or `data-private/`) **Who uses it**: Dataset managers (Rasmus, Ed, Nastassia, Linsey, etc.) **How to use it**: 1. Create a folder for your data: `{provider}/{dataset}/` (e.g., `swfsc.noaa.gov/calcofi-db/`) 2. Inside, organize files into `raw/` (original data) and `derived/` (processed data) 3. Simply upload or overwrite files—versioning happens automatically 4. No special naming required (no date prefixes needed) **Example structure**: ``` data-public/ ├── calcofi.org/ │ └── bottle-database/ │ ├── raw/ │ │ └── bottle_export_2026-01-15.csv │ └── derived/ │ └── bottle_cleaned.csv ├── swfsc.noaa.gov/ │ └── calcofi-db/ │ ├── raw/ │ │ ├── larva.csv │ │ └── cruise.csv │ └── derived/ ``` ::: ::: {.callout-note collapse="false"} ## 🗄️ The Archive **What**: A versioned backup of all Drive files in the cloud[^datalake]. **Location**: Google Cloud Storage buckets (`gs://calcofi-files-public/` and `gs://calcofi-files-private/`) **Who uses it**: Automated systems (you don't need to interact with it directly) **What it does**: - **Daily sync**: Copies all files from the Drive to the Archive - **Snapshots**: Creates timestamped copies so we can always go back in time - **Never loses data**: Even if you delete a file from Drive, it's preserved in the Archive **Why it matters**: Scientists can reproduce analyses using the exact data from any past date. [^datalake]: Technical term: This is a "versioned data lake" built on Google Cloud Storage with rclone synchronization. ::: ::: {.callout-note collapse="false"} ## ⚙️ The Workflows **What**: Automated scripts that process data from the Archive into the Database. **Location**: `/workflows/` folder (Quarto notebooks and R scripts) **Who uses it**: Data engineers and developers **Two types of workflows**: 1. **Ingest workflows** (`ingest_*.qmd`): Read CSV files → Clean/transform → Load into Database 2. **Publish workflows** (`publish_*.qmd`): Read from Database → Format for portals → Upload to OBIS, EDI, etc. **Key feature**: Dependencies are managed automatically—if source data changes, all downstream steps re-run. ::: ::: {.callout-note collapse="false"} ## 🗃️ The Database **What**: The integrated CalCOFI database combining all datasets into related tables. **Format**: DuckDB[^duckdb] with Parquet files (modern, fast, serverless) **Two versions**: | Version | Purpose | Access | Features | |---------|---------|--------|----------| | **Working Database** | Internal processing | Workflows only | Full provenance tracking, UUIDs, source file references, time travel | | **Frozen Database** | Public releases | Apps, API, R package | Clean tables, versioned releases (v2026.02, v2026.03...), optimized for queries | **Why two versions?** The Working Database tracks every detail for reproducibility. The Frozen Database provides clean, stable snapshots for users who just want the data. [^duckdb]: [DuckDB](https://duckdb.org/) is a fast, in-process database that can query Parquet files directly from cloud storage without downloading them. ::: ### How Data Flows ```{mermaid} %%| label: fig-data-flow-simple %%| fig-cap: "Simplified Data Flow" flowchart TB subgraph you["👤 YOU"] upload["Upload CSV to Drive"] end subgraph auto["🤖 AUTOMATED"] sync["Sync to Archive"] snapshot["Create snapshot"] transform["Run workflows"] end subgraph result["📊 RESULT"] db["Updated Database"] portals["Published to Portals"] end upload --> sync --> snapshot --> transform --> db --> portals style you fill:#e8f4e8,stroke:#2e7d32 style auto fill:#fff3e0,stroke:#ef6c00 style result fill:#f3e5f5,stroke:#7b1fa2 ``` **Your role as a dataset manager**: Upload files to the Drive. Everything else happens automatically. --- ## Executive Summary This plan modernizes the CalCOFI data workflow architecture to: 1. Version raw data files in Google Cloud Storage (GCS) 2. Use DuckDB + Parquet as the primary integrated database format 3. Leverage `targets` for workflow dependency management 4. Use R packages (`calcofi4r`, `calcofi4db`) for reusable functions 5. Maintain reproducible pipelines from raw data to published datasets --- ## Current State Analysis ### Existing Infrastructure | Component | Current State | Location | |-----------|--------------|----------| | **Workflows** | Individual Quarto notebooks | `/workflows/` | | **Database** | PostgreSQL with dev/prod schemas | api.calcofi.io | | **R Packages** | `calcofi4r` (user), `calcofi4db` (admin) | GitHub | | **Data Sources** | CSV files on Google Drive | `~/My Drive/projects/calcofi/data-public` | | **Documentation** | Quarto site | `/docs/` | ### Key Workflow Files | Workflow | Purpose | Status | |----------|---------|--------| | `ingest_calcofi.org_bottle-database.qmd` | Ingest bottle/cast data | Partial | | `ingest_swfsc.noaa.gov_calcofi-db.qmd` | Ingest NOAA ichthyoplankton data | Active | | `publish_larvae_to_obis.qmd` | Publish larvae to OBIS | Active (needs bottle data) | ### Identified Gaps 1. **No dependency management**: Workflows run independently without declared dependencies 2. **No file versioning**: Source files updated in place without version history 3. **PostgreSQL overhead**: Requires server maintenance and API infrastructure 4. **Disconnected workflows**: No systematic connection between ingest and publish steps 5. **Manual sequencing**: Users must know to run bottle ingestion before larvae ingestion --- ## File Versioning Strategy ### The Two-World Problem CalCOFI data originates from multiple sources managed by different individuals: - **Rasmus**: Bottle/cast data, CTD profiles - **Ed**: Ichthyoplankton (larvae, eggs) data - **Nastassia**: eDNA data - **Linsey**: Zooplankton samples These dataset managers need a simple interface (Google Drive) to deposit and organize files, while the data engineering side needs version control, immutability, and reproducibility. ### Google Drive: The Human Interface This is "**The Drive**" from @sec-overview. Dataset managers upload files here; everything else is automatic. #### Folder Structure **Recommended organization**: `{provider}/{dataset}/raw/` and `{provider}/{dataset}/derived/` ``` calcofi/data-public/ # Shared Google Drive folder (public data) ├── calcofi.org/ │ └── bottle-database/ │ ├── raw/ # Original exports from source system │ │ └── bottle_export_2026-01.csv │ └── derived/ # Cleaned/processed versions │ ├── bottle.csv │ ├── cast.csv │ └── field_descriptions.csv ├── swfsc.noaa.gov/ │ └── calcofi-db/ │ ├── raw/ # Original NOAA database exports │ │ └── calcofi_db_dump_2026-01.sql │ └── derived/ # Extracted CSV tables │ ├── cruise.csv │ ├── egg.csv │ ├── larva.csv │ ├── net.csv │ ├── ship.csv │ ├── site.csv │ ├── species.csv │ └── tow.csv ├── coastwatch.pfeg.noaa.gov/ │ └── erdCalCOFIlrvsiz/ │ └── derived/ │ └── larvae_size.csv └── _archive/ # Manual archive (optional) └── 2025-01-15_bottle.csv # User-created backup calcofi/data-private/ # Private/sensitive data (restricted access) ├── edna/ # eDNA samples (Nastassia) │ ├── raw/ │ └── derived/ ├── zooplankton/ # Zooplankton samples (Linsey) │ ├── raw/ │ └── derived/ └── _archive/ # Manual archive (optional) ``` #### Google Drive Naming Conventions - **Top-level folders**: `{provider}/` - the organization providing the data (e.g., `swfsc.noaa.gov`) - **Dataset folders**: `{dataset}/` - specific dataset name (e.g., `calcofi-db`) - **Subfolders** (recommended): - `raw/` - Original data as received from source (database dumps, original CSVs) - `derived/` - Processed/cleaned versions ready for ingestion - **Files**: Original source names, no date prefixes required - **Updates**: Simply overwrite existing files - **No versioning in Drive**: Google Drive does not maintain file history for CSVs #### Dataset Manager Workflow 1. Create folder structure: `{provider}/{dataset}/raw/` and `{provider}/{dataset}/derived/` 2. Upload original data to `raw/` (preserve original format) 3. Upload processed/cleaned data to `derived/` (CSV format preferred) 4. Simply overwrite files when updating—versioning happens automatically in the Archive 5. Optionally notify data team of significant changes **See @sec-overview for the big picture of how your data flows through the system.** ### Google Cloud Storage: The Versioned Data Lake This is "**The Archive**" from @sec-overview[^datalake-detail]. It automatically captures versioned snapshots of all files from the Drive. [^datalake-detail]: Technical terms: This is a "versioned data lake" implemented with Google Cloud Storage buckets and rclone synchronization with timestamped archive snapshots. #### Bucket Structure ``` gs://calcofi-files-public/ # Versioned public source files ├── _sync/ # Working directory (rclone syncs here) │ ├── calcofi.org/ │ │ └── bottle-database/ │ │ ├── bottle.csv │ │ └── cast.csv │ └── swfsc.noaa.gov/ │ └── calcofi-db/ │ ├── cruise.csv │ ├── egg.csv │ └── ... ├── archive/ # Timestamped immutable snapshots │ ├── 2026-02-02_121557/ # YYYY-MM-DD_HHMMSS │ │ └── calcofi.org/ │ │ └── bottle-database/ │ │ └── bottle.csv # Snapshot at that time │ ├── 2026-02-15_120000/ │ │ └── swfsc.noaa.gov/ │ │ └── calcofi-db/ │ │ └── larva.csv │ └── ... └── manifests/ # Version metadata ├── manifest_2026-02-02_121557.json ├── manifest_2026-02-15_120000.json └── manifest_latest.json # Points to latest archive gs://calcofi-files-private/ # Versioned private source files ├── _sync/ # Working directory (rclone syncs here) │ ├── edna/ │ └── zooplankton/ ├── archive/ # Timestamped immutable snapshots └── manifests/ gs://calcofi-db/ # Integrated database ├── ingest/ # Per-workflow outputs (intermediate) │ ├── swfsc.noaa.gov_calcofi-db/ # Flattened {provider}_{dataset} │ │ ├── cruise.parquet │ │ ├── ship.parquet │ │ ├── site.parquet │ │ ├── ... │ │ ├── manifest.json # Provenance for this workflow │ │ └── metadata.json # Table & column metadata sidecar │ └── calcofi.org_bottle-database/ │ ├── cast.parquet │ ├── bottle.parquet │ ├── manifest.json │ └── metadata.json │ └── ducklake/ # DuckLake manages parquet + catalog ├── working/ # Integrated database (combines ingests) │ ├── catalog/ # DuckLake catalog metadata │ └── data/ # Parquet files with provenance columns │ ├── cruise/ │ ├── site/ │ └── ... └── releases/ # Frozen public releases ├── v2026.02/ │ ├── catalog/ │ └── data/ # Provenance columns stripped └── latest -> v2026.02 # Symlink to current ``` ### Workflow Metadata Redefinition files (column/table renaming rules) live in `workflows/metadata/`, NOT in the calcofi4db R package. This decouples workflow-specific metadata from the package code. ``` workflows/metadata/ ├── swfsc.noaa.gov/ │ └── calcofi-db/ │ ├── tbls_redefine.csv # Table renaming rules │ ├── flds_redefine.csv # Field renaming/typing rules │ ├── tbls_raw.csv # Auto-generated raw table info │ └── flds_raw.csv # Auto-generated raw field info └── calcofi.org/ └── bottle-database/ ├── flds_rename.csv └── ship_renames.csv ``` **Benefits of this structure:** - Version-controlled alongside workflows (not buried in R package) - Easy to edit without rebuilding the package - Clear ownership per dataset - Allows different projects to use calcofi4db without conflicting metadata #### GCS Naming Conventions - **_sync/**: Working directory, exact mirror of Google Drive structure (rclone syncs here) - **archive/{YYYY-MM-DD_HHMMSS}/**: Immutable snapshot of complete data at that timestamp - **manifests/manifest_{YYYY-MM-DD_HHMMSS}.json**: Snapshot metadata - **manifests/manifest_latest.json**: Points to latest archive snapshot ### Rclone Sync: Capturing Daily Changes #### Installation **macOS:** ```bash brew install rclone ``` **Linux (Ubuntu/Debian):** ```bash curl https://rclone.org/install.sh | sudo bash ``` #### Configuration Run `rclone config` and create two remotes: **Remote 1: Google Drive (`gdrive`)** ``` n) New remote name> gdrive Storage> drive client_id> # leave blank client_secret> # leave blank scope> 1 # Full access (or 2 for read-only) service_account_file> # leave blank for OAuth, or path to JSON key Edit advanced config> n Use auto config> y # opens browser (use 'n' on headless server) Configure this as a Shared Drive> n ```  **Remote 2: Google Cloud Storage (`gcs`)** ``` n) New remote name> gcs Storage> google cloud storage client_id> # leave blank client_secret> # leave blank project_number> ucsd-sio-calcofi service_account_file> # leave blank for OAuth, or path to JSON key Edit advanced config> n Use auto config> y # opens browser (use 'n' on headless server) ```  #### Headless Server Setup For Linux servers without a browser, use `rclone authorize` on a local machine: ```bash # on local machine with browser rclone authorize "drive" # copy the token output # on server, during rclone config Use auto config> n # paste the token when prompted ``` Similarly for GCS: ```bash # on local machine rclone authorize "google cloud storage" ``` #### Service Account (Optional) For automated/scheduled syncs, use a GCP service account instead of OAuth: 1. Create service account in GCP Console → IAM → Service Accounts 2. Grant roles: `Storage Object Admin` for `calcofi-files-public`, `calcofi-files-private`, and `calcofi-db` buckets 3. Download JSON key file to secure location (e.g., `/etc/rclone/calcofi-sa.json`) 4. In `rclone config`, provide path to JSON key instead of using OAuth #### Verify Configuration ```bash # test Google Drive access rclone lsd gdrive:projects/calcofi/data-public # OR rclone lsd gdrive-ecoquants:projects/calcofi/data-public # test GCS access rclone ls gcs:calcofi-files-public --max-depth 1 # OR rclone ls gcs-calcofi:calcofi-files-public --max-depth 1 # test sync (dry-run first!) rclone sync gdrive:calcofi/data-public gcs:calcofi-files-public/_sync --dry-run -v rclone sync gdrive-ecoquants:projects/calcofi/data-public gcs-calcofi:calcofi-files-public/_sync --dry-run -v ``` #### How It Works ```bash #!/bin/bash # sync_gdrive_to_gcs.sh - Run daily via cron TIMESTAMP=$(date +%Y-%m-%d_%H%M%S) LOG_FILE="/var/log/rclone/sync_${TIMESTAMP}.log" # --- PUBLIC DATA --- # Step 1: Sync Google Drive to _sync/ working directory rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync \ --checksum \ --drive-export-formats csv \ --log-file ${LOG_FILE} \ --log-level INFO # Step 2: Create immutable archive snapshot rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/${TIMESTAMP} # Step 3: Generate and upload manifest rclone lsjson gs://calcofi-files-public/archive/${TIMESTAMP} --recursive \ > /tmp/manifest_${TIMESTAMP}.json rclone copy /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/ rclone copyto /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/manifest_latest.json # --- PRIVATE DATA --- # Repeat for private bucket (separate log file) rclone sync gdrive:calcofi/data-private gs://calcofi-files-private/_sync \ --checksum \ --drive-export-formats csv \ --log-file ${LOG_FILE%.log}_private.log \ --log-level INFO rclone copy gs://calcofi-files-private/_sync gs://calcofi-files-private/archive/${TIMESTAMP} ``` #### Key Behaviors | Scenario | Google Drive Action | Rclone Behavior | Result in GCS | |----------|--------------------|-----------------|----| | New file | User uploads `new.csv` | `rclone sync` copies to `_sync/` | `_sync/new.csv` created, next archive includes it | | Updated file | User overwrites `bottle.csv` | `rclone sync` updates `_sync/` | `_sync/bottle.csv` updated, archive snapshots preserve history | | Deleted file | User deletes `old.csv` | `rclone sync` removes from `_sync/` | Removed from `_sync/`, preserved in prior archives | | Renamed file | User renames `a.csv` → `b.csv` | Treated as delete + create | `a.csv` gone from `_sync/`, `b.csv` created | | No changes | Nothing | `--checksum` skips identical files | No new sync needed | #### The Two-Step Archive Strategy The archiving approach uses two steps for reliability: 1. **Sync to `_sync/`**: rclone syncs Google Drive to the `_sync/` working directory 2. **Copy to `archive/{timestamp}/`**: Create an immutable snapshot of the current state Benefits: - **Immutable archives**: Each archive folder is a complete snapshot, not just deltas - **Reproducibility**: Any archive folder contains everything needed to recreate that state - **No data loss**: If sync fails, previous archives remain intact - **Public access**: `calcofi-files-public` bucket has public read access for easy data sharing ### Manifest Files Each manifest captures the complete state of the archive at a point in time: ```json { "generated_at": "2026-02-02T12:15:57Z", "archive_timestamp": "2026-02-02_121557", "bucket": "gs://calcofi-files-public", "archive_path": "archive/2026-02-02_121557", "files": [ { "Path": "calcofi.org/bottle-database/bottle.csv", "Size": 45234567, "ModTime": "2026-02-02T10:30:00Z", "MimeType": "text/csv", "gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/calcofi.org/bottle-database/bottle.csv" }, { "Path": "swfsc.noaa.gov/calcofi-db/larva.csv", "Size": 12345678, "ModTime": "2026-02-01T08:00:00Z", "MimeType": "text/csv", "gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/swfsc.noaa.gov/calcofi-db/larva.csv" } ], "total_files": 112457, "total_size_bytes": 35987654321 } ``` ### Reconstructing Historical State To recreate the data lake as it existed on a specific date: ```r # R function to get file from a specific archive snapshot get_archived_file <- function(path, timestamp = "latest", bucket = "calcofi-files-public") { # Get manifest for that timestamp if (timestamp == "latest") { manifest_url <- glue("gs://{bucket}/manifests/manifest_latest.json") } else { manifest_url <- glue("gs://{bucket}/manifests/manifest_{timestamp}.json") } manifest <- jsonlite::read_json(manifest_url) # Construct the archive URL archive_url <- glue("gs://{bucket}/archive/{manifest$archive_timestamp}/{path}") # Or use public HTTPS URL for direct access https_url <- glue("https://storage.googleapis.com/{bucket}/archive/{manifest$archive_timestamp}/{path}") return(https_url) } ``` ### Automation Schedule ```{mermaid} %%| label: fig-schedule %%| fig-cap: "Daily Sync Schedule (UTC)" %%--- %%displayMode: compact %%--- %%{init: {'theme': 'default', 'themeVariables': { 'textColor': '##333'}}}%% gantt title Daily Sync Schedule (UTC) dateFormat HH:mm axisFormat %H:%M section Sync rclone :a1, 00:00, 10m archive :a2, after a1, 5m manifest :a3, after a2, 5m section Pipeline targets :b1, after a3, 10m parquet :b2, after b1, 5m duckdb :b3, after b2, 5m section Notify notify :c1, after b3, 5m ``` The daily automation runs at midnight UTC and completes in under an hour: 1. **Sync** phase: `rclone` copies new/changed files from Google Drive to the `_sync/` folder in GCS, then creates an immutable `archive/` snapshot and generates a manifest file documenting the state. 2. **Pipeline** phase: The `targets` pipeline detects which source files changed and re-runs only the affected steps—transforming CSVs to Parquet format and updating the DuckDB database. 3. **Notify** phase: Sends notifications (email/Slack) summarizing what changed and any errors encountered. If no files changed in Google Drive, the pipeline completes quickly with minimal processing. --- ## Proposed Architecture ### Data Flow Diagram ```{mermaid} %%| label: fig-data-flow %%| fig-cap: "CalCOFI Data Flow Architecture" flowchart TB subgraph drive["📁 THE DRIVE"] direction LR gdrive_pub["data-public/"] gdrive_priv["data-private/"] end subgraph archive["🗄️ THE ARCHIVE"] direction LR bucket_pub["calcofi-files-public"] bucket_priv["calcofi-files-private"] end subgraph database["🗃️ THE DATABASE"] direction LR working["Working DuckLake"] frozen["Frozen Releases"] end subgraph outputs["🌐 OUTPUTS"] direction LR obis["OBIS"] edi["EDI"] erddap["ERDDAP"] apps["Apps/API"] end drive -->|"rclone sync"| archive archive -->|"targets pipeline"| database database -->|"publish workflows"| outputs style drive fill:#e8f4e8,stroke:#2e7d32 style archive fill:#e3f2fd,stroke:#1565c0 style database fill:#f3e5f5,stroke:#7b1fa2 style outputs fill:#fce4ec,stroke:#c2185b ``` ### Technology Stack | Layer | Technology | Purpose | Replaces | |-------|------------|---------|----------| | **Storage** | | | | | | `gs://calcofi-files-public` | Versioned public source CSV files | Google Drive only | | | `gs://calcofi-files-private` | Versioned private/sensitive files | Google Drive only | | | `gs://calcofi-db` | Parquet, DuckDB, PMTiles | - | | **Sync** | rclone | Drive → GCS with archive snapshots | Manual copying | | **Format** | Apache Parquet | Efficient columnar storage | CSV | | **Compute** | DuckDB | Serverless SQL queries | PostgreSQL (gradual) | | **Database** | | | | | | Working DuckLake | Internal DB with provenance, time travel | PostgreSQL | | | [Frozen DuckLake](https://ducklake.select/2025/10/24/frozen-ducklake/) | Public versioned releases (v2026.02, etc.) | PostgreSQL snapshots | | **Orchestration** | `targets` | Dependency management, caching | Manual sequencing | | **Tiles** | PMTiles + tippecanoe | Cloud-native vector tiles | pg_tileserv | | **Mapping** | mapgl | Modern WebGL maps in Shiny | leaflet + pg_tileserv | | **Functions** | `calcofi4r`, `calcofi4db` | Reusable R code | Ad-hoc scripts | **GCP Project**: `ucsd-sio-calcofi` --- ## Database Naming Conventions {#sec-naming} ### Table Names - **Table names are singular** (e.g., `cruise`, `site`, `species`, `lookup` not `cruises`, `sites`, `lookups`) - Table names use all lower case with underscores for multi-word names (snake_case) ### Column Names - Use **snake_case** (lower-case with underscores) for all column names - Unique identifiers are suffixed with: - `*_id` for integer keys (sequential or from source) - `*_key` for string/text keys (natural keys) - `*_uuid` for universally unique identifiers (provenance tracking only) - Suffix with units where applicable (e.g., `length_mm`, `depth_m`, `temp_degc`) --- ## Primary Key Strategy {#sec-primary-keys} ### General Principle: Natural Keys Over Surrogate Keys **Prefer natural keys** (meaningful identifiers from the domain) over surrogate keys (arbitrary integers) unless: - No natural key exists - Performance requires integer keys for large tables with frequent joins - The natural key is unstable or may change **Benefits of natural keys:** - Self-documenting (cruise_key `2401NH` tells you January 2024, New Horizon) - No lookup needed to understand relationships - Consistent across systems (source data uses same identifiers) - Easier debugging and data validation ### Natural Keys (use as PRIMARY KEY) | Table | Primary Key | Format | Example | |-------|-------------|--------|---------| | `cruise` | `cruise_key` | TEXT `YYMMKK` | `'2401NH'` | | `ship` | `ship_key` | TEXT (2-letter) | `'NH'` | | `tow_type` | `tow_type_key` | TEXT | `'CB'` | | `species` | `species_id` | INTEGER | `1234` (natural from source) | ### Sequential Integer Keys (with explicit sort order) For tables without natural keys, use sequential integers **with deterministic sort order** for reproducibility when appending new data: | Table | Primary Key | Sort Order for ID Assignment | |-------|-------------|------------------------------| | `site` | `site_id` | `ORDER BY cruise_key, orderocc` | | `tow` | `tow_id` | `ORDER BY site_id, time_start` | | `net` | `net_id` | `ORDER BY tow_id, side` | | `ichthyo` | `ichthyo_id` | `ORDER BY net_id, species_id, life_stage, measurement_type, measurement_value` | | `lookup` | `lookup_id` | `ORDER BY lookup_type, lookup_num` | **Why sort order matters:** When new cruises are added later, rows get appended. Without deterministic ordering, the same data could get different IDs on different runs, breaking reproducibility. ### Foreign Key Relationships ``` ship.ship_key (PK) ↓ cruise.cruise_key (PK) ←── cruise.ship_key (FK) ↓ site.site_id (PK) ←── site.cruise_key (FK) ↓ tow.tow_id (PK) ←── tow.site_id (FK) ↓ tow.tow_type_key (FK) → tow_type.tow_type_key net.net_id (PK) ←── net.tow_id (FK) ↓ ichthyo.ichthyo_id (PK) ←── ichthyo.net_id (FK) ichthyo.species_id (FK) → species.species_id --- Bottle chain (linked via cruise bridge) --- casts.cast_id (PK) ←── casts.cruise_key (FK, nullable) → cruise.cruise_key ↓ casts.grid_key (FK) → grid.grid_key ↓ casts.ship_key (FK, nullable) → ship.ship_key bottle.bottle_id (PK) ←── bottle.cast_id (FK) ↓ bottle_measurement ←── bottle_measurement.bottle_id (FK) bottle_measurement.measurement_type_id → measurement_type --- Taxonomy --- species.species_id (PK) species.worms_id → taxon (WoRMS authority) species.itis_id → taxon (ITIS authority) taxon.taxonID (PK per authority) taxa_rank.taxonRank (PK) --- Spatial --- grid.grid_key (PK) ←── canonical from swfsc (not in bottle export) segment (reference table) ``` **Note**: `grid.parquet` is only exported from the swfsc workflow (canonical source). The bottle workflow no longer exports grid. ### UUID Columns - **Remove** `cruise_uuid`, `site_uuid`, `tow_uuid`, `net_uuid` from output tables - **Retain** `_source_uuid` in Working DuckLake for provenance (stripped in frozen release) --- ## Database Architecture: Working vs Frozen DuckLake {#sec-database} The CalCOFI integrated database uses [DuckLake](https://ducklake.select/), a lakehouse catalog that provides version control for data tables. We maintain **two distinct databases** to serve different needs: ### The Two-Database Strategy ```{mermaid} %%| label: fig-ducklake-architecture %%| fig-cap: "Working vs Frozen DuckLake Architecture" flowchart TB subgraph archive["📁 Archive"] csv["CSV source files"] end subgraph working["🔧 WORKING DUCKLAKE"] direction LR w_tables["Provenance columns"] w_history["Time travel"] w_access["Internal only"] end subgraph frozen["❄️ FROZEN DUCKLAKE"] direction LR f_versions["v2026.02, v2026.03..."] f_clean["Clean tables"] f_access["Public access"] end subgraph outputs["🌐 Public Access"] direction LR apps["Shiny Apps"] api["REST API"] rpkg["calcofi4r"] end csv -->|"ingest workflows"| working working -->|"freeze release"| frozen frozen --> outputs style archive fill:#e3f2fd,stroke:#1565c0 style working fill:#fff3e0,stroke:#ef6c00 style frozen fill:#e8f4e8,stroke:#2e7d32 style outputs fill:#fce4ec,stroke:#c2185b ``` ### Working DuckLake (Internal) The Working DuckLake is the **internal, mutable database** used by workflows. It prioritizes: - **Full provenance tracking**: Every row knows where it came from - **Time travel**: Query the database as it existed at any past point - **Frequent updates**: Can be modified whenever new data arrives **Schema features**: | Column | Type | Purpose | |--------|------|---------| | `_source_file` | VARCHAR | Path to original CSV file in Archive | | `_source_row` | INTEGER | Row number in source file | | `_source_uuid` | UUID | Original record ID (e.g., from NOAA database) | | `_ingested_at` | TIMESTAMP | When this row was added/updated | **Example query with provenance**: ```sql -- Find all larvae records from a specific source file SELECT scientific_name, count, _source_file, _source_row FROM larvae WHERE _source_file LIKE '%swfsc.noaa.gov/calcofi-db/larva.csv' AND cruise_id = '202301CC'; ``` **Time travel query**: ```sql -- Query larvae table as it existed on January 15th SELECT * FROM larvae AT TIMESTAMP '2026-01-15 00:00:00'; ``` ### Frozen DuckLake (Public Releases) {#sec-frozen} The Frozen DuckLake provides **stable, versioned snapshots** for external users. Based on the [Frozen DuckLake](https://ducklake.select/2025/10/24/frozen-ducklake/) pattern: **Key characteristics**: - **Immutable**: Once released, a version never changes - **Clean schema**: No internal provenance columns (cleaner for users) - **Semantic versioning**: `v{YYYY}.{MM}` format (e.g., `v2026.02`) - **Public access**: Anyone can query via HTTP/Parquet **Release process**: ```{mermaid} %%| label: fig-freeze-process %%| fig-cap: "Freezing a Release" flowchart LR working["Working DuckLake"] validate["Validate data quality"] strip["Remove provenance columns"] version["Tag version (v2026.02)"] upload["Upload to releases/"] announce["Announce release"] working --> validate --> strip --> version --> upload --> announce style working fill:#fff3e0,stroke:#ef6c00 style validate fill:#fff9c4,stroke:#f9a825 style strip fill:#fff9c4,stroke:#f9a825 style version fill:#fff9c4,stroke:#f9a825 style upload fill:#e8f4e8,stroke:#2e7d32 style announce fill:#e8f4e8,stroke:#2e7d32 ``` **Frozen release structure**: ``` gs://calcofi-db/ducklake/releases/ ├── v2026.02/ │ ├── catalog.json # DuckLake catalog │ ├── parquet/ │ │ ├── bottle.parquet │ │ ├── cast.parquet │ │ ├── larvae.parquet │ │ └── ... │ └── RELEASE_NOTES.md ├── v2026.03/ │ └── ... └── latest -> v2026.03 # Symlink to latest ``` ### Accessing Frozen Releases **From R (using calcofi4r)**: ```r library(calcofi4r) # Get latest release (default) con <- cc_get_db() # Get specific version con <- cc_get_db(version = "v2026.02") # List available versions cc_list_versions() #> [1] "v2026.02" "v2026.03" "v2026.04" ``` **Direct Parquet access (no R required)**: ```python import duckdb # Query directly from GCS (no download needed) con = duckdb.connect() con.execute(""" SELECT * FROM read_parquet( 'https://storage.googleapis.com/calcofi-db/ducklake/releases/latest/parquet/larvae.parquet' ) LIMIT 10 """) ``` **From any HTTP client**: ```bash # Download parquet file directly curl -O https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.02/parquet/larvae.parquet ``` ### Release Schedule | Frequency | Trigger | Example | |-----------|---------|---------| | **Monthly** | Regular data updates | v2026.02, v2026.03 | | **Ad-hoc** | Major dataset additions | v2026.02.1 (patch) | | **Annual** | Year-end comprehensive release | v2026.12 | **Release checklist**: - [ ] All ingest workflows pass - [ ] Data quality checks pass (no nulls in required fields, valid ranges) - [ ] Foreign key relationships valid - [ ] Row counts within expected range - [ ] Release notes document changes - [ ] Announce to users --- ## R Package Strategy ### Package Roles #### `calcofi4db` - Database Administration Tools **Target users**: Data managers, workflow developers ``` calcofi4db/R/ ├── cloud.R # GCS operations (@concept cloud) ├── read.R # CSV/Parquet reading (@concept read) ├── transform.R # Data transformation (@concept transform) ├── ingest.R # Database loading (@concept ingest) ├── workflow.R # Workflow outputs to GCS (@concept workflow) ├── parquet.R # Parquet operations (@concept parquet) ├── duckdb.R # DuckDB operations (@concept duckdb) ├── ducklake.R # DuckLake operations (@concept ducklake) ├── archive.R # GCS archive sync (@concept archive) ├── version.R # Schema versioning (@concept version) ├── check.R # Data validation (@concept check) ├── validate.R # FK/lookup validation functions (@concept validate) ├── wrangle.R # Local DuckDB wrangling functions (@concept wrangle) ├── ship.R # Ship fuzzy matching functions (@concept ship) [NEW] ├── taxonomy.R # Taxonomy standardization functions (@concept taxonomy) [NEW] ├── spatial.R # Spatial operations (@concept spatial) ├── freeze.R # Frozen release operations (@concept freeze) ├── display.R # Display helpers (@concept display) └── viz.R # Diagram generation (@concept viz) ``` #### `calcofi4r` - User-Facing Tools **Target users**: Researchers, analysts, app developers ``` calcofi4r/R/ ├── read.R # Data retrieval (@concept read) ├── database.R # Database connection (@concept database) ├── analyze.R # Statistical analysis (@concept analyze) ├── visualize.R # Plotting and mapping (@concept visualize) └── functions.R # Helper functions (@concept utils) ``` ### Function Reuse Across Contexts | Context | Package | Example Usage | |---------|---------|---------------| | Ingestion workflows | `calcofi4db` | `read_csv_files()`, `ingest_csv_to_db()` | | Publishing workflows | `calcofi4db` | `get_db_con()`, `create_db_manifest()` | | Shiny apps | `calcofi4r` | `cc_get_db()`, `cc_plot_timeseries()` | | Analysis scripts | `calcofi4r` | `cc_read_bottle()`, `cc_analyze_trend()` | | API endpoints | `calcofi4r` | `cc_query_db()`, `cc_get_variables()` | ### New Functions to Implement #### In `calcofi4db`: ```r # R/cloud.R - GCS operations get_gcs_file(bucket, path, local_path = NULL) put_gcs_file(local_path, bucket, path) sync_gdrive_to_gcs(gdrive_path, gcs_bucket, backup = TRUE) list_gcs_versions(bucket, path) # R/parquet.R - Parquet operations csv_to_parquet(csv_path, schema_def = NULL) read_parquet_table(path, con = NULL) write_parquet_table(data, path, partitions = NULL) add_parquet_metadata(path, metadata_list) # R/duckdb.R - DuckDB operations get_duckdb_con(path = ":memory:", read_only = FALSE) create_duckdb_views(con, manifest) attach_ducklake(con, catalog_path) set_duckdb_comments(con, table, comments) # R/workflow.R - Workflow output functions [NEW] write_ingest_outputs(data_info, provider, dataset, gcs_bucket) # 1. Transform data via transform_data() # 2. Write parquet files to gs://calcofi-db/ingest/{provider}_{dataset}/ # 3. Write manifest.json with provenance info read_ingest_manifest(provider, dataset, gcs_bucket) read_ingest_parquet(provider, dataset, table, gcs_bucket) list_ingest_outputs(gcs_bucket) integrate_to_working_ducklake(ingests, gcs_bucket, ducklake_path) # Combines ingest outputs into ducklake/working/ # R/ducklake.R - DuckLake catalog operations get_working_ducklake() # connect to internal Working DuckLake ingest_dataset(con, d, mode) # ✓ IMPLEMENTED - high-level dataset ingestion # - calls transform_data() internally # - handles uuid column detection # - tracks provenance via gcs_path ingest_to_working(data, table, source_file, source_uuid_col = NULL) add_provenance_columns(data, source_file, source_row_start = 1) save_working_ducklake(con) # upload Working DuckLake to GCS list_working_tables(con) # list tables with provenance stats # R/freeze.R - Frozen release operations freeze_release(version, release_notes = NULL) validate_for_release(con) # run all quality checks before freeze list_frozen_releases() # list available frozen versions compare_releases(v1, v2) # diff two releases # R/ship.R - Ship code reconciliation [NEW] match_ships(unmatched_ships, reference_ships, ship_renames_csv, fetch_ices) # Multi-source fuzzy matching: manual overrides → exact nodc → exact name # → CalCOFI UNOLS → NODC WOD → ICES → word-overlap fuzzy fetch_ship_ices(ices_api, ices_ship_code_type) # Fetch ship codes from ICES Reference Codes API # R/taxonomy.R - Taxonomy standardization [NEW] standardize_species(con, species_tbl, include_gbif, batch_size) # Query WoRMS → ITIS → GBIF for each species # Updates species table with worms_id, itis_id, gbif_id build_taxon_table(con, species_tbl, taxon_tbl, include_itis, batch_size) # Build taxon hierarchy table from WoRMS/ITIS classification # Also creates taxa_rank lookup table with rank ordering ``` #### In `calcofi4r`: ```r # R/database.R - Frozen DuckLake access (user-facing) cc_get_db(version = "latest", local_cache = TRUE) # Connect to frozen release (never Working DuckLake) cc_list_versions() # list available frozen releases cc_db_info(version = "latest") # release date, row counts, notes cc_release_notes(version) # view release notes for a version # R/database.R - Derived views [NEW] cc_make_view(con, template, view_name, base_table, column_definitions) # Create derived VIEW from template or custom SQL expressions # Templates: "casts_extra" (year, month, quarter, julian_day, lat/lon parts, cruise, db_sta_key) cc_list_view_templates() # list available prebaked templates # R/read.R - Convenience functions for common tables cc_read_bottle(version = "latest", ...) cc_read_larvae(version = "latest", ...) cc_read_cast(version = "latest", ...) # R/query.R - Query helpers cc_query(sql, version = "latest") cc_list_tables(version = "latest") cc_describe_table(table, version = "latest") ``` --- ## Workflow Dependency Management with `targets` ### Pipeline Definition Create `workflows/_targets.R`: ```r library(targets) library(tarchetypes) tar_option_set( packages = c("calcofi4db", "duckdb", "dplyr", "arrow"), format = "qs" # Fast serialization ) # Config DIR_DATA <- "~/My Drive/projects/calcofi/data-public" DIR_METADATA <- "metadata" # workflows/metadata/ GCS_BUCKET <- "calcofi-db" # Define pipeline list( # ─── Ingest SWFSC NOAA CalCOFI Database ───────────────────────── tar_target( ingest_swfsc, { d <- calcofi4db::read_csv_files( provider = "swfsc.noaa.gov", dataset = "calcofi-db", dir_data = DIR_DATA, metadata_dir = DIR_METADATA) calcofi4db::write_ingest_outputs( data_info = d, provider = "swfsc.noaa.gov", dataset = "calcofi-db", gcs_bucket = GCS_BUCKET) }), # ─── Ingest CalCOFI.org Bottle Database ───────────────────────── tar_target( ingest_bottle, { d <- calcofi4db::read_csv_files( provider = "calcofi.org", dataset = "bottle-database", dir_data = DIR_DATA, metadata_dir = DIR_METADATA) calcofi4db::write_ingest_outputs( data_info = d, provider = "calcofi.org", dataset = "bottle-database", gcs_bucket = GCS_BUCKET) }), # ─── Integrate Ingest Outputs into Working DuckLake ────────────── tar_target( working_ducklake, calcofi4db::integrate_to_working_ducklake( ingests = list(ingest_swfsc, ingest_bottle), gcs_bucket = GCS_BUCKET, ducklake_path = "ducklake/working")), # ─── Publish Datasets ─────────────────────────────────────────── tar_target( obis_archive, create_obis_archive(database, include_bottle = TRUE)) ) ``` The key changes in this pipeline: 1. **`metadata_dir` parameter**: Points to `workflows/metadata/` instead of the legacy `calcofi4db/inst/ingest/` path 2. **`write_ingest_outputs()`**: Writes parquet files + manifest.json to `gs://calcofi-db/ingest/{provider}_{dataset}/` 3. **`integrate_to_working_ducklake()`**: Combines ingest parquet files into the Working DuckLake at `gs://calcofi-db/ducklake/working/` **Note:** No separate `database/` folder - the Working DuckLake IS the integrated database. The `releases/` folder will contain frozen snapshots with provenance columns stripped. ### Workflow Lifecycle ```{mermaid} %%| label: fig-lifecycle %%| fig-cap: "Development and Update Cycles" flowchart TB subgraph dev["Development Cycle"] direction LR d1["1\. Explore"] d2["2\. Ingest"] d3["3\. Targets"] d4["4\. Run"] d5["5\. Publish"] d1 --> d2 --> d3 --> d4 --> d5 end subgraph update["Update Cycle"] direction LR u1["1\. New data"] u2["2\. Sync"] u3["3\. Run"] u4["4\. Republish"] u1 --> u2 --> u3 --> u4 end dev -.->|"formalized"| update style dev fill:#fff3e0,stroke:#ef6c00 style update fill:#e8f4e8,stroke:#2e7d32 ``` ### Dependency Graph ```{mermaid} %%| label: fig-targets %%| fig-cap: "targets Pipeline Dependency Graph" flowchart LR subgraph archive["📁 GCS Archive"] archive_swfsc["swfsc.noaa.gov/ calcofi-db"] archive_bottle["calcofi.org/ bottle-database"] end subgraph ingest["⚙️ Ingest Outputs"] ingest_swfsc["ingest_swfsc (parquet + manifest)"] ingest_bottle["ingest_bottle (parquet + manifest)"] end subgraph database["🗃️ Working DuckLake"] integrate["integrate_to_working_ducklake()"] ducklake["ducklake/working/"] end subgraph publish["🌐 Publish"] obis_archive end archive_swfsc --> ingest_swfsc archive_bottle --> ingest_bottle ingest_swfsc --> integrate ingest_bottle --> integrate integrate --> ducklake ducklake --> obis_archive style archive fill:#e3f2fd,stroke:#1565c0 style ingest fill:#fff3e0,stroke:#ef6c00 style database fill:#f3e5f5,stroke:#7b1fa2 style publish fill:#fce4ec,stroke:#c2185b ``` The key insight is the separation between **ingest outputs** (per-workflow parquet + manifest) and the **final database** (integrated DuckDB). Each ingest workflow: 1. Reads from GCS archive (immutable source files) 2. Applies transformations via redefinition metadata 3. Writes parquet files to `gs://calcofi-db/ingest/{provider}_{dataset}/` 4. Writes `manifest.json` with provenance (archive timestamp, table stats) The final database step combines all ingest outputs into a unified DuckDB. --- ## Documentation Updates ### Recommended Changes to `docs/` #### `docs/db.qmd` - Database Documentation **Current**: PostgreSQL-focused with naming conventions and ingestion strategy **Proposed changes**: - Add section on DuckDB as primary database - Document Parquet file organization - Explain DuckLake versioning and time travel - Keep naming conventions (snake_case, *_id, *_uuid suffixes) - Update "Integrated database ingestion strategy" for targets pipeline #### `docs/api.qmd` - API Documentation **Current**: REST API endpoints for PostgreSQL queries **Proposed changes**: - Note that REST API may be deprecated for most use cases - Document direct DuckDB access via `calcofi4r::cc_get_db()` - Keep API for web apps that cannot use DuckDB directly - Consider lightweight API using DuckDB backend if needed #### `docs/portals.qmd` - Portal Documentation **Current**: Good overview of EDI, NCEI, OBIS, ERDDAP **Proposed changes**: - Update data flow diagram to show DuckDB → Portals - Add section on automated publishing via targets pipeline - Document versioning strategy for portal submissions ### Where Documentation Should Live | Content | Location | Rationale | |---------|----------|-----------| | Database schema | `docs/db.qmd` | User-facing documentation | | Workflow development | `workflows/README.md` | Developer guide | | Package functions | `calcofi4r/`, `calcofi4db/` | pkgdown reference | | Data dictionary | `calcofi4db/inst/` | Versioned with code | | Architecture decisions | `docs/` or `workflows/_docs/` | Long-term reference | --- ## Implementation Priority ### Phase 1: Foundation ✅ COMPLETE (2026-02-02) **Goal**: Establish cloud infrastructure and helper functions 1. **Set up GCS buckets** ✅ - Created `gs://calcofi-files-public/` with public read access - Created `gs://calcofi-files-private/` with restricted access - Configured rclone access via OAuth 2. **Add cloud functions to `calcofi4db`** ✅ - Created `R/cloud.R` with `get_gcs_file()`, `put_gcs_file()` - Created `R/parquet.R` with `csv_to_parquet()` - Created `R/duckdb.R` with `get_duckdb_con()`, `create_duckdb_views()` 3. **Initial sync** ✅ - Synced Google Drive `data-public/` to `gs://calcofi-files-public/_sync/` - Created initial archive snapshot: `archive/2026-02-02_121557/` - Generated manifest: `manifests/manifest_2026-02-02_121557.json` - Total: 112,457 files (33.5 GiB) ### Phase 2: Pipeline ✅ COMPLETE (2026-02-05) **Goal**: Implement targets pipeline with new architecture 1. **Migrate metadata to `workflows/metadata/`** ✅ - Created `workflows/metadata/{provider}/{dataset}/` directories - Copied redefinition files from `calcofi4db/inst/ingest/` - Updated workflows to use `metadata_dir` parameter 2. **Create `calcofi4db/R/workflow.R`** ✅ - `write_ingest_outputs()`: Write parquet + manifest to GCS - `write_parquet_outputs()`: Export tables to parquet with manifest - `read_ingest_manifest()`: Read manifest from GCS 3. **Update `workflows/_targets.R`** ✅ - Extended with `ingest_swfsc_manifest` + `ingest_bottle_manifest` file targets - Added `merge_manifest` target that renders `merge_ichthyo_bottle.qmd` - Pipeline: CSV→parquet→DuckDB + manifest-based merge 4. **Cross-dataset integration (new)** ✅ - **Ship matching**: `calcofi4db/R/ship.R` with `match_ships()`, `fetch_ship_ices()` - **Cruise bridge**: Derive `cruise_key` (YYMMKK) from bottle casts via ship_key + datetime - **Taxonomy standardization**: `calcofi4db/R/taxonomy.R` with `standardize_species()`, `build_taxon_table()` - **Grid canonical source**: grid.parquet only from swfsc (removed from bottle export) 5. **Update publish workflow** (pending) - Modify `publish_larvae_to_obis.qmd` to read from DuckDB - Add bottle data to EMoF extension ### Phase 3: Working DuckLake ✅ COMPLETE (2026-02-05) **Goal**: Implement Working DuckLake with provenance tracking 1. **Working DuckLake functions implemented** ✅ - `get_working_ducklake()`: Connect to Working DuckLake (GCS-backed) - `ingest_to_working()`: Ingest with provenance columns - `add_provenance_columns()`: Helper for `_source_file`, `_source_row`, `_source_uuid`, `_ingested_at` - `save_working_ducklake()`: Upload to GCS - `list_working_tables()`: Table summary with provenance stats 2. **Merge workflow loads all tables** ✅ - `merge_ichthyo_bottle.qmd` loads 15+ tables from upstream parquet - Modified tables (casts, species, taxon, taxa_rank) exported separately - Manifest references GCS URIs for unchanged upstream tables ### Phase 4: Frozen DuckLake ✅ COMPLETE (2026-02-05) **Goal**: Implement [Frozen DuckLake](https://ducklake.select/2025/10/24/frozen-ducklake/) for public releases 1. **Freeze functions implemented** ✅ - `validate_for_release()`: Data quality checks - `freeze_release()` / `write_parquet_outputs(strip_provenance=TRUE)`: Create immutable release - Strip provenance columns for clean public schema - Generate release notes with cross-dataset integration summary 2. **Release structure** ✅ - `gs://calcofi-db/ducklake/releases/{version}/` - Parquet files + manifest + RELEASE_NOTES.md - `latest` pointer to current release 3. **`calcofi4r` frozen access** ✅ - `cc_get_db(version = "latest")`: Default to frozen releases - `cc_list_versions()`: List available releases - `cc_release_notes()`: View release notes - `cc_make_view(con, "casts_extra")`: Create derived VIEWs with prebaked templates - `cc_list_view_templates()`: List available templates 4. **First public release**: v2026.02 (pending GCS upload) ### Phase 5: Documentation & Polish **Goal**: Complete documentation and evaluate API 1. **Update docs/** - Revise `db.qmd` for DuckDB/DuckLake - Document Working vs Frozen databases - Evaluate API necessity - Update portal documentation 2. **Clean up workflows/** - Consolidate README.md with essentials - Archive deprecated exploration notebooks - Document workflow development process - Create "Getting Started" guide for data managers --- ## Verification Plan ### End-to-End Test ```r # 1. Sync test data from Google Drive to GCS system("rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync") # 2. Create archive snapshot timestamp <- format(Sys.time(), "%Y-%m-%d_%H%M%S") system(glue::glue("rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/{timestamp}")) # 3. Run targets pipeline targets::tar_make() # 4. Verify database con <- calcofi4db::get_duckdb_con("calcofi.duckdb") DBI::dbListTables(con) DBI::dbGetQuery(con, "SELECT COUNT(*) FROM larvae") # 5. Verify OBIS archive obistools::check_fields(read_csv("data/darwincore/larvae/occurrence.csv")) ``` ### Integration Tests - [ ] GCS upload/download roundtrip - [ ] Parquet schema preservation - [ ] DuckDB query performance vs PostgreSQL - [ ] OBIS archive validation - [ ] Time travel query accuracy --- ## Migration Strategy: PostgreSQL → DuckDB ### Current PostgreSQL Dependencies Based on `server/docker-compose.yml`, the following services depend on PostgreSQL: | Service | Purpose | Migration Path | |---------|---------|----------------| | `postgis` | Spatial database | Keep for vector tiles (short-term) | | `pg_tileserv` | Vector tile server | Replace with PMTiles | | `pg_rest` | REST API (PostgREST) | Keep for web apps, add DuckDB option | | `plumber` | R API | Update to use DuckDB | | `pgadmin` | Database admin | Deprecate once migrated | ### Vector Tiles: pg_tileserv → PMTiles The current architecture uses `pg_tileserv` to serve vector tiles from PostGIS for Shiny mapping apps. The migration path: ```{mermaid} %%| label: fig-tiles-migration %%| fig-cap: "Vector Tiles Migration Path" flowchart LR subgraph current["Current Flow (PostgreSQL)"] direction LR pg["PostGIS"] ts["pg_tileserv"] lf["Leaflet"] pg --> ts --> lf end subgraph target["Target Flow (Cloud-native)"] direction LR geo["GeoParquet"] tc["tippecanoe"] pm["PMTiles"] mg["mapgl"] geo --> tc --> pm --> mg end current -.->|"migrate"| target style current fill:#ffebee,stroke:#c62828 style target fill:#e8f4e8,stroke:#2e7d32 ``` #### Implementation Steps 1. **Generate PMTiles from PostGIS** ```bash # Export geometries from PostGIS ogr2ogr -f GeoJSON stations.geojson \ PG:"host=postgis user=admin dbname=gis" \ -sql "SELECT * FROM stations" # Convert to PMTiles using tippecanoe tippecanoe -o stations.pmtiles \ --minimum-zoom=0 --maximum-zoom=14 \ --layer=stations \ stations.geojson # Upload to GCS gsutil cp stations.pmtiles gs://calcofi-db/tiles/ ``` 2. **Update Shiny Apps** ```r # Old: pg_tileserv leaflet() |> addTiles() |> addMapboxGL( style = list( sources = list( stations = list( type = "vector", tiles = list("https://tile.calcofi.io/public.stations/{z}/{x}/{y}.pbf") ) ) ) ) # New: PMTiles with mapgl library(mapgl) mapboxgl() |> add_pmtiles_source( id = "stations", url = "https://storage.googleapis.com/calcofi-db/tiles/stations.pmtiles" ) |> add_layer( id = "stations-layer", type = "circle", source = "stations", source_layer = "stations" ) ``` 3. **Add PMTiles generation to targets pipeline** ```r # In _targets.R tar_target( pmtiles_stations, create_pmtiles( geom_source = duckdb_file, query = "SELECT * FROM stations", output = "tiles/stations.pmtiles" ) ) ``` ### Migration Roadmap | Phase | Timeline | Actions | |-------|----------|---------| | **Phase 1** | Now | Add DuckDB as primary data access; keep PostgreSQL | | **Phase 2** | +2 months | Generate PMTiles for all spatial layers | | **Phase 3** | +4 months | Update Shiny apps to use mapgl + PMTiles | | **Phase 4** | +6 months | Deprecate pg_tileserv, reduce PostgreSQL to backup | | **Phase 5** | +12 months | Fully deprecate PostgreSQL (if no dependencies remain) | --- ## Resolved Questions Based on clarifications received: 1. **GCS Project**: Use existing project `ucsd-sio-calcofi` - Existing bucket `calcofi-db` → DuckLake/database files - New bucket `calcofi-files-public` → Versioned public source files (public read access) - New bucket `calcofi-files-private` → Versioned private/sensitive files (restricted access) 2. **Google Drive Structure** (implemented 2026-02-02): - `calcofi/data-public/` → Public datasets (bottle, larvae, etc.) - `calcofi/data-private/` → Private/sensitive data (eDNA, zooplankton samples) 3. **GCS Folder Structure** (implemented 2026-02-02): - `_sync/` → Working directory (rclone syncs here, mutable) - `archive/{timestamp}/` → Immutable snapshots (complete copies, not deltas) - `manifests/` → JSON metadata files with `manifest_latest.json` pointer 4. **API Strategy**: Keep PostgreSQL + API for now with migration roadmap - pg_tileserv needed for current Shiny apps - Migrate to PMTiles over time using tippecanoe + mapgl 5. **Priority Data**: Larvae + Bottle datasets for proof-of-concept 6. **Bottle in OBIS**: Include as EMoF extension in larvae dataset ## Remaining Questions 1. **DuckLake vs MotherDuck**: Self-hosted DuckLake catalog or use MotherDuck service? 2. **Automation**: GitHub Actions vs server cron for scheduled rclone syncs? 3. **PMTiles hosting**: Serve from GCS directly or via CDN? --- ## Appendix: Key References ### Documentation - [Versioned Data Lake Strategy](workflows/_docs/2026-01-02%20Versioned%20Data%20Lake%20Strategy.md) - [Data Management Action Plan](workflows/_docs/2026-01-27%20updated%20working%20draft_Data_Management_Action_Plan_CalCOFI.md) - [Data Pipeline Timeline](workflows/_docs/2026-01-31%20Data%20Pipeline%20Timeline%20.md) ### Technologies **Database & Analytics**: - [DuckDB](https://duckdb.org/) - Fast in-process analytical database - [DuckLake](https://ducklake.select/) - Lakehouse catalog for DuckDB - [Frozen DuckLake](https://ducklake.select/2025/10/24/frozen-ducklake/) - Immutable versioned releases - [Apache Parquet](https://parquet.apache.org/) - Columnar storage format **Workflow & Orchestration**: - [targets R package](https://books.ropensci.org/targets/) - Pipeline dependency management **Cloud Storage**: - [rclone](https://rclone.org/) - Cloud storage sync tool - [googleCloudStorageR](https://code.markedmondson.me/googleCloudStorageR/) - R interface to GCS - [Data lakes and big data analytics - Google Cloud Storage](https://cloud.google.com/storage#data-lakes-and-big-data-analytics) **Vector Tiles**: - [PMTiles](https://protomaps.com/docs/pmtiles) - Cloud-native vector tile format - [tippecanoe](https://github.com/felt/tippecanoe) - Build vector tilesets - [mapgl R package](https://walker-data.com/mapgl/) - Modern WebGL maps in R/Shiny ### CalCOFI Resources - [calcofi.io/docs](https://calcofi.io/docs) - CalCOFI documentation site - [calcofi.io/calcofi4r](https://calcofi.io/calcofi4r) - User-facing R package - [calcofi.io/calcofi4db](https://calcofi.io/calcofi4db) - Database admin R package

1 Overview for Data Managers & Scientists

1.1 The Big Picture

1.2 Four Key Components

1.3 How Data Flows

2 Executive Summary

3 Current State Analysis

3.1 Existing Infrastructure

3.2 Key Workflow Files

3.3 Identified Gaps

4 File Versioning Strategy

4.1 The Two-World Problem

4.2 Google Drive: The Human Interface

4.2.1 Folder Structure

4.2.2 Google Drive Naming Conventions

4.2.3 Dataset Manager Workflow

4.3 Google Cloud Storage: The Versioned Data Lake

4.3.1 Bucket Structure

4.4 Workflow Metadata

4.4.1 GCS Naming Conventions

4.5 Rclone Sync: Capturing Daily Changes

4.5.1 Installation

4.5.2 Configuration

4.5.3 Headless Server Setup

4.5.4 Service Account (Optional)

4.5.5 Verify Configuration

4.5.6 How It Works

4.5.7 Key Behaviors

4.5.8 The Two-Step Archive Strategy

4.6 Manifest Files

4.7 Reconstructing Historical State

4.8 Automation Schedule

5 Proposed Architecture

5.1 Data Flow Diagram

5.2 Technology Stack

6 Database Naming Conventions

6.1 Table Names

6.2 Column Names

7 Primary Key Strategy

7.1 General Principle: Natural Keys Over Surrogate Keys

7.2 Natural Keys (use as PRIMARY KEY)

7.3 Sequential Integer Keys (with explicit sort order)

7.4 Foreign Key Relationships

7.5 UUID Columns

8 Database Architecture: Working vs Frozen DuckLake

8.1 The Two-Database Strategy

8.2 Working DuckLake (Internal)

8.3 Frozen DuckLake (Public Releases)

8.4 Accessing Frozen Releases

8.5 Release Schedule

9 R Package Strategy

9.1 Package Roles

9.1.1 calcofi4db - Database Administration Tools

9.1.2 calcofi4r - User-Facing Tools

9.2 Function Reuse Across Contexts

9.3 New Functions to Implement

9.3.1 In calcofi4db:

9.3.2 In calcofi4r:

10 Workflow Dependency Management with targets

10.1 Pipeline Definition

10.2 Workflow Lifecycle

10.3 Dependency Graph

11 Documentation Updates

11.1 Recommended Changes to docs/

11.1.1 docs/db.qmd - Database Documentation

11.1.2 docs/api.qmd - API Documentation

11.1.3 docs/portals.qmd - Portal Documentation

11.2 Where Documentation Should Live

12 Implementation Priority

12.1 Phase 1: Foundation ✅ COMPLETE (2026-02-02)

12.2 Phase 2: Pipeline ✅ COMPLETE (2026-02-05)

12.3 Phase 3: Working DuckLake ✅ COMPLETE (2026-02-05)

12.4 Phase 4: Frozen DuckLake ✅ COMPLETE (2026-02-05)

12.5 Phase 5: Documentation & Polish

13 Verification Plan

13.1 End-to-End Test

13.2 Integration Tests

14 Migration Strategy: PostgreSQL → DuckDB

14.1 Current PostgreSQL Dependencies

14.2 Vector Tiles: pg_tileserv → PMTiles

14.2.1 Implementation Steps

9.1.1 `calcofi4db` - Database Administration Tools

9.1.2 `calcofi4r` - User-Facing Tools

9.3.1 In `calcofi4db`:

9.3.2 In `calcofi4r`:

10 Workflow Dependency Management with `targets`

11.1 Recommended Changes to `docs/`

11.1.1 `docs/db.qmd` - Database Documentation

11.1.2 `docs/api.qmd` - API Documentation

11.1.3 `docs/portals.qmd` - Portal Documentation