CalCOFI Data Workflow Plan

Published

2026-02-03

This document outlines a comprehensive strategy for managing and versioning CalCOFI data files, building an integrated database, and publishing datasets to external portals.

1 Overview for Data Managers & Scientists

This section provides a high-level view of the CalCOFI data system. You don’t need to understand all the technical details—just the four main components and how they connect.

1.1 The Big Picture

flowchart TB
    subgraph drive["📁 THE DRIVE"]
        direction LR
        gd_pub["data-public/<br/>Bottle, Larvae, CTD..."]
        gd_priv["data-private/<br/>eDNA, Zooplankton..."]
    end

    subgraph archive["🗄️ THE ARCHIVE"]
        direction LR
        sync["_sync/<br/>(working copy)"]
        hist["archive/<br/>(snapshots)"]
    end

    subgraph workflows["⚙️ THE WORKFLOWS"]
        direction LR
        ingest["Ingest<br/>(CSV → DB)"]
        publish["Publish<br/>(DB → Portals)"]
    end

    subgraph database["🗃️ THE DATABASE"]
        direction LR
        working["Working<br/>(internal)"]
        frozen["Frozen<br/>(public releases)"]
    end

    subgraph outputs["🌐 OUTPUTS"]
        direction LR
        obis["OBIS"]
        apps["Shiny Apps"]
        api["API / R pkg"]
    end

    drive -->|"daily sync"| archive
    archive -->|"read files"| workflows
    workflows -->|"transform"| database
    database -->|"publish"| outputs

    style drive fill:#e8f4e8,stroke:#2e7d32
    style archive fill:#e3f2fd,stroke:#1565c0
    style workflows fill:#fff3e0,stroke:#ef6c00
    style database fill:#f3e5f5,stroke:#7b1fa2
    style outputs fill:#fce4ec,stroke:#c2185b
Figure 1: CalCOFI Data System Overview

1.2 Four Key Components

Note📁 The Drive

What: The shared Google Drive folder where you upload your data files.

Location: My Drive/projects/calcofi/data-public/ (or data-private/)

Who uses it: Dataset managers (Rasmus, Ed, Nastassia, Linsey, etc.)

How to use it:

  1. Create a folder for your data: {provider}/{dataset}/ (e.g., swfsc.noaa.gov/calcofi-db/)
  2. Inside, organize files into raw/ (original data) and derived/ (processed data)
  3. Simply upload or overwrite files—versioning happens automatically
  4. No special naming required (no date prefixes needed)

Example structure:

data-public/
├── calcofi.org/
│   └── bottle-database/
│       ├── raw/
│       │   └── bottle_export_2026-01-15.csv
│       └── derived/
│           └── bottle_cleaned.csv
├── swfsc.noaa.gov/
│   └── calcofi-db/
│       ├── raw/
│       │   ├── larva.csv
│       │   └── cruise.csv
│       └── derived/
Note🗄️ The Archive

What: A versioned backup of all Drive files in the cloud1.

Location: Google Cloud Storage buckets (gs://calcofi-files-public/ and gs://calcofi-files-private/)

Who uses it: Automated systems (you don’t need to interact with it directly)

What it does:

  • Daily sync: Copies all files from the Drive to the Archive
  • Snapshots: Creates timestamped copies so we can always go back in time
  • Never loses data: Even if you delete a file from Drive, it’s preserved in the Archive

Why it matters: Scientists can reproduce analyses using the exact data from any past date.

Note⚙️ The Workflows

What: Automated scripts that process data from the Archive into the Database.

Location: /workflows/ folder (Quarto notebooks and R scripts)

Who uses it: Data engineers and developers

Two types of workflows:

  1. Ingest workflows (ingest_*.qmd): Read CSV files → Clean/transform → Load into Database
  2. Publish workflows (publish_*.qmd): Read from Database → Format for portals → Upload to OBIS, EDI, etc.

Key feature: Dependencies are managed automatically—if source data changes, all downstream steps re-run.

Note🗃️ The Database

What: The integrated CalCOFI database combining all datasets into related tables.

Format: DuckDB2 with Parquet files (modern, fast, serverless)

Two versions:

Version Purpose Access Features
Working Database Internal processing Workflows only Full provenance tracking, UUIDs, source file references, time travel
Frozen Database Public releases Apps, API, R package Clean tables, versioned releases (v2026.02, v2026.03…), optimized for queries

Why two versions? The Working Database tracks every detail for reproducibility. The Frozen Database provides clean, stable snapshots for users who just want the data.

1.3 How Data Flows

flowchart TB
    subgraph you["👤 YOU"]
        upload["Upload CSV to Drive"]
    end

    subgraph auto["🤖 AUTOMATED"]
        sync["Sync to Archive"]
        snapshot["Create snapshot"]
        transform["Run workflows"]
    end

    subgraph result["📊 RESULT"]
        db["Updated Database"]
        portals["Published to Portals"]
    end

    upload --> sync --> snapshot --> transform --> db --> portals

    style you fill:#e8f4e8,stroke:#2e7d32
    style auto fill:#fff3e0,stroke:#ef6c00
    style result fill:#f3e5f5,stroke:#7b1fa2
Figure 2: Simplified Data Flow

Your role as a dataset manager: Upload files to the Drive. Everything else happens automatically.


2 Executive Summary

This plan modernizes the CalCOFI data workflow architecture to:

  1. Version raw data files in Google Cloud Storage (GCS)
  2. Use DuckDB + Parquet as the primary integrated database format
  3. Leverage targets for workflow dependency management
  4. Use R packages (calcofi4r, calcofi4db) for reusable functions
  5. Maintain reproducible pipelines from raw data to published datasets

3 Current State Analysis

3.1 Existing Infrastructure

Component Current State Location
Workflows Individual Quarto notebooks /workflows/
Database PostgreSQL with dev/prod schemas api.calcofi.io
R Packages calcofi4r (user), calcofi4db (admin) GitHub
Data Sources CSV files on Google Drive ~/My Drive/projects/calcofi/data-public
Documentation Quarto site /docs/

3.2 Key Workflow Files

Workflow Purpose Status
ingest_calcofi.org_bottle-database.qmd Ingest bottle/cast data Partial
ingest_swfsc.noaa.gov_calcofi-db.qmd Ingest NOAA ichthyoplankton data Active
publish_larvae_to_obis.qmd Publish larvae to OBIS Active (needs bottle data)

3.3 Identified Gaps

  1. No dependency management: Workflows run independently without declared dependencies
  2. No file versioning: Source files updated in place without version history
  3. PostgreSQL overhead: Requires server maintenance and API infrastructure
  4. Disconnected workflows: No systematic connection between ingest and publish steps
  5. Manual sequencing: Users must know to run bottle ingestion before larvae ingestion

4 File Versioning Strategy

4.1 The Two-World Problem

CalCOFI data originates from multiple sources managed by different individuals:

  • Rasmus: Bottle/cast data, CTD profiles
  • Ed: Ichthyoplankton (larvae, eggs) data
  • Nastassia: eDNA data
  • Linsey: Zooplankton samples

These dataset managers need a simple interface (Google Drive) to deposit and organize files, while the data engineering side needs version control, immutability, and reproducibility.

4.2 Google Drive: The Human Interface

This is “The Drive” from Section 1. Dataset managers upload files here; everything else is automatic.

4.2.1 Folder Structure

Recommended organization: {provider}/{dataset}/raw/ and {provider}/{dataset}/derived/

calcofi/data-public/                       # Shared Google Drive folder (public data)
├── calcofi.org/
│   └── bottle-database/
│       ├── raw/                           # Original exports from source system
│       │   └── bottle_export_2026-01.csv
│       └── derived/                       # Cleaned/processed versions
│           ├── bottle.csv
│           ├── cast.csv
│           └── field_descriptions.csv
├── swfsc.noaa.gov/
│   └── calcofi-db/
│       ├── raw/                           # Original NOAA database exports
│       │   └── calcofi_db_dump_2026-01.sql
│       └── derived/                       # Extracted CSV tables
│           ├── cruise.csv
│           ├── egg.csv
│           ├── larva.csv
│           ├── net.csv
│           ├── ship.csv
│           ├── site.csv
│           ├── species.csv
│           └── tow.csv
├── coastwatch.pfeg.noaa.gov/
│   └── erdCalCOFIlrvsiz/
│       └── derived/
│           └── larvae_size.csv
└── _archive/                              # Manual archive (optional)
    └── 2025-01-15_bottle.csv              # User-created backup

calcofi/data-private/                      # Private/sensitive data (restricted access)
├── edna/                                  # eDNA samples (Nastassia)
│   ├── raw/
│   └── derived/
├── zooplankton/                           # Zooplankton samples (Linsey)
│   ├── raw/
│   └── derived/
└── _archive/                              # Manual archive (optional)

4.2.2 Google Drive Naming Conventions

  • Top-level folders: {provider}/ - the organization providing the data (e.g., swfsc.noaa.gov)
  • Dataset folders: {dataset}/ - specific dataset name (e.g., calcofi-db)
  • Subfolders (recommended):
    • raw/ - Original data as received from source (database dumps, original CSVs)
    • derived/ - Processed/cleaned versions ready for ingestion
  • Files: Original source names, no date prefixes required
  • Updates: Simply overwrite existing files
  • No versioning in Drive: Google Drive does not maintain file history for CSVs

4.2.3 Dataset Manager Workflow

  1. Create folder structure: {provider}/{dataset}/raw/ and {provider}/{dataset}/derived/
  2. Upload original data to raw/ (preserve original format)
  3. Upload processed/cleaned data to derived/ (CSV format preferred)
  4. Simply overwrite files when updating—versioning happens automatically in the Archive
  5. Optionally notify data team of significant changes

See Section 1 for the big picture of how your data flows through the system.

4.3 Google Cloud Storage: The Versioned Data Lake

This is “The Archive” from Section 13. It automatically captures versioned snapshots of all files from the Drive.

4.3.1 Bucket Structure

gs://calcofi-files-public/                 # Versioned public source files
├── _sync/                                 # Working directory (rclone syncs here)
│   ├── calcofi.org/
│   │   └── bottle-database/
│   │       ├── bottle.csv
│   │       └── cast.csv
│   └── swfsc.noaa.gov/
│       └── calcofi-db/
│           ├── cruise.csv
│           ├── egg.csv
│           └── ...
├── archive/                               # Timestamped immutable snapshots
│   ├── 2026-02-02_121557/                # YYYY-MM-DD_HHMMSS
│   │   └── calcofi.org/
│   │       └── bottle-database/
│   │           └── bottle.csv            # Snapshot at that time
│   ├── 2026-02-15_120000/
│   │   └── swfsc.noaa.gov/
│   │       └── calcofi-db/
│   │           └── larva.csv
│   └── ...
└── manifests/                             # Version metadata
    ├── manifest_2026-02-02_121557.json
    ├── manifest_2026-02-15_120000.json
    └── manifest_latest.json              # Points to latest archive

gs://calcofi-files-private/                # Versioned private source files
├── _sync/                                 # Working directory (rclone syncs here)
│   ├── edna/
│   └── zooplankton/
├── archive/                               # Timestamped immutable snapshots
└── manifests/

gs://calcofi-db/                           # Integrated database
├── parquet/                               # Transformed tables
│   ├── bottle.parquet
│   ├── cast.parquet
│   ├── larvae.parquet
│   └── ...
├── duckdb/                                # DuckDB database files
│   └── calcofi.duckdb
└── ducklake/                              # DuckLake catalog
    └── catalog.json

4.3.2 GCS Naming Conventions

  • **_sync/**: Working directory, exact mirror of Google Drive structure (rclone syncs here)
  • archive/{YYYY-MM-DD_HHMMSS}/: Immutable snapshot of complete data at that timestamp
  • manifests/manifest_{YYYY-MM-DD_HHMMSS}.json: Snapshot metadata
  • manifests/manifest_latest.json: Points to latest archive snapshot

4.4 Rclone Sync: Capturing Daily Changes

4.4.1 Installation

macOS:

brew install rclone

Linux (Ubuntu/Debian):

curl https://rclone.org/install.sh | sudo bash

4.4.2 Configuration

Run rclone config and create two remotes:

Remote 1: Google Drive (gdrive)

n) New remote
name> gdrive
Storage> drive
client_id>                        # leave blank
client_secret>                    # leave blank
scope> 1                          # Full access (or 2 for read-only)
service_account_file>             # leave blank for OAuth, or path to JSON key
Edit advanced config> n
Use auto config> y                # opens browser (use 'n' on headless server)
Configure this as a Shared Drive> n

Remote 2: Google Cloud Storage (gcs)

n) New remote
name> gcs
Storage> google cloud storage
client_id>                        # leave blank
client_secret>                    # leave blank
project_number> ucsd-sio-calcofi
service_account_file>             # leave blank for OAuth, or path to JSON key
Edit advanced config> n
Use auto config> y                # opens browser (use 'n' on headless server)

4.4.3 Headless Server Setup

For Linux servers without a browser, use rclone authorize on a local machine:

# on local machine with browser
rclone authorize "drive"
# copy the token output

# on server, during rclone config
Use auto config> n
# paste the token when prompted

Similarly for GCS:

# on local machine
rclone authorize "google cloud storage"

4.4.4 Service Account (Optional)

For automated/scheduled syncs, use a GCP service account instead of OAuth:

  1. Create service account in GCP Console → IAM → Service Accounts
  2. Grant roles: Storage Object Admin for calcofi-files-public, calcofi-files-private, and calcofi-db buckets
  3. Download JSON key file to secure location (e.g., /etc/rclone/calcofi-sa.json)
  4. In rclone config, provide path to JSON key instead of using OAuth

4.4.5 Verify Configuration

# test Google Drive access
rclone lsd gdrive:projects/calcofi/data-public      # OR
rclone lsd gdrive-ecoquants:projects/calcofi/data-public

# test GCS access
rclone ls gcs:calcofi-files-public --max-depth 1    # OR
rclone ls gcs-calcofi:calcofi-files-public --max-depth 1

# test sync (dry-run first!)
rclone sync gdrive:calcofi/data-public gcs:calcofi-files-public/_sync --dry-run -v
rclone sync gdrive-ecoquants:projects/calcofi/data-public gcs-calcofi:calcofi-files-public/_sync --dry-run -v

4.4.6 How It Works

#!/bin/bash
# sync_gdrive_to_gcs.sh - Run daily via cron

TIMESTAMP=$(date +%Y-%m-%d_%H%M%S)
LOG_FILE="/var/log/rclone/sync_${TIMESTAMP}.log"

# --- PUBLIC DATA ---
# Step 1: Sync Google Drive to _sync/ working directory
rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync \
  --checksum \
  --drive-export-formats csv \
  --log-file ${LOG_FILE} \
  --log-level INFO

# Step 2: Create immutable archive snapshot
rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/${TIMESTAMP}

# Step 3: Generate and upload manifest
rclone lsjson gs://calcofi-files-public/archive/${TIMESTAMP} --recursive \
  > /tmp/manifest_${TIMESTAMP}.json
rclone copy /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/
rclone copyto /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/manifest_latest.json

# --- PRIVATE DATA ---
# Repeat for private bucket (separate log file)
rclone sync gdrive:calcofi/data-private gs://calcofi-files-private/_sync \
  --checksum \
  --drive-export-formats csv \
  --log-file ${LOG_FILE%.log}_private.log \
  --log-level INFO

rclone copy gs://calcofi-files-private/_sync gs://calcofi-files-private/archive/${TIMESTAMP}

4.4.7 Key Behaviors

Scenario Google Drive Action Rclone Behavior Result in GCS
New file User uploads new.csv rclone sync copies to _sync/ _sync/new.csv created, next archive includes it
Updated file User overwrites bottle.csv rclone sync updates _sync/ _sync/bottle.csv updated, archive snapshots preserve history
Deleted file User deletes old.csv rclone sync removes from _sync/ Removed from _sync/, preserved in prior archives
Renamed file User renames a.csvb.csv Treated as delete + create a.csv gone from _sync/, b.csv created
No changes Nothing --checksum skips identical files No new sync needed

4.4.8 The Two-Step Archive Strategy

The archiving approach uses two steps for reliability:

  1. Sync to _sync/: rclone syncs Google Drive to the _sync/ working directory
  2. Copy to archive/{timestamp}/: Create an immutable snapshot of the current state

Benefits:

  • Immutable archives: Each archive folder is a complete snapshot, not just deltas
  • Reproducibility: Any archive folder contains everything needed to recreate that state
  • No data loss: If sync fails, previous archives remain intact
  • Public access: calcofi-files-public bucket has public read access for easy data sharing

4.5 Manifest Files

Each manifest captures the complete state of the archive at a point in time:

{
  "generated_at": "2026-02-02T12:15:57Z",
  "archive_timestamp": "2026-02-02_121557",
  "bucket": "gs://calcofi-files-public",
  "archive_path": "archive/2026-02-02_121557",
  "files": [
    {
      "Path": "calcofi.org/bottle-database/bottle.csv",
      "Size": 45234567,
      "ModTime": "2026-02-02T10:30:00Z",
      "MimeType": "text/csv",
      "gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/calcofi.org/bottle-database/bottle.csv"
    },
    {
      "Path": "swfsc.noaa.gov/calcofi-db/larva.csv",
      "Size": 12345678,
      "ModTime": "2026-02-01T08:00:00Z",
      "MimeType": "text/csv",
      "gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/swfsc.noaa.gov/calcofi-db/larva.csv"
    }
  ],
  "total_files": 112457,
  "total_size_bytes": 35987654321
}

4.6 Reconstructing Historical State

To recreate the data lake as it existed on a specific date:

# R function to get file from a specific archive snapshot
get_archived_file <- function(path, timestamp = "latest", bucket = "calcofi-files-public") {
  # Get manifest for that timestamp
  if (timestamp == "latest") {
    manifest_url <- glue("gs://{bucket}/manifests/manifest_latest.json")
  } else {
    manifest_url <- glue("gs://{bucket}/manifests/manifest_{timestamp}.json")
  }

  manifest <- jsonlite::read_json(manifest_url)

  # Construct the archive URL
  archive_url <- glue("gs://{bucket}/archive/{manifest$archive_timestamp}/{path}")

  # Or use public HTTPS URL for direct access
  https_url <- glue("https://storage.googleapis.com/{bucket}/archive/{manifest$archive_timestamp}/{path}")

  return(https_url)
}

4.7 Automation Schedule

%%---
%%displayMode: compact
%%---

%%{init: {'theme': 'default', 'themeVariables': { 'textColor': '##333'}}}%%

gantt
    title Daily Sync Schedule (UTC)
    dateFormat HH:mm
    axisFormat %H:%M

    section Sync
    rclone           :a1, 00:00, 10m
    archive          :a2, after a1, 5m
    manifest         :a3, after a2, 5m

    section Pipeline
    targets          :b1, after a3, 10m
    parquet          :b2, after b1, 5m
    duckdb           :b3, after b2, 5m

    section Notify
    notify           :c1, after b3, 5m
Figure 3: Daily Sync Schedule (UTC)

The daily automation runs at midnight UTC and completes in under an hour:

  1. Sync phase: rclone copies new/changed files from Google Drive to the _sync/ folder in GCS, then creates an immutable archive/ snapshot and generates a manifest file documenting the state.

  2. Pipeline phase: The targets pipeline detects which source files changed and re-runs only the affected steps—transforming CSVs to Parquet format and updating the DuckDB database.

  3. Notify phase: Sends notifications (email/Slack) summarizing what changed and any errors encountered.

If no files changed in Google Drive, the pipeline completes quickly with minimal processing.


5 Proposed Architecture

5.1 Data Flow Diagram

flowchart TB
    subgraph drive["📁 THE DRIVE"]
        direction LR
        gdrive_pub["data-public/"]
        gdrive_priv["data-private/"]
    end

    subgraph archive["🗄️ THE ARCHIVE"]
        direction LR
        bucket_pub["calcofi-files-public"]
        bucket_priv["calcofi-files-private"]
    end

    subgraph database["🗃️ THE DATABASE"]
        direction LR
        working["Working DuckLake"]
        frozen["Frozen Releases"]
    end

    subgraph outputs["🌐 OUTPUTS"]
        direction LR
        obis["OBIS"]
        edi["EDI"]
        erddap["ERDDAP"]
        apps["Apps/API"]
    end

    drive -->|"rclone sync"| archive
    archive -->|"targets pipeline"| database
    database -->|"publish workflows"| outputs

    style drive fill:#e8f4e8,stroke:#2e7d32
    style archive fill:#e3f2fd,stroke:#1565c0
    style database fill:#f3e5f5,stroke:#7b1fa2
    style outputs fill:#fce4ec,stroke:#c2185b
Figure 4: CalCOFI Data Flow Architecture

5.2 Technology Stack

Layer Technology Purpose Replaces
Storage
gs://calcofi-files-public Versioned public source CSV files Google Drive only
gs://calcofi-files-private Versioned private/sensitive files Google Drive only
gs://calcofi-db Parquet, DuckDB, PMTiles -
Sync rclone Drive → GCS with archive snapshots Manual copying
Format Apache Parquet Efficient columnar storage CSV
Compute DuckDB Serverless SQL queries PostgreSQL (gradual)
Database
Working DuckLake Internal DB with provenance, time travel PostgreSQL
Frozen DuckLake Public versioned releases (v2026.02, etc.) PostgreSQL snapshots
Orchestration targets Dependency management, caching Manual sequencing
Tiles PMTiles + tippecanoe Cloud-native vector tiles pg_tileserv
Mapping mapgl Modern WebGL maps in Shiny leaflet + pg_tileserv
Functions calcofi4r, calcofi4db Reusable R code Ad-hoc scripts

GCP Project: ucsd-sio-calcofi


6 Database Architecture: Working vs Frozen DuckLake

The CalCOFI integrated database uses DuckLake, a lakehouse catalog that provides version control for data tables. We maintain two distinct databases to serve different needs:

6.1 The Two-Database Strategy

flowchart TB
    subgraph archive["📁 Archive"]
        csv["CSV source files"]
    end

    subgraph working["🔧 WORKING DUCKLAKE"]
        direction LR
        w_tables["Provenance columns"]
        w_history["Time travel"]
        w_access["Internal only"]
    end

    subgraph frozen["❄️ FROZEN DUCKLAKE"]
        direction LR
        f_versions["v2026.02, v2026.03..."]
        f_clean["Clean tables"]
        f_access["Public access"]
    end

    subgraph outputs["🌐 Public Access"]
        direction LR
        apps["Shiny Apps"]
        api["REST API"]
        rpkg["calcofi4r"]
    end

    csv -->|"ingest workflows"| working
    working -->|"freeze release"| frozen
    frozen --> outputs

    style archive fill:#e3f2fd,stroke:#1565c0
    style working fill:#fff3e0,stroke:#ef6c00
    style frozen fill:#e8f4e8,stroke:#2e7d32
    style outputs fill:#fce4ec,stroke:#c2185b
Figure 5: Working vs Frozen DuckLake Architecture

6.2 Working DuckLake (Internal)

The Working DuckLake is the internal, mutable database used by workflows. It prioritizes:

  • Full provenance tracking: Every row knows where it came from
  • Time travel: Query the database as it existed at any past point
  • Frequent updates: Can be modified whenever new data arrives

Schema features:

Column Type Purpose
_source_file VARCHAR Path to original CSV file in Archive
_source_row INTEGER Row number in source file
_source_uuid UUID Original record ID (e.g., from NOAA database)
_ingested_at TIMESTAMP When this row was added/updated

Example query with provenance:

-- Find all larvae records from a specific source file
SELECT scientific_name, count, _source_file, _source_row
FROM larvae
WHERE _source_file LIKE '%swfsc.noaa.gov/calcofi-db/larva.csv'
  AND cruise_id = '202301CC';

Time travel query:

-- Query larvae table as it existed on January 15th
SELECT * FROM larvae
AT TIMESTAMP '2026-01-15 00:00:00';

6.3 Frozen DuckLake (Public Releases)

The Frozen DuckLake provides stable, versioned snapshots for external users. Based on the Frozen DuckLake pattern:

Key characteristics:

  • Immutable: Once released, a version never changes
  • Clean schema: No internal provenance columns (cleaner for users)
  • Semantic versioning: v{YYYY}.{MM} format (e.g., v2026.02)
  • Public access: Anyone can query via HTTP/Parquet

Release process:

flowchart LR
    working["Working DuckLake"]
    validate["Validate data quality"]
    strip["Remove provenance columns"]
    version["Tag version (v2026.02)"]
    upload["Upload to releases/"]
    announce["Announce release"]

    working --> validate --> strip --> version --> upload --> announce

    style working fill:#fff3e0,stroke:#ef6c00
    style validate fill:#fff9c4,stroke:#f9a825
    style strip fill:#fff9c4,stroke:#f9a825
    style version fill:#fff9c4,stroke:#f9a825
    style upload fill:#e8f4e8,stroke:#2e7d32
    style announce fill:#e8f4e8,stroke:#2e7d32
Figure 6: Freezing a Release

Frozen release structure:

gs://calcofi-db/ducklake/releases/
├── v2026.02/
│   ├── catalog.json          # DuckLake catalog
│   ├── parquet/
│   │   ├── bottle.parquet
│   │   ├── cast.parquet
│   │   ├── larvae.parquet
│   │   └── ...
│   └── RELEASE_NOTES.md
├── v2026.03/
│   └── ...
└── latest -> v2026.03        # Symlink to latest

6.4 Accessing Frozen Releases

From R (using calcofi4r):

library(calcofi4r)

# Get latest release (default)
con <- cc_get_db()

# Get specific version
con <- cc_get_db(version = "v2026.02")

# List available versions
cc_list_versions()
#> [1] "v2026.02" "v2026.03" "v2026.04"

Direct Parquet access (no R required):

import duckdb

# Query directly from GCS (no download needed)
con = duckdb.connect()
con.execute("""
    SELECT * FROM read_parquet(
        'https://storage.googleapis.com/calcofi-db/ducklake/releases/latest/parquet/larvae.parquet'
    )
    LIMIT 10
""")

From any HTTP client:

# Download parquet file directly
curl -O https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.02/parquet/larvae.parquet

6.5 Release Schedule

Frequency Trigger Example
Monthly Regular data updates v2026.02, v2026.03
Ad-hoc Major dataset additions v2026.02.1 (patch)
Annual Year-end comprehensive release v2026.12

Release checklist:


7 R Package Strategy

7.1 Package Roles

7.1.1 calcofi4db - Database Administration Tools

Target users: Data managers, workflow developers

calcofi4db/R/
├── cloud.R        # GCS operations (@concept cloud)
├── read.R         # CSV/Parquet reading (@concept read)
├── transform.R    # Data transformation (@concept transform)
├── ingest.R       # Database loading (@concept ingest)
├── parquet.R      # Parquet operations (@concept parquet)   [NEW]
├── duckdb.R       # DuckDB operations (@concept duckdb)     [NEW]
├── version.R      # Schema versioning (@concept version)
├── check.R        # Data validation (@concept check)
└── viz.R          # Diagram generation (@concept viz)

7.1.2 calcofi4r - User-Facing Tools

Target users: Researchers, analysts, app developers

calcofi4r/R/
├── read.R         # Data retrieval (@concept read)
├── database.R     # Database connection (@concept database)
├── analyze.R      # Statistical analysis (@concept analyze)
├── visualize.R    # Plotting and mapping (@concept visualize)
└── functions.R    # Helper functions (@concept utils)

7.2 Function Reuse Across Contexts

Context Package Example Usage
Ingestion workflows calcofi4db read_csv_files(), ingest_csv_to_db()
Publishing workflows calcofi4db get_db_con(), create_db_manifest()
Shiny apps calcofi4r cc_get_db(), cc_plot_timeseries()
Analysis scripts calcofi4r cc_read_bottle(), cc_analyze_trend()
API endpoints calcofi4r cc_query_db(), cc_get_variables()

7.3 New Functions to Implement

7.3.1 In calcofi4db:

# R/cloud.R - GCS operations
get_gcs_file(bucket, path, local_path = NULL)
put_gcs_file(local_path, bucket, path)
sync_gdrive_to_gcs(gdrive_path, gcs_bucket, backup = TRUE)
list_gcs_versions(bucket, path)

# R/parquet.R - Parquet operations
csv_to_parquet(csv_path, schema_def = NULL)
read_parquet_table(path, con = NULL)
write_parquet_table(data, path, partitions = NULL)
add_parquet_metadata(path, metadata_list)

# R/duckdb.R - DuckDB operations
get_duckdb_con(path = ":memory:", read_only = FALSE)
create_duckdb_views(con, manifest)
attach_ducklake(con, catalog_path)
set_duckdb_comments(con, table, comments)

# R/ducklake.R - DuckLake catalog operations [NEW]
get_working_ducklake()           # connect to internal Working DuckLake
ingest_to_working(data, table, source_file, source_uuid_col = NULL)
add_provenance_columns(data, source_file, source_row_start = 1)

# R/freeze.R - Frozen release operations [NEW]
freeze_release(version, release_notes = NULL)
  # 1. Validate data quality

  # 2. Strip provenance columns (_source_*, _ingested_at)
  # 3. Export to parquet files
  # 4. Create DuckLake catalog
  # 5. Upload to releases/{version}/
validate_for_release(con)        # run all quality checks before freeze
list_frozen_releases()           # list available frozen versions
compare_releases(v1, v2)         # diff two releases

7.3.2 In calcofi4r:

# R/database.R - Frozen DuckLake access (user-facing)
cc_get_db(version = "latest", local_cache = TRUE)
  # Connect to frozen release (never Working DuckLake)
cc_list_versions()               # list available frozen releases
cc_db_info(version = "latest")   # release date, row counts, notes
cc_release_notes(version)        # view release notes for a version

# R/read.R - Convenience functions for common tables
cc_read_bottle(version = "latest", ...)
cc_read_larvae(version = "latest", ...)
cc_read_cast(version = "latest", ...)

# R/query.R - Query helpers
cc_query(sql, version = "latest")
cc_list_tables(version = "latest")
cc_describe_table(table, version = "latest")

8 Workflow Dependency Management with targets

8.1 Pipeline Definition

Create workflows/_targets.R:

library(targets)
library(tarchetypes)

tar_option_set(
  packages = c("calcofi4db", "duckdb", "dplyr", "arrow"),
  format = "qs"  # Fast serialization
)

# Define pipeline
list(
  # ─── Raw Data from GCS (public bucket, latest archive) ─────────
  tar_target(
    raw_bottle,
    get_gcs_file("gs://calcofi-files-public", "calcofi.org/bottle-database/bottle.csv")
  ),
  tar_target(
    raw_cast,
    get_gcs_file("gs://calcofi-files-public", "calcofi.org/bottle-database/cast.csv")
  ),
  tar_target(
    raw_larvae,
    get_gcs_file("gs://calcofi-files-public", "swfsc.noaa.gov/calcofi-db/larva.csv")
  ),

  # ─── Transform to Parquet ──────────────────────────────────────
  tar_target(
    pqt_bottle,
    csv_to_parquet(raw_bottle, output = "parquet/bottle.parquet")
  ),
  tar_target(
    pqt_cast,
    csv_to_parquet(raw_cast, output = "parquet/cast.parquet")
  ),
  tar_target(
    pqt_larvae,
    csv_to_parquet(raw_larvae, output = "parquet/larvae.parquet")
  ),

  # ─── Create Integrated DuckDB ──────────────────────────────────
  tar_target(
    db_manifest,
    create_db_manifest(
      tables = list(pqt_bottle, pqt_cast, pqt_larvae),
      version = format(Sys.Date(), "%Y.%m.%d")
    )
  ),
  tar_target(
    duckdb_file,
    create_duckdb_from_manifest(db_manifest)
  ),

  # ─── Publish Datasets ──────────────────────────────────────────
  tar_target(
    obis_archive,
    create_obis_archive(duckdb_file, include_bottle = TRUE)
  )
)

8.2 Workflow Lifecycle

flowchart TB
    subgraph dev["Development Cycle"]
        direction LR
        d1["1\. Explore"]
        d2["2\. Ingest"]
        d3["3\. Targets"]
        d4["4\. Run"]
        d5["5\. Publish"]
        d1 --> d2 --> d3 --> d4 --> d5
    end

    subgraph update["Update Cycle"]
        direction LR
        u1["1\. New data"]
        u2["2\. Sync"]
        u3["3\. Run"]
        u4["4\. Republish"]
        u1 --> u2 --> u3 --> u4
    end

    dev -.->|"formalized"| update

    style dev fill:#fff3e0,stroke:#ef6c00
    style update fill:#e8f4e8,stroke:#2e7d32
Figure 7: Development and Update Cycles

8.3 Dependency Graph

flowchart LR
    subgraph raw["📁 Raw CSV"]
        raw_bottle
        raw_cast
        raw_larvae
    end

    subgraph parquet["⚙️ Parquet"]
        pqt_bottle
        pqt_cast
        pqt_larvae
    end

    subgraph database["🗃️ Database"]
        db_manifest
        duckdb_file
    end

    subgraph publish["🌐 Publish"]
        obis_archive
    end

    raw_bottle --> pqt_bottle
    raw_cast --> pqt_cast
    raw_larvae --> pqt_larvae

    pqt_bottle --> db_manifest
    pqt_cast --> db_manifest
    pqt_larvae --> db_manifest

    db_manifest --> duckdb_file
    duckdb_file --> obis_archive

    style raw fill:#e3f2fd,stroke:#1565c0
    style parquet fill:#fff3e0,stroke:#ef6c00
    style database fill:#f3e5f5,stroke:#7b1fa2
    style publish fill:#fce4ec,stroke:#c2185b
Figure 8: targets Pipeline Dependency Graph

9 Documentation Updates

9.2 Where Documentation Should Live

Content Location Rationale
Database schema docs/db.qmd User-facing documentation
Workflow development workflows/README.md Developer guide
Package functions calcofi4r/, calcofi4db/ pkgdown reference
Data dictionary calcofi4db/inst/ Versioned with code
Architecture decisions docs/ or workflows/_docs/ Long-term reference

10 Implementation Priority

10.1 Phase 1: Foundation ✅ COMPLETE (2026-02-02)

Goal: Establish cloud infrastructure and helper functions

  1. Set up GCS buckets
    • Created gs://calcofi-files-public/ with public read access
    • Created gs://calcofi-files-private/ with restricted access
    • Configured rclone access via OAuth
  2. Add cloud functions to calcofi4db
    • Created R/cloud.R with get_gcs_file(), put_gcs_file()
    • Created R/parquet.R with csv_to_parquet()
    • Created R/duckdb.R with get_duckdb_con(), create_duckdb_views()
  3. Initial sync
    • Synced Google Drive data-public/ to gs://calcofi-files-public/_sync/
    • Created initial archive snapshot: archive/2026-02-02_121557/
    • Generated manifest: manifests/manifest_2026-02-02_121557.json
    • Total: 112,457 files (33.5 GiB)

10.2 Phase 2: Pipeline

Goal: Implement targets pipeline for priority datasets

  1. Create workflows/_targets.R
    • Define targets for bottle and larvae data
    • Establish Parquet transformation pipeline
    • Create integrated DuckDB
  2. Migrate ingestion workflows
    • Convert ingest_calcofi.org_bottle-database.qmd to target
    • Convert ingest_swfsc.noaa.gov_calcofi-db.qmd to target
  3. Update publish workflow
    • Modify publish_larvae_to_obis.qmd to read from DuckDB
    • Add bottle data to EMoF extension

10.3 Phase 3: Working DuckLake

Goal: Implement Working DuckLake with provenance tracking

  1. Set up Working DuckLake catalog
    • Configure DuckLake at gs://calcofi-db/ducklake/working/
    • Enable time travel queries
    • Create Git-tracked catalog manifest
  2. Add provenance columns to ingest workflows
    • _source_file: Path to source CSV in Archive
    • _source_row: Row number in source file
    • _source_uuid: Original record UUID (if available)
    • _ingested_at: Ingestion timestamp
  3. Implement calcofi4db DuckLake functions
    • get_working_ducklake(): Connect to internal database
    • ingest_to_working(): Ingest with provenance
    • add_provenance_columns(): Helper for provenance

10.4 Phase 4: Frozen DuckLake

Goal: Implement Frozen DuckLake for public releases

  1. Create freeze workflow
    • validate_for_release(): Data quality checks
    • freeze_release(): Create immutable release
    • Strip provenance columns for clean public schema
    • Generate release notes
  2. Set up release structure
    • gs://calcofi-db/ducklake/releases/v2026.02/
    • Parquet files + DuckLake catalog
    • latest symlink to current release
  3. Update calcofi4r for frozen access
    • cc_get_db(version = "latest"): Default to frozen releases
    • cc_list_versions(): List available releases
    • cc_release_notes(): View release notes
  4. First public release: v2026.02
    • Bottle + Cast + Larvae tables
    • Announce to users

10.5 Phase 5: Documentation & Polish

Goal: Complete documentation and evaluate API

  1. Update docs/
    • Revise db.qmd for DuckDB/DuckLake
    • Document Working vs Frozen databases
    • Evaluate API necessity
    • Update portal documentation
  2. Clean up workflows/
    • Consolidate README.md with essentials
    • Archive deprecated exploration notebooks
    • Document workflow development process
    • Create “Getting Started” guide for data managers

11 Verification Plan

11.1 End-to-End Test

# 1. Sync test data from Google Drive to GCS
system("rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync")

# 2. Create archive snapshot
timestamp <- format(Sys.time(), "%Y-%m-%d_%H%M%S")
system(glue::glue("rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/{timestamp}"))

# 3. Run targets pipeline
targets::tar_make()

# 4. Verify database
con <- calcofi4db::get_duckdb_con("calcofi.duckdb")
DBI::dbListTables(con)
DBI::dbGetQuery(con, "SELECT COUNT(*) FROM larvae")

# 5. Verify OBIS archive
obistools::check_fields(read_csv("data/darwincore/larvae/occurrence.csv"))

11.2 Integration Tests


12 Migration Strategy: PostgreSQL → DuckDB

12.1 Current PostgreSQL Dependencies

Based on server/docker-compose.yml, the following services depend on PostgreSQL:

Service Purpose Migration Path
postgis Spatial database Keep for vector tiles (short-term)
pg_tileserv Vector tile server Replace with PMTiles
pg_rest REST API (PostgREST) Keep for web apps, add DuckDB option
plumber R API Update to use DuckDB
pgadmin Database admin Deprecate once migrated

12.2 Vector Tiles: pg_tileserv → PMTiles

The current architecture uses pg_tileserv to serve vector tiles from PostGIS for Shiny mapping apps. The migration path:

flowchart LR
    subgraph current["Current Flow (PostgreSQL)"]
        direction LR
        pg["PostGIS"]
        ts["pg_tileserv"]
        lf["Leaflet"]
        pg --> ts --> lf
    end

    subgraph target["Target Flow (Cloud-native)"]
        direction LR
        geo["GeoParquet"]
        tc["tippecanoe"]
        pm["PMTiles"]
        mg["mapgl"]
        geo --> tc --> pm --> mg
    end

    current -.->|"migrate"| target

    style current fill:#ffebee,stroke:#c62828
    style target fill:#e8f4e8,stroke:#2e7d32
Figure 9: Vector Tiles Migration Path

12.2.1 Implementation Steps

  1. Generate PMTiles from PostGIS

    # Export geometries from PostGIS
    ogr2ogr -f GeoJSON stations.geojson \
      PG:"host=postgis user=admin dbname=gis" \
      -sql "SELECT * FROM stations"
    
    # Convert to PMTiles using tippecanoe
    tippecanoe -o stations.pmtiles \
      --minimum-zoom=0 --maximum-zoom=14 \
      --layer=stations \
      stations.geojson
    
    # Upload to GCS
    gsutil cp stations.pmtiles gs://calcofi-db/tiles/
  2. Update Shiny Apps

    # Old: pg_tileserv
    leaflet() |>
      addTiles() |>
      addMapboxGL(
        style = list(
          sources = list(
            stations = list(
              type = "vector",
              tiles = list("https://tile.calcofi.io/public.stations/{z}/{x}/{y}.pbf")
            )
          )
        )
      )
    
    # New: PMTiles with mapgl
    library(mapgl)
    mapboxgl() |>
      add_pmtiles_source(
        id = "stations",
        url = "https://storage.googleapis.com/calcofi-db/tiles/stations.pmtiles"
      ) |>
      add_layer(
        id = "stations-layer",
        type = "circle",
        source = "stations",
        source_layer = "stations"
      )
  3. Add PMTiles generation to targets pipeline

    # In _targets.R
    tar_target(
      pmtiles_stations,
      create_pmtiles(
        geom_source = duckdb_file,
        query = "SELECT * FROM stations",
        output = "tiles/stations.pmtiles"
      )
    )

12.3 Migration Roadmap

Phase Timeline Actions
Phase 1 Now Add DuckDB as primary data access; keep PostgreSQL
Phase 2 +2 months Generate PMTiles for all spatial layers
Phase 3 +4 months Update Shiny apps to use mapgl + PMTiles
Phase 4 +6 months Deprecate pg_tileserv, reduce PostgreSQL to backup
Phase 5 +12 months Fully deprecate PostgreSQL (if no dependencies remain)

13 Resolved Questions

Based on clarifications received:

  1. GCS Project: Use existing project ucsd-sio-calcofi

    • Existing bucket calcofi-db → DuckLake/database files
    • New bucket calcofi-files-public → Versioned public source files (public read access)
    • New bucket calcofi-files-private → Versioned private/sensitive files (restricted access)
  2. Google Drive Structure (implemented 2026-02-02):

    • calcofi/data-public/ → Public datasets (bottle, larvae, etc.)
    • calcofi/data-private/ → Private/sensitive data (eDNA, zooplankton samples)
  3. GCS Folder Structure (implemented 2026-02-02):

    • _sync/ → Working directory (rclone syncs here, mutable)
    • archive/{timestamp}/ → Immutable snapshots (complete copies, not deltas)
    • manifests/ → JSON metadata files with manifest_latest.json pointer
  4. API Strategy: Keep PostgreSQL + API for now with migration roadmap

    • pg_tileserv needed for current Shiny apps
    • Migrate to PMTiles over time using tippecanoe + mapgl
  5. Priority Data: Larvae + Bottle datasets for proof-of-concept

  6. Bottle in OBIS: Include as EMoF extension in larvae dataset

14 Remaining Questions

  1. DuckLake vs MotherDuck: Self-hosted DuckLake catalog or use MotherDuck service?
  2. Automation: GitHub Actions vs server cron for scheduled rclone syncs?
  3. PMTiles hosting: Serve from GCS directly or via CDN?

15 Appendix: Key References

15.1 Documentation

15.2 Technologies

Database & Analytics:

Workflow & Orchestration:

Cloud Storage:

Vector Tiles:

15.3 CalCOFI Resources

Footnotes

  1. Technical term: This is a “versioned data lake” built on Google Cloud Storage with rclone synchronization.↩︎

  2. DuckDB is a fast, in-process database that can query Parquet files directly from cloud storage without downloading them.↩︎

  3. Technical terms: This is a “versioned data lake” implemented with Google Cloud Storage buckets and rclone synchronization with timestamped archive snapshots.↩︎