flowchart TB
subgraph drive["📁 THE DRIVE"]
direction LR
gd_pub["data-public/<br/>Bottle, Larvae, CTD..."]
gd_priv["data-private/<br/>eDNA, Zooplankton..."]
end
subgraph archive["🗄️ THE ARCHIVE"]
direction LR
sync["_sync/<br/>(working copy)"]
hist["archive/<br/>(snapshots)"]
end
subgraph workflows["⚙️ THE WORKFLOWS"]
direction LR
ingest["Ingest<br/>(CSV → DB)"]
publish["Publish<br/>(DB → Portals)"]
end
subgraph database["🗃️ THE DATABASE"]
direction LR
working["Working<br/>(internal)"]
frozen["Frozen<br/>(public releases)"]
end
subgraph outputs["🌐 OUTPUTS"]
direction LR
obis["OBIS"]
apps["Shiny Apps"]
api["API / R pkg"]
end
drive -->|"daily sync"| archive
archive -->|"read files"| workflows
workflows -->|"transform"| database
database -->|"publish"| outputs
style drive fill:#e8f4e8,stroke:#2e7d32
style archive fill:#e3f2fd,stroke:#1565c0
style workflows fill:#fff3e0,stroke:#ef6c00
style database fill:#f3e5f5,stroke:#7b1fa2
style outputs fill:#fce4ec,stroke:#c2185b
CalCOFI Data Workflow Plan
This document outlines a comprehensive strategy for managing and versioning CalCOFI data files, building an integrated database, and publishing datasets to external portals.
1 Overview for Data Managers & Scientists
This section provides a high-level view of the CalCOFI data system. You don’t need to understand all the technical details—just the four main components and how they connect.
1.1 The Big Picture
1.2 Four Key Components
What: The shared Google Drive folder where you upload your data files.
Location: My Drive/projects/calcofi/data-public/ (or data-private/)
Who uses it: Dataset managers (Rasmus, Ed, Nastassia, Linsey, etc.)
How to use it:
- Create a folder for your data:
{provider}/{dataset}/(e.g.,swfsc.noaa.gov/calcofi-db/) - Inside, organize files into
raw/(original data) andderived/(processed data) - Simply upload or overwrite files—versioning happens automatically
- No special naming required (no date prefixes needed)
Example structure:
data-public/
├── calcofi.org/
│ └── bottle-database/
│ ├── raw/
│ │ └── bottle_export_2026-01-15.csv
│ └── derived/
│ └── bottle_cleaned.csv
├── swfsc.noaa.gov/
│ └── calcofi-db/
│ ├── raw/
│ │ ├── larva.csv
│ │ └── cruise.csv
│ └── derived/
What: A versioned backup of all Drive files in the cloud1.
Location: Google Cloud Storage buckets (gs://calcofi-files-public/ and gs://calcofi-files-private/)
Who uses it: Automated systems (you don’t need to interact with it directly)
What it does:
- Daily sync: Copies all files from the Drive to the Archive
- Snapshots: Creates timestamped copies so we can always go back in time
- Never loses data: Even if you delete a file from Drive, it’s preserved in the Archive
Why it matters: Scientists can reproduce analyses using the exact data from any past date.
What: Automated scripts that process data from the Archive into the Database.
Location: /workflows/ folder (Quarto notebooks and R scripts)
Who uses it: Data engineers and developers
Two types of workflows:
- Ingest workflows (
ingest_*.qmd): Read CSV files → Clean/transform → Load into Database - Publish workflows (
publish_*.qmd): Read from Database → Format for portals → Upload to OBIS, EDI, etc.
Key feature: Dependencies are managed automatically—if source data changes, all downstream steps re-run.
What: The integrated CalCOFI database combining all datasets into related tables.
Format: DuckDB2 with Parquet files (modern, fast, serverless)
Two versions:
| Version | Purpose | Access | Features |
|---|---|---|---|
| Working Database | Internal processing | Workflows only | Full provenance tracking, UUIDs, source file references, time travel |
| Frozen Database | Public releases | Apps, API, R package | Clean tables, versioned releases (v2026.02, v2026.03…), optimized for queries |
Why two versions? The Working Database tracks every detail for reproducibility. The Frozen Database provides clean, stable snapshots for users who just want the data.
1.3 How Data Flows
flowchart TB
subgraph you["👤 YOU"]
upload["Upload CSV to Drive"]
end
subgraph auto["🤖 AUTOMATED"]
sync["Sync to Archive"]
snapshot["Create snapshot"]
transform["Run workflows"]
end
subgraph result["📊 RESULT"]
db["Updated Database"]
portals["Published to Portals"]
end
upload --> sync --> snapshot --> transform --> db --> portals
style you fill:#e8f4e8,stroke:#2e7d32
style auto fill:#fff3e0,stroke:#ef6c00
style result fill:#f3e5f5,stroke:#7b1fa2
Your role as a dataset manager: Upload files to the Drive. Everything else happens automatically.
2 Executive Summary
This plan modernizes the CalCOFI data workflow architecture to:
- Version raw data files in Google Cloud Storage (GCS)
- Use DuckDB + Parquet as the primary integrated database format
- Leverage
targetsfor workflow dependency management - Use R packages (
calcofi4r,calcofi4db) for reusable functions - Maintain reproducible pipelines from raw data to published datasets
3 Current State Analysis
3.1 Existing Infrastructure
| Component | Current State | Location |
|---|---|---|
| Workflows | Individual Quarto notebooks | /workflows/ |
| Database | PostgreSQL with dev/prod schemas | api.calcofi.io |
| R Packages | calcofi4r (user), calcofi4db (admin) |
GitHub |
| Data Sources | CSV files on Google Drive | ~/My Drive/projects/calcofi/data-public |
| Documentation | Quarto site | /docs/ |
3.2 Key Workflow Files
| Workflow | Purpose | Status |
|---|---|---|
ingest_calcofi.org_bottle-database.qmd |
Ingest bottle/cast data | Partial |
ingest_swfsc.noaa.gov_calcofi-db.qmd |
Ingest NOAA ichthyoplankton data | Active |
publish_larvae_to_obis.qmd |
Publish larvae to OBIS | Active (needs bottle data) |
3.3 Identified Gaps
- No dependency management: Workflows run independently without declared dependencies
- No file versioning: Source files updated in place without version history
- PostgreSQL overhead: Requires server maintenance and API infrastructure
- Disconnected workflows: No systematic connection between ingest and publish steps
- Manual sequencing: Users must know to run bottle ingestion before larvae ingestion
4 File Versioning Strategy
4.1 The Two-World Problem
CalCOFI data originates from multiple sources managed by different individuals:
- Rasmus: Bottle/cast data, CTD profiles
- Ed: Ichthyoplankton (larvae, eggs) data
- Nastassia: eDNA data
- Linsey: Zooplankton samples
These dataset managers need a simple interface (Google Drive) to deposit and organize files, while the data engineering side needs version control, immutability, and reproducibility.
4.2 Google Drive: The Human Interface
This is “The Drive” from Section 1. Dataset managers upload files here; everything else is automatic.
4.2.1 Folder Structure
Recommended organization: {provider}/{dataset}/raw/ and {provider}/{dataset}/derived/
calcofi/data-public/ # Shared Google Drive folder (public data)
├── calcofi.org/
│ └── bottle-database/
│ ├── raw/ # Original exports from source system
│ │ └── bottle_export_2026-01.csv
│ └── derived/ # Cleaned/processed versions
│ ├── bottle.csv
│ ├── cast.csv
│ └── field_descriptions.csv
├── swfsc.noaa.gov/
│ └── calcofi-db/
│ ├── raw/ # Original NOAA database exports
│ │ └── calcofi_db_dump_2026-01.sql
│ └── derived/ # Extracted CSV tables
│ ├── cruise.csv
│ ├── egg.csv
│ ├── larva.csv
│ ├── net.csv
│ ├── ship.csv
│ ├── site.csv
│ ├── species.csv
│ └── tow.csv
├── coastwatch.pfeg.noaa.gov/
│ └── erdCalCOFIlrvsiz/
│ └── derived/
│ └── larvae_size.csv
└── _archive/ # Manual archive (optional)
└── 2025-01-15_bottle.csv # User-created backup
calcofi/data-private/ # Private/sensitive data (restricted access)
├── edna/ # eDNA samples (Nastassia)
│ ├── raw/
│ └── derived/
├── zooplankton/ # Zooplankton samples (Linsey)
│ ├── raw/
│ └── derived/
└── _archive/ # Manual archive (optional)
4.2.2 Google Drive Naming Conventions
- Top-level folders:
{provider}/- the organization providing the data (e.g.,swfsc.noaa.gov) - Dataset folders:
{dataset}/- specific dataset name (e.g.,calcofi-db) - Subfolders (recommended):
raw/- Original data as received from source (database dumps, original CSVs)derived/- Processed/cleaned versions ready for ingestion
- Files: Original source names, no date prefixes required
- Updates: Simply overwrite existing files
- No versioning in Drive: Google Drive does not maintain file history for CSVs
4.2.3 Dataset Manager Workflow
- Create folder structure:
{provider}/{dataset}/raw/and{provider}/{dataset}/derived/ - Upload original data to
raw/(preserve original format) - Upload processed/cleaned data to
derived/(CSV format preferred) - Simply overwrite files when updating—versioning happens automatically in the Archive
- Optionally notify data team of significant changes
See Section 1 for the big picture of how your data flows through the system.
4.3 Google Cloud Storage: The Versioned Data Lake
This is “The Archive” from Section 13. It automatically captures versioned snapshots of all files from the Drive.
4.3.1 Bucket Structure
gs://calcofi-files-public/ # Versioned public source files
├── _sync/ # Working directory (rclone syncs here)
│ ├── calcofi.org/
│ │ └── bottle-database/
│ │ ├── bottle.csv
│ │ └── cast.csv
│ └── swfsc.noaa.gov/
│ └── calcofi-db/
│ ├── cruise.csv
│ ├── egg.csv
│ └── ...
├── archive/ # Timestamped immutable snapshots
│ ├── 2026-02-02_121557/ # YYYY-MM-DD_HHMMSS
│ │ └── calcofi.org/
│ │ └── bottle-database/
│ │ └── bottle.csv # Snapshot at that time
│ ├── 2026-02-15_120000/
│ │ └── swfsc.noaa.gov/
│ │ └── calcofi-db/
│ │ └── larva.csv
│ └── ...
└── manifests/ # Version metadata
├── manifest_2026-02-02_121557.json
├── manifest_2026-02-15_120000.json
└── manifest_latest.json # Points to latest archive
gs://calcofi-files-private/ # Versioned private source files
├── _sync/ # Working directory (rclone syncs here)
│ ├── edna/
│ └── zooplankton/
├── archive/ # Timestamped immutable snapshots
└── manifests/
gs://calcofi-db/ # Integrated database
├── parquet/ # Transformed tables
│ ├── bottle.parquet
│ ├── cast.parquet
│ ├── larvae.parquet
│ └── ...
├── duckdb/ # DuckDB database files
│ └── calcofi.duckdb
└── ducklake/ # DuckLake catalog
└── catalog.json
4.3.2 GCS Naming Conventions
- **_sync/**: Working directory, exact mirror of Google Drive structure (rclone syncs here)
- archive/{YYYY-MM-DD_HHMMSS}/: Immutable snapshot of complete data at that timestamp
- manifests/manifest_{YYYY-MM-DD_HHMMSS}.json: Snapshot metadata
- manifests/manifest_latest.json: Points to latest archive snapshot
4.4 Rclone Sync: Capturing Daily Changes
4.4.1 Installation
macOS:
brew install rcloneLinux (Ubuntu/Debian):
curl https://rclone.org/install.sh | sudo bash4.4.2 Configuration
Run rclone config and create two remotes:
Remote 1: Google Drive (gdrive)
n) New remote
name> gdrive
Storage> drive
client_id> # leave blank
client_secret> # leave blank
scope> 1 # Full access (or 2 for read-only)
service_account_file> # leave blank for OAuth, or path to JSON key
Edit advanced config> n
Use auto config> y # opens browser (use 'n' on headless server)
Configure this as a Shared Drive> n
Remote 2: Google Cloud Storage (gcs)
n) New remote
name> gcs
Storage> google cloud storage
client_id> # leave blank
client_secret> # leave blank
project_number> ucsd-sio-calcofi
service_account_file> # leave blank for OAuth, or path to JSON key
Edit advanced config> n
Use auto config> y # opens browser (use 'n' on headless server)
4.4.3 Headless Server Setup
For Linux servers without a browser, use rclone authorize on a local machine:
# on local machine with browser
rclone authorize "drive"
# copy the token output
# on server, during rclone config
Use auto config> n
# paste the token when promptedSimilarly for GCS:
# on local machine
rclone authorize "google cloud storage"4.4.4 Service Account (Optional)
For automated/scheduled syncs, use a GCP service account instead of OAuth:
- Create service account in GCP Console → IAM → Service Accounts
- Grant roles:
Storage Object Adminforcalcofi-files-public,calcofi-files-private, andcalcofi-dbbuckets - Download JSON key file to secure location (e.g.,
/etc/rclone/calcofi-sa.json) - In
rclone config, provide path to JSON key instead of using OAuth
4.4.5 Verify Configuration
# test Google Drive access
rclone lsd gdrive:projects/calcofi/data-public # OR
rclone lsd gdrive-ecoquants:projects/calcofi/data-public
# test GCS access
rclone ls gcs:calcofi-files-public --max-depth 1 # OR
rclone ls gcs-calcofi:calcofi-files-public --max-depth 1
# test sync (dry-run first!)
rclone sync gdrive:calcofi/data-public gcs:calcofi-files-public/_sync --dry-run -v
rclone sync gdrive-ecoquants:projects/calcofi/data-public gcs-calcofi:calcofi-files-public/_sync --dry-run -v4.4.6 How It Works
#!/bin/bash
# sync_gdrive_to_gcs.sh - Run daily via cron
TIMESTAMP=$(date +%Y-%m-%d_%H%M%S)
LOG_FILE="/var/log/rclone/sync_${TIMESTAMP}.log"
# --- PUBLIC DATA ---
# Step 1: Sync Google Drive to _sync/ working directory
rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync \
--checksum \
--drive-export-formats csv \
--log-file ${LOG_FILE} \
--log-level INFO
# Step 2: Create immutable archive snapshot
rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/${TIMESTAMP}
# Step 3: Generate and upload manifest
rclone lsjson gs://calcofi-files-public/archive/${TIMESTAMP} --recursive \
> /tmp/manifest_${TIMESTAMP}.json
rclone copy /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/
rclone copyto /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/manifest_latest.json
# --- PRIVATE DATA ---
# Repeat for private bucket (separate log file)
rclone sync gdrive:calcofi/data-private gs://calcofi-files-private/_sync \
--checksum \
--drive-export-formats csv \
--log-file ${LOG_FILE%.log}_private.log \
--log-level INFO
rclone copy gs://calcofi-files-private/_sync gs://calcofi-files-private/archive/${TIMESTAMP}4.4.7 Key Behaviors
| Scenario | Google Drive Action | Rclone Behavior | Result in GCS |
|---|---|---|---|
| New file | User uploads new.csv |
rclone sync copies to _sync/ |
_sync/new.csv created, next archive includes it |
| Updated file | User overwrites bottle.csv |
rclone sync updates _sync/ |
_sync/bottle.csv updated, archive snapshots preserve history |
| Deleted file | User deletes old.csv |
rclone sync removes from _sync/ |
Removed from _sync/, preserved in prior archives |
| Renamed file | User renames a.csv → b.csv |
Treated as delete + create | a.csv gone from _sync/, b.csv created |
| No changes | Nothing | --checksum skips identical files |
No new sync needed |
4.4.8 The Two-Step Archive Strategy
The archiving approach uses two steps for reliability:
- Sync to
_sync/: rclone syncs Google Drive to the_sync/working directory - Copy to
archive/{timestamp}/: Create an immutable snapshot of the current state
Benefits:
- Immutable archives: Each archive folder is a complete snapshot, not just deltas
- Reproducibility: Any archive folder contains everything needed to recreate that state
- No data loss: If sync fails, previous archives remain intact
- Public access:
calcofi-files-publicbucket has public read access for easy data sharing
4.5 Manifest Files
Each manifest captures the complete state of the archive at a point in time:
{
"generated_at": "2026-02-02T12:15:57Z",
"archive_timestamp": "2026-02-02_121557",
"bucket": "gs://calcofi-files-public",
"archive_path": "archive/2026-02-02_121557",
"files": [
{
"Path": "calcofi.org/bottle-database/bottle.csv",
"Size": 45234567,
"ModTime": "2026-02-02T10:30:00Z",
"MimeType": "text/csv",
"gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/calcofi.org/bottle-database/bottle.csv"
},
{
"Path": "swfsc.noaa.gov/calcofi-db/larva.csv",
"Size": 12345678,
"ModTime": "2026-02-01T08:00:00Z",
"MimeType": "text/csv",
"gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/swfsc.noaa.gov/calcofi-db/larva.csv"
}
],
"total_files": 112457,
"total_size_bytes": 35987654321
}4.6 Reconstructing Historical State
To recreate the data lake as it existed on a specific date:
# R function to get file from a specific archive snapshot
get_archived_file <- function(path, timestamp = "latest", bucket = "calcofi-files-public") {
# Get manifest for that timestamp
if (timestamp == "latest") {
manifest_url <- glue("gs://{bucket}/manifests/manifest_latest.json")
} else {
manifest_url <- glue("gs://{bucket}/manifests/manifest_{timestamp}.json")
}
manifest <- jsonlite::read_json(manifest_url)
# Construct the archive URL
archive_url <- glue("gs://{bucket}/archive/{manifest$archive_timestamp}/{path}")
# Or use public HTTPS URL for direct access
https_url <- glue("https://storage.googleapis.com/{bucket}/archive/{manifest$archive_timestamp}/{path}")
return(https_url)
}4.7 Automation Schedule
%%---
%%displayMode: compact
%%---
%%{init: {'theme': 'default', 'themeVariables': { 'textColor': '##333'}}}%%
gantt
title Daily Sync Schedule (UTC)
dateFormat HH:mm
axisFormat %H:%M
section Sync
rclone :a1, 00:00, 10m
archive :a2, after a1, 5m
manifest :a3, after a2, 5m
section Pipeline
targets :b1, after a3, 10m
parquet :b2, after b1, 5m
duckdb :b3, after b2, 5m
section Notify
notify :c1, after b3, 5m
The daily automation runs at midnight UTC and completes in under an hour:
Sync phase:
rclonecopies new/changed files from Google Drive to the_sync/folder in GCS, then creates an immutablearchive/snapshot and generates a manifest file documenting the state.Pipeline phase: The
targetspipeline detects which source files changed and re-runs only the affected steps—transforming CSVs to Parquet format and updating the DuckDB database.Notify phase: Sends notifications (email/Slack) summarizing what changed and any errors encountered.
If no files changed in Google Drive, the pipeline completes quickly with minimal processing.
5 Proposed Architecture
5.1 Data Flow Diagram
flowchart TB
subgraph drive["📁 THE DRIVE"]
direction LR
gdrive_pub["data-public/"]
gdrive_priv["data-private/"]
end
subgraph archive["🗄️ THE ARCHIVE"]
direction LR
bucket_pub["calcofi-files-public"]
bucket_priv["calcofi-files-private"]
end
subgraph database["🗃️ THE DATABASE"]
direction LR
working["Working DuckLake"]
frozen["Frozen Releases"]
end
subgraph outputs["🌐 OUTPUTS"]
direction LR
obis["OBIS"]
edi["EDI"]
erddap["ERDDAP"]
apps["Apps/API"]
end
drive -->|"rclone sync"| archive
archive -->|"targets pipeline"| database
database -->|"publish workflows"| outputs
style drive fill:#e8f4e8,stroke:#2e7d32
style archive fill:#e3f2fd,stroke:#1565c0
style database fill:#f3e5f5,stroke:#7b1fa2
style outputs fill:#fce4ec,stroke:#c2185b
5.2 Technology Stack
| Layer | Technology | Purpose | Replaces |
|---|---|---|---|
| Storage | |||
gs://calcofi-files-public |
Versioned public source CSV files | Google Drive only | |
gs://calcofi-files-private |
Versioned private/sensitive files | Google Drive only | |
gs://calcofi-db |
Parquet, DuckDB, PMTiles | - | |
| Sync | rclone | Drive → GCS with archive snapshots | Manual copying |
| Format | Apache Parquet | Efficient columnar storage | CSV |
| Compute | DuckDB | Serverless SQL queries | PostgreSQL (gradual) |
| Database | |||
| Working DuckLake | Internal DB with provenance, time travel | PostgreSQL | |
| Frozen DuckLake | Public versioned releases (v2026.02, etc.) | PostgreSQL snapshots | |
| Orchestration | targets |
Dependency management, caching | Manual sequencing |
| Tiles | PMTiles + tippecanoe | Cloud-native vector tiles | pg_tileserv |
| Mapping | mapgl | Modern WebGL maps in Shiny | leaflet + pg_tileserv |
| Functions | calcofi4r, calcofi4db |
Reusable R code | Ad-hoc scripts |
GCP Project: ucsd-sio-calcofi
6 Database Architecture: Working vs Frozen DuckLake
The CalCOFI integrated database uses DuckLake, a lakehouse catalog that provides version control for data tables. We maintain two distinct databases to serve different needs:
6.1 The Two-Database Strategy
flowchart TB
subgraph archive["📁 Archive"]
csv["CSV source files"]
end
subgraph working["🔧 WORKING DUCKLAKE"]
direction LR
w_tables["Provenance columns"]
w_history["Time travel"]
w_access["Internal only"]
end
subgraph frozen["❄️ FROZEN DUCKLAKE"]
direction LR
f_versions["v2026.02, v2026.03..."]
f_clean["Clean tables"]
f_access["Public access"]
end
subgraph outputs["🌐 Public Access"]
direction LR
apps["Shiny Apps"]
api["REST API"]
rpkg["calcofi4r"]
end
csv -->|"ingest workflows"| working
working -->|"freeze release"| frozen
frozen --> outputs
style archive fill:#e3f2fd,stroke:#1565c0
style working fill:#fff3e0,stroke:#ef6c00
style frozen fill:#e8f4e8,stroke:#2e7d32
style outputs fill:#fce4ec,stroke:#c2185b
6.2 Working DuckLake (Internal)
The Working DuckLake is the internal, mutable database used by workflows. It prioritizes:
- Full provenance tracking: Every row knows where it came from
- Time travel: Query the database as it existed at any past point
- Frequent updates: Can be modified whenever new data arrives
Schema features:
| Column | Type | Purpose |
|---|---|---|
_source_file |
VARCHAR | Path to original CSV file in Archive |
_source_row |
INTEGER | Row number in source file |
_source_uuid |
UUID | Original record ID (e.g., from NOAA database) |
_ingested_at |
TIMESTAMP | When this row was added/updated |
Example query with provenance:
-- Find all larvae records from a specific source file
SELECT scientific_name, count, _source_file, _source_row
FROM larvae
WHERE _source_file LIKE '%swfsc.noaa.gov/calcofi-db/larva.csv'
AND cruise_id = '202301CC';Time travel query:
-- Query larvae table as it existed on January 15th
SELECT * FROM larvae
AT TIMESTAMP '2026-01-15 00:00:00';6.3 Frozen DuckLake (Public Releases)
The Frozen DuckLake provides stable, versioned snapshots for external users. Based on the Frozen DuckLake pattern:
Key characteristics:
- Immutable: Once released, a version never changes
- Clean schema: No internal provenance columns (cleaner for users)
- Semantic versioning:
v{YYYY}.{MM}format (e.g.,v2026.02) - Public access: Anyone can query via HTTP/Parquet
Release process:
flowchart LR
working["Working DuckLake"]
validate["Validate data quality"]
strip["Remove provenance columns"]
version["Tag version (v2026.02)"]
upload["Upload to releases/"]
announce["Announce release"]
working --> validate --> strip --> version --> upload --> announce
style working fill:#fff3e0,stroke:#ef6c00
style validate fill:#fff9c4,stroke:#f9a825
style strip fill:#fff9c4,stroke:#f9a825
style version fill:#fff9c4,stroke:#f9a825
style upload fill:#e8f4e8,stroke:#2e7d32
style announce fill:#e8f4e8,stroke:#2e7d32
Frozen release structure:
gs://calcofi-db/ducklake/releases/
├── v2026.02/
│ ├── catalog.json # DuckLake catalog
│ ├── parquet/
│ │ ├── bottle.parquet
│ │ ├── cast.parquet
│ │ ├── larvae.parquet
│ │ └── ...
│ └── RELEASE_NOTES.md
├── v2026.03/
│ └── ...
└── latest -> v2026.03 # Symlink to latest
6.4 Accessing Frozen Releases
From R (using calcofi4r):
library(calcofi4r)
# Get latest release (default)
con <- cc_get_db()
# Get specific version
con <- cc_get_db(version = "v2026.02")
# List available versions
cc_list_versions()
#> [1] "v2026.02" "v2026.03" "v2026.04"Direct Parquet access (no R required):
import duckdb
# Query directly from GCS (no download needed)
con = duckdb.connect()
con.execute("""
SELECT * FROM read_parquet(
'https://storage.googleapis.com/calcofi-db/ducklake/releases/latest/parquet/larvae.parquet'
)
LIMIT 10
""")From any HTTP client:
# Download parquet file directly
curl -O https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.02/parquet/larvae.parquet6.5 Release Schedule
| Frequency | Trigger | Example |
|---|---|---|
| Monthly | Regular data updates | v2026.02, v2026.03 |
| Ad-hoc | Major dataset additions | v2026.02.1 (patch) |
| Annual | Year-end comprehensive release | v2026.12 |
Release checklist:
7 R Package Strategy
7.1 Package Roles
7.1.1 calcofi4db - Database Administration Tools
Target users: Data managers, workflow developers
calcofi4db/R/
├── cloud.R # GCS operations (@concept cloud)
├── read.R # CSV/Parquet reading (@concept read)
├── transform.R # Data transformation (@concept transform)
├── ingest.R # Database loading (@concept ingest)
├── parquet.R # Parquet operations (@concept parquet) [NEW]
├── duckdb.R # DuckDB operations (@concept duckdb) [NEW]
├── version.R # Schema versioning (@concept version)
├── check.R # Data validation (@concept check)
└── viz.R # Diagram generation (@concept viz)
7.1.2 calcofi4r - User-Facing Tools
Target users: Researchers, analysts, app developers
calcofi4r/R/
├── read.R # Data retrieval (@concept read)
├── database.R # Database connection (@concept database)
├── analyze.R # Statistical analysis (@concept analyze)
├── visualize.R # Plotting and mapping (@concept visualize)
└── functions.R # Helper functions (@concept utils)
7.2 Function Reuse Across Contexts
| Context | Package | Example Usage |
|---|---|---|
| Ingestion workflows | calcofi4db |
read_csv_files(), ingest_csv_to_db() |
| Publishing workflows | calcofi4db |
get_db_con(), create_db_manifest() |
| Shiny apps | calcofi4r |
cc_get_db(), cc_plot_timeseries() |
| Analysis scripts | calcofi4r |
cc_read_bottle(), cc_analyze_trend() |
| API endpoints | calcofi4r |
cc_query_db(), cc_get_variables() |
7.3 New Functions to Implement
7.3.1 In calcofi4db:
# R/cloud.R - GCS operations
get_gcs_file(bucket, path, local_path = NULL)
put_gcs_file(local_path, bucket, path)
sync_gdrive_to_gcs(gdrive_path, gcs_bucket, backup = TRUE)
list_gcs_versions(bucket, path)
# R/parquet.R - Parquet operations
csv_to_parquet(csv_path, schema_def = NULL)
read_parquet_table(path, con = NULL)
write_parquet_table(data, path, partitions = NULL)
add_parquet_metadata(path, metadata_list)
# R/duckdb.R - DuckDB operations
get_duckdb_con(path = ":memory:", read_only = FALSE)
create_duckdb_views(con, manifest)
attach_ducklake(con, catalog_path)
set_duckdb_comments(con, table, comments)
# R/ducklake.R - DuckLake catalog operations [NEW]
get_working_ducklake() # connect to internal Working DuckLake
ingest_to_working(data, table, source_file, source_uuid_col = NULL)
add_provenance_columns(data, source_file, source_row_start = 1)
# R/freeze.R - Frozen release operations [NEW]
freeze_release(version, release_notes = NULL)
# 1. Validate data quality
# 2. Strip provenance columns (_source_*, _ingested_at)
# 3. Export to parquet files
# 4. Create DuckLake catalog
# 5. Upload to releases/{version}/
validate_for_release(con) # run all quality checks before freeze
list_frozen_releases() # list available frozen versions
compare_releases(v1, v2) # diff two releases7.3.2 In calcofi4r:
# R/database.R - Frozen DuckLake access (user-facing)
cc_get_db(version = "latest", local_cache = TRUE)
# Connect to frozen release (never Working DuckLake)
cc_list_versions() # list available frozen releases
cc_db_info(version = "latest") # release date, row counts, notes
cc_release_notes(version) # view release notes for a version
# R/read.R - Convenience functions for common tables
cc_read_bottle(version = "latest", ...)
cc_read_larvae(version = "latest", ...)
cc_read_cast(version = "latest", ...)
# R/query.R - Query helpers
cc_query(sql, version = "latest")
cc_list_tables(version = "latest")
cc_describe_table(table, version = "latest")8 Workflow Dependency Management with targets
8.1 Pipeline Definition
Create workflows/_targets.R:
library(targets)
library(tarchetypes)
tar_option_set(
packages = c("calcofi4db", "duckdb", "dplyr", "arrow"),
format = "qs" # Fast serialization
)
# Define pipeline
list(
# ─── Raw Data from GCS (public bucket, latest archive) ─────────
tar_target(
raw_bottle,
get_gcs_file("gs://calcofi-files-public", "calcofi.org/bottle-database/bottle.csv")
),
tar_target(
raw_cast,
get_gcs_file("gs://calcofi-files-public", "calcofi.org/bottle-database/cast.csv")
),
tar_target(
raw_larvae,
get_gcs_file("gs://calcofi-files-public", "swfsc.noaa.gov/calcofi-db/larva.csv")
),
# ─── Transform to Parquet ──────────────────────────────────────
tar_target(
pqt_bottle,
csv_to_parquet(raw_bottle, output = "parquet/bottle.parquet")
),
tar_target(
pqt_cast,
csv_to_parquet(raw_cast, output = "parquet/cast.parquet")
),
tar_target(
pqt_larvae,
csv_to_parquet(raw_larvae, output = "parquet/larvae.parquet")
),
# ─── Create Integrated DuckDB ──────────────────────────────────
tar_target(
db_manifest,
create_db_manifest(
tables = list(pqt_bottle, pqt_cast, pqt_larvae),
version = format(Sys.Date(), "%Y.%m.%d")
)
),
tar_target(
duckdb_file,
create_duckdb_from_manifest(db_manifest)
),
# ─── Publish Datasets ──────────────────────────────────────────
tar_target(
obis_archive,
create_obis_archive(duckdb_file, include_bottle = TRUE)
)
)8.2 Workflow Lifecycle
flowchart TB
subgraph dev["Development Cycle"]
direction LR
d1["1\. Explore"]
d2["2\. Ingest"]
d3["3\. Targets"]
d4["4\. Run"]
d5["5\. Publish"]
d1 --> d2 --> d3 --> d4 --> d5
end
subgraph update["Update Cycle"]
direction LR
u1["1\. New data"]
u2["2\. Sync"]
u3["3\. Run"]
u4["4\. Republish"]
u1 --> u2 --> u3 --> u4
end
dev -.->|"formalized"| update
style dev fill:#fff3e0,stroke:#ef6c00
style update fill:#e8f4e8,stroke:#2e7d32
8.3 Dependency Graph
flowchart LR
subgraph raw["📁 Raw CSV"]
raw_bottle
raw_cast
raw_larvae
end
subgraph parquet["⚙️ Parquet"]
pqt_bottle
pqt_cast
pqt_larvae
end
subgraph database["🗃️ Database"]
db_manifest
duckdb_file
end
subgraph publish["🌐 Publish"]
obis_archive
end
raw_bottle --> pqt_bottle
raw_cast --> pqt_cast
raw_larvae --> pqt_larvae
pqt_bottle --> db_manifest
pqt_cast --> db_manifest
pqt_larvae --> db_manifest
db_manifest --> duckdb_file
duckdb_file --> obis_archive
style raw fill:#e3f2fd,stroke:#1565c0
style parquet fill:#fff3e0,stroke:#ef6c00
style database fill:#f3e5f5,stroke:#7b1fa2
style publish fill:#fce4ec,stroke:#c2185b
9 Documentation Updates
9.1 Recommended Changes to docs/
9.1.1 docs/db.qmd - Database Documentation
Current: PostgreSQL-focused with naming conventions and ingestion strategy
Proposed changes:
- Add section on DuckDB as primary database
- Document Parquet file organization
- Explain DuckLake versioning and time travel
- Keep naming conventions (snake_case, *_id, *_uuid suffixes)
- Update “Integrated database ingestion strategy” for targets pipeline
9.1.2 docs/api.qmd - API Documentation
Current: REST API endpoints for PostgreSQL queries
Proposed changes:
- Note that REST API may be deprecated for most use cases
- Document direct DuckDB access via
calcofi4r::cc_get_db() - Keep API for web apps that cannot use DuckDB directly
- Consider lightweight API using DuckDB backend if needed
9.1.3 docs/portals.qmd - Portal Documentation
Current: Good overview of EDI, NCEI, OBIS, ERDDAP
Proposed changes:
- Update data flow diagram to show DuckDB → Portals
- Add section on automated publishing via targets pipeline
- Document versioning strategy for portal submissions
9.2 Where Documentation Should Live
| Content | Location | Rationale |
|---|---|---|
| Database schema | docs/db.qmd |
User-facing documentation |
| Workflow development | workflows/README.md |
Developer guide |
| Package functions | calcofi4r/, calcofi4db/ |
pkgdown reference |
| Data dictionary | calcofi4db/inst/ |
Versioned with code |
| Architecture decisions | docs/ or workflows/_docs/ |
Long-term reference |
10 Implementation Priority
10.1 Phase 1: Foundation ✅ COMPLETE (2026-02-02)
Goal: Establish cloud infrastructure and helper functions
- Set up GCS buckets ✅
- Created
gs://calcofi-files-public/with public read access - Created
gs://calcofi-files-private/with restricted access - Configured rclone access via OAuth
- Created
- Add cloud functions to
calcofi4db✅- Created
R/cloud.Rwithget_gcs_file(),put_gcs_file() - Created
R/parquet.Rwithcsv_to_parquet() - Created
R/duckdb.Rwithget_duckdb_con(),create_duckdb_views()
- Created
- Initial sync ✅
- Synced Google Drive
data-public/togs://calcofi-files-public/_sync/ - Created initial archive snapshot:
archive/2026-02-02_121557/ - Generated manifest:
manifests/manifest_2026-02-02_121557.json - Total: 112,457 files (33.5 GiB)
- Synced Google Drive
10.2 Phase 2: Pipeline
Goal: Implement targets pipeline for priority datasets
- Create
workflows/_targets.R- Define targets for bottle and larvae data
- Establish Parquet transformation pipeline
- Create integrated DuckDB
- Migrate ingestion workflows
- Convert
ingest_calcofi.org_bottle-database.qmdto target - Convert
ingest_swfsc.noaa.gov_calcofi-db.qmdto target
- Convert
- Update publish workflow
- Modify
publish_larvae_to_obis.qmdto read from DuckDB - Add bottle data to EMoF extension
- Modify
10.3 Phase 3: Working DuckLake
Goal: Implement Working DuckLake with provenance tracking
- Set up Working DuckLake catalog
- Configure DuckLake at
gs://calcofi-db/ducklake/working/ - Enable time travel queries
- Create Git-tracked catalog manifest
- Configure DuckLake at
- Add provenance columns to ingest workflows
_source_file: Path to source CSV in Archive_source_row: Row number in source file_source_uuid: Original record UUID (if available)_ingested_at: Ingestion timestamp
- Implement
calcofi4dbDuckLake functionsget_working_ducklake(): Connect to internal databaseingest_to_working(): Ingest with provenanceadd_provenance_columns(): Helper for provenance
10.4 Phase 4: Frozen DuckLake
Goal: Implement Frozen DuckLake for public releases
- Create freeze workflow
validate_for_release(): Data quality checksfreeze_release(): Create immutable release- Strip provenance columns for clean public schema
- Generate release notes
- Set up release structure
gs://calcofi-db/ducklake/releases/v2026.02/- Parquet files + DuckLake catalog
latestsymlink to current release
- Update
calcofi4rfor frozen accesscc_get_db(version = "latest"): Default to frozen releasescc_list_versions(): List available releasescc_release_notes(): View release notes
- First public release: v2026.02
- Bottle + Cast + Larvae tables
- Announce to users
10.5 Phase 5: Documentation & Polish
Goal: Complete documentation and evaluate API
- Update docs/
- Revise
db.qmdfor DuckDB/DuckLake - Document Working vs Frozen databases
- Evaluate API necessity
- Update portal documentation
- Revise
- Clean up workflows/
- Consolidate README.md with essentials
- Archive deprecated exploration notebooks
- Document workflow development process
- Create “Getting Started” guide for data managers
11 Verification Plan
11.1 End-to-End Test
# 1. Sync test data from Google Drive to GCS
system("rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync")
# 2. Create archive snapshot
timestamp <- format(Sys.time(), "%Y-%m-%d_%H%M%S")
system(glue::glue("rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/{timestamp}"))
# 3. Run targets pipeline
targets::tar_make()
# 4. Verify database
con <- calcofi4db::get_duckdb_con("calcofi.duckdb")
DBI::dbListTables(con)
DBI::dbGetQuery(con, "SELECT COUNT(*) FROM larvae")
# 5. Verify OBIS archive
obistools::check_fields(read_csv("data/darwincore/larvae/occurrence.csv"))11.2 Integration Tests
12 Migration Strategy: PostgreSQL → DuckDB
12.1 Current PostgreSQL Dependencies
Based on server/docker-compose.yml, the following services depend on PostgreSQL:
| Service | Purpose | Migration Path |
|---|---|---|
postgis |
Spatial database | Keep for vector tiles (short-term) |
pg_tileserv |
Vector tile server | Replace with PMTiles |
pg_rest |
REST API (PostgREST) | Keep for web apps, add DuckDB option |
plumber |
R API | Update to use DuckDB |
pgadmin |
Database admin | Deprecate once migrated |
12.2 Vector Tiles: pg_tileserv → PMTiles
The current architecture uses pg_tileserv to serve vector tiles from PostGIS for Shiny mapping apps. The migration path:
flowchart LR
subgraph current["Current Flow (PostgreSQL)"]
direction LR
pg["PostGIS"]
ts["pg_tileserv"]
lf["Leaflet"]
pg --> ts --> lf
end
subgraph target["Target Flow (Cloud-native)"]
direction LR
geo["GeoParquet"]
tc["tippecanoe"]
pm["PMTiles"]
mg["mapgl"]
geo --> tc --> pm --> mg
end
current -.->|"migrate"| target
style current fill:#ffebee,stroke:#c62828
style target fill:#e8f4e8,stroke:#2e7d32
12.2.1 Implementation Steps
Generate PMTiles from PostGIS
# Export geometries from PostGIS ogr2ogr -f GeoJSON stations.geojson \ PG:"host=postgis user=admin dbname=gis" \ -sql "SELECT * FROM stations" # Convert to PMTiles using tippecanoe tippecanoe -o stations.pmtiles \ --minimum-zoom=0 --maximum-zoom=14 \ --layer=stations \ stations.geojson # Upload to GCS gsutil cp stations.pmtiles gs://calcofi-db/tiles/Update Shiny Apps
# Old: pg_tileserv leaflet() |> addTiles() |> addMapboxGL( style = list( sources = list( stations = list( type = "vector", tiles = list("https://tile.calcofi.io/public.stations/{z}/{x}/{y}.pbf") ) ) ) ) # New: PMTiles with mapgl library(mapgl) mapboxgl() |> add_pmtiles_source( id = "stations", url = "https://storage.googleapis.com/calcofi-db/tiles/stations.pmtiles" ) |> add_layer( id = "stations-layer", type = "circle", source = "stations", source_layer = "stations" )Add PMTiles generation to targets pipeline
# In _targets.R tar_target( pmtiles_stations, create_pmtiles( geom_source = duckdb_file, query = "SELECT * FROM stations", output = "tiles/stations.pmtiles" ) )
12.3 Migration Roadmap
| Phase | Timeline | Actions |
|---|---|---|
| Phase 1 | Now | Add DuckDB as primary data access; keep PostgreSQL |
| Phase 2 | +2 months | Generate PMTiles for all spatial layers |
| Phase 3 | +4 months | Update Shiny apps to use mapgl + PMTiles |
| Phase 4 | +6 months | Deprecate pg_tileserv, reduce PostgreSQL to backup |
| Phase 5 | +12 months | Fully deprecate PostgreSQL (if no dependencies remain) |
13 Resolved Questions
Based on clarifications received:
GCS Project: Use existing project
ucsd-sio-calcofi- Existing bucket
calcofi-db→ DuckLake/database files - New bucket
calcofi-files-public→ Versioned public source files (public read access) - New bucket
calcofi-files-private→ Versioned private/sensitive files (restricted access)
- Existing bucket
Google Drive Structure (implemented 2026-02-02):
calcofi/data-public/→ Public datasets (bottle, larvae, etc.)calcofi/data-private/→ Private/sensitive data (eDNA, zooplankton samples)
GCS Folder Structure (implemented 2026-02-02):
_sync/→ Working directory (rclone syncs here, mutable)archive/{timestamp}/→ Immutable snapshots (complete copies, not deltas)manifests/→ JSON metadata files withmanifest_latest.jsonpointer
API Strategy: Keep PostgreSQL + API for now with migration roadmap
- pg_tileserv needed for current Shiny apps
- Migrate to PMTiles over time using tippecanoe + mapgl
Priority Data: Larvae + Bottle datasets for proof-of-concept
Bottle in OBIS: Include as EMoF extension in larvae dataset
14 Remaining Questions
- DuckLake vs MotherDuck: Self-hosted DuckLake catalog or use MotherDuck service?
- Automation: GitHub Actions vs server cron for scheduled rclone syncs?
- PMTiles hosting: Serve from GCS directly or via CDN?
15 Appendix: Key References
15.1 Documentation
15.2 Technologies
Database & Analytics:
- DuckDB - Fast in-process analytical database
- DuckLake - Lakehouse catalog for DuckDB
- Frozen DuckLake - Immutable versioned releases
- Apache Parquet - Columnar storage format
Workflow & Orchestration:
- targets R package - Pipeline dependency management
Cloud Storage:
- rclone - Cloud storage sync tool
- googleCloudStorageR - R interface to GCS
- Data lakes and big data analytics - Google Cloud Storage
Vector Tiles:
- PMTiles - Cloud-native vector tile format
- tippecanoe - Build vector tilesets
- mapgl R package - Modern WebGL maps in R/Shiny
15.3 CalCOFI Resources
- calcofi.io/docs - CalCOFI documentation site
- calcofi.io/calcofi4r - User-facing R package
- calcofi.io/calcofi4db - Database admin R package
Footnotes
Technical term: This is a “versioned data lake” built on Google Cloud Storage with rclone synchronization.↩︎
DuckDB is a fast, in-process database that can query Parquet files directly from cloud storage without downloading them.↩︎
Technical terms: This is a “versioned data lake” implemented with Google Cloud Storage buckets and rclone synchronization with timestamped archive snapshots.↩︎