---
title: "CalCOFI Data Workflow Plan"
format:
html:
toc: true
toc-depth: 3
number-sections: true
code-fold: true
embed-resources: true
---
<style>
/* flowchart boxes */
.mermaid-js .flowchart-link { stroke: #333!important; }
/* gantt chart: make task text visible */
.mermaid-js .taskText,
.mermaid-js .taskTextOutsideRight,
.mermaid-js .taskTextOutsideLeft,
.mermaid-js [class^="taskText"] {
fill: #333!important;
}
</style>
<!--
PROMPT to Claude 2026-01-31:
Help me write a new a data workflow plan to manage and version raw data files, an integrated database and workflows to ingest raw data in and publish datasets out.
- Use documents in the @workflows/_docs folder as suggestions for technologies (with technological preference where in conflict given to the more recent files as given by their date prefix filename).
- We already have a server on the Google Cloud platform, so thinking of using Google Cloud Storage buckets and a new service account for managing that (rather than AWS S3).
- We already have workflows in @workflows, but there are not yet dependencies built between them. Nor is the sequence anywhere stated. For instance, to build the integrated database thus far requires running first @workflows/ingest_calcofi.org_bottle-database.qmd, then @workflows/ingest_calcofi.org_bottle-database.qmd (once completed). How can these be better organized and developed as a dataset is initially explored, then ingested, then updated?
- Consider using R package `targets` to manage dependencies between files and workflows, and `pins` for sharing and versioning files.
- Review and suggest revisions to the existing documentation under @docs/, especially on Database, API and Portals. Could the API and Postgres database be superceded by use of parquet files and DuckDB?
- Describe relevance, strategy and purpose of R packages @calcofi4r and @calcofi4db.
- How can we systematically reuse documented functions in R libraries across workflows, apps, database access, etc?
- For now, write this to @workflows/README_PLAN.md. Eventually, a more succinct version will probably be folded into @workflows/README.md, but also looking for suggestions, such as perhaps more appropriate in the R libraries @calcofi4r or @calcofi4db and/or @docs.
- Be sure to include a systematic plan for implementation, including what should be refactored first. Prioritize producing a tidy, clean version of the integrated database with ingestion from: @workflows/ingest_calcofi.org_bottle-database.qmd, @workflows/ingest_calcofi.org_bottle-database.qmd; and output to @workflows/publish_larvae_to_obis.qmdpublish_larvae_to_obis.qmd (which does not yet include the bottle data). Then get this example to work from storing source files in a versioned cloud storage bucket, to referencing these files in the ingest scripts (perhaps with helper functions to get latest cloud version file, given state of Google Drive), producing and uploading parquet files to cloud storage, registering new versions with a DuckLake, and producing helper functions to get latest (or past) versions of the database as relational duckdb with metadata as COMMENTS and foreign keys registered.
-->
> This document outlines a comprehensive strategy for managing and versioning CalCOFI data files, building an integrated database, and publishing datasets to external portals.
## Overview for Data Managers & Scientists {#sec-overview}
This section provides a high-level view of the CalCOFI data system. **You don't need to understand all the technical details**—just the four main components and how they connect.
### The Big Picture
```{mermaid}
%%| label: fig-overview
%%| fig-cap: "CalCOFI Data System Overview"
flowchart TB
subgraph drive["📁 THE DRIVE"]
direction LR
gd_pub["data-public/<br/>Bottle, Larvae, CTD..."]
gd_priv["data-private/<br/>eDNA, Zooplankton..."]
end
subgraph archive["🗄️ THE ARCHIVE"]
direction LR
sync["_sync/<br/>(working copy)"]
hist["archive/<br/>(snapshots)"]
end
subgraph workflows["⚙️ THE WORKFLOWS"]
direction LR
ingest["Ingest<br/>(CSV → DB)"]
publish["Publish<br/>(DB → Portals)"]
end
subgraph database["🗃️ THE DATABASE"]
direction LR
working["Working<br/>(internal)"]
frozen["Frozen<br/>(public releases)"]
end
subgraph outputs["🌐 OUTPUTS"]
direction LR
obis["OBIS"]
apps["Shiny Apps"]
api["API / R pkg"]
end
drive -->|"daily sync"| archive
archive -->|"read files"| workflows
workflows -->|"transform"| database
database -->|"publish"| outputs
style drive fill:#e8f4e8,stroke:#2e7d32
style archive fill:#e3f2fd,stroke:#1565c0
style workflows fill:#fff3e0,stroke:#ef6c00
style database fill:#f3e5f5,stroke:#7b1fa2
style outputs fill:#fce4ec,stroke:#c2185b
```
### Four Key Components
::: {.callout-note collapse="false"}
## 📁 The Drive
**What**: The shared Google Drive folder where you upload your data files.
**Location**: `My Drive/projects/calcofi/data-public/` (or `data-private/`)
**Who uses it**: Dataset managers (Rasmus, Ed, Nastassia, Linsey, etc.)
**How to use it**:
1. Create a folder for your data: `{provider}/{dataset}/` (e.g., `swfsc.noaa.gov/calcofi-db/`)
2. Inside, organize files into `raw/` (original data) and `derived/` (processed data)
3. Simply upload or overwrite files—versioning happens automatically
4. No special naming required (no date prefixes needed)
**Example structure**:
```
data-public/
├── calcofi.org/
│ └── bottle-database/
│ ├── raw/
│ │ └── bottle_export_2026-01-15.csv
│ └── derived/
│ └── bottle_cleaned.csv
├── swfsc.noaa.gov/
│ └── calcofi-db/
│ ├── raw/
│ │ ├── larva.csv
│ │ └── cruise.csv
│ └── derived/
```
:::
::: {.callout-note collapse="false"}
## 🗄️ The Archive
**What**: A versioned backup of all Drive files in the cloud[^datalake].
**Location**: Google Cloud Storage buckets (`gs://calcofi-files-public/` and `gs://calcofi-files-private/`)
**Who uses it**: Automated systems (you don't need to interact with it directly)
**What it does**:
- **Daily sync**: Copies all files from the Drive to the Archive
- **Snapshots**: Creates timestamped copies so we can always go back in time
- **Never loses data**: Even if you delete a file from Drive, it's preserved in the Archive
**Why it matters**: Scientists can reproduce analyses using the exact data from any past date.
[^datalake]: Technical term: This is a "versioned data lake" built on Google Cloud Storage with rclone synchronization.
:::
::: {.callout-note collapse="false"}
## ⚙️ The Workflows
**What**: Automated scripts that process data from the Archive into the Database.
**Location**: `/workflows/` folder (Quarto notebooks and R scripts)
**Who uses it**: Data engineers and developers
**Two types of workflows**:
1. **Ingest workflows** (`ingest_*.qmd`): Read CSV files → Clean/transform → Load into Database
2. **Publish workflows** (`publish_*.qmd`): Read from Database → Format for portals → Upload to OBIS, EDI, etc.
**Key feature**: Dependencies are managed automatically—if source data changes, all downstream steps re-run.
:::
::: {.callout-note collapse="false"}
## 🗃️ The Database
**What**: The integrated CalCOFI database combining all datasets into related tables.
**Format**: DuckDB[^duckdb] with Parquet files (modern, fast, serverless)
**Two versions**:
| Version | Purpose | Access | Features |
|---------|---------|--------|----------|
| **Working Database** | Internal processing | Workflows only | Full provenance tracking, UUIDs, source file references, time travel |
| **Frozen Database** | Public releases | Apps, API, R package | Clean tables, versioned releases (v2026.02, v2026.03...), optimized for queries |
**Why two versions?** The Working Database tracks every detail for reproducibility. The Frozen Database provides clean, stable snapshots for users who just want the data.
[^duckdb]: [DuckDB](https://duckdb.org/) is a fast, in-process database that can query Parquet files directly from cloud storage without downloading them.
:::
### How Data Flows
```{mermaid}
%%| label: fig-data-flow-simple
%%| fig-cap: "Simplified Data Flow"
flowchart TB
subgraph you["👤 YOU"]
upload["Upload CSV to Drive"]
end
subgraph auto["🤖 AUTOMATED"]
sync["Sync to Archive"]
snapshot["Create snapshot"]
transform["Run workflows"]
end
subgraph result["📊 RESULT"]
db["Updated Database"]
portals["Published to Portals"]
end
upload --> sync --> snapshot --> transform --> db --> portals
style you fill:#e8f4e8,stroke:#2e7d32
style auto fill:#fff3e0,stroke:#ef6c00
style result fill:#f3e5f5,stroke:#7b1fa2
```
**Your role as a dataset manager**: Upload files to the Drive. Everything else happens automatically.
---
## Executive Summary
This plan modernizes the CalCOFI data workflow architecture to:
1. Version raw data files in Google Cloud Storage (GCS)
2. Use DuckDB + Parquet as the primary integrated database format
3. Leverage `targets` for workflow dependency management
4. Use R packages (`calcofi4r`, `calcofi4db`) for reusable functions
5. Maintain reproducible pipelines from raw data to published datasets
---
## Current State Analysis
### Existing Infrastructure
| Component | Current State | Location |
|-----------|--------------|----------|
| **Workflows** | Individual Quarto notebooks | `/workflows/` |
| **Database** | PostgreSQL with dev/prod schemas | api.calcofi.io |
| **R Packages** | `calcofi4r` (user), `calcofi4db` (admin) | GitHub |
| **Data Sources** | CSV files on Google Drive | `~/My Drive/projects/calcofi/data-public` |
| **Documentation** | Quarto site | `/docs/` |
### Key Workflow Files
| Workflow | Purpose | Status |
|----------|---------|--------|
| `ingest_calcofi.org_bottle-database.qmd` | Ingest bottle/cast data | Partial |
| `ingest_swfsc.noaa.gov_calcofi-db.qmd` | Ingest NOAA ichthyoplankton data | Active |
| `publish_larvae_to_obis.qmd` | Publish larvae to OBIS | Active (needs bottle data) |
### Identified Gaps
1. **No dependency management**: Workflows run independently without declared dependencies
2. **No file versioning**: Source files updated in place without version history
3. **PostgreSQL overhead**: Requires server maintenance and API infrastructure
4. **Disconnected workflows**: No systematic connection between ingest and publish steps
5. **Manual sequencing**: Users must know to run bottle ingestion before larvae ingestion
---
## File Versioning Strategy
### The Two-World Problem
CalCOFI data originates from multiple sources managed by different individuals:
- **Rasmus**: Bottle/cast data, CTD profiles
- **Ed**: Ichthyoplankton (larvae, eggs) data
- **Nastassia**: eDNA data
- **Linsey**: Zooplankton samples
These dataset managers need a simple interface (Google Drive) to deposit and organize files, while the data engineering side needs version control, immutability, and reproducibility.
### Google Drive: The Human Interface
This is "**The Drive**" from @sec-overview. Dataset managers upload files here; everything else is automatic.
#### Folder Structure
**Recommended organization**: `{provider}/{dataset}/raw/` and `{provider}/{dataset}/derived/`
```
calcofi/data-public/ # Shared Google Drive folder (public data)
├── calcofi.org/
│ └── bottle-database/
│ ├── raw/ # Original exports from source system
│ │ └── bottle_export_2026-01.csv
│ └── derived/ # Cleaned/processed versions
│ ├── bottle.csv
│ ├── cast.csv
│ └── field_descriptions.csv
├── swfsc.noaa.gov/
│ └── calcofi-db/
│ ├── raw/ # Original NOAA database exports
│ │ └── calcofi_db_dump_2026-01.sql
│ └── derived/ # Extracted CSV tables
│ ├── cruise.csv
│ ├── egg.csv
│ ├── larva.csv
│ ├── net.csv
│ ├── ship.csv
│ ├── site.csv
│ ├── species.csv
│ └── tow.csv
├── coastwatch.pfeg.noaa.gov/
│ └── erdCalCOFIlrvsiz/
│ └── derived/
│ └── larvae_size.csv
└── _archive/ # Manual archive (optional)
└── 2025-01-15_bottle.csv # User-created backup
calcofi/data-private/ # Private/sensitive data (restricted access)
├── edna/ # eDNA samples (Nastassia)
│ ├── raw/
│ └── derived/
├── zooplankton/ # Zooplankton samples (Linsey)
│ ├── raw/
│ └── derived/
└── _archive/ # Manual archive (optional)
```
#### Google Drive Naming Conventions
- **Top-level folders**: `{provider}/` - the organization providing the data (e.g., `swfsc.noaa.gov`)
- **Dataset folders**: `{dataset}/` - specific dataset name (e.g., `calcofi-db`)
- **Subfolders** (recommended):
- `raw/` - Original data as received from source (database dumps, original CSVs)
- `derived/` - Processed/cleaned versions ready for ingestion
- **Files**: Original source names, no date prefixes required
- **Updates**: Simply overwrite existing files
- **No versioning in Drive**: Google Drive does not maintain file history for CSVs
#### Dataset Manager Workflow
1. Create folder structure: `{provider}/{dataset}/raw/` and `{provider}/{dataset}/derived/`
2. Upload original data to `raw/` (preserve original format)
3. Upload processed/cleaned data to `derived/` (CSV format preferred)
4. Simply overwrite files when updating—versioning happens automatically in the Archive
5. Optionally notify data team of significant changes
**See @sec-overview for the big picture of how your data flows through the system.**
### Google Cloud Storage: The Versioned Data Lake
This is "**The Archive**" from @sec-overview[^datalake-detail]. It automatically captures versioned snapshots of all files from the Drive.
[^datalake-detail]: Technical terms: This is a "versioned data lake" implemented with Google Cloud Storage buckets and rclone synchronization with timestamped archive snapshots.
#### Bucket Structure
```
gs://calcofi-files-public/ # Versioned public source files
├── _sync/ # Working directory (rclone syncs here)
│ ├── calcofi.org/
│ │ └── bottle-database/
│ │ ├── bottle.csv
│ │ └── cast.csv
│ └── swfsc.noaa.gov/
│ └── calcofi-db/
│ ├── cruise.csv
│ ├── egg.csv
│ └── ...
├── archive/ # Timestamped immutable snapshots
│ ├── 2026-02-02_121557/ # YYYY-MM-DD_HHMMSS
│ │ └── calcofi.org/
│ │ └── bottle-database/
│ │ └── bottle.csv # Snapshot at that time
│ ├── 2026-02-15_120000/
│ │ └── swfsc.noaa.gov/
│ │ └── calcofi-db/
│ │ └── larva.csv
│ └── ...
└── manifests/ # Version metadata
├── manifest_2026-02-02_121557.json
├── manifest_2026-02-15_120000.json
└── manifest_latest.json # Points to latest archive
gs://calcofi-files-private/ # Versioned private source files
├── _sync/ # Working directory (rclone syncs here)
│ ├── edna/
│ └── zooplankton/
├── archive/ # Timestamped immutable snapshots
└── manifests/
gs://calcofi-db/ # Integrated database
├── ingest/ # Per-workflow outputs (intermediate)
│ ├── swfsc.noaa.gov_calcofi-db/ # Flattened {provider}_{dataset}
│ │ ├── cruise.parquet
│ │ ├── ship.parquet
│ │ ├── site.parquet
│ │ ├── ...
│ │ ├── manifest.json # Provenance for this workflow
│ │ └── metadata.json # Table & column metadata sidecar
│ └── calcofi.org_bottle-database/
│ ├── cast.parquet
│ ├── bottle.parquet
│ ├── manifest.json
│ └── metadata.json
│
└── ducklake/ # DuckLake manages parquet + catalog
├── working/ # Integrated database (combines ingests)
│ ├── catalog/ # DuckLake catalog metadata
│ └── data/ # Parquet files with provenance columns
│ ├── cruise/
│ ├── site/
│ └── ...
└── releases/ # Frozen public releases
├── v2026.02/
│ ├── catalog/
│ └── data/ # Provenance columns stripped
└── latest -> v2026.02 # Symlink to current
```
### Workflow Metadata
Redefinition files (column/table renaming rules) live in `workflows/metadata/`, NOT in the calcofi4db R package. This decouples workflow-specific metadata from the package code.
```
workflows/metadata/
├── swfsc.noaa.gov/
│ └── calcofi-db/
│ ├── tbls_redefine.csv # Table renaming rules
│ ├── flds_redefine.csv # Field renaming/typing rules
│ ├── tbls_raw.csv # Auto-generated raw table info
│ └── flds_raw.csv # Auto-generated raw field info
└── calcofi.org/
└── bottle-database/
├── flds_rename.csv
└── ship_renames.csv
```
**Benefits of this structure:**
- Version-controlled alongside workflows (not buried in R package)
- Easy to edit without rebuilding the package
- Clear ownership per dataset
- Allows different projects to use calcofi4db without conflicting metadata
#### GCS Naming Conventions
- **_sync/**: Working directory, exact mirror of Google Drive structure (rclone syncs here)
- **archive/{YYYY-MM-DD_HHMMSS}/**: Immutable snapshot of complete data at that timestamp
- **manifests/manifest_{YYYY-MM-DD_HHMMSS}.json**: Snapshot metadata
- **manifests/manifest_latest.json**: Points to latest archive snapshot
### Rclone Sync: Capturing Daily Changes
#### Installation
**macOS:**
```bash
brew install rclone
```
**Linux (Ubuntu/Debian):**
```bash
curl https://rclone.org/install.sh | sudo bash
```
#### Configuration
Run `rclone config` and create two remotes:
**Remote 1: Google Drive (`gdrive`)**
```
n) New remote
name> gdrive
Storage> drive
client_id> # leave blank
client_secret> # leave blank
scope> 1 # Full access (or 2 for read-only)
service_account_file> # leave blank for OAuth, or path to JSON key
Edit advanced config> n
Use auto config> y # opens browser (use 'n' on headless server)
Configure this as a Shared Drive> n
```
<!---
Ben's laptop output on 2026-02-01:
Configuration complete.
Options:
- type: drive
- scope: drive
- token: {"access_token":"...","expiry":"2026-02-01T01:40:32.653983-06:00","expires_in":3599}
- team_drive:
Keep this "gdrive-ecoquants" remote?
y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d> y
Current remotes:
Name Type
==== ====
gcs-calcofi google cloud storage
gdrive-ecoquants drive
-->
**Remote 2: Google Cloud Storage (`gcs`)**
```
n) New remote
name> gcs
Storage> google cloud storage
client_id> # leave blank
client_secret> # leave blank
project_number> ucsd-sio-calcofi
service_account_file> # leave blank for OAuth, or path to JSON key
Edit advanced config> n
Use auto config> y # opens browser (use 'n' on headless server)
```
<!--
Ben's laptop output on 2026-02-01:
Use web browser to automatically authenticate rclone with remote?
* Say Y if the machine running rclone has a web browser you can use
* Say N if running rclone on a (remote) machine without web browser access
If not sure try Y. If Y failed, try N.
y) Yes (default)
n) No
y/n>
2026/02/01 00:29:35 NOTICE: Make sure your Redirect URL is set to "http://127.0.0.1:53682/" in your custom config.
2026/02/01 00:29:35 NOTICE: If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth?state=paZTK8N-GsB5j3kJDEq5SA
2026/02/01 00:29:35 NOTICE: Log in and authorize rclone for access
2026/02/01 00:29:35 NOTICE: Waiting for code...
2026/02/01 00:29:48 NOTICE: Got code
Configuration complete.
Options:
- type: google cloud storage
- project_number: ucsd-sio-calcofi
- object_acl: publicRead
- bucket_acl: publicRead
- bucket_policy_only: true
- token: {"access_token":"...","expiry":"2026-02-01T01:29:48.100866-06:00","expires_in":3599}
Keep this "gcs-calcofi" remote?
y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d> y
Current remotes:
Name Type
==== ====
gcs-calcofi google cloud storage
gdrive drive
-->
#### Headless Server Setup
For Linux servers without a browser, use `rclone authorize` on a local machine:
```bash
# on local machine with browser
rclone authorize "drive"
# copy the token output
# on server, during rclone config
Use auto config> n
# paste the token when prompted
```
Similarly for GCS:
```bash
# on local machine
rclone authorize "google cloud storage"
```
#### Service Account (Optional)
For automated/scheduled syncs, use a GCP service account instead of OAuth:
1. Create service account in GCP Console → IAM → Service Accounts
2. Grant roles: `Storage Object Admin` for `calcofi-files-public`, `calcofi-files-private`, and `calcofi-db` buckets
3. Download JSON key file to secure location (e.g., `/etc/rclone/calcofi-sa.json`)
4. In `rclone config`, provide path to JSON key instead of using OAuth
#### Verify Configuration
```bash
# test Google Drive access
rclone lsd gdrive:projects/calcofi/data-public # OR
rclone lsd gdrive-ecoquants:projects/calcofi/data-public
# test GCS access
rclone ls gcs:calcofi-files-public --max-depth 1 # OR
rclone ls gcs-calcofi:calcofi-files-public --max-depth 1
# test sync (dry-run first!)
rclone sync gdrive:calcofi/data-public gcs:calcofi-files-public/_sync --dry-run -v
rclone sync gdrive-ecoquants:projects/calcofi/data-public gcs-calcofi:calcofi-files-public/_sync --dry-run -v
```
#### How It Works
```bash
#!/bin/bash
# sync_gdrive_to_gcs.sh - Run daily via cron
TIMESTAMP=$(date +%Y-%m-%d_%H%M%S)
LOG_FILE="/var/log/rclone/sync_${TIMESTAMP}.log"
# --- PUBLIC DATA ---
# Step 1: Sync Google Drive to _sync/ working directory
rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync \
--checksum \
--drive-export-formats csv \
--log-file ${LOG_FILE} \
--log-level INFO
# Step 2: Create immutable archive snapshot
rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/${TIMESTAMP}
# Step 3: Generate and upload manifest
rclone lsjson gs://calcofi-files-public/archive/${TIMESTAMP} --recursive \
> /tmp/manifest_${TIMESTAMP}.json
rclone copy /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/
rclone copyto /tmp/manifest_${TIMESTAMP}.json gs://calcofi-files-public/manifests/manifest_latest.json
# --- PRIVATE DATA ---
# Repeat for private bucket (separate log file)
rclone sync gdrive:calcofi/data-private gs://calcofi-files-private/_sync \
--checksum \
--drive-export-formats csv \
--log-file ${LOG_FILE%.log}_private.log \
--log-level INFO
rclone copy gs://calcofi-files-private/_sync gs://calcofi-files-private/archive/${TIMESTAMP}
```
#### Key Behaviors
| Scenario | Google Drive Action | Rclone Behavior | Result in GCS |
|----------|--------------------|-----------------|----|
| New file | User uploads `new.csv` | `rclone sync` copies to `_sync/` | `_sync/new.csv` created, next archive includes it |
| Updated file | User overwrites `bottle.csv` | `rclone sync` updates `_sync/` | `_sync/bottle.csv` updated, archive snapshots preserve history |
| Deleted file | User deletes `old.csv` | `rclone sync` removes from `_sync/` | Removed from `_sync/`, preserved in prior archives |
| Renamed file | User renames `a.csv` → `b.csv` | Treated as delete + create | `a.csv` gone from `_sync/`, `b.csv` created |
| No changes | Nothing | `--checksum` skips identical files | No new sync needed |
#### The Two-Step Archive Strategy
The archiving approach uses two steps for reliability:
1. **Sync to `_sync/`**: rclone syncs Google Drive to the `_sync/` working directory
2. **Copy to `archive/{timestamp}/`**: Create an immutable snapshot of the current state
Benefits:
- **Immutable archives**: Each archive folder is a complete snapshot, not just deltas
- **Reproducibility**: Any archive folder contains everything needed to recreate that state
- **No data loss**: If sync fails, previous archives remain intact
- **Public access**: `calcofi-files-public` bucket has public read access for easy data sharing
### Manifest Files
Each manifest captures the complete state of the archive at a point in time:
```json
{
"generated_at": "2026-02-02T12:15:57Z",
"archive_timestamp": "2026-02-02_121557",
"bucket": "gs://calcofi-files-public",
"archive_path": "archive/2026-02-02_121557",
"files": [
{
"Path": "calcofi.org/bottle-database/bottle.csv",
"Size": 45234567,
"ModTime": "2026-02-02T10:30:00Z",
"MimeType": "text/csv",
"gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/calcofi.org/bottle-database/bottle.csv"
},
{
"Path": "swfsc.noaa.gov/calcofi-db/larva.csv",
"Size": 12345678,
"ModTime": "2026-02-01T08:00:00Z",
"MimeType": "text/csv",
"gcs_url": "gs://calcofi-files-public/archive/2026-02-02_121557/swfsc.noaa.gov/calcofi-db/larva.csv"
}
],
"total_files": 112457,
"total_size_bytes": 35987654321
}
```
### Reconstructing Historical State
To recreate the data lake as it existed on a specific date:
```r
# R function to get file from a specific archive snapshot
get_archived_file <- function(path, timestamp = "latest", bucket = "calcofi-files-public") {
# Get manifest for that timestamp
if (timestamp == "latest") {
manifest_url <- glue("gs://{bucket}/manifests/manifest_latest.json")
} else {
manifest_url <- glue("gs://{bucket}/manifests/manifest_{timestamp}.json")
}
manifest <- jsonlite::read_json(manifest_url)
# Construct the archive URL
archive_url <- glue("gs://{bucket}/archive/{manifest$archive_timestamp}/{path}")
# Or use public HTTPS URL for direct access
https_url <- glue("https://storage.googleapis.com/{bucket}/archive/{manifest$archive_timestamp}/{path}")
return(https_url)
}
```
### Automation Schedule
```{mermaid}
%%| label: fig-schedule
%%| fig-cap: "Daily Sync Schedule (UTC)"
%%---
%%displayMode: compact
%%---
%%{init: {'theme': 'default', 'themeVariables': { 'textColor': '##333'}}}%%
gantt
title Daily Sync Schedule (UTC)
dateFormat HH:mm
axisFormat %H:%M
section Sync
rclone :a1, 00:00, 10m
archive :a2, after a1, 5m
manifest :a3, after a2, 5m
section Pipeline
targets :b1, after a3, 10m
parquet :b2, after b1, 5m
duckdb :b3, after b2, 5m
section Notify
notify :c1, after b3, 5m
```
The daily automation runs at midnight UTC and completes in under an hour:
1. **Sync** phase: `rclone` copies new/changed files from Google Drive to the `_sync/` folder in GCS, then creates an immutable `archive/` snapshot and generates a manifest file documenting the state.
2. **Pipeline** phase: The `targets` pipeline detects which source files changed and re-runs only the affected steps—transforming CSVs to Parquet format and updating the DuckDB database.
3. **Notify** phase: Sends notifications (email/Slack) summarizing what changed and any errors encountered.
If no files changed in Google Drive, the pipeline completes quickly with minimal processing.
---
## Proposed Architecture
### Data Flow Diagram
```{mermaid}
%%| label: fig-data-flow
%%| fig-cap: "CalCOFI Data Flow Architecture"
flowchart TB
subgraph drive["📁 THE DRIVE"]
direction LR
gdrive_pub["data-public/"]
gdrive_priv["data-private/"]
end
subgraph archive["🗄️ THE ARCHIVE"]
direction LR
bucket_pub["calcofi-files-public"]
bucket_priv["calcofi-files-private"]
end
subgraph database["🗃️ THE DATABASE"]
direction LR
working["Working DuckLake"]
frozen["Frozen Releases"]
end
subgraph outputs["🌐 OUTPUTS"]
direction LR
obis["OBIS"]
edi["EDI"]
erddap["ERDDAP"]
apps["Apps/API"]
end
drive -->|"rclone sync"| archive
archive -->|"targets pipeline"| database
database -->|"publish workflows"| outputs
style drive fill:#e8f4e8,stroke:#2e7d32
style archive fill:#e3f2fd,stroke:#1565c0
style database fill:#f3e5f5,stroke:#7b1fa2
style outputs fill:#fce4ec,stroke:#c2185b
```
### Technology Stack
| Layer | Technology | Purpose | Replaces |
|-------|------------|---------|----------|
| **Storage** | | | |
| | `gs://calcofi-files-public` | Versioned public source CSV files | Google Drive only |
| | `gs://calcofi-files-private` | Versioned private/sensitive files | Google Drive only |
| | `gs://calcofi-db` | Parquet, DuckDB, PMTiles | - |
| **Sync** | rclone | Drive → GCS with archive snapshots | Manual copying |
| **Format** | Apache Parquet | Efficient columnar storage | CSV |
| **Compute** | DuckDB | Serverless SQL queries | PostgreSQL (gradual) |
| **Database** | | | |
| | Working DuckLake | Internal DB with provenance, time travel | PostgreSQL |
| | [Frozen DuckLake](https://ducklake.select/2025/10/24/frozen-ducklake/) | Public versioned releases (v2026.02, etc.) | PostgreSQL snapshots |
| **Orchestration** | `targets` | Dependency management, caching | Manual sequencing |
| **Tiles** | PMTiles + tippecanoe | Cloud-native vector tiles | pg_tileserv |
| **Mapping** | mapgl | Modern WebGL maps in Shiny | leaflet + pg_tileserv |
| **Functions** | `calcofi4r`, `calcofi4db` | Reusable R code | Ad-hoc scripts |
**GCP Project**: `ucsd-sio-calcofi`
---
## Database Naming Conventions {#sec-naming}
### Table Names
- **Table names are singular** (e.g., `cruise`, `site`, `species`, `lookup` not `cruises`, `sites`, `lookups`)
- Table names use all lower case with underscores for multi-word names (snake_case)
### Column Names
- Use **snake_case** (lower-case with underscores) for all column names
- Unique identifiers are suffixed with:
- `*_id` for integer keys (sequential or from source)
- `*_key` for string/text keys (natural keys)
- `*_uuid` for universally unique identifiers (provenance tracking only)
- Suffix with units where applicable (e.g., `length_mm`, `depth_m`, `temp_degc`)
---
## Primary Key Strategy {#sec-primary-keys}
### General Principle: Natural Keys Over Surrogate Keys
**Prefer natural keys** (meaningful identifiers from the domain) over surrogate keys (arbitrary integers) unless:
- No natural key exists
- Performance requires integer keys for large tables with frequent joins
- The natural key is unstable or may change
**Benefits of natural keys:**
- Self-documenting (cruise_key `2401NH` tells you January 2024, New Horizon)
- No lookup needed to understand relationships
- Consistent across systems (source data uses same identifiers)
- Easier debugging and data validation
### Natural Keys (use as PRIMARY KEY)
| Table | Primary Key | Format | Example |
|-------|-------------|--------|---------|
| `cruise` | `cruise_key` | TEXT `YYMMKK` | `'2401NH'` |
| `ship` | `ship_key` | TEXT (2-letter) | `'NH'` |
| `tow_type` | `tow_type_key` | TEXT | `'CB'` |
| `species` | `species_id` | INTEGER | `1234` (natural from source) |
### Sequential Integer Keys (with explicit sort order)
For tables without natural keys, use sequential integers **with deterministic sort order** for reproducibility when appending new data:
| Table | Primary Key | Sort Order for ID Assignment |
|-------|-------------|------------------------------|
| `site` | `site_id` | `ORDER BY cruise_key, orderocc` |
| `tow` | `tow_id` | `ORDER BY site_id, time_start` |
| `net` | `net_id` | `ORDER BY tow_id, side` |
| `ichthyo` | `ichthyo_id` | `ORDER BY net_id, species_id, life_stage, measurement_type, measurement_value` |
| `lookup` | `lookup_id` | `ORDER BY lookup_type, lookup_num` |
**Why sort order matters:** When new cruises are added later, rows get appended. Without deterministic ordering, the same data could get different IDs on different runs, breaking reproducibility.
### Foreign Key Relationships
```
ship.ship_key (PK)
↓
cruise.cruise_key (PK) ←── cruise.ship_key (FK)
↓
site.site_id (PK) ←── site.cruise_key (FK)
↓
tow.tow_id (PK) ←── tow.site_id (FK)
↓ tow.tow_type_key (FK) → tow_type.tow_type_key
net.net_id (PK) ←── net.tow_id (FK)
↓
ichthyo.ichthyo_id (PK) ←── ichthyo.net_id (FK)
ichthyo.species_id (FK) → species.species_id
--- Bottle chain (linked via cruise bridge) ---
casts.cast_id (PK) ←── casts.cruise_key (FK, nullable) → cruise.cruise_key
↓ casts.grid_key (FK) → grid.grid_key
↓ casts.ship_key (FK, nullable) → ship.ship_key
bottle.bottle_id (PK) ←── bottle.cast_id (FK)
↓
bottle_measurement ←── bottle_measurement.bottle_id (FK)
bottle_measurement.measurement_type_id → measurement_type
--- Taxonomy ---
species.species_id (PK) species.worms_id → taxon (WoRMS authority)
species.itis_id → taxon (ITIS authority)
taxon.taxonID (PK per authority)
taxa_rank.taxonRank (PK)
--- Spatial ---
grid.grid_key (PK) ←── canonical from swfsc (not in bottle export)
segment (reference table)
```
**Note**: `grid.parquet` is only exported from the swfsc workflow (canonical source). The bottle workflow no longer exports grid.
### UUID Columns
- **Remove** `cruise_uuid`, `site_uuid`, `tow_uuid`, `net_uuid` from output tables
- **Retain** `_source_uuid` in Working DuckLake for provenance (stripped in frozen release)
---
## Database Architecture: Working vs Frozen DuckLake {#sec-database}
The CalCOFI integrated database uses [DuckLake](https://ducklake.select/), a lakehouse catalog that provides version control for data tables. We maintain **two distinct databases** to serve different needs:
### The Two-Database Strategy
```{mermaid}
%%| label: fig-ducklake-architecture
%%| fig-cap: "Working vs Frozen DuckLake Architecture"
flowchart TB
subgraph archive["📁 Archive"]
csv["CSV source files"]
end
subgraph working["🔧 WORKING DUCKLAKE"]
direction LR
w_tables["Provenance columns"]
w_history["Time travel"]
w_access["Internal only"]
end
subgraph frozen["❄️ FROZEN DUCKLAKE"]
direction LR
f_versions["v2026.02, v2026.03..."]
f_clean["Clean tables"]
f_access["Public access"]
end
subgraph outputs["🌐 Public Access"]
direction LR
apps["Shiny Apps"]
api["REST API"]
rpkg["calcofi4r"]
end
csv -->|"ingest workflows"| working
working -->|"freeze release"| frozen
frozen --> outputs
style archive fill:#e3f2fd,stroke:#1565c0
style working fill:#fff3e0,stroke:#ef6c00
style frozen fill:#e8f4e8,stroke:#2e7d32
style outputs fill:#fce4ec,stroke:#c2185b
```
### Working DuckLake (Internal)
The Working DuckLake is the **internal, mutable database** used by workflows. It prioritizes:
- **Full provenance tracking**: Every row knows where it came from
- **Time travel**: Query the database as it existed at any past point
- **Frequent updates**: Can be modified whenever new data arrives
**Schema features**:
| Column | Type | Purpose |
|--------|------|---------|
| `_source_file` | VARCHAR | Path to original CSV file in Archive |
| `_source_row` | INTEGER | Row number in source file |
| `_source_uuid` | UUID | Original record ID (e.g., from NOAA database) |
| `_ingested_at` | TIMESTAMP | When this row was added/updated |
**Example query with provenance**:
```sql
-- Find all larvae records from a specific source file
SELECT scientific_name, count, _source_file, _source_row
FROM larvae
WHERE _source_file LIKE '%swfsc.noaa.gov/calcofi-db/larva.csv'
AND cruise_id = '202301CC';
```
**Time travel query**:
```sql
-- Query larvae table as it existed on January 15th
SELECT * FROM larvae
AT TIMESTAMP '2026-01-15 00:00:00';
```
### Frozen DuckLake (Public Releases) {#sec-frozen}
The Frozen DuckLake provides **stable, versioned snapshots** for external users. Based on the [Frozen DuckLake](https://ducklake.select/2025/10/24/frozen-ducklake/) pattern:
**Key characteristics**:
- **Immutable**: Once released, a version never changes
- **Clean schema**: No internal provenance columns (cleaner for users)
- **Semantic versioning**: `v{YYYY}.{MM}` format (e.g., `v2026.02`)
- **Public access**: Anyone can query via HTTP/Parquet
**Release process**:
```{mermaid}
%%| label: fig-freeze-process
%%| fig-cap: "Freezing a Release"
flowchart LR
working["Working DuckLake"]
validate["Validate data quality"]
strip["Remove provenance columns"]
version["Tag version (v2026.02)"]
upload["Upload to releases/"]
announce["Announce release"]
working --> validate --> strip --> version --> upload --> announce
style working fill:#fff3e0,stroke:#ef6c00
style validate fill:#fff9c4,stroke:#f9a825
style strip fill:#fff9c4,stroke:#f9a825
style version fill:#fff9c4,stroke:#f9a825
style upload fill:#e8f4e8,stroke:#2e7d32
style announce fill:#e8f4e8,stroke:#2e7d32
```
**Frozen release structure**:
```
gs://calcofi-db/ducklake/releases/
├── v2026.02/
│ ├── catalog.json # DuckLake catalog
│ ├── parquet/
│ │ ├── bottle.parquet
│ │ ├── cast.parquet
│ │ ├── larvae.parquet
│ │ └── ...
│ └── RELEASE_NOTES.md
├── v2026.03/
│ └── ...
└── latest -> v2026.03 # Symlink to latest
```
### Accessing Frozen Releases
**From R (using calcofi4r)**:
```r
library(calcofi4r)
# Get latest release (default)
con <- cc_get_db()
# Get specific version
con <- cc_get_db(version = "v2026.02")
# List available versions
cc_list_versions()
#> [1] "v2026.02" "v2026.03" "v2026.04"
```
**Direct Parquet access (no R required)**:
```python
import duckdb
# Query directly from GCS (no download needed)
con = duckdb.connect()
con.execute("""
SELECT * FROM read_parquet(
'https://storage.googleapis.com/calcofi-db/ducklake/releases/latest/parquet/larvae.parquet'
)
LIMIT 10
""")
```
**From any HTTP client**:
```bash
# Download parquet file directly
curl -O https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.02/parquet/larvae.parquet
```
### Release Schedule
| Frequency | Trigger | Example |
|-----------|---------|---------|
| **Monthly** | Regular data updates | v2026.02, v2026.03 |
| **Ad-hoc** | Major dataset additions | v2026.02.1 (patch) |
| **Annual** | Year-end comprehensive release | v2026.12 |
**Release checklist**:
- [ ] All ingest workflows pass
- [ ] Data quality checks pass (no nulls in required fields, valid ranges)
- [ ] Foreign key relationships valid
- [ ] Row counts within expected range
- [ ] Release notes document changes
- [ ] Announce to users
---
## R Package Strategy
### Package Roles
#### `calcofi4db` - Database Administration Tools
**Target users**: Data managers, workflow developers
```
calcofi4db/R/
├── cloud.R # GCS operations (@concept cloud)
├── read.R # CSV/Parquet reading (@concept read)
├── transform.R # Data transformation (@concept transform)
├── ingest.R # Database loading (@concept ingest)
├── workflow.R # Workflow outputs to GCS (@concept workflow)
├── parquet.R # Parquet operations (@concept parquet)
├── duckdb.R # DuckDB operations (@concept duckdb)
├── ducklake.R # DuckLake operations (@concept ducklake)
├── archive.R # GCS archive sync (@concept archive)
├── version.R # Schema versioning (@concept version)
├── check.R # Data validation (@concept check)
├── validate.R # FK/lookup validation functions (@concept validate)
├── wrangle.R # Local DuckDB wrangling functions (@concept wrangle)
├── ship.R # Ship fuzzy matching functions (@concept ship) [NEW]
├── taxonomy.R # Taxonomy standardization functions (@concept taxonomy) [NEW]
├── spatial.R # Spatial operations (@concept spatial)
├── freeze.R # Frozen release operations (@concept freeze)
├── display.R # Display helpers (@concept display)
└── viz.R # Diagram generation (@concept viz)
```
#### `calcofi4r` - User-Facing Tools
**Target users**: Researchers, analysts, app developers
```
calcofi4r/R/
├── read.R # Data retrieval (@concept read)
├── database.R # Database connection (@concept database)
├── analyze.R # Statistical analysis (@concept analyze)
├── visualize.R # Plotting and mapping (@concept visualize)
└── functions.R # Helper functions (@concept utils)
```
### Function Reuse Across Contexts
| Context | Package | Example Usage |
|---------|---------|---------------|
| Ingestion workflows | `calcofi4db` | `read_csv_files()`, `ingest_csv_to_db()` |
| Publishing workflows | `calcofi4db` | `get_db_con()`, `create_db_manifest()` |
| Shiny apps | `calcofi4r` | `cc_get_db()`, `cc_plot_timeseries()` |
| Analysis scripts | `calcofi4r` | `cc_read_bottle()`, `cc_analyze_trend()` |
| API endpoints | `calcofi4r` | `cc_query_db()`, `cc_get_variables()` |
### New Functions to Implement
#### In `calcofi4db`:
```r
# R/cloud.R - GCS operations
get_gcs_file(bucket, path, local_path = NULL)
put_gcs_file(local_path, bucket, path)
sync_gdrive_to_gcs(gdrive_path, gcs_bucket, backup = TRUE)
list_gcs_versions(bucket, path)
# R/parquet.R - Parquet operations
csv_to_parquet(csv_path, schema_def = NULL)
read_parquet_table(path, con = NULL)
write_parquet_table(data, path, partitions = NULL)
add_parquet_metadata(path, metadata_list)
# R/duckdb.R - DuckDB operations
get_duckdb_con(path = ":memory:", read_only = FALSE)
create_duckdb_views(con, manifest)
attach_ducklake(con, catalog_path)
set_duckdb_comments(con, table, comments)
# R/workflow.R - Workflow output functions [NEW]
write_ingest_outputs(data_info, provider, dataset, gcs_bucket)
# 1. Transform data via transform_data()
# 2. Write parquet files to gs://calcofi-db/ingest/{provider}_{dataset}/
# 3. Write manifest.json with provenance info
read_ingest_manifest(provider, dataset, gcs_bucket)
read_ingest_parquet(provider, dataset, table, gcs_bucket)
list_ingest_outputs(gcs_bucket)
integrate_to_working_ducklake(ingests, gcs_bucket, ducklake_path)
# Combines ingest outputs into ducklake/working/
# R/ducklake.R - DuckLake catalog operations
get_working_ducklake() # connect to internal Working DuckLake
ingest_dataset(con, d, mode) # ✓ IMPLEMENTED - high-level dataset ingestion
# - calls transform_data() internally
# - handles uuid column detection
# - tracks provenance via gcs_path
ingest_to_working(data, table, source_file, source_uuid_col = NULL)
add_provenance_columns(data, source_file, source_row_start = 1)
save_working_ducklake(con) # upload Working DuckLake to GCS
list_working_tables(con) # list tables with provenance stats
# R/freeze.R - Frozen release operations
freeze_release(version, release_notes = NULL)
validate_for_release(con) # run all quality checks before freeze
list_frozen_releases() # list available frozen versions
compare_releases(v1, v2) # diff two releases
# R/ship.R - Ship code reconciliation [NEW]
match_ships(unmatched_ships, reference_ships, ship_renames_csv, fetch_ices)
# Multi-source fuzzy matching: manual overrides → exact nodc → exact name
# → CalCOFI UNOLS → NODC WOD → ICES → word-overlap fuzzy
fetch_ship_ices(ices_api, ices_ship_code_type)
# Fetch ship codes from ICES Reference Codes API
# R/taxonomy.R - Taxonomy standardization [NEW]
standardize_species(con, species_tbl, include_gbif, batch_size)
# Query WoRMS → ITIS → GBIF for each species
# Updates species table with worms_id, itis_id, gbif_id
build_taxon_table(con, species_tbl, taxon_tbl, include_itis, batch_size)
# Build taxon hierarchy table from WoRMS/ITIS classification
# Also creates taxa_rank lookup table with rank ordering
```
#### In `calcofi4r`:
```r
# R/database.R - Frozen DuckLake access (user-facing)
cc_get_db(version = "latest", local_cache = TRUE)
# Connect to frozen release (never Working DuckLake)
cc_list_versions() # list available frozen releases
cc_db_info(version = "latest") # release date, row counts, notes
cc_release_notes(version) # view release notes for a version
# R/database.R - Derived views [NEW]
cc_make_view(con, template, view_name, base_table, column_definitions)
# Create derived VIEW from template or custom SQL expressions
# Templates: "casts_extra" (year, month, quarter, julian_day, lat/lon parts, cruise, db_sta_key)
cc_list_view_templates() # list available prebaked templates
# R/read.R - Convenience functions for common tables
cc_read_bottle(version = "latest", ...)
cc_read_larvae(version = "latest", ...)
cc_read_cast(version = "latest", ...)
# R/query.R - Query helpers
cc_query(sql, version = "latest")
cc_list_tables(version = "latest")
cc_describe_table(table, version = "latest")
```
---
## Workflow Dependency Management with `targets`
### Pipeline Definition
Create `workflows/_targets.R`:
```r
library(targets)
library(tarchetypes)
tar_option_set(
packages = c("calcofi4db", "duckdb", "dplyr", "arrow"),
format = "qs" # Fast serialization
)
# Config
DIR_DATA <- "~/My Drive/projects/calcofi/data-public"
DIR_METADATA <- "metadata" # workflows/metadata/
GCS_BUCKET <- "calcofi-db"
# Define pipeline
list(
# ─── Ingest SWFSC NOAA CalCOFI Database ─────────────────────────
tar_target(
ingest_swfsc,
{
d <- calcofi4db::read_csv_files(
provider = "swfsc.noaa.gov",
dataset = "calcofi-db",
dir_data = DIR_DATA,
metadata_dir = DIR_METADATA)
calcofi4db::write_ingest_outputs(
data_info = d,
provider = "swfsc.noaa.gov",
dataset = "calcofi-db",
gcs_bucket = GCS_BUCKET)
}),
# ─── Ingest CalCOFI.org Bottle Database ─────────────────────────
tar_target(
ingest_bottle,
{
d <- calcofi4db::read_csv_files(
provider = "calcofi.org",
dataset = "bottle-database",
dir_data = DIR_DATA,
metadata_dir = DIR_METADATA)
calcofi4db::write_ingest_outputs(
data_info = d,
provider = "calcofi.org",
dataset = "bottle-database",
gcs_bucket = GCS_BUCKET)
}),
# ─── Integrate Ingest Outputs into Working DuckLake ──────────────
tar_target(
working_ducklake,
calcofi4db::integrate_to_working_ducklake(
ingests = list(ingest_swfsc, ingest_bottle),
gcs_bucket = GCS_BUCKET,
ducklake_path = "ducklake/working")),
# ─── Publish Datasets ───────────────────────────────────────────
tar_target(
obis_archive,
create_obis_archive(database, include_bottle = TRUE))
)
```
The key changes in this pipeline:
1. **`metadata_dir` parameter**: Points to `workflows/metadata/` instead of the legacy `calcofi4db/inst/ingest/` path
2. **`write_ingest_outputs()`**: Writes parquet files + manifest.json to `gs://calcofi-db/ingest/{provider}_{dataset}/`
3. **`integrate_to_working_ducklake()`**: Combines ingest parquet files into the Working DuckLake at `gs://calcofi-db/ducklake/working/`
**Note:** No separate `database/` folder - the Working DuckLake IS the integrated database. The `releases/` folder will contain frozen snapshots with provenance columns stripped.
### Workflow Lifecycle
```{mermaid}
%%| label: fig-lifecycle
%%| fig-cap: "Development and Update Cycles"
flowchart TB
subgraph dev["Development Cycle"]
direction LR
d1["1\. Explore"]
d2["2\. Ingest"]
d3["3\. Targets"]
d4["4\. Run"]
d5["5\. Publish"]
d1 --> d2 --> d3 --> d4 --> d5
end
subgraph update["Update Cycle"]
direction LR
u1["1\. New data"]
u2["2\. Sync"]
u3["3\. Run"]
u4["4\. Republish"]
u1 --> u2 --> u3 --> u4
end
dev -.->|"formalized"| update
style dev fill:#fff3e0,stroke:#ef6c00
style update fill:#e8f4e8,stroke:#2e7d32
```
### Dependency Graph
```{mermaid}
%%| label: fig-targets
%%| fig-cap: "targets Pipeline Dependency Graph"
flowchart LR
subgraph archive["📁 GCS Archive"]
archive_swfsc["swfsc.noaa.gov/<br/>calcofi-db"]
archive_bottle["calcofi.org/<br/>bottle-database"]
end
subgraph ingest["⚙️ Ingest Outputs"]
ingest_swfsc["ingest_swfsc<br/>(parquet + manifest)"]
ingest_bottle["ingest_bottle<br/>(parquet + manifest)"]
end
subgraph database["🗃️ Working DuckLake"]
integrate["integrate_to_working_ducklake()"]
ducklake["ducklake/working/"]
end
subgraph publish["🌐 Publish"]
obis_archive
end
archive_swfsc --> ingest_swfsc
archive_bottle --> ingest_bottle
ingest_swfsc --> integrate
ingest_bottle --> integrate
integrate --> ducklake
ducklake --> obis_archive
style archive fill:#e3f2fd,stroke:#1565c0
style ingest fill:#fff3e0,stroke:#ef6c00
style database fill:#f3e5f5,stroke:#7b1fa2
style publish fill:#fce4ec,stroke:#c2185b
```
The key insight is the separation between **ingest outputs** (per-workflow parquet + manifest) and the **final database** (integrated DuckDB). Each ingest workflow:
1. Reads from GCS archive (immutable source files)
2. Applies transformations via redefinition metadata
3. Writes parquet files to `gs://calcofi-db/ingest/{provider}_{dataset}/`
4. Writes `manifest.json` with provenance (archive timestamp, table stats)
The final database step combines all ingest outputs into a unified DuckDB.
---
## Documentation Updates
### Recommended Changes to `docs/`
#### `docs/db.qmd` - Database Documentation
**Current**: PostgreSQL-focused with naming conventions and ingestion strategy
**Proposed changes**:
- Add section on DuckDB as primary database
- Document Parquet file organization
- Explain DuckLake versioning and time travel
- Keep naming conventions (snake_case, *_id, *_uuid suffixes)
- Update "Integrated database ingestion strategy" for targets pipeline
#### `docs/api.qmd` - API Documentation
**Current**: REST API endpoints for PostgreSQL queries
**Proposed changes**:
- Note that REST API may be deprecated for most use cases
- Document direct DuckDB access via `calcofi4r::cc_get_db()`
- Keep API for web apps that cannot use DuckDB directly
- Consider lightweight API using DuckDB backend if needed
#### `docs/portals.qmd` - Portal Documentation
**Current**: Good overview of EDI, NCEI, OBIS, ERDDAP
**Proposed changes**:
- Update data flow diagram to show DuckDB → Portals
- Add section on automated publishing via targets pipeline
- Document versioning strategy for portal submissions
### Where Documentation Should Live
| Content | Location | Rationale |
|---------|----------|-----------|
| Database schema | `docs/db.qmd` | User-facing documentation |
| Workflow development | `workflows/README.md` | Developer guide |
| Package functions | `calcofi4r/`, `calcofi4db/` | pkgdown reference |
| Data dictionary | `calcofi4db/inst/` | Versioned with code |
| Architecture decisions | `docs/` or `workflows/_docs/` | Long-term reference |
---
## Implementation Priority
### Phase 1: Foundation ✅ COMPLETE (2026-02-02)
**Goal**: Establish cloud infrastructure and helper functions
1. **Set up GCS buckets** ✅
- Created `gs://calcofi-files-public/` with public read access
- Created `gs://calcofi-files-private/` with restricted access
- Configured rclone access via OAuth
2. **Add cloud functions to `calcofi4db`** ✅
- Created `R/cloud.R` with `get_gcs_file()`, `put_gcs_file()`
- Created `R/parquet.R` with `csv_to_parquet()`
- Created `R/duckdb.R` with `get_duckdb_con()`, `create_duckdb_views()`
3. **Initial sync** ✅
- Synced Google Drive `data-public/` to `gs://calcofi-files-public/_sync/`
- Created initial archive snapshot: `archive/2026-02-02_121557/`
- Generated manifest: `manifests/manifest_2026-02-02_121557.json`
- Total: 112,457 files (33.5 GiB)
### Phase 2: Pipeline ✅ COMPLETE (2026-02-05)
**Goal**: Implement targets pipeline with new architecture
1. **Migrate metadata to `workflows/metadata/`** ✅
- Created `workflows/metadata/{provider}/{dataset}/` directories
- Copied redefinition files from `calcofi4db/inst/ingest/`
- Updated workflows to use `metadata_dir` parameter
2. **Create `calcofi4db/R/workflow.R`** ✅
- `write_ingest_outputs()`: Write parquet + manifest to GCS
- `write_parquet_outputs()`: Export tables to parquet with manifest
- `read_ingest_manifest()`: Read manifest from GCS
3. **Update `workflows/_targets.R`** ✅
- Extended with `ingest_swfsc_manifest` + `ingest_bottle_manifest` file targets
- Added `merge_manifest` target that renders `merge_ichthyo_bottle.qmd`
- Pipeline: CSV→parquet→DuckDB + manifest-based merge
4. **Cross-dataset integration (new)** ✅
- **Ship matching**: `calcofi4db/R/ship.R` with `match_ships()`, `fetch_ship_ices()`
- **Cruise bridge**: Derive `cruise_key` (YYMMKK) from bottle casts via ship_key + datetime
- **Taxonomy standardization**: `calcofi4db/R/taxonomy.R` with `standardize_species()`, `build_taxon_table()`
- **Grid canonical source**: grid.parquet only from swfsc (removed from bottle export)
5. **Update publish workflow** (pending)
- Modify `publish_larvae_to_obis.qmd` to read from DuckDB
- Add bottle data to EMoF extension
### Phase 3: Working DuckLake ✅ COMPLETE (2026-02-05)
**Goal**: Implement Working DuckLake with provenance tracking
1. **Working DuckLake functions implemented** ✅
- `get_working_ducklake()`: Connect to Working DuckLake (GCS-backed)
- `ingest_to_working()`: Ingest with provenance columns
- `add_provenance_columns()`: Helper for `_source_file`, `_source_row`, `_source_uuid`, `_ingested_at`
- `save_working_ducklake()`: Upload to GCS
- `list_working_tables()`: Table summary with provenance stats
2. **Merge workflow loads all tables** ✅
- `merge_ichthyo_bottle.qmd` loads 15+ tables from upstream parquet
- Modified tables (casts, species, taxon, taxa_rank) exported separately
- Manifest references GCS URIs for unchanged upstream tables
### Phase 4: Frozen DuckLake ✅ COMPLETE (2026-02-05)
**Goal**: Implement [Frozen DuckLake](https://ducklake.select/2025/10/24/frozen-ducklake/) for public releases
1. **Freeze functions implemented** ✅
- `validate_for_release()`: Data quality checks
- `freeze_release()` / `write_parquet_outputs(strip_provenance=TRUE)`: Create immutable release
- Strip provenance columns for clean public schema
- Generate release notes with cross-dataset integration summary
2. **Release structure** ✅
- `gs://calcofi-db/ducklake/releases/{version}/`
- Parquet files + manifest + RELEASE_NOTES.md
- `latest` pointer to current release
3. **`calcofi4r` frozen access** ✅
- `cc_get_db(version = "latest")`: Default to frozen releases
- `cc_list_versions()`: List available releases
- `cc_release_notes()`: View release notes
- `cc_make_view(con, "casts_extra")`: Create derived VIEWs with prebaked templates
- `cc_list_view_templates()`: List available templates
4. **First public release**: v2026.02 (pending GCS upload)
### Phase 5: Documentation & Polish
**Goal**: Complete documentation and evaluate API
1. **Update docs/**
- Revise `db.qmd` for DuckDB/DuckLake
- Document Working vs Frozen databases
- Evaluate API necessity
- Update portal documentation
2. **Clean up workflows/**
- Consolidate README.md with essentials
- Archive deprecated exploration notebooks
- Document workflow development process
- Create "Getting Started" guide for data managers
---
## Verification Plan
### End-to-End Test
```r
# 1. Sync test data from Google Drive to GCS
system("rclone sync gdrive:calcofi/data-public gs://calcofi-files-public/_sync")
# 2. Create archive snapshot
timestamp <- format(Sys.time(), "%Y-%m-%d_%H%M%S")
system(glue::glue("rclone copy gs://calcofi-files-public/_sync gs://calcofi-files-public/archive/{timestamp}"))
# 3. Run targets pipeline
targets::tar_make()
# 4. Verify database
con <- calcofi4db::get_duckdb_con("calcofi.duckdb")
DBI::dbListTables(con)
DBI::dbGetQuery(con, "SELECT COUNT(*) FROM larvae")
# 5. Verify OBIS archive
obistools::check_fields(read_csv("data/darwincore/larvae/occurrence.csv"))
```
### Integration Tests
- [ ] GCS upload/download roundtrip
- [ ] Parquet schema preservation
- [ ] DuckDB query performance vs PostgreSQL
- [ ] OBIS archive validation
- [ ] Time travel query accuracy
---
## Migration Strategy: PostgreSQL → DuckDB
### Current PostgreSQL Dependencies
Based on `server/docker-compose.yml`, the following services depend on PostgreSQL:
| Service | Purpose | Migration Path |
|---------|---------|----------------|
| `postgis` | Spatial database | Keep for vector tiles (short-term) |
| `pg_tileserv` | Vector tile server | Replace with PMTiles |
| `pg_rest` | REST API (PostgREST) | Keep for web apps, add DuckDB option |
| `plumber` | R API | Update to use DuckDB |
| `pgadmin` | Database admin | Deprecate once migrated |
### Vector Tiles: pg_tileserv → PMTiles
The current architecture uses `pg_tileserv` to serve vector tiles from PostGIS for Shiny mapping apps. The migration path:
```{mermaid}
%%| label: fig-tiles-migration
%%| fig-cap: "Vector Tiles Migration Path"
flowchart LR
subgraph current["Current Flow (PostgreSQL)"]
direction LR
pg["PostGIS"]
ts["pg_tileserv"]
lf["Leaflet"]
pg --> ts --> lf
end
subgraph target["Target Flow (Cloud-native)"]
direction LR
geo["GeoParquet"]
tc["tippecanoe"]
pm["PMTiles"]
mg["mapgl"]
geo --> tc --> pm --> mg
end
current -.->|"migrate"| target
style current fill:#ffebee,stroke:#c62828
style target fill:#e8f4e8,stroke:#2e7d32
```
#### Implementation Steps
1. **Generate PMTiles from PostGIS**
```bash
# Export geometries from PostGIS
ogr2ogr -f GeoJSON stations.geojson \
PG:"host=postgis user=admin dbname=gis" \
-sql "SELECT * FROM stations"
# Convert to PMTiles using tippecanoe
tippecanoe -o stations.pmtiles \
--minimum-zoom=0 --maximum-zoom=14 \
--layer=stations \
stations.geojson
# Upload to GCS
gsutil cp stations.pmtiles gs://calcofi-db/tiles/
```
2. **Update Shiny Apps**
```r
# Old: pg_tileserv
leaflet() |>
addTiles() |>
addMapboxGL(
style = list(
sources = list(
stations = list(
type = "vector",
tiles = list("https://tile.calcofi.io/public.stations/{z}/{x}/{y}.pbf")
)
)
)
)
# New: PMTiles with mapgl
library(mapgl)
mapboxgl() |>
add_pmtiles_source(
id = "stations",
url = "https://storage.googleapis.com/calcofi-db/tiles/stations.pmtiles"
) |>
add_layer(
id = "stations-layer",
type = "circle",
source = "stations",
source_layer = "stations"
)
```
3. **Add PMTiles generation to targets pipeline**
```r
# In _targets.R
tar_target(
pmtiles_stations,
create_pmtiles(
geom_source = duckdb_file,
query = "SELECT * FROM stations",
output = "tiles/stations.pmtiles"
)
)
```
### Migration Roadmap
| Phase | Timeline | Actions |
|-------|----------|---------|
| **Phase 1** | Now | Add DuckDB as primary data access; keep PostgreSQL |
| **Phase 2** | +2 months | Generate PMTiles for all spatial layers |
| **Phase 3** | +4 months | Update Shiny apps to use mapgl + PMTiles |
| **Phase 4** | +6 months | Deprecate pg_tileserv, reduce PostgreSQL to backup |
| **Phase 5** | +12 months | Fully deprecate PostgreSQL (if no dependencies remain) |
---
## Resolved Questions
Based on clarifications received:
1. **GCS Project**: Use existing project `ucsd-sio-calcofi`
- Existing bucket `calcofi-db` → DuckLake/database files
- New bucket `calcofi-files-public` → Versioned public source files (public read access)
- New bucket `calcofi-files-private` → Versioned private/sensitive files (restricted access)
2. **Google Drive Structure** (implemented 2026-02-02):
- `calcofi/data-public/` → Public datasets (bottle, larvae, etc.)
- `calcofi/data-private/` → Private/sensitive data (eDNA, zooplankton samples)
3. **GCS Folder Structure** (implemented 2026-02-02):
- `_sync/` → Working directory (rclone syncs here, mutable)
- `archive/{timestamp}/` → Immutable snapshots (complete copies, not deltas)
- `manifests/` → JSON metadata files with `manifest_latest.json` pointer
4. **API Strategy**: Keep PostgreSQL + API for now with migration roadmap
- pg_tileserv needed for current Shiny apps
- Migrate to PMTiles over time using tippecanoe + mapgl
5. **Priority Data**: Larvae + Bottle datasets for proof-of-concept
6. **Bottle in OBIS**: Include as EMoF extension in larvae dataset
## Remaining Questions
1. **DuckLake vs MotherDuck**: Self-hosted DuckLake catalog or use MotherDuck service?
2. **Automation**: GitHub Actions vs server cron for scheduled rclone syncs?
3. **PMTiles hosting**: Serve from GCS directly or via CDN?
---
## Appendix: Key References
### Documentation
- [Versioned Data Lake Strategy](workflows/_docs/2026-01-02%20Versioned%20Data%20Lake%20Strategy.md)
- [Data Management Action Plan](workflows/_docs/2026-01-27%20updated%20working%20draft_Data_Management_Action_Plan_CalCOFI.md)
- [Data Pipeline Timeline](workflows/_docs/2026-01-31%20Data%20Pipeline%20Timeline%20.md)
### Technologies
**Database & Analytics**:
- [DuckDB](https://duckdb.org/) - Fast in-process analytical database
- [DuckLake](https://ducklake.select/) - Lakehouse catalog for DuckDB
- [Frozen DuckLake](https://ducklake.select/2025/10/24/frozen-ducklake/) - Immutable versioned releases
- [Apache Parquet](https://parquet.apache.org/) - Columnar storage format
**Workflow & Orchestration**:
- [targets R package](https://books.ropensci.org/targets/) - Pipeline dependency management
**Cloud Storage**:
- [rclone](https://rclone.org/) - Cloud storage sync tool
- [googleCloudStorageR](https://code.markedmondson.me/googleCloudStorageR/) - R interface to GCS
- [Data lakes and big data analytics - Google Cloud Storage](https://cloud.google.com/storage#data-lakes-and-big-data-analytics)
**Vector Tiles**:
- [PMTiles](https://protomaps.com/docs/pmtiles) - Cloud-native vector tile format
- [tippecanoe](https://github.com/felt/tippecanoe) - Build vector tilesets
- [mapgl R package](https://walker-data.com/mapgl/) - Modern WebGL maps in R/Shiny
### CalCOFI Resources
- [calcofi.io/docs](https://calcofi.io/docs) - CalCOFI documentation site
- [calcofi.io/calcofi4r](https://calcofi.io/calcofi4r) - User-facing R package
- [calcofi.io/calcofi4db](https://calcofi.io/calcofi4db) - Database admin R package