CalCOFI Database Tools package for ingesting and managing datasets in the CalCOFI database using an integrated schema-based workflow.
Overview
This package provides functions for the CalCOFI integrated database ingestion strategy, which uses a two-schema approach:
-
dev
schema: Fresh development schema recreated with each ingestion run for QA/QC -
prod
schema: Versioned production schema for stable public access
Key features:
- Reading CSV files from Google Drive with metadata extraction
- Creating and managing redefinition files for tables and fields
- Transforming data according to redefinition rules
- Ingesting multiple datasets into a fresh database schema
- Managing database relationships and indexes
- Recording schema versions with full provenance
Installation
# Install from GitHub
remotes::install_github("CalCOFI/calcofi4db")
Usage
Master ingestion workflow
The primary workflow is the master ingestion script inst/create_db.qmd
that recreates the dev
schema with all datasets:
- Drops and recreates
dev
schema - Ingests multiple datasets from Google Drive
- Applies transformations via redefinition files
- Creates relationships (primary/foreign keys, indexes)
- Records schema version with metadata
Individual dataset ingestion
For programmatic control, use the core functions:
library(calcofi4db)
library(DBI)
library(RPostgres)
# Connect to database
con <- dbConnect(
Postgres(),
dbname = "gis",
host = "localhost",
port = 5432,
user = "admin",
password = "postgres"
)
# Read CSV files and metadata from Google Drive
d <- read_csv_files(
provider = "swfsc.noaa.gov",
dataset = "calcofi-db"
)
# Transform data according to redefinitions
transformed_data <- transform_data(d)
# Ingest into dev schema
ingest_csv_to_db(
con = con,
schema = "dev",
transformed_data = transformed_data,
d_flds_rd = d$d_flds_rd,
d_gdata = d$d_gdata,
workflow_info = d$workflow_info
)
# Record schema version
record_schema_version(
con = con,
schema = "dev",
version = "1.0.0",
description = "Initial ingestion of NOAA CalCOFI Database",
script_permalink = "https://github.com/CalCOFI/calcofi4db/blob/main/inst/create_db.qmd"
)
# Disconnect
dbDisconnect(con)
Schema versioning
Each successful ingestion records a version in the schema_version
table with:
- version: Semantic version (e.g., “1.0.0”, “1.1.0”)
- description: Changes in this version
- date_created: Ingestion timestamp
- script_permalink: GitHub permalink to ingestion script
Versions are archived as SQL dumps in Google Drive for reproducibility.
Documentation
See the CalCOFI Database Documentation for complete details on:
- Database naming conventions
- Ingestion workflow architecture
- Schema versioning strategy
- Metadata management
- Publishing to data portals