To name columns, use snake-case (i.e., lower-case with underscores) so as to prevent the need to quote SQL statements. (TIP: Use janitor::clean_names() to convert a table.)
Unique identifiers are suffixed with:
*_id for unique integer keys;
*_uuid for universally unique identifiers as defined by RFC 4122 and stored in Postgres as UUID Type.
*_key for unique string keys;
*_seq for auto-incrementing sequence integer keys.
Suffix with units where applicable (e.g., *_m for meters, *_km for kilometers, degc for degrees Celsius). See units vignette.
Set geometry column to geom (used by PostGIS spatial extension). If the table has multiple geometry columns, use geom for the default geometry column and geom_{type} for additional geometry columns (e.g., geom_point, geom_line, geom_polygon).
5.1.3 Primary key conventions
Prefer natural keys (meaningful domain identifiers) over surrogate keys where stable:
cruise_key = ‘YYMMKK’ (e.g., 2401NH = January 2024, New Horizon)
ship_key = 2-letter ship code (e.g., NH)
tow_type_key = tow type code (e.g., CB)
Sequential integer keys should have explicit sort order documented for reproducibility:
site_id sorted by cruise_key, orderocc
tow_id sorted by site_id, time_start
net_id sorted by tow_id, side
ichthyo_id sorted by net_id, species_id, life_stage, measurement_type, measurement_value
Avoid UUIDs in output tables: Use _source_uuid in Working DuckLake for provenance tracking only (stripped in frozen releases).
5.2 Use Unicode for text
The default character encoding for Postgresql is unicode (UTF8), which allows for international characters, accents and special characters. Improper encoding can royally mess up basic text.
Logging into the server, we can see this with the following command:
Use Unicode (utf-8 in Python or UTF8 in Postgresql) encoding for all database text values to support international characters and documentation (i.e., tabs, etc for markdown conversion).
In Python, use pandas to read (read_csv()) and write (to_csv()) with UTF-8 encoding (i.e., encoding='utf-8').:
import pandas as pdfrom sqlalchemy import create_engineengine = create_engine('postgresql://user:password@localhost:5432/dbname')# read from a csv filedf = pd.read_csv('file.csv', encoding='utf-8')# write to PostgreSQLdf.to_sql('table_name', engine, if_exists='replace', index=False, method='multi', chunksize=1000, encoding='utf-8')# read from PostgreSQLdf = pd.read_sql('SELECT * FROM table_name', engine, encoding='utf-8')# write to a csv file with UTF-8 encodingdf.to_csv('file.csv', index=False, encoding='utf-8')
library(readr)library(DBI)library(RPostgres)# connect to PostgreSQLcon <-dbConnect(RPostgres::Postgres(), dbname ="dbname", host ="localhost", port =5432, user ="user", password ="password")# read from a csv filedf <-read_csv('file.csv', locale =locale(encoding ='UTF-8')) # explicitdf <-read_csv('file.csv') # implicit# write to PostgreSQLdbWriteTable(con, 'table_name', df, overwrite =TRUE)# read from PostgreSQLdf <-dbReadTable(con, 'table_name')# write to a csv file with UTF-8 encodingwrite_excel_csv(df, 'file.csv', locale =locale(encoding ='UTF-8')) # explicitwrite_excel_csv(df, 'file.csv') # implicit
5.3 Integrated database ingestion strategy
5.3.1 Overview
The CalCOFI database uses a two-schema strategy for development and production:
dev schema: Development schema where new datasets, tables, fields, and relationships are ingested and QA/QC’d. This schema is recreated fresh with each ingestion run using the master ingestion script.
prod schema: Production schema for stable, versioned data used by public APIs, apps, and data portals (OBIS, EDI, ERDDAP). Once dev is validated, it’s copied to prod with a version number.
Records schema version with metadata in schema_version table
Each dataset section in the master script handles:
Reading CSV files from Google Drive
Transforming data according to redefinition rules
Loading into database tables
Adding table/field comments with metadata
flowchart TB
%% Node definitions
gd[("`<b>Source Data</b>
Google Drive:
calcofi/data/{provider}/{dataset}/*.csv`")]
iw["<b>Ingest</b>
calcofi4db: create_db.qmd; sections by {provider},{dataset}"]
dd["<b>Data Definitions</b>
calcofi4db: ingest/{provider}/{dataset}/:
<ul>
<li>tbls_redefine.csv</li>
<li>flds_redefine.csv</li>
</ul>"]
db[("<b>Database</b>")]
api["<b>API Endpoint</b>
/db_tables
/db_columns"]
catalog["<b>R Function</b>
calcofi4r::cc_db_catalog()"]
eml["<b>Publish Workflow</b>
workflows: publish_{dataset}_{portal}.qmd
with {portal}s:
<ul>
<li>erddap</li>
<li>edi</li>
<li>obis</li>
<li>ncei</li>
</ul>"]
%% Edge definitions
gd --> iw
iw -->|"1 auto-generated"| dd
dd -->|"2 manual edit"| iw
iw -->|"3 data"| db
iw --> comments
comments -->|"4 metadata"| db
db --> api
api --> catalog
db --> eml
%% Comments subgraph with internal nodes
subgraph comments["<b>Database Comments</b>
(stored as text in JSON format to differentiate elements)"]
direction TB
h["hideme"]:::hidden
h~~~tbl
h~~~fld
tbl["per <em>Table</em>:
<ul>
<li>description</li>
<li>source (<em>linked</em>)</li>
<li>source_created (<em>datetime</em>)</li>
<li>workflow (<em>linked</em>)</li>
<li>workflow_ingested (<em>datetime</em>)</li>
</ul>"]
fld["per <em>Field</em>:
<ul>
<li>description</li>
<li>units (SI)`</li>
</ul>"]
end
%% Clickable links
click gd "https://drive.google.com/drive/folders/1xxdWa4mWkmfkJUQsHxERTp9eBBXBMbV7" "calcofi folder - Google Drive"
click api "https://api.calcofi.io/db_tables" "API endpoint</b>"
click catalog "https://calcofi.io/calcofi4r/reference/cc_db_catalog.html" "R package function"
%% Styling
classDef source fill:#f9f9f9,stroke:#000,stroke-width:2px,color:#000
classDef process fill:#a3e0f2,stroke:#000,stroke-width:2px,color:#000
classDef eml fill:#F0FDF4,stroke:#22C55E,stroke-width:2px,color:#000,text-align:left
classDef data fill:#ffbe75,stroke:#000,stroke-width:2px,color:#000
classDef api fill:#9ad294,stroke:#000,stroke-width:2px,color:#000
classDef meta fill:#c9a6db,stroke:#000,stroke-width:2px,color:#000,text-align:left
classDef hidden display: none;
class gd source
class dd,comments,tbl,fld meta
class iw process
class db data
class api,catalog api
class tbl,fld li
class eml eml
Figure 5.1: Integrated database ingestion scheme.
5.3.3 Using calcofi4db package
The calcofi4db package provides streamlined functions for dataset ingestion.
5.3.3.1 DuckLake Workflow (Recommended)
The preferred approach uses DuckDB with provenance tracking:
library(calcofi4db)# connect to working DuckLake (downloads from GCS if needed)con <-get_working_ducklake()# read CSV files from local directory or GCS archive# - syncs to GCS archive for immutable provenance tracking# - loads redefinition metadata for column renaming/typingd <-read_csv_files(provider ="swfsc.noaa.gov",dataset ="calcofi-db",dir_data ="~/My Drive/projects/calcofi/data-public",metadata_dir ="metadata",sync_archive =TRUE)# ingest dataset with automatic provenance tracking# - transforms data using redefinition files# - adds _source_file, _source_row, _source_uuid, _ingested_at columns# - handles uuid column detection automaticallytbl_stats <-ingest_dataset(con = con,d = d,mode ="replace")# save to GCS and closesave_working_ducklake(con)close_duckdb(con)
The ingest_dataset() function handles:
Calling transform_data() to apply table/field redefinitions
Detecting UUID columns automatically for provenance tracking
Calling ingest_to_working() for each table with proper source file tracking
After prod schema is versioned, additional workflows publish data to Portals (ERDDAP, EDI, OBIS, NCEI) using ecological metadata language (EML) via the EML R package, pulling metadata directly from database comments.
5.3.7 OR Describe tables and columns directly
Use the COMMENT clause to add descriptions to tables and columns, either through the GUI pgadmin.calcofi.io (by right-clicking on the table or column and selecting Properties) or with SQL. For example:
COMMENTONTABLEpublic.aoi_fed_sanctuaries IS'areas of interest (`aoi`) polygons for federal **National Marine Sanctuaries**; loaded by _workflow_ [load_sanctuaries](https://calcofi.io/workflows/load_sanctuaries.html)';
Note the use of markdown for including links and formatting (e.g., bold, code, italics), such that the above SQL will render like so:
areas of interest (aoi) polygons for federal National Marine Sanctuaries; loaded by workflowload_sanctuaries
It is especially helpful to link to any workflows that are responsible for the ingesting or updating of the input data.