5  Database

5.1 Database naming conventions

There are only two hard things in Computer Science: cache invalidation and naming things. – Phil Karlton (Netscape architect)

We’re circling the wagons to come up with the best conventions for naming. Here are some ideas:

5.1.1 Name tables

  • Table names are singular and use all lower case.
    • Example: cruise, site, species, lookup (not cruises, sites, lookups)

5.1.2 Name columns

  • To name columns, use snake-case (i.e., lower-case with underscores) so as to prevent the need to quote SQL statements. (TIP: Use janitor::clean_names() to convert a table.)

  • Unique identifiers are suffixed with:

    • *_id for unique integer keys;
    • *_uuid for universally unique identifiers as defined by RFC 4122 and stored in Postgres as UUID Type.
    • *_key for unique string keys;
    • *_seq for auto-incrementing sequence integer keys.
  • Suffix with units where applicable (e.g., *_m for meters, *_km for kilometers, degc for degrees Celsius). See units vignette.

  • Set geometry column to geom (used by PostGIS spatial extension). If the table has multiple geometry columns, use geom for the default geometry column and geom_{type} for additional geometry columns (e.g., geom_point, geom_line, geom_polygon).

5.1.3 Primary key conventions

Prefer natural keys (meaningful domain identifiers) over surrogate keys where stable:

  • cruise_key = ‘YYMMKK’ (e.g., 2401NH = January 2024, New Horizon)
  • ship_key = 2-letter ship code (e.g., NH)
  • tow_type_key = tow type code (e.g., CB)

Sequential integer keys should have explicit sort order documented for reproducibility:

  • site_id sorted by cruise_key, orderocc
  • tow_id sorted by site_id, time_start
  • net_id sorted by tow_id, side
  • ichthyo_id sorted by net_id, species_id, life_stage, measurement_type, measurement_value

Avoid UUIDs in output tables: Use _source_uuid in Working DuckLake for provenance tracking only (stripped in frozen releases).

5.2 Use Unicode for text

The default character encoding for Postgresql is unicode (UTF8), which allows for international characters, accents and special characters. Improper encoding can royally mess up basic text.

Logging into the server, we can see this with the following command:

docker exec -it postgis psql -l
                                  List of databases
        Name        | Owner | Encoding |  Collate   |   Ctype    | Access privileges 
--------------------+-------+----------+------------+------------+-------------------
 gis                | admin | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/admin        +
                    |       |          |            |            | admin=CTc/admin  +
                    |       |          |            |            | ro_user=c/admin
 lter_core_metabase | admin | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/admin        +
                    |       |          |            |            | admin=CTc/admin  +
                    |       |          |            |            | rw_user=c/admin
 postgres           | admin | UTF8     | en_US.utf8 | en_US.utf8 | 
 template0          | admin | UTF8     | en_US.utf8 | en_US.utf8 | =c/admin         +
                    |       |          |            |            | admin=CTc/admin
 template1          | admin | UTF8     | en_US.utf8 | en_US.utf8 | =c/admin         +
                    |       |          |            |            | admin=CTc/admin
 template_postgis   | admin | UTF8     | en_US.utf8 | en_US.utf8 | 
(6 rows)

Use Unicode (utf-8 in Python or UTF8 in Postgresql) encoding for all database text values to support international characters and documentation (i.e., tabs, etc for markdown conversion).

  • In Python, use pandas to read (read_csv()) and write (to_csv()) with UTF-8 encoding (i.e., encoding='utf-8').:

    import pandas as pd
    from sqlalchemy import create_engine
    engine = create_engine('postgresql://user:password@localhost:5432/dbname')
    
    # read from a csv file
    df = pd.read_csv('file.csv', encoding='utf-8')
    
    # write to PostgreSQL
    df.to_sql('table_name', engine, if_exists='replace', index=False, method='multi', chunksize=1000, encoding='utf-8')
    
    # read from PostgreSQL
    df = pd.read_sql('SELECT * FROM table_name', engine, encoding='utf-8')
    
    # write to a csv file with UTF-8 encoding
    df.to_csv('file.csv', index=False, encoding='utf-8')
  • In R, use readr to read (read_csv()) and write (write_excel_csv()) to force UTF-8 encoding.

    library(readr)
    library(DBI)
    library(RPostgres)
    
    # connect to PostgreSQL
    con <- dbConnect(RPostgres::Postgres(), dbname = "dbname", host = "localhost", port = 5432, user = "user", password = "password")
    
    # read from a csv file
    df <- read_csv('file.csv', locale = locale(encoding = 'UTF-8'))  # explicit
    df <- read_csv('file.csv')                                       # implicit
    
    # write to PostgreSQL
    dbWriteTable(con, 'table_name', df, overwrite = TRUE)
    
    # read from PostgreSQL
    df <- dbReadTable(con, 'table_name')
    
    # write to a csv file with UTF-8 encoding
    write_excel_csv(df, 'file.csv', locale = locale(encoding = 'UTF-8'))  # explicit
    write_excel_csv(df, 'file.csv')                                       # implicit

5.3 Integrated database ingestion strategy

5.3.1 Overview

The CalCOFI database uses a two-schema strategy for development and production:

  • dev schema: Development schema where new datasets, tables, fields, and relationships are ingested and QA/QC’d. This schema is recreated fresh with each ingestion run using the master ingestion script.
  • prod schema: Production schema for stable, versioned data used by public APIs, apps, and data portals (OBIS, EDI, ERDDAP). Once dev is validated, it’s copied to prod with a version number.

5.3.2 Master ingestion workflow

All datasets are ingested using a single master Quarto script calcofi4db/inst/create_db.qmd that:

  1. Drops and recreates the dev schema (fresh start each run)
  2. Ingests multiple datasets from Google Drive source files (CSV, potentially SHP/NC in future)
  3. Applies transformations using redefinition files (tbls_redefine.csv, flds_redefine.csv)
  4. Creates relationships (primary keys, foreign keys, indexes)
  5. Records schema version with metadata in schema_version table

Each dataset section in the master script handles:

  • Reading CSV files from Google Drive
  • Transforming data according to redefinition rules
  • Loading into database tables
  • Adding table/field comments with metadata
flowchart TB
    %% Node definitions
    gd[("`<b>Source Data</b>
          Google Drive:
          calcofi/data/{provider}/{dataset}/*.csv`")]
    iw["<b>Ingest</b>
        calcofi4db: create_db.qmd; sections by {provider},{dataset}"]
    dd["<b>Data Definitions</b>
        calcofi4db: ingest/{provider}/{dataset}/:
        <ul>
          <li>tbls_redefine.csv</li>
          <li>flds_redefine.csv</li>
        </ul>"]
    db[("<b>Database</b>")]
    api["<b>API Endpoint</b>
         /db_tables
         /db_columns"]
    catalog["<b>R Function</b>
             calcofi4r::cc_db_catalog()"]
    eml["<b>Publish Workflow</b>
      workflows: publish_{dataset}_{portal}.qmd
      with {portal}s:
      <ul>
        <li>erddap</li>
        <li>edi</li>
        <li>obis</li>
        <li>ncei</li>
      </ul>"]

    %% Edge definitions
    gd --> iw
    iw -->|"1 auto-generated"| dd
    dd -->|"2 manual edit"| iw
    iw -->|"3 data"| db
    iw --> comments
    comments -->|"4 metadata"| db
    db --> api
    api --> catalog
    db --> eml

    %% Comments subgraph with internal nodes
    subgraph comments["<b>Database Comments</b>
              (stored as text in JSON format to differentiate elements)"]
        direction TB
        h["hideme"]:::hidden
        h~~~tbl
        h~~~fld
        tbl["per <em>Table</em>:
            <ul>
              <li>description</li>
              <li>source (<em>linked</em>)</li>
              <li>source_created (<em>datetime</em>)</li>
              <li>workflow (<em>linked</em>)</li>
              <li>workflow_ingested (<em>datetime</em>)</li>
            </ul>"]
        fld["per <em>Field</em>:
            <ul>
              <li>description</li>
              <li>units (SI)`</li>
            </ul>"]
    end

    %% Clickable links
    click gd "https://drive.google.com/drive/folders/1xxdWa4mWkmfkJUQsHxERTp9eBBXBMbV7" "calcofi folder - Google Drive"
    click api "https://api.calcofi.io/db_tables" "API endpoint</b>"
    click catalog "https://calcofi.io/calcofi4r/reference/cc_db_catalog.html" "R package function"

    %% Styling
    classDef source fill:#f9f9f9,stroke:#000,stroke-width:2px,color:#000
    classDef process fill:#a3e0f2,stroke:#000,stroke-width:2px,color:#000
    classDef eml fill:#F0FDF4,stroke:#22C55E,stroke-width:2px,color:#000,text-align:left
    classDef data fill:#ffbe75,stroke:#000,stroke-width:2px,color:#000
    classDef api fill:#9ad294,stroke:#000,stroke-width:2px,color:#000
    classDef meta fill:#c9a6db,stroke:#000,stroke-width:2px,color:#000,text-align:left
    classDef hidden display: none;

    class gd source
    class dd,comments,tbl,fld meta
    class iw process
    class db data
    class api,catalog api
    class tbl,fld li
    class eml eml
Figure 5.1: Integrated database ingestion scheme.

5.3.3 Using calcofi4db package

The calcofi4db package provides streamlined functions for dataset ingestion.

5.3.4 Release versioning

Each rendered release_database.qmd writes a frozen, immutable release to gs://calcofi-db/ducklake/releases/{version}/ where version is vYYYY.MM.DD. The release directory contains:

  • parquet/ — table files (single .parquet or hive-partitioned directories)
  • catalog.json — structural manifest (rows, partitioned, total_size)
  • relationships.json — primary keys and foreign keys
  • metadata.json — table and column descriptions, units, dataset block, measurement-type registry
  • RELEASE_NOTES.md — human-readable summary

A bucket-level versions.json lists every release and latest.txt points at the most recent one.

5.3.5 Metadata and documentation

Sources of truth for table and column descriptions live in this repo (not in the database), so they are reviewable via PRs and travel with the code that produced them:

  • metadata/{provider}/{dataset}/tbls_redefine.csvtbl_new, tbl_description
  • metadata/{provider}/{dataset}/flds_redefine.csvtbl_new, fld_new, fld_description, units
  • metadata/{provider}/{dataset}/metadata_derived.csv — optional markdown/units overlay for derived tables and columns
  • metadata/release_tables.csv and metadata/release_columns.csv — for tables built inside release_database.qmd itself (cruise_summary, _spatial, _spatial_attr, …)
  • metadata/measurement_type.csv — canonical measurement types with units
  • metadata/dataset.csv — dataset-level descriptions, citations, links

These flow through two stages:

  1. Per-ingest: calcofi4db::build_metadata_json() (called inside finalize_ingest()) writes data/parquet/{provider}_{dataset}/metadata.json (schema version 1.0) alongside the parquet outputs.

  2. Per-release: calcofi4db::merge_metadata_json() (called inside release_database.qmd) merges every per-ingest sidecar plus the release-only registries and writes gs://calcofi-db/ducklake/releases/{version}/metadata.json (schema version 1.1) with this shape:

    {
      "schema_version": "1.1",
      "release_version": "v2026.05.14",
      "release_date":    "2026-05-14",
      "datasets":          { "calcofi_bottle": { ... } },
      "tables":            { "bottle":         { "name_long": "Bottle",
                                                 "description_md": "...",
                                                 "provider": "calcofi",
                                                 "dataset":  "bottle" } },
      "columns":           { "bottle.depth_m": { "name_long": "Depth M",
                                                 "units":          "meters",
                                                 "description_md": "Bottle depth" } },
      "measurement_types": { "temperature":    { "description": "...",
                                                 "units":       "degC",
                                                 "is_canonical": true } }
    }

Markdown is supported anywhere in description_md; consumers should render it through a markdown parser when displaying.

5.3.6 Consuming descriptions

From R, fetch the descriptions either column-by-column or as a full interactive catalog. Both calls hit the release metadata.json sidecar:

library(calcofi4r)

# schema + descriptions + units for a single table
tbl <- cc_describe_table("bottle")
attr(tbl, "description_md")    # table-level description

# interactive datatable of every table and column in the release
cc_db_catalog()
cc_db_catalog(tables = c("bottle", "ichthyo"))
cc_db_catalog(version = "v2026.05.14")

When writing a new ingest, populate tbls_redefine.csv and flds_redefine.csv (especially fld_description and units) before running the ingest notebook — build_metadata_json() reads them directly. The Claude Code /generate-metadata and /validate-ingest skills enforce this.

5.3.7 Publishing to portals

Workflows in workflows/publish_*.qmd push selected tables to external portals (ERDDAP, EDI, OBIS, NCEI). They read the release metadata.json to assemble Ecological Metadata Language (EML) and ERDDAP datasets.xml entries, so descriptions stay in sync across portals without re-keying.

5.4 Relationships between tables

5.5 Spatial Tips