5 Database

5.1 Database naming conventions

There are only two hard things in Computer Science: cache invalidation and naming things. – Phil Karlton (Netscape architect)

We’re circling the wagons to come up with the best conventions for naming. Here are some ideas:

5.1.1 Name tables

Table names are singular and use all lower case.

5.1.2 Name columns

To name columns, use snake-case (i.e., lower-case with underscores) so as to prevent the need to quote SQL statements. (TIP: Use janitor::clean_names() to convert a table.)
Unique identifiers are suffixed with:
- *_id for unique integer keys;
- *_uuid for universally unique identifiers as defined by RFC 4122 and stored in Postgres as UUID Type.
- *_key for unique string keys;
- *_seq for auto-incrementing sequence integer keys.
Suffix with units where applicable (e.g., *_m for meters, *_km for kilometers, degc for degrees Celsius). See units vignette.
Set geometry column to geom (used by PostGIS spatial extension). If the table has multiple geometry columns, use geom for the default geometry column and geom_{type} for additional geometry columns (e.g., geom_point, geom_line, geom_polygon).

5.2 Use Unicode for text

The default character encoding for Postgresql is unicode (UTF8), which allows for international characters, accents and special characters. Improper encoding can royally mess up basic text.

Logging into the server, we can see this with the following command:

docker exec -it postgis psql -l

                                  List of databases
        Name        | Owner | Encoding |  Collate   |   Ctype    | Access privileges 
--------------------+-------+----------+------------+------------+-------------------
 gis                | admin | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/admin        +
                    |       |          |            |            | admin=CTc/admin  +
                    |       |          |            |            | ro_user=c/admin
 lter_core_metabase | admin | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/admin        +
                    |       |          |            |            | admin=CTc/admin  +
                    |       |          |            |            | rw_user=c/admin
 postgres           | admin | UTF8     | en_US.utf8 | en_US.utf8 | 
 template0          | admin | UTF8     | en_US.utf8 | en_US.utf8 | =c/admin         +
                    |       |          |            |            | admin=CTc/admin
 template1          | admin | UTF8     | en_US.utf8 | en_US.utf8 | =c/admin         +
                    |       |          |            |            | admin=CTc/admin
 template_postgis   | admin | UTF8     | en_US.utf8 | en_US.utf8 | 
(6 rows)

Use Unicode (utf-8 in Python or UTF8 in Postgresql) encoding for all database text values to support international characters and documentation (i.e., tabs, etc for markdown conversion).

In Python, use pandas to read (read_csv()) and write (to_csv()) with UTF-8 encoding (i.e., encoding='utf-8').:

import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:password@localhost:5432/dbname')

# read from a csv file
df = pd.read_csv('file.csv', encoding='utf-8')

# write to PostgreSQL
df.to_sql('table_name', engine, if_exists='replace', index=False, method='multi', chunksize=1000, encoding='utf-8')

# read from PostgreSQL
df = pd.read_sql('SELECT * FROM table_name', engine, encoding='utf-8')

# write to a csv file with UTF-8 encoding
df.to_csv('file.csv', index=False, encoding='utf-8')

In R, use readr to read (read_csv()) and write (write_excel_csv()) to force UTF-8 encoding.

library(readr)
library(DBI)
library(RPostgres)

# connect to PostgreSQL
con <- dbConnect(RPostgres::Postgres(), dbname = "dbname", host = "localhost", port = 5432, user = "user", password = "password")

# read from a csv file
df <- read_csv('file.csv', locale = locale(encoding = 'UTF-8'))  # explicit
df <- read_csv('file.csv')                                       # implicit

# write to PostgreSQL
dbWriteTable(con, 'table_name', df, overwrite = TRUE)

# read from PostgreSQL
df <- dbReadTable(con, 'table_name')

# write to a csv file with UTF-8 encoding
write_excel_csv(df, 'file.csv', locale = locale(encoding = 'UTF-8'))  # explicit
write_excel_csv(df, 'file.csv')                                       # implicit

5.3 Ingest datasets with documentation

Use Quarto documents with chunks of R code in the workflows Github repository to ingest datasets into the database. For example, see the ingest_noaa-calcofi-db workflow.

flowchart TB
    %% Node definitions
    gd[("`<b>Source Data</b>
          Google Drive:
          calcofi/data/{provider}/{dataset}/*.csv`")]
    iw["<b>Ingest Workflow</b>
        workflows: ingest_{provider}_{dataset}.qmd"]
    dd["<b>Data Definitions</b>
        workflows: /ingest/{provider}/{dataset}/:
        <ul>
          <li>tbls_redefine.csv</li>
          <li>flds_redefine.csv</li>
        </ul>"]
    db[("<b>Database</b>")]
    api["<b>API Endpoint</b>
         /db_tables
         /db_columns"]
    catalog["<b>R Function</b>
             calcofi4r::cc_db_catalog()"]
    eml["<b>Publish Workflow</b>
      workflows: publish_{dataset}_{portal}.qmd
      with {portal}s:
      <ul>
        <li>erddap</li>
        <li>edi</li>
        <li>obis</li>
        <li>ncei</li>
      </ul>"]

    %% Edge definitions
    gd --> iw
    iw -->|"1 auto-generated"| dd
    dd -->|"2 manual edit"| iw
    iw -->|"3 data"| db
    iw --> comments
    comments -->|"4 metadata"| db
    db --> api
    api --> catalog
    db --> eml

    %% Comments subgraph with internal nodes
    subgraph comments["<b>Database Comments</b>
              (stored as text in JSON format to differentiate elements)"]
        direction TB
        h["hideme"]:::hidden
        h~~~tbl
        h~~~fld
        tbl["per <em>Table</em>:
            <ul>
              <li>description</li>
              <li>source (<em>linked</em>)</li>
              <li>source_created (<em>datetime</em>)</li>
              <li>workflow (<em>linked</em>)</li>
              <li>workflow_ingested (<em>datetime</em>)</li>
            </ul>"]
        fld["per <em>Field</em>:
            <ul>
              <li>description</li>
              <li>units (SI)`</li>
            </ul>"]
    end

    %% Clickable links
    click gd "https://drive.google.com/drive/folders/1xxdWa4mWkmfkJUQsHxERTp9eBBXBMbV7" "calcofi folder - Google Drive"
    click api "https://api.calcofi.io/db_tables" "API endpoint</b>"
    click catalog "https://calcofi.io/calcofi4r/reference/cc_db_catalog.html" "R package function"

    %% Styling
    classDef source fill:#f9f9f9,stroke:#000,stroke-width:2px,color:#000
    classDef process fill:#a3e0f2,stroke:#000,stroke-width:2px,color:#000
    classDef eml fill:#F0FDF4,stroke:#22C55E,stroke-width:2px,color:#000,text-align:left
    classDef data fill:#ffbe75,stroke:#000,stroke-width:2px,color:#000
    classDef api fill:#9ad294,stroke:#000,stroke-width:2px,color:#000
    classDef meta fill:#c9a6db,stroke:#000,stroke-width:2px,color:#000,text-align:left
    classDef hidden display: none;

    class gd source
    class dd,comments,tbl,fld meta
    class iw process
    class db data
    class api,catalog api
    class tbl,fld li
    class eml eml

Figure 5.1: Database documentation scheme.

5.3.1 Using calcofi4db package

The calcofi4db package provides functions to streamline dataset ingestion, metadata generation, and change detection. The standard workflow is:

Load data files: load_csv_files() reads CSV files from a directory and prepares them for ingestion
Transform data: transform_data() applies transformations according to redefinition files
Detect changes: detect_csv_changes() compares data with existing database tables
Ingest data: ingest_csv_to_db() writes data to the database with proper metadata

For convenience, the high-level ingest_dataset() function combines these steps:

library(calcofi4db)
library(DBI)
library(RPostgres)

# Connect to database
con <- dbConnect(
  Postgres(),
  dbname = "gis",
  host = "localhost",
  port = 5432,
  user = "admin",
  password = "postgres"
)

# Ingest a dataset
result <- ingest_dataset(
  con = con,
  provider = "swfsc.noaa.gov",
  dataset = "calcofi-db",
  dir_data = "/path/to/data",
  schema = "public",
  dir_googledata = "https://drive.google.com/drive/folders/your-folder-id",
  email = "your.email@example.com"
)

# Examine changes and results
result$changes
result$stats

5.3.2 Workflow details

Google Drive *.csv files get ingested with a workflow per dataset (in Github repository calcofi/workflows as a Quarto document). Data definition CSV files (tbls_redefine.csv , flds_redefine.csv) are auto-generated (if missing) and manually updated to rename and describe tables and fields. After injecting the data for each of the tables, extra metadata is added to the COMMENTs of each table as JSON elements (links in markdown), including at the table level:

description: general description describing contents and how each row is unique
source: CSV (linked to Google Drive source as markdown)
source_created: datetime stamp of when source was created on GoogleDrive
workflow: html (rendered Quarto document on Github)
workflow_ingested: datetime of ingestion

And at the field level:

description: general description of the field
units: using the International System of Units (SI) as much as possible

These comments are then exposed by the API db_tables endpoint, which can be consumed and rendered into a tabular searchable catalog with calcofi4r::cc_db_catalog.

5.3.3 Change detection strategy

The calcofi4db package implements a comprehensive change detection strategy:

Table changes:
- New tables are identified for initial creation
- Existing tables are identified for potential updates
Field changes:
- Added fields: New columns in CSV not present in the database
- Removed fields: Columns in database not present in the CSV
- Type changes: Fields with different data types between CSV and database
Data changes:
- Row counts are compared between source and destination
- Data comparison is handled with checksum verification

If changes are detected, they are displayed to the user who can decide whether to: - Create new tables - Modify existing table schemas - Update data with appropriate strategies (append, replace, merge)

Additional workflows will publish the data to the various Portals (ERDDAP, EDI, OBIS, NCEI) using ecological metadata language (EML) and the EML R package, pulling directly from the structured metadata in the database (on table and field definitions).

5.3.4 OR Describe tables and columns directly

Use the COMMENT clause to add descriptions to tables and columns, either through the GUI pgadmin.calcofi.io (by right-clicking on the table or column and selecting Properties) or with SQL. For example:

COMMENT ON TABLE public.aoi_fed_sanctuaries IS 'areas of interest (`aoi`) polygons for federal **National Marine Sanctuaries**; loaded by _workflow_ [load_sanctuaries](https://calcofi.io/workflows/load_sanctuaries.html)';

Note the use of markdown for including links and formatting (e.g., bold, code, italics), such that the above SQL will render like so:

areas of interest (aoi) polygons for federal National Marine Sanctuaries; loaded by workflow load_sanctuaries
It is especially helpful to link to any workflows that are responsible for the ingesting or updating of the input data.

5.3.5 Display tables and columns with metadata

These descriptions can be viewed in the CalCOFI API api.calcofi.io as CSV tables (see code in calcofi/api: plumber.R):
- api.calcofi.io/db_tables
  fields:
  - schema: (only “public” so far)
  - table_type: “table”, “view”, or “materialized view” (none yet)
  - table: name of table
  - table_description: description of table (possibly in markdown)
- api.calcofi.io/db_columns
  fields:
  - schema: (only “public” so far)
  - table_type: “table”, “view”, or “materialized view” (none yet)
  - table: name of table
  - column: name of column
  - column_type: data type of column
  - column_description: description of column (possibly in markdown)
Fetch and display these descriptions into an interactive table with calcofi4r::cc_db_catalog().

5.4 Relationships between tables

See calcofi/workflows: clean_db
TODO: add calcofi/apps: db to show latest tables, columns and relationsips

5.5 Spatial Tips

Use ST_Subdivide() when running spatial joins on large polygons.