flowchart TB %% Node definitions gd[("`<b>Source Data</b> Google Drive: calcofi/data/{provider}/{dataset}/*.csv`")] iw["<b>Ingest Workflow</b> workflows: ingest_{provider}_{dataset}.qmd"] dd["<b>Data Definitions</b> workflows: /ingest/{provider}/{dataset}/: <ul> <li>tbls_redefine.csv</li> <li>flds_redefine.csv</li> </ul>"] db[("<b>Database</b>")] api["<b>API Endpoint</b> /db_tables /db_columns"] catalog["<b>R Function</b> calcofi4r::cc_db_catalog()"] eml["<b>Publish Workflow</b> workflows: publish_{dataset}_{portal}.qmd with {portal}s: <ul> <li>erddap</li> <li>edi</li> <li>obis</li> <li>ncei</li> </ul>"] %% Edge definitions gd --> iw iw -->|"1 auto-generated"| dd dd -->|"2 manual edit"| iw iw -->|"3 data"| db iw --> comments comments -->|"4 metadata"| db db --> api api --> catalog db --> eml %% Comments subgraph with internal nodes subgraph comments["<b>Database Comments</b> (stored as text in JSON format to differentiate elements)"] direction TB h["hideme"]:::hidden h~~~tbl h~~~fld tbl["per <em>Table</em>: <ul> <li>description</li> <li>source (<em>linked</em>)</li> <li>source_created (<em>datetime</em>)</li> <li>workflow (<em>linked</em>)</li> <li>workflow_ingested (<em>datetime</em>)</li> </ul>"] fld["per <em>Field</em>: <ul> <li>description</li> <li>units (SI)`</li> </ul>"] end %% Clickable links click gd "https://drive.google.com/drive/folders/1xxdWa4mWkmfkJUQsHxERTp9eBBXBMbV7" "calcofi folder - Google Drive" click api "https://api.calcofi.io/db_tables" "API endpoint</b>" click catalog "https://calcofi.io/calcofi4r/reference/cc_db_catalog.html" "R package function" %% Styling classDef source fill:#f9f9f9,stroke:#000,stroke-width:2px,color:#000 classDef process fill:#a3e0f2,stroke:#000,stroke-width:2px,color:#000 classDef eml fill:#F0FDF4,stroke:#22C55E,stroke-width:2px,color:#000,text-align:left classDef data fill:#ffbe75,stroke:#000,stroke-width:2px,color:#000 classDef api fill:#9ad294,stroke:#000,stroke-width:2px,color:#000 classDef meta fill:#c9a6db,stroke:#000,stroke-width:2px,color:#000,text-align:left classDef hidden display: none; class gd source class dd,comments,tbl,fld meta class iw process class db data class api,catalog api class tbl,fld li class eml eml
5 Database
5.1 Database naming conventions
There are only two hard things in Computer Science: cache invalidation and naming things. – Phil Karlton (Netscape architect)
We’re circling the wagons to come up with the best conventions for naming. Here are some ideas:
5.1.1 Name tables
- Table names are singular and use all lower case.
5.1.2 Name columns
To name columns, use snake-case (i.e., lower-case with underscores) so as to prevent the need to quote SQL statements. (TIP: Use
janitor::clean_names()
to convert a table.)Unique identifiers are suffixed with:
Suffix with units where applicable (e.g.,
*_m
for meters,*_km
for kilometers,degc
for degrees Celsius). See units vignette.Set geometry column to
geom
(used by PostGIS spatial extension). If the table has multiple geometry columns, usegeom
for the default geometry column andgeom_{type}
for additional geometry columns (e.g.,geom_point
,geom_line
,geom_polygon
).
5.2 Use Unicode for text
The default character encoding for Postgresql is unicode (UTF8
), which allows for international characters, accents and special characters. Improper encoding can royally mess up basic text.
Logging into the server, we can see this with the following command:
docker exec -it postgis psql -l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------------------+-------+----------+------------+------------+-------------------
gis | admin | UTF8 | en_US.utf8 | en_US.utf8 | =Tc/admin +
| | | | | admin=CTc/admin +
| | | | | ro_user=c/admin
lter_core_metabase | admin | UTF8 | en_US.utf8 | en_US.utf8 | =Tc/admin +
| | | | | admin=CTc/admin +
| | | | | rw_user=c/admin
postgres | admin | UTF8 | en_US.utf8 | en_US.utf8 |
template0 | admin | UTF8 | en_US.utf8 | en_US.utf8 | =c/admin +
| | | | | admin=CTc/admin
template1 | admin | UTF8 | en_US.utf8 | en_US.utf8 | =c/admin +
| | | | | admin=CTc/admin
template_postgis | admin | UTF8 | en_US.utf8 | en_US.utf8 |
(6 rows)
Use Unicode (utf-8
in Python or UTF8
in Postgresql) encoding for all database text values to support international characters and documentation (i.e., tabs, etc for markdown conversion).
In Python, use
pandas
to read (read_csv()
) and write (to_csv()
) with UTF-8 encoding (i.e.,encoding='utf-8'
).:import pandas as pd from sqlalchemy import create_engine = create_engine('postgresql://user:password@localhost:5432/dbname') engine # read from a csv file = pd.read_csv('file.csv', encoding='utf-8') df # write to PostgreSQL 'table_name', engine, if_exists='replace', index=False, method='multi', chunksize=1000, encoding='utf-8') df.to_sql( # read from PostgreSQL = pd.read_sql('SELECT * FROM table_name', engine, encoding='utf-8') df # write to a csv file with UTF-8 encoding 'file.csv', index=False, encoding='utf-8') df.to_csv(
In R, use
readr
to read (read_csv()
) and write (write_excel_csv()
) to force UTF-8 encoding.library(readr) library(DBI) library(RPostgres) # connect to PostgreSQL <- dbConnect(RPostgres::Postgres(), dbname = "dbname", host = "localhost", port = 5432, user = "user", password = "password") con # read from a csv file <- read_csv('file.csv', locale = locale(encoding = 'UTF-8')) # explicit df <- read_csv('file.csv') # implicit df # write to PostgreSQL dbWriteTable(con, 'table_name', df, overwrite = TRUE) # read from PostgreSQL <- dbReadTable(con, 'table_name') df # write to a csv file with UTF-8 encoding write_excel_csv(df, 'file.csv', locale = locale(encoding = 'UTF-8')) # explicit write_excel_csv(df, 'file.csv') # implicit
5.3 Ingest datasets with documentation
Use Quarto documents with chunks of R code in the workflows Github repository to ingest datasets into the database. For example, see the ingest_noaa-calcofi-db workflow.
5.3.1 Using calcofi4db package
The calcofi4db package provides functions to streamline dataset ingestion, metadata generation, and change detection. The standard workflow is:
- Load data files:
load_csv_files()
reads CSV files from a directory and prepares them for ingestion - Transform data:
transform_data()
applies transformations according to redefinition files - Detect changes:
detect_csv_changes()
compares data with existing database tables - Ingest data:
ingest_csv_to_db()
writes data to the database with proper metadata
For convenience, the high-level ingest_dataset()
function combines these steps:
library(calcofi4db)
library(DBI)
library(RPostgres)
# Connect to database
<- dbConnect(
con Postgres(),
dbname = "gis",
host = "localhost",
port = 5432,
user = "admin",
password = "postgres"
)
# Ingest a dataset
<- ingest_dataset(
result con = con,
provider = "swfsc.noaa.gov",
dataset = "calcofi-db",
dir_data = "/path/to/data",
schema = "public",
dir_googledata = "https://drive.google.com/drive/folders/your-folder-id",
email = "your.email@example.com"
)
# Examine changes and results
$changes
result$stats result
5.3.2 Workflow details
Google Drive *.csv files get ingested with a workflow per dataset (in Github repository calcofi/workflows as a Quarto document). Data definition CSV files (tbls_redefine.csv
, flds_redefine.csv
) are auto-generated (if missing) and manually updated to rename and describe tables and fields. After injecting the data for each of the tables, extra metadata is added to the COMMENT
s of each table as JSON elements (links in markdown), including at the table level:
- description: general description describing contents and how each row is unique
- source: CSV (linked to Google Drive source as markdown)
- source_created: datetime stamp of when source was created on GoogleDrive
- workflow: html (rendered Quarto document on Github)
- workflow_ingested: datetime of ingestion
And at the field level:
- description: general description of the field
- units: using the International System of Units (SI) as much as possible
These comments are then exposed by the API db_tables endpoint, which can be consumed and rendered into a tabular searchable catalog with calcofi4r::cc_db_catalog.
5.3.3 Change detection strategy
The calcofi4db
package implements a comprehensive change detection strategy:
- Table changes:
- New tables are identified for initial creation
- Existing tables are identified for potential updates
- Field changes:
- Added fields: New columns in CSV not present in the database
- Removed fields: Columns in database not present in the CSV
- Type changes: Fields with different data types between CSV and database
- Data changes:
- Row counts are compared between source and destination
- Data comparison is handled with checksum verification
If changes are detected, they are displayed to the user who can decide whether to: - Create new tables - Modify existing table schemas - Update data with appropriate strategies (append, replace, merge)
Additional workflows will publish the data to the various Portals (ERDDAP, EDI, OBIS, NCEI) using ecological metadata language (EML) and the EML R package, pulling directly from the structured metadata in the database (on table and field definitions).
5.3.4 OR Describe tables and columns directly
Use the
COMMENT
clause to add descriptions to tables and columns, either through the GUI pgadmin.calcofi.io (by right-clicking on the table or column and selectingProperties
) or with SQL. For example:COMMENT ON TABLE public.aoi_fed_sanctuaries IS 'areas of interest (`aoi`) polygons for federal **National Marine Sanctuaries**; loaded by _workflow_ [load_sanctuaries](https://calcofi.io/workflows/load_sanctuaries.html)';
Note the use of markdown for including links and formatting (e.g., bold, code, italics), such that the above SQL will render like so:
areas of interest (
aoi
) polygons for federal National Marine Sanctuaries; loaded by workflow load_sanctuariesIt is especially helpful to link to any workflows that are responsible for the ingesting or updating of the input data.
5.3.5 Display tables and columns with metadata
- These descriptions can be viewed in the CalCOFI API api.calcofi.io as CSV tables (see code in calcofi/api:
plumber.R
):- api.calcofi.io
/db_tables
fields:
schema
: (only “public” so far)table_type
: “table”, “view”, or “materialized view” (none yet)table
: name of tabletable_description
: description of table (possibly in markdown)
- api.calcofi.io
/db_columns
fields:
schema
: (only “public” so far)table_type
: “table”, “view”, or “materialized view” (none yet)table
: name of tablecolumn
: name of columncolumn_type
: data type of columncolumn_description
: description of column (possibly in markdown)
- api.calcofi.io
- Fetch and display these descriptions into an interactive table with
calcofi4r::
cc_db_catalog()
.
5.4 Relationships between tables
TODO:
add calcofi/apps: db to show latest tables, columns and relationsips
5.5 Spatial Tips
- Use
ST_Subdivide()
when running spatial joins on large polygons.