Goal: Systematically evaluate and clean-up the database, including renaming tables and columns, and documenting relationships, per naming conventions being established in CalCOFI/docs: Database.

Code
source(here::here("../apps_dev/libs/db.R")) # con: database connection 

librarian::shelf(
  calcofi/calcofi4r,
  dm, DT, glue, purrr, stringr, tidyr)

1 Transitionary Tables

Let’s evaluate tables that are in transition, i.e. containing in the table name: “old”, “new”, “backup”, “temp”, “temp”, or “test”.

Code
d_tbls <- tibble(
  table = dbListTables(con)) |> 
  mutate(
    subset = case_when(
      str_detect(table, "old")      ~ "old",
      str_detect(table, "new")      ~ "new",
      str_detect(table, "backup")   ~ "backup",
      str_detect(table, "temp|tmp") ~ "temp",
      str_detect(table, "test")     ~ "test",
      TRUE ~ "base")) |> 
  arrange(subset, table)
table(d_tbls$subset)

backup   base    new    old   temp   test 
     3     59      4      4      2      1 
Code
d_tbls |> 
  filter(subset != "base")
# A tibble: 14 × 2
   table                subset
   <chr>                <chr> 
 1 egg_counts_backup    backup
 2 larvae_counts_backup backup
 3 nets_backup          backup
 4 newnet2ctdcast       new   
 5 scrippscast_new      new   
 6 species_codes_new    new   
 7 tow_types_new        new   
 8 egg_counts_old       old   
 9 larvae_counts_old    old   
10 stations_old         old   
11 tows_old             old   
12 tempnet2ctd          temp  
13 tmp_out              temp  
14 test_rast            test  
Code
tbls_other <- d_tbls |> 
  filter(subset != "base") |> 
  pull(table)
cc_db_catalog(tbls_other)
  • TODO: Evaluate these tables and decide what to do with them, i.e. rename or drop.

2 Base Tables

Code
tbls_base <- d_tbls |> 
  filter(subset == "base") |> 
  pull(table)
tibble(
  table = tbls_base)
# A tibble: 59 × 1
   table              
   <chr>              
 1 aoi_fed_sanctuaries
 2 core_stations      
 3 cruises            
 4 ctd_bottles        
 5 ctd_cast_bottles   
 6 ctd_casts          
 7 ctd_dic            
 8 ctd_profiles       
 9 dataset_keywords   
10 dataset_methods    
# ℹ 49 more rows
Code
cc_db_catalog(tables = tbls_base)

2.1 Proposed renames

  • Generalize from “species” to “taxa” (or “taxon” singular), since not all species_codes are at taxonomic level of species.

  • Follow conventions at calcofi.io/docs: Database.

  • See CalCOFI/workflows: modify_db.R for previous SQL commands, including renames, and dates.

  • TODO: Consider other changes and how to document. For instance, perhaps track changes in the pg_dump of only data definition language (DDL) for the whole database (i.e., CREATE TABLE SQL statements), and see git differences over time.

    pg_dump -U user_name -h host database -s -t table_or_view_names -f table_or_view_names.sql
Code
# TODO: species_codes -> taxa
q("ALTER TABLE species_codes RENAME TO taxa")
# TODO: species.spccode -> sp_key
q("ALTER TABLE taxa RENAME COLUMN spccode TO taxa_key")

3 Relationships (dm R package)

3.1 Relationships, "keys_only"

Code
con_dm <- dm_from_con(
  con, 
  table_names = tbls_base,
  learn_keys  = T)
con_dm
── Table source ────────────────────────────────────────────────────────────────
src:  postgres  [admin@localhost:5432/gis]
── Metadata ────────────────────────────────────────────────────────────────────
Tables: `aoi_fed_sanctuaries`, `core_stations`, `cruises`, `ctd_bottles`, `ctd_cast_bottles`, … (59 total)
Columns: 569
Primary keys: 13
Foreign keys: 5
Code
dm_draw(con_dm, view_type = "keys_only")

3.2 Relationships, "all"

This is too much.

Code
# TODO: make SVG zoomable
dm_draw(con_dm, view_type = "all")

3.3 Show existing unique keys across tables

Code
dm_get_all_uks(con_dm)
# A tibble: 13 × 3
   table           uk_col     kind 
   <chr>           <keys>     <chr>
 1 cruises         cruise_id  PK   
 2 ctd_bottles     btl_cnt    PK   
 3 ctd_casts       cast_count PK   
 4 grd_a           rid        PK   
 5 grd_gcs         rid        PK   
 6 grd_mer         rid        PK   
 7 nets            netid      PK   
 8 r_mer100km      rid        PK   
 9 r_sta_cnt       rid        PK   
10 spatial_ref_sys srid       PK   
11 stations        stationid  PK   
12 tows            towid      PK   
13 z_idw           rid        PK   

3.4 Identify candidate keys

Code
# candidate primary keys
dm_enum_pk_candidates(con_dm, "species_codes")
# A tibble: 6 × 3
  columns         candidate why                                                 
  <keys>          <lgl>     <chr>                                               
1 id              TRUE      ""                                                  
2 spccode         TRUE      ""                                                  
3 itis_tsn        FALSE     "has 2 missing values, and duplicate values: 82590 …
4 common_name     FALSE     "has 421 missing values, and duplicate values: Drea…
5 scientific_name FALSE     "has duplicate values: None (6), Octopodidae (5), A…
6 taxon_rank      FALSE     "has duplicate values: Species (743), Genus (262), …
Code
# candidate foreign keys
try(
  dm_enum_fk_candidates(con_dm, "species_codes", "species_groups"))
Error in abort_ref_tbl_has_no_pk(ref_table_name) : 
  ref_table `species_groups` needs a primary key first. Use `dm_enum_pk_candidates()` to find appropriate columns and `dm_add_pk()` to define a primary key.

3.5 Run Shiny app

Code
# install.packages("DiagrammeRsvg")
dm_gui(dm = con_dm)

The dm::dm_gui() Shiny app after clicking on two tables. Haven’t tried editing relationships yet.