Code
source(here::here("../apps_dev/libs/db.R")) # con: database connection
::shelf(
librarian/calcofi4r,
calcofi dm, DT, glue, purrr, stringr, tidyr)
Goal: Systematically evaluate and clean-up the database, including renaming tables and columns, and documenting relationships, per naming conventions being established in CalCOFI/docs: Database.
source(here::here("../apps_dev/libs/db.R")) # con: database connection
::shelf(
librarian/calcofi4r,
calcofi dm, DT, glue, purrr, stringr, tidyr)
Let’s evaluate tables that are in transition, i.e. containing in the table name: “old”, “new”, “backup”, “temp”, “temp”, or “test”.
<- tibble(
d_tbls table = dbListTables(con)) |>
mutate(
subset = case_when(
str_detect(table, "old") ~ "old",
str_detect(table, "new") ~ "new",
str_detect(table, "backup") ~ "backup",
str_detect(table, "temp|tmp") ~ "temp",
str_detect(table, "test") ~ "test",
TRUE ~ "base")) |>
arrange(subset, table)
table(d_tbls$subset)
backup base new old temp test
3 59 4 4 2 1
|>
d_tbls filter(subset != "base")
# A tibble: 14 × 2
table subset
<chr> <chr>
1 egg_counts_backup backup
2 larvae_counts_backup backup
3 nets_backup backup
4 newnet2ctdcast new
5 scrippscast_new new
6 species_codes_new new
7 tow_types_new new
8 egg_counts_old old
9 larvae_counts_old old
10 stations_old old
11 tows_old old
12 tempnet2ctd temp
13 tmp_out temp
14 test_rast test
<- d_tbls |>
tbls_other filter(subset != "base") |>
pull(table)
cc_db_catalog(tbls_other)
TODO:
Evaluate these tables and decide what to do with them, i.e. rename or drop.<- d_tbls |>
tbls_base filter(subset == "base") |>
pull(table)
tibble(
table = tbls_base)
# A tibble: 59 × 1
table
<chr>
1 aoi_fed_sanctuaries
2 core_stations
3 cruises
4 ctd_bottles
5 ctd_cast_bottles
6 ctd_casts
7 ctd_dic
8 ctd_profiles
9 dataset_keywords
10 dataset_methods
# ℹ 49 more rows
cc_db_catalog(tables = tbls_base)
Generalize from “species” to “taxa” (or “taxon” singular), since not all species_codes are at taxonomic level of species.
Follow conventions at calcofi.io/docs: Database.
See CalCOFI/workflows: modify_db.R
for previous SQL commands, including renames, and dates.
TODO:
Consider other changes and how to document. For instance, perhaps track changes in the pg_dump
of only data definition language (DDL) for the whole database (i.e., CREATE TABLE
SQL statements), and see git diff
erences over time.
pg_dump -U user_name -h host database -s -t table_or_view_names -f table_or_view_names.sql
# TODO: species_codes -> taxa
q("ALTER TABLE species_codes RENAME TO taxa")
# TODO: species.spccode -> sp_key
q("ALTER TABLE taxa RENAME COLUMN spccode TO taxa_key")
dm
R package)"keys_only"
<- dm_from_con(
con_dm
con, table_names = tbls_base,
learn_keys = T)
con_dm
── Table source ────────────────────────────────────────────────────────────────
src: postgres [admin@localhost:5432/gis]
── Metadata ────────────────────────────────────────────────────────────────────
Tables: `aoi_fed_sanctuaries`, `core_stations`, `cruises`, `ctd_bottles`, `ctd_cast_bottles`, … (59 total)
Columns: 569
Primary keys: 13
Foreign keys: 5
dm_draw(con_dm, view_type = "keys_only")
"all"
This is too much.
# TODO: make SVG zoomable
dm_draw(con_dm, view_type = "all")
dm_get_all_uks(con_dm)
# A tibble: 13 × 3
table uk_col kind
<chr> <keys> <chr>
1 cruises cruise_id PK
2 ctd_bottles btl_cnt PK
3 ctd_casts cast_count PK
4 grd_a rid PK
5 grd_gcs rid PK
6 grd_mer rid PK
7 nets netid PK
8 r_mer100km rid PK
9 r_sta_cnt rid PK
10 spatial_ref_sys srid PK
11 stations stationid PK
12 tows towid PK
13 z_idw rid PK
# candidate primary keys
dm_enum_pk_candidates(con_dm, "species_codes")
# A tibble: 6 × 3
columns candidate why
<keys> <lgl> <chr>
1 id TRUE ""
2 spccode TRUE ""
3 itis_tsn FALSE "has 2 missing values, and duplicate values: 82590 …
4 common_name FALSE "has 421 missing values, and duplicate values: Drea…
5 scientific_name FALSE "has duplicate values: None (6), Octopodidae (5), A…
6 taxon_rank FALSE "has duplicate values: Species (743), Genus (262), …
# candidate foreign keys
try(
dm_enum_fk_candidates(con_dm, "species_codes", "species_groups"))
Error in abort_ref_tbl_has_no_pk(ref_table_name) :
ref_table `species_groups` needs a primary key first. Use `dm_enum_pk_candidates()` to find appropriate columns and `dm_add_pk()` to define a primary key.
# install.packages("DiagrammeRsvg")
dm_gui(dm = con_dm)