Changelog
Source:NEWS.md
calcofi4db 2.6.2
Invert consolidation, pipeline exclusions, and missing species corrections
-
consolidate_ichthyo_tables()gainsinvert_tblparameter — folds Ed Weber’sinverts.csvinto the unifiedichthyotable withlife_stage = "invert". -
build_targets_list()gainsexcludeparameter — skip targets by name (e.g.,exclude = "ingest_calcofi_ctd-cast"). Excluded targets are also stripped from other targets’ dependency lists. Normalizes hyphens to underscores for matching. -
apply_data_corrections()adds 6 missing invert species (including Market squid, Doryteuthis opalescens) sourced from ERDDAPerdCalCOFIinvcnt. Dynamically matches columns to avoid errors whengbif_idhasn’t been added yet.
calcofi4db 2.6.1
Sorted parquet output with ST_Hilbert spatial ordering
-
sort_byparameterwrite_parquet_outputs()gains asort_bynamed list to specify row ordering per table. Sorted row groups enable predicate pushdown (min/max statistics skip irrelevant chunks). -
Hilbert spatial sort Use
"hilbert:lon_col,lat_col"syntax insort_byto order rows byST_Hilbert()curve position — clusters spatially nearby records for fast bounding-box queries. -
paste0()in COPY TO SQL construction inwrite_parquet_outputs()usespaste0()instead ofglue::glue()to prevent cli{variable}interpolation errors when propagating through targets. -
sort_by in manifest.json Sort specifications recorded alongside
partition_byfor downstream consumers.
calcofi4db 2.6.0
Native GEOMETRY storage via DuckDB v1.5 — removes spatial workaround
-
storage_compatibility_version = 'latest'get_duckdb_con()now sets this in the default config, enabling DuckDB v1.5’s native built-in GEOMETRY type. This fixes the “Buffer overflow” / “Skipping beyond end of binary data” spatial serialization bug that occurred with the old v0.10.2 storage format. -
Removed geom_wkb workaround
assign_grid_key()no longer refreshes grid geometry from a stored WKB column — native GEOMETRY storage is reliable. -
Requires
duckdb >= 1.5.1Added minimum version constraint in DESCRIPTION to ensure the native GEOMETRY type is available. -
Avoid glue in spatial.R
assign_grid_key()usespaste0()instead ofglue::glue()to prevent cli from intercepting{variable}patterns in error messages propagated through targets.
calcofi4db 2.5.6 (superseded)
Grid geometry refresh workaround for DuckDB spatial bug (removed in 2.6.0)
calcofi4db 2.5.5
Server-side GCS copy for archives & sync_to_gcs replaces put_gcs_file loops
-
Server-side archive copy
.sync_to_gcs_archive()now checks_sync/{provider}/{dataset}/on GCS before uploading from local. If a file exists with matching MD5, usescopy_gcs_file()for instant server-side copy — no local I/O or GD mount needed. -
copy_gcs_file(src, dst)New helper for server-side GCS-to-GCS copy viagcloud storage cp. -
Bottle & DIC uploads replaced
put_gcs_file()loops in QMDs withsync_to_gcs()for hash-based deduplication (idempotent re-renders).
calcofi4db 2.5.4
Consolidated sync_to_gcs() with archive mode, exclude patterns & GCS logging
-
Unified sync function
sync_to_gcs()gainsarchive,exclude, andlog_to_gcsparameters. Whenarchive = TRUE, creates timestamped immutable snapshots (replacingsync_to_gcs_archive()internals). WhenFALSE(default), standard mirror mode. -
Exclude patterns New
excludeparameter accepts glob patterns (e.g.,c(".DS_Store", "*.tmp")) to skip files during sync. -
GCS action logging
log_to_gcs = TRUEwrites a timestamped JSON log togs://{bucket}/{prefix}/_logs/sync_YYYY-MM-DD_HHMMSS.jsondocumenting every upload, skip, and delete. -
Richer results Sync results tibble now includes
sizeandreasoncolumns (e.g., “checksum match”, “new file”, “crc32c changed”). -
sync_to_gcs_archive()deprecated Now a thin wrapper callingsync_to_gcs(archive = TRUE). Existing callers work unchanged.
calcofi4db 2.5.3
DuckDB driver lifecycle, idempotent ingestion & defensive ALTER TABLE
-
DuckDB driver lifecycle
get_duckdb_con()now creates a named driver viaduckdb::duckdb(dbdir=...)and stores it as an attribute;close_duckdb()callsduckdb_shutdown()for proper WAL flush. Also setsautoload_known_extensions = "true"so the spatial extension loads during WAL replay. -
Idempotent DuckLake ingestion
ingest_to_working()checks_source_filebefore appending — skips if rows from the same source already exist, making notebook re-renders safe. -
Defensive
ADD COLUMN IF NOT EXISTSAllALTER TABLE … ADD COLUMNcalls acrossload_prior_tables(),load_gcs_parquet_to_duckdb(),standardize_species_local(),standardize_species(),finalize_ingest(),create_cruise_key(),propagate_natural_key(),assign_sequential_ids(), andreplace_uuid_with_id()now useIF NOT EXISTSto prevent errors on re-runs. -
Better duplicate-key warnings
create_cruise_key()now shows top-10 examples with counts in the warning message.
calcofi4db 2.5.2
VIEWs for dependencies, GCS server-side copy, crc32c sync & spatial consolidation
-
VIEW-based dependency loading
load_prior_tables()gainsas_viewparameter — creates VIEWs instead of TABLEs for zero-copy parquet reads. Dependency tables no longer duplicated across ingests. -
calcofi.modifiesfrontmatter New YAML field declares which dependency tables an ingest modifies (e.g.,ship).parse_qmd_frontmatter()parses it;build_release_table_registry()discovers_newdelta sidecars from the filesystem. -
GCS server-side copy for releases
release_database.qmdcopies parquet fromingest/toreleases/on GCS viagcloud storage cpinstead of re-uploading from local. Only derived/merged tables exported locally. -
crc32c hash comparison
sync_to_gcs()usesgcloud storage ls --jsonfor crc32c hashes;list_gcs_files()returnscrc32ccolumn. Unchanged files skipped entirely. -
Stale file cleanup
sync_to_gcs()gainsdelete_staleparameter to remove orphaned GCS files after partition key or table renames. -
export_parquet()New helper using DuckDB nativeCOPY TO PARQUET— handles GEOMETRY columns (as WKB), preferred overarrow::write_parquet(). -
build_release_table_registry()Auto-discovers table-to-ingest mapping from manifests with canonical source marking for duplicates. -
Archive listing fix
get_latest_archive_timestamp()uses non-recursivegcloud storage lsinstead of recursive--jsonscan that was hanging on large archives.
calcofi4db 2.5.1
Mismatch tracking, supplemental table support, targets integration & bug fix
-
New mismatch collectors Added
collect_ship_mismatches(),collect_measurement_type_mismatches(), andcollect_cruise_key_mismatches()to detect unresolved entities and populatemanifest.jsonmismatches section. -
Supplemental table support
write_parquet_outputs()gainsmismatchesandsupplementalparameters;load_prior_tables()andfinalize_ingest()gaininclude_supplementalto exclude supplemental tables (e.g. wide-format ERDDAP outputs) by default. -
New spatial manifest Added
write_spatial_manifest()to generatemanifest.jsonfor spatial parquet directories. -
New ship helper Added
ensure_interim_ships()to insert placeholder ship entries for unmatched codes so downstream FK joins can proceed. -
Targets integration Added
parse_qmd_frontmatter()andbuild_targets_list()to build atargetspipeline fromcalcofi:YAML frontmatter in.qmdworkflow files. Addedyamlto Imports andtargetsto Suggests. -
Relationships refactor
build_relationships_json()now accepts arelslist as an alternative to admobject, removing the hard dependency on thedmpackage. -
Partition change detection
write_parquet_outputs()now detects when partition values change and forces a re-write. -
Bug fix Fixed
print_csv_change_stats()usingfields_addedinstead offields_removedwhen counting removed fields.
calcofi4db 2.5.0
Simplified provider/dataset naming, taxonomy & workflow improvements
-
Dataset renaming Renamed dataset providers from URL-style to short names (e.g.,
swfsc.noaa.gov/calcofi-db->swfsc/ichthyo,calcofi.org/bottle-database->calcofi/bottle); moved correspondinginst/ingest/config files to match. -
New taxonomy functions Added
standardize_species_local()for fast local species standardization viaspp.duckdbwith optional WoRMS API fallback. Addedbuild_taxon_hierarchy()to build taxonomic hierarchies from localspp.duckdbusing recursive CTEs. -
New workflow function Added
finalize_ingest()high-level function to push parquet tables to Working DuckLake with provenance tracking. -
New cloud helpers Added GCS cleanup helpers:
delete_gcs_prefix(),cleanup_gcs_obsolete(). -
New display helper Added
dt()display helper for interactive DataTables with CSV export. -
New wrangle helpers Added relationship JSON helpers:
build_relationships_json(),merge_relationships_json(),read_relationships_json(). Addedassign_deterministic_uuids_md5()using DuckDB-native md5. -
Improved
sync_to_gcs()to support recursive/hive-partitioned subdirectories.
calcofi4db 2.4.0
*Use _uuid over _id, smarter sync with GCS*
- Revert from int
_idto_uuidpreferred unique identifiers for SWFSC icthyo db - Use smarter synchronizing with GCS using md5 hash checks and modified time filenaming
calcofi4db 2.3.0
Addition of ship, taxonomy functions
Added helper functions for processing:
- ships:
fetch_ship_ices(),match_ships(),add_ship_info(). - taxonomy:
build_taxon_table(),standardize_species()
calcofi4db 2.2.1
Addition of spatial, parquet, viz helper functions
- Added functions to help with spatial data processing including:
add_point_geom(),assign_grid_key(). - Added parquet helper function:
load_gcs_parquet_to_duckdb(). - Added ingest workflow helper visualzation of table function:
preview_tables().
calcofi4db 2.2.0
Improvements to cloud plan functions
Workflow ingest_swfsc.noaa.gov_calcofi-db.qmd now fully automates ingestion of CalCOFI database from SWFSC NOAA archive to parquet files in Google Cloud Storage. Many new functions added.
calcofi4db 2.1.0
Addition of functions for phase 2 of cloud plan
- Added ducklake and freeze functions. Updated documentation with concepts.
calcofi4db 1.0.0
Initial production release with NOAA CalCOFI Database
- Complete NOAA CalCOFI Database ingestion with spatial features
- Add synchronized versioning system for package and database
- Create master ingestion workflow with integrity checks
- Implement comprehensive metadata management
calcofi4db 0.1.1
- Fix
detect_csv_changes()to compare CSV files withread_csv_files()output.- Add type mismatch checks for fields in the CSV files.
- Add
print_csv_change_stats()functions for textual summary of changes. - Add
display_csv_changes()to display changes in a color-coded table and- Ensure compatibility with multiple output formats: interactive DataTable, static kable, or raw tibble.
- Expand documentation for
read_csv_files()anddetect_csv_changes().