Skip to contents

calcofi4db 2.8.2

  • merge_metadata_json() adds each dataset’s workflow_url (from the ingest calcofi: YAML) to its datasets[] entry, so the schema site can link the rendered ingest notebook next to the calcofi.org / data-source links.

calcofi4db 2.8.1

  • Content-hash dedup ignores provenance columns — the per-table/partition signature now always excludes _source_file, _source_row, _source_uuid, and _ingested_at (even when strip_provenance = FALSE). Otherwise _ingested_at (set to the current time on every ingest) made every table look changed, defeating the dedup for tables exported with provenance.

calcofi4db 2.8.0

Content-hash dedup of parquet uploads + Parquet V2 / zstd defaults

  • write_parquet_outputs() content-hash dedup — computes an order-independent content signature per table (and per partition for partitioned tables), stored in manifest.json as data_hash. On re-run, unchanged tables/partitions are reused from the previous run instead of being re-written and re-uploaded. A few new cruises (or a metadata-only change) now rewrite only the affected partitions, not all 15 GB of ctd_measurement. Replaces the previous coarse row-count check that forced a full-table rewrite whenever any partition value changed.
  • Parquet V2 + zstd defaultsCOPY TO now writes PARQUET_VERSION V2 and defaults compression = "zstd" (was "snappy") for better compression at minimal cost. Native DuckDB GEOMETRY (v1.5+) round-trips correctly under both. The encoding is recorded in manifest.json as parquet_format; a format change forces a one-time full rewrite so the new encoding actually applies (content hashes track data, not file bytes). ROW_GROUP_SIZE_BYTES is intentionally not set on these writes because it requires preserve_insertion_order=false, which conflicts with ordered output.
  • primary_keys parameter — optional named list (table → PK column) appended as a final ORDER BY tiebreaker for a stable total order (better row-group statistics; byte-stable single-file outputs).
  • sync_to_gcs() crc32c fixgcloud storage hash is now called without the removed --crc32c flag (rejected by gcloud ≥ 5xx), which had silently degraded change detection to a size-only comparison.

calcofi4db 2.7.1

  • parse_qmd_frontmatter() now reads the whole file when locating the YAML front matter delimiters instead of only the first 50 lines, so workflows with long calcofi: blocks (e.g. dataset_meta + additional_datasets) are parsed and not silently dropped from the targets pipeline / release registry.

calcofi4db 2.7.0

YAML-authoritative dataset metadata, per-dataset contributions, and richer release sidecars

  • read_ingest_yaml() / read_calcofi_meta() read the calcofi: YAML block from ingest_*.qmd workflows — the authoritative source for provider/dataset, dataset_meta, tables_owned, workflow_url, and erd.color. Replaces metadata/dataset.csv.
  • ingest_yaml_to_dataset_df() rebuilds the in-database dataset registry table from the ingest YAML (including additional_datasets: folded into one ingest, e.g. swfsc_invert), so ingests no longer read dataset.csv.
  • build_metadata_json() gains tables_owned — emits a contributions block (per-table COUNT(*), owned/shared flags) for owned tables only, avoiding mis-attribution of reference tables loaded from prior ingests. Per-ingest schema bumped to "1.1".
  • merge_metadata_json() now (a) builds the datasets block from ingest_yaml= (authoritative; dataset_csv= kept as deprecated fallback), (b) propagates each table’s workflow link, (c) aggregates a release-level contributions block (rows + pct per dataset, with over_attributed flag and table_rows= denominators), (d) adds erd_legend, datasets[].tables, and measurement_types[].datasets (from _source_datasets). Release schema bumped to "1.2". All new fields are additive.

calcofi4db 2.6.2

Invert consolidation, pipeline exclusions, and missing species corrections

  • consolidate_ichthyo_tables() gains invert_tbl parameter — folds Ed Weber’s inverts.csv into the unified ichthyo table with life_stage = "invert".
  • build_targets_list() gains exclude parameter — skip targets by name (e.g., exclude = "ingest_calcofi_ctd-cast"). Excluded targets are also stripped from other targets’ dependency lists. Normalizes hyphens to underscores for matching.
  • apply_data_corrections() adds 6 missing invert species (including Market squid, Doryteuthis opalescens) sourced from ERDDAP erdCalCOFIinvcnt. Dynamically matches columns to avoid errors when gbif_id hasn’t been added yet.

calcofi4db 2.6.1

Sorted parquet output with ST_Hilbert spatial ordering

  • sort_by parameter write_parquet_outputs() gains a sort_by named list to specify row ordering per table. Sorted row groups enable predicate pushdown (min/max statistics skip irrelevant chunks).
  • Hilbert spatial sort Use "hilbert:lon_col,lat_col" syntax in sort_by to order rows by ST_Hilbert() curve position — clusters spatially nearby records for fast bounding-box queries.
  • paste0() in COPY TO SQL construction in write_parquet_outputs() uses paste0() instead of glue::glue() to prevent cli {variable} interpolation errors when propagating through targets.
  • sort_by in manifest.json Sort specifications recorded alongside partition_by for downstream consumers.

calcofi4db 2.6.0

Native GEOMETRY storage via DuckDB v1.5 — removes spatial workaround

  • storage_compatibility_version = 'latest' get_duckdb_con() now sets this in the default config, enabling DuckDB v1.5’s native built-in GEOMETRY type. This fixes the “Buffer overflow” / “Skipping beyond end of binary data” spatial serialization bug that occurred with the old v0.10.2 storage format.
  • Removed geom_wkb workaround assign_grid_key() no longer refreshes grid geometry from a stored WKB column — native GEOMETRY storage is reliable.
  • Requires duckdb >= 1.5.1 Added minimum version constraint in DESCRIPTION to ensure the native GEOMETRY type is available.
  • Avoid glue in spatial.R assign_grid_key() uses paste0() instead of glue::glue() to prevent cli from intercepting {variable} patterns in error messages propagated through targets.

calcofi4db 2.5.6 (superseded)

Grid geometry refresh workaround for DuckDB spatial bug (removed in 2.6.0)

calcofi4db 2.5.5

Server-side GCS copy for archives & sync_to_gcs replaces put_gcs_file loops

  • Server-side archive copy .sync_to_gcs_archive() now checks _sync/{provider}/{dataset}/ on GCS before uploading from local. If a file exists with matching MD5, uses copy_gcs_file() for instant server-side copy — no local I/O or GD mount needed.
  • copy_gcs_file(src, dst) New helper for server-side GCS-to-GCS copy via gcloud storage cp.
  • Bottle & DIC uploads replaced put_gcs_file() loops in QMDs with sync_to_gcs() for hash-based deduplication (idempotent re-renders).

calcofi4db 2.5.4

Consolidated sync_to_gcs() with archive mode, exclude patterns & GCS logging

  • Unified sync function sync_to_gcs() gains archive, exclude, and log_to_gcs parameters. When archive = TRUE, creates timestamped immutable snapshots (replacing sync_to_gcs_archive() internals). When FALSE (default), standard mirror mode.
  • Exclude patterns New exclude parameter accepts glob patterns (e.g., c(".DS_Store", "*.tmp")) to skip files during sync.
  • GCS action logging log_to_gcs = TRUE writes a timestamped JSON log to gs://{bucket}/{prefix}/_logs/sync_YYYY-MM-DD_HHMMSS.json documenting every upload, skip, and delete.
  • Richer results Sync results tibble now includes size and reason columns (e.g., “checksum match”, “new file”, “crc32c changed”).
  • sync_to_gcs_archive() deprecated Now a thin wrapper calling sync_to_gcs(archive = TRUE). Existing callers work unchanged.

calcofi4db 2.5.3

DuckDB driver lifecycle, idempotent ingestion & defensive ALTER TABLE

calcofi4db 2.5.2

VIEWs for dependencies, GCS server-side copy, crc32c sync & spatial consolidation

  • VIEW-based dependency loading load_prior_tables() gains as_view parameter — creates VIEWs instead of TABLEs for zero-copy parquet reads. Dependency tables no longer duplicated across ingests.
  • calcofi.modifies frontmatter New YAML field declares which dependency tables an ingest modifies (e.g., ship). parse_qmd_frontmatter() parses it; build_release_table_registry() discovers _new delta sidecars from the filesystem.
  • GCS server-side copy for releases release_database.qmd copies parquet from ingest/ to releases/ on GCS via gcloud storage cp instead of re-uploading from local. Only derived/merged tables exported locally.
  • crc32c hash comparison sync_to_gcs() uses gcloud storage ls --json for crc32c hashes; list_gcs_files() returns crc32c column. Unchanged files skipped entirely.
  • Stale file cleanup sync_to_gcs() gains delete_stale parameter to remove orphaned GCS files after partition key or table renames.
  • export_parquet() New helper using DuckDB native COPY TO PARQUET — handles GEOMETRY columns (as WKB), preferred over arrow::write_parquet().
  • build_release_table_registry() Auto-discovers table-to-ingest mapping from manifests with canonical source marking for duplicates.
  • Archive listing fix get_latest_archive_timestamp() uses non-recursive gcloud storage ls instead of recursive --json scan that was hanging on large archives.

calcofi4db 2.5.1

Mismatch tracking, supplemental table support, targets integration & bug fix

calcofi4db 2.5.0

Simplified provider/dataset naming, taxonomy & workflow improvements

calcofi4db 2.4.0

*Use _uuid over _id, smarter sync with GCS*

  • Revert from int _id to _uuid preferred unique identifiers for SWFSC icthyo db
  • Use smarter synchronizing with GCS using md5 hash checks and modified time filenaming

calcofi4db 2.3.0

Addition of ship, taxonomy functions

Added helper functions for processing:

calcofi4db 2.2.1

Addition of spatial, parquet, viz helper functions

calcofi4db 2.2.0

Improvements to cloud plan functions

Workflow ingest_swfsc.noaa.gov_calcofi-db.qmd now fully automates ingestion of CalCOFI database from SWFSC NOAA archive to parquet files in Google Cloud Storage. Many new functions added.

calcofi4db 2.1.0

Addition of functions for phase 2 of cloud plan

  • Added ducklake and freeze functions. Updated documentation with concepts.

calcofi4db 1.2.0

Addition of functions for phase 1 of cloud plan

calcofi4db 1.1.0

Addition of CalCOFI Bottle Database

calcofi4db 1.0.0

Initial production release with NOAA CalCOFI Database

  • Complete NOAA CalCOFI Database ingestion with spatial features
  • Add synchronized versioning system for package and database
  • Create master ingestion workflow with integrity checks
  • Implement comprehensive metadata management

calcofi4db 0.1.1