Skip to contents

Cloud Storage (GCS)

functions for Google Cloud Storage operations

cleanup_gcs_obsolete()
Clean up obsolete GCS directories from dataset renames
copy_gcs_file()
Server-side copy between GCS paths
create_gcs_manifest()
Create a manifest of current GCS files
delete_gcs_prefix()
Delete all objects under a GCS prefix
get_calcofi_file()
Get a CalCOFI file from the immutable archive
get_gcs_file()
Get a file from Google Cloud Storage
get_historical_file()
Get historical file from a specific date
get_manifest()
Get manifest for a specific date
list_calcofi_files()
List CalCOFI files from manifest
list_gcs_files()
List files in a GCS bucket/prefix
list_gcs_versions()
List versions of a file in GCS archive
put_gcs_file()
Upload a file to Google Cloud Storage
sync_to_gcs()
Sync local files to GCS, skipping unchanged files

Parquet

functions for Apache Parquet file operations

add_parquet_metadata()
Add metadata to Parquet file
csv_to_parquet()
Convert CSV file to Parquet format
export_parquet()
Export a DuckDB Table or Query to Parquet
get_parquet_metadata()
Get Parquet file metadata
read_parquet_table()
Read a Parquet table
upload_parquet()
Upload Parquet file to GCS
write_parquet_table()
Write data to Parquet format

DuckDB

basic DuckDB database operations

close_duckdb()
Disconnect from DuckDB and shutdown
create_duckdb_from_parquet()
Create DuckDB from Parquet files
create_duckdb_views()
Create views from a manifest
duckdb_to_parquet()
Export DuckDB table to Parquet
get_duckdb_con()
Get a DuckDB connection
get_duckdb_tables()
Get table information from DuckDB
load_duckdb_extension()
Install and load DuckDB extension
save_duckdb_to_gcs()
Save DuckDB to GCS
set_duckdb_comments()
Set table and column comments in DuckDB

Working DuckLake

functions for the internal Working DuckLake with provenance tracking

add_provenance_columns()
Add provenance columns to a data frame
get_working_ducklake()
Get Working DuckLake connection
ingest_dataset()
Ingest Dataset into Working DuckLake
ingest_to_working()
Ingest data to Working DuckLake
list_working_tables()
List tables with provenance in Working DuckLake
load_prior_tables()
Load Tables from a Prior Ingest's Parquet Directory
query_at_time()
Query Working DuckLake at a point in time
save_working_ducklake()
Save Working DuckLake to GCS
strip_provenance_columns()
Strip provenance columns from data

Frozen Releases

functions for creating and managing frozen DuckLake releases

compare_releases()
Compare two frozen releases
freeze_release()
Freeze a release of the DuckLake
get_release_metadata()
Get metadata for a frozen release
list_frozen_releases()
List available frozen releases
upload_frozen_release()
Upload Frozen Release to GCS
validate_for_release()
Validate Working DuckLake for release

Read

functions reading data, particularly from CSV files in the Google Drive CalCOFI data folder

create_redefinition_files()
Create Redefinition Files for Tables and Fields
determine_field_types()
Determine Field Types for Database
read_csv_files()
Read CSV Files and Their Metadata
read_csv_metadata()
Read CSV Files and Extract Metadata

Transform

functions to transform data, after reading data and before ingesting into the database

detect_csv_changes()
Detect Changes in CSV Files
display_csv_changes()
Display CSV Changes in a Formatted Table
print_csv_change_stats()
Print CSV Change Statistics
transform_data()
Transform Data for Database Ingestion

Check

functions for data validation and integrity checking

check_data_integrity()
Check Data Integrity for Ingestion
check_multiple_datasets()
Check Multiple Datasets for Integrity
render_integrity_message()
Render Data Integrity Check Message

Validate

functions for referential integrity validation and flagging invalid rows

delete_flagged_rows()
Delete Flagged Rows from Database
flag_invalid_rows()
Flag and Export Invalid Rows
validate_dataset()
Run All Validations for a Dataset
validate_egg_stages()
Validate Egg Stage Values
validate_fk_references()
Validate Foreign Key References
validate_lookup_values()
Validate Lookup Values Exist

Ingest

functions for ingesting data into the database (PostgreSQL deprecated, use DuckLake)

ingest_csv_to_db() deprecated
Ingest CSV data to PostgreSQL database (DEPRECATED)
ingest_dataset_pg() deprecated
Ingest a Dataset to PostgreSQL (DEPRECATED)

Version

functions for schema and package versioning

get_schema_versions()
Get Schema Version History
init_schema_version_csv()
Initialize Schema Version CSV
record_schema_version()
Record Schema Version

Utilities

utility functions for database operations (PostgreSQL deprecated)

copy_schema()
Copy Database Schema
get_db_con() deprecated
Get a database connection to the CalCOFI PostgreSQL database (DEPRECATED)

Wrangle

functions for local DuckDB wrangling (keys, IDs, table consolidation)

apply_data_corrections()
Apply Data Corrections
assign_deterministic_uuids()
Assign deterministic UUIDs from composite key columns
assign_deterministic_uuids_md5()
Assign deterministic UUIDs using DuckDB-native md5
assign_sequential_ids()
Assign Sequential IDs with Deterministic Sort Order
build_metadata_json()
Build Metadata JSON for Parquet Outputs
build_relationships_json()
Build Relationships JSON from dm Object
collect_cruise_key_mismatches()
Collect Cruise Key Mismatches
collect_measurement_type_mismatches()
Collect Measurement Type Mismatches
collect_ship_mismatches()
Collect Ship Mismatches
consolidate_ichthyo_tables()
Consolidate Ichthyoplankton Tables into Tidy Format
convert_cruise_key_format()
Convert Old YYMMKK Cruise Key to YYYY-MM-NODC Format
create_cruise_key()
Create Cruise Key from Ship NODC Code and Date
create_lookup_table()
Create Lookup Table from Vocabulary Definitions
enforce_column_types()
Enforce Column Types Before Export
merge_relationships_json()
Merge Multiple Relationships JSON Files
propagate_natural_key()
Propagate Key from Parent to Child Table
read_relationships_json()
Read Relationships JSON and Optionally Apply to dm
replace_uuid_with_id()
Replace UUIDs with Integer Foreign Keys
standardize_site_key()
Standardize Site Key from Line and Station Columns
write_parquet_outputs()
Write Tables to Parquet Files
write_spatial_manifest()
Write Spatial Manifest

Workflow

functions for orchestrating ingestion workflow steps

build_release_table_registry()
Build Release Table Registry from Ingest Manifests
build_targets_list()
Build Targets List from Quarto Frontmatter
finalize_ingest()
Finalize Ingest — Push Parquet Tables to Working DuckLake
integrate_to_working_ducklake()
Integrate Ingest Outputs into Working DuckLake
list_ingest_outputs()
List Available Ingest Outputs
parse_qmd_frontmatter()
Parse YAML Frontmatter from Quarto Notebooks
read_ingest_manifest()
Read Ingest Manifest from GCS
read_ingest_parquet()
Read Ingest Parquet Table from GCS
write_ingest_outputs()
Write Ingest Workflow Outputs to GCS

Display

helper functions for workflow display outputs (GitHub links, validation tables)

dt()
Create Interactive Data Table with CSV Export
github_file_link()
Create GitHub File Link
preview_tables()
Preview Tables with Head and Tail Rows
show_flagged_file()
Show Flagged File Result
show_validation_results()
Show Validation Results with GitHub Links

Archive

functions for archiving and managing historical data snapshots

cleanup_duplicate_archives()
Remove duplicate archives from GCS
compare_local_vs_archive()
Compare local files with GCS archive
download_archive()
Download archive to local directory
get_archive_manifest()
Get archive manifest (file metadata)
get_latest_archive_timestamp()
Get latest archive timestamp from GCS
get_local_manifest()
Get local file manifest
sync_to_gcs_archive()
Sync local files to GCS archive (deprecated wrapper)

Version Sync

functions for synchronizing schema and package versions

commit_version_and_permalink()
Commit Version Changes and Get Permalink
complete_version_release()
Complete Version Release Workflow
get_package_version()
Get Current Package Version
suggest_next_version()
Suggest Next Version
update_package_version()
Synchronized Version Management for Package and Database

Visualize

functions visualizing diagnostic outputs, particularly color coded data tables

show_fields_redefine()
Show fields to redefine
show_source_files()
Show source files
show_tables_redefine()
Show tables to redefine

Other

check for other functions or datasets not captured by above categories

add_point_geom()
Add Point Geometry Column to a DuckDB Table
assign_grid_key()
Assign Grid Key via Spatial Join
build_taxon_hierarchy()
Build Taxon Hierarchy from Local spp.duckdb via Recursive CTEs
build_taxon_table()
Build Taxonomic Hierarchy Table from WoRMS
derive_cruise_key_on_casts()
Derive Cruise Key on Bottle Casts via Ship Matching
ensure_interim_ships()
Ensure Interim Ship Entries for Unmatched Ships
fetch_ship_ices()
Fetch Ship Codes from ICES Reference Codes API
load_gcs_parquet_to_duckdb()
Load a GCS Parquet File into DuckDB
match_ships()
Match Ship Codes Across Datasets Using Multi-Source References
report_ship_matches()
Report Ship Matching Status for a Dataset
standardize_species()
Standardize Species Identifiers Using WoRMS/ITIS/GBIF APIs
standardize_species_local()
Standardize Species Using Local spp.duckdb Lookups