Skip to contents

Cloud Storage (GCS)

functions for Google Cloud Storage operations

create_gcs_manifest()
Create a manifest of current GCS files
get_calcofi_file()
Get a CalCOFI file from the immutable archive
get_gcs_file()
Get a file from Google Cloud Storage
get_historical_file()
Get historical file from a specific date
get_manifest()
Get manifest for a specific date
list_calcofi_files()
List CalCOFI files from manifest
list_gcs_files()
List files in a GCS bucket/prefix
list_gcs_versions()
List versions of a file in GCS archive
put_gcs_file()
Upload a file to Google Cloud Storage

Parquet

functions for Apache Parquet file operations

add_parquet_metadata()
Add metadata to Parquet file
csv_to_parquet()
Convert CSV file to Parquet format
get_parquet_metadata()
Get Parquet file metadata
read_parquet_table()
Read a Parquet table
upload_parquet()
Upload Parquet file to GCS
write_parquet_table()
Write data to Parquet format

DuckDB

basic DuckDB database operations

close_duckdb()
Disconnect from DuckDB and shutdown
create_duckdb_from_parquet()
Create DuckDB from Parquet files
create_duckdb_views()
Create views from a manifest
duckdb_to_parquet()
Export DuckDB table to Parquet
get_duckdb_con()
Get a DuckDB connection
get_duckdb_tables()
Get table information from DuckDB
load_duckdb_extension()
Install and load DuckDB extension
save_duckdb_to_gcs()
Save DuckDB to GCS
set_duckdb_comments()
Set table and column comments in DuckDB

Working DuckLake

functions for the internal Working DuckLake with provenance tracking

add_provenance_columns()
Add provenance columns to a data frame
get_working_ducklake()
Get Working DuckLake connection
ingest_to_working()
Ingest data to Working DuckLake
list_working_tables()
List tables with provenance in Working DuckLake
query_at_time()
Query Working DuckLake at a point in time
save_working_ducklake()
Save Working DuckLake to GCS
strip_provenance_columns()
Strip provenance columns from data

Frozen Releases

functions for creating and managing frozen DuckLake releases

compare_releases()
Compare two frozen releases
freeze_release()
Freeze a release of the DuckLake
get_release_metadata()
Get metadata for a frozen release
list_frozen_releases()
List available frozen releases
validate_for_release()
Validate Working DuckLake for release

Read

functions reading data, particularly from CSV files in the Google Drive CalCOFI data folder

create_redefinition_files()
Create Redefinition Files for Tables and Fields
determine_field_types()
Determine Field Types for Database
read_csv_files()
Read CSV Files and Their Metadata
read_csv_metadata()
Read CSV Files and Extract Metadata

Transform

functions to transform data, after reading data and before ingesting into the database

detect_csv_changes()
Detect Changes in CSV Files
display_csv_changes()
Display CSV Changes in a Formatted Table
print_csv_change_stats()
Print CSV Change Statistics
transform_data()
Transform Data for Database Ingestion

Validation

functions for data validation and integrity checking

check_data_integrity()
Check Data Integrity for Ingestion
check_multiple_datasets()
Check Multiple Datasets for Integrity
render_integrity_message()
Render Data Integrity Check Message

Ingest

functions for ingesting data into the database (PostgreSQL deprecated, use DuckLake)

ingest_csv_to_db() deprecated
Ingest CSV data to PostgreSQL database (DEPRECATED)
ingest_dataset() deprecated
Ingest a Dataset (DEPRECATED)

Version

functions for schema and package versioning

get_schema_versions()
Get Schema Version History
init_schema_version_csv()
Initialize Schema Version CSV
record_schema_version()
Record Schema Version

Utilities

utility functions for database operations (PostgreSQL deprecated)

copy_schema()
Copy Database Schema
get_db_con() deprecated
Get a database connection to the CalCOFI PostgreSQL database (DEPRECATED)

Visualize

functions visualizing diagnostic outputs, particularly color coded data tables

show_fields_redefine()
Show fields to redefine
show_googledrive_files()
Show Google Drive files
show_tables_redefine()
Show tables to redefine

Other

check for other functions or datasets not captured by above categories

commit_version_and_permalink()
Commit Version Changes and Get Permalink
complete_version_release()
Complete Version Release Workflow
get_package_version()
Get Current Package Version
suggest_next_version()
Suggest Next Version
update_package_version()
Synchronized Version Management for Package and Database