Compares md5 hashes across archive timestamps for a given provider/dataset.
When multiple archives have identical content, keeps the earliest and
removes the rest.
Usage
cleanup_duplicate_archives(
provider,
dataset,
gcs_bucket = "calcofi-files-public",
archive_prefix = "archive",
dry_run = TRUE
)
Arguments
- provider
Data provider (e.g., "swfsc.noaa.gov")
- dataset
Dataset name (e.g., "calcofi-db")
- gcs_bucket
GCS bucket name
- archive_prefix
Archive folder prefix
- dry_run
If TRUE (default), only report what would be removed
Value
Tibble of removed (or would-be-removed) archive timestamps
Examples
if (FALSE) { # \dontrun{
# preview what would be removed
cleanup_duplicate_archives("swfsc.noaa.gov", "calcofi-db")
# actually remove duplicates
cleanup_duplicate_archives("swfsc.noaa.gov", "calcofi-db", dry_run = FALSE)
} # }