---
title: "Publish Larvae to OBIS"
editor_options:
chunk_output_type: console
---
This workflow converts CalCOFI ichthyoplankton (fish eggs and larvae) data to an OBIS-compliant DarwinCore Archive for publication. The archive follows the Event Core structure with Occurrence and ExtendedMeasurementOrFact (eMoF) extensions.
**References**: [CalCOFI Fish Eggs & Larvae](https://calcofi.org/data/marine-ecosystem-data/fish-eggs-larvae/) | [erdCalCOFIlrvcnt | CoastWatch ERDDAP](https://coastwatch.pfeg.noaa.gov/erddap/info/erdCalCOFIlrvcnt/index.html) | [OBIS Manual](https://manual.obis.org/) | [DarwinCore Terms](https://dwc.tdwg.org/terms/)
```{r}
#| label: setup
librarian::shelf(
CalCOFI / calcofi4db,
DBI,
dm,
dplyr,
DT,
EML,
glue,
here,
lubridate,
purrr,
stringr,
readr,
tibble,
tidyr,
uuid,
xml2,
quiet = T
)
options(readr.show_col_types = F)
# Dataset metadata
dataset_title <- "CalCOFI Fish Larvae Tows"
dataset_abstract <- "Fish larvae counts and standardized counts for eggs captured in CalCOFI ichthyoplankton nets (primarily vertical [Calvet or Pairovet], oblique [bongo or ring nets], and surface tows [Manta nets]). Surface tows are normally standardized to count per 1,000 m^3 strained. Oblique tows are normally standardized to count per 10 m^2 of surface sampled."
dataset_keywords <- c(
"atmosphere",
"biology",
"biosphere",
"calcofi",
"earth science",
"environment",
"latitude",
"longitude",
"ocean",
"time"
) # GCMD Science Keywords (source: https://catalog.data.gov/dataset/calcofi-larvae-counts-positive-tows)
# dataset_uuid <- UUIDgenerate() # Or use existing dataset UUID
country_code <- "US"
water_body <- "California Current" # LME: http://marineregions.org/mrgid/8549
dataset_id <- glue(
"calcofi.io_workflow_larvae_to_obis_{format(Sys.Date(), '%Y-%m-%d')}"
)
sampling_methods_description <- "Standard CalCOFI ichthyoplankton tows. The standard oblique tow uses a bongo net (71 cm diameter, 0.505 mm mesh) or 1-m ring net (prior to 1978) retrieved at a constant wire angle (45 degrees) from 210 m depth to surface. Surface tows use a Manta net; vertical tows use CalVET/Pairovet nets. Flowmeters measure volume filtered. Samples are preserved in 5% formalin. In the lab, fish eggs and larvae are sorted, identified to lowest taxon possible, enumerated, and measured."
study_extent_description <- "The study covers the California Current ecosystem, primarily off Southern California (standard lines 77-93) but historically extending from the border of Canada to the tip of Baja California. The time series for this dataset generally begins in 1951 and continues to the present, with quarterly cruises."
sampling_description <- "Samples are collected at fixed stations along the CalCOFI grid. Oblique tows are standardized to counts per 10 m^2 of sea surface. Surface tows are standardized to counts per 1,000 m^3. Data includes raw counts (tallies) and standardized abundances."
funding_information <- "CalCOFI is a partnership between the NOAA National Marine Fisheries Service (NMFS), Scripps Institution of Oceanography (SIO), and California Department of Fish and Wildlife (CDFW)."
rights <- "The data may be used and redistributed for free but is not intended for legal use, since it may contain inaccuracies. Neither the data Contributor, ERD, NOAA, nor the United States Government, nor any of their employees or contractors, makes any warranty, express or implied, including warranties of merchantability and fitness for a particular purpose, or assumes any legal liability for the accuracy, completeness, or usefulness, of this information."
# source: [erdCalCOFIlrvcnt | CoastWatch ERDDAP](https://coastwatch.pfeg.noaa.gov/erddap/info/erdCalCOFIlrvcnt/index.html)
license <- "To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the Public Domain (CC0 1.0). Users may copy, modify, distribute and use the work, including for commercial purposes, without restriction."
# source: [erdCalCOFIlrvcnt | CoastWatch ERDDAP](https://coastwatch.pfeg.noaa.gov/erddap/info/erdCalCOFIlrvcnt/index.html)
license_xml <- c(
"Public Domain (CC0 1.0)" = '<ulink url="http://creativecommons.org/publicdomain/zero/1.0/legalcode"><citetitle>Public Domain (CC0 1.0)</citetitle></ulink>'
)
intellectual_rights <- list(
para = paste(rights, license, collapse = " ")
)
dir_out <- here("data/darwincore/larvae")
# get database connection
schema <- "dev"
con <- get_db_con(schema)
```
## Database Schema
The CalCOFI database organizes ichthyoplankton data hierarchically: cruises contain sites, sites contain tows, and tows contain net samples. Biological observations (eggs, larvae) link to net samples via `net_uuid`.
```{r}
#| label: fig-dm_tbls
#| fig-cap: "Entity relationship diagram (ERD) of the CalCOFI database."
# exclude experimental tables
tbls <- dbListTables(con) |>
sort() |>
setdiff(
c("schema_version", "grid", "site_seg", "bottle", "cast")
)
# learn relations from database and draw
dm <- dm_from_con(con, schema = schema, table_names = tbls, learn_keys = T)
dm_draw(dm, view_type = "all")
```
```{mermaid}
%%| label: fig-calcofi_erd
%%| fig-cap: "Entity relationship diagram (ERD) of the CalCOFI database. Created from image above and prompt to [Claude](https://claude.ai/): 'Generate an ERD mermaid diagram from the image.'"
%%| file: diagrams/calcofi_erd.mmd
```
## Life Stage Reference
Egg development stages follow [Moser & Ahlstrom (1985)](https://repository.library.noaa.gov/view/noaa/5695/noaa_5695_DS1.pdf). Larva stages include: yolk sac, preflexion, flexion, postflexion, and transformation.
::: {.callout-note collapse="true"}
## Egg Stage Descriptions (Moser & Ahlstrom, 1985)
#### Stage I
Cell division has not yet begun. In intact eggs the cytoplasm of the
single cell appears as a clear hemisphere at one pole, easily dif-
ferentiated from the yolk mass which is divided into granules. The
cytoplasm may be displaced to other locations around the periphery
of the yolk mass, but there is usually some accumulation at one pole,
which allows the stage to be identified.
#### Stage II
This begins with the division of the single cell into two cells or
blastomeres. The division is first noticeable when a furrow develops
in the middle of the cytoplasmic cap. Small bubble-like structures
(probably artifacts) are often visible along the furrow and help iden-
tify it. The next cleavage plane is at right angles to the first, and
subsequent synchronous divisions in both meridional and latitudinal
planes produce a hemispherical mound of cells, termed the blasto-
disc. After the 5th or 6th division, the blastodisc has a berry-like
appearance, the so-called "mulberry stage," and with subsequent
divisions the blastomeres become increasingly smaller and more
difficult to distinguish individually. During a certain phase of the
early divisions the blastomeres are about the same size as the yolk
granules. Ifthe blastodisc and yolk mass become disrupted during
collection or preservation, the blastomeres may become distributed
among the yolk granules. They may be distinguished from one
another since they have different refractive indices and the
blastomeres appear darker when viewed with transmitted light.
#### Stage III
Ahlstrom (1943) defined this stage in sardine eggs as beginning with
the appearance of the segmentation cavity. The segmentation cavity
of teleost eggs is the space formed between the blastodisc and the
yolk mass during late cleavage. In most anchovy eggs in our collec-
tions the blastodisc is shrunken and somewhat cup-shaped and con-
sequently the segmentation cavity, which is a delicate structure, is
obliterated. We have found it preferable to define the beginning of
Stage ill on the basis of the appearance of the blastoderm, i.e., when
it has the appearance of tissue rather than of a collection of individual
cells. This stage marks the beginning of gastrulation. The margin
of the blastodisc becomes slightly thickened and is termed the germ
ring. At one region of the germ ring the thickening extends inward
to form the embryonic shield, which defines the future axis of the
embryo. Gastrulation proceeds by further proliferation and downward
movement of cells in the region of the germ ring by a process known
as epiboly. Simultaneously, proliferation and inward migration (em-
boly) of cells from the margin of the embryonic shield produce the
organ-forming cell layers of the primordial embryo. At the end of
Stage III the germ ring is one-third down the yolk mass and the
bilateral nature of the primordial embryo is apparent.
#### Stage IV
At the beginning of this stage the germ ring has enclosed one-third
of the yolk mass and the embryo is beginning to form along the
median region of the embryonic shield. At the end of this stage,
defined by the germ ring enveloping two-thirds of the yolk, the head
region of the embryo is becoming apparent.
#### Stage V
This stage begins with the germ ring two-thirds down the yolk and
ends with the closure of the blastopore and the complete enclosure
of the yolk by the cellular sheath of the embryo. The stage is charac-
terized by rapid differentiation resulting in the formation of several
somites in the midregion of the embryonic axis, development of the
notochord which can be seen from a dorsal viewpoint, and differ-
entiation of the optic vesicles from the brain.
#### Stage VI
This stage begins with the closure of the blastopore and ends when
the tail starts to separate from the yolk mass. The embryonic sheath
of cells is extremely thin and, in some samples, it may be difficult
to determine the point of blastopore closure. In these cases the event
can be estimated from the position of the caudal terminus of the
embryonic axis, since it grows toward the pole with the edge of the
cellular sheath. Initially the caudal region lies flat against the polar
region of the yolk, then gradually thickens and becomes more round-
ed at the tip until it is clearly separate from the yolk. During this
stage the somites are apparent along the entire body axis (except
at the caudal portion), the lens primordium appears in the eye, and
the regions of the brain begin to differentiate.
#### Stage VII
At the beginning of this stage the tip of the tail free from the yolk
is broadly rounded, then begins to narrow as it elongates. The noto-
chord extends almost to the tip and the finfold is just becoming visi-
ble. At the end of this stage the length of the free tail is one-half
the length of the head. For this purpose, head length is considered
the distance from the tip of the snout to the back of the cerebellum
(see Fig. 2H). Relative tail length is the criterion for each remain-
ing stage.
#### Stage VIII
This stage begins when the free tail length is l:,reater than one-half
the head length and ends when tail length equals head length. The
tail becomes pointed during this stage and begins to bend away from
the axis of the body, to the right or left side. The curvature of the
tail generally increases with development, but is subject to individual
variability (Fig. 2). Judgement is required in compensating for curva-
ture in estimating relative tail length; however, accuracy and preci-
sion increase rapidly with practice.
#### Stage IX
This stage begins with the tail extending one-quarter the length of
the yolk sac and ends when it reaches one-half the yolk sac length.
The gut is now apparent along the ventral surface of the tail, and
its terminal section passes through the fin fold which is now con-
siderably wider than in the previous stage. The pectoral fin buds
appear as lateral thickenings as do the otic vesicles.
#### Stage X
This stage starts when the tail is one-half the length of the yolk sac
and ends when it reaches three-quarters of the yolk sac length.
#### Stage XI
This is the final stage before hatching and is defined by a tail length
greater than three-quarters of the length of the yolk sac.
:::
::: {.callout-note collapse="true"}
## Larva Stage Descriptions
Developmental stages of fish larvae captured in CalCOFI ichthyoplankton nets:
- **YOLK**: Yolk-sac larva - newly hatched, still absorbing yolk
- **PREF**: Preflexion larva - notochord straight, before flexion
- **FLEX**: Flexion larva - notochord flexing upward
- **POST**: Postflexion larva - notochord flexion complete
- **TRNS**: Transformation stage - transitioning to juvenile
:::
## Conversion Design Decisions
Key design decisions for converting CalCOFI ichthyoplankton data to DarwinCore:
1. **Event hierarchy**: cruise → site → tow → net (following OBIS sampling-event model)
2. **Occurrence consolidation**: One occurrence per life stage (egg or larva) per species per net sample, rather than separate occurrences for each developmental stage
3. **ExtendedMeasurementOrFact (eMoF)**: Stage-specific counts, body lengths, and sample-level measurements linked via `occurrenceID` or `eventID`
4. **Vocabulary mappings**: S11 for life stages, P06 for units, custom descriptive terms for CalCOFI-specific developmental stages
5. **Sample size**: `vol_sampled_m3` stored as `sampleSizeValue` in Event core (not in eMoF)
## Data Extraction from Database
First, let's extract all necessary tables from the database:
```{r}
#| label: extract-tables
# Get simple lazy table references
tbl_ship <- tbl(con, "ship")
tbl_cruise <- tbl(con, "cruise")
tbl_site <- tbl(con, "site")
tbl_tow <- tbl(con, "tow")
tbl_tow_type <- tbl(con, "tow_type")
tbl_net <- tbl(con, "net")
tbl_species <- tbl(con, "species")
# biological observation tables
tbl_egg <- tbl(con, "egg")
tbl_egg_stage <- tbl(con, "egg_stage")
tbl_larva <- tbl(con, "larva")
tbl_larva_stage <- tbl(con, "larva_stage")
tbl_larva_size <- tbl(con, "larva_size")
```
## Create Life Stage Vocabulary Lookup Tables
Define mappings for egg and larva developmental stages to descriptive terms:
```{r}
#| label: vocabulary-lookups
# Egg stage vocabulary mapping (Moser & Ahlstrom, 1985)
egg_stage_vocab <- tibble(
stage = 1:15,
stage_description = c(
"egg, stage 1 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 2 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 3 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 4 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 5 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 6 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 7 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 8 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 9 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 10 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 11 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 12 of 11 (Moser & Ahlstrom, 1985)", # TODO: stage 12 to 15 of 11 remove?
"egg, stage 13 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 14 of 11 (Moser & Ahlstrom, 1985)",
"egg, stage 15 of 11 (Moser & Ahlstrom, 1985)"
)
)
# Larva stage vocabulary mapping
larva_stage_vocab <- tibble(
stage = c("YOLK", "PREF", "FLEX", "POST", "TRNS"),
stage_description = c(
"larva, yolk sac",
"larva, preflexion",
"larva, flexion",
"larva, postflexion",
"larva, transformation"
)
)
# P06 Measurement Unit vocabulary (NERC)
measurement_unit_vocab <- tibble(
measurementUnit = c(
"individuals",
"millimeters",
"dimensionless",
"cubic meters",
"grams",
"proportion",
"m^3"
),
measurementUnitID = c(
"http://vocab.nerc.ac.uk/collection/P06/current/UUUU/",
"http://vocab.nerc.ac.uk/collection/P06/current/UXMM/",
"http://vocab.nerc.ac.uk/collection/P06/current/UUUU/",
"http://vocab.nerc.ac.uk/collection/P06/current/MCUB/",
"http://vocab.nerc.ac.uk/collection/P06/current/UGRM/",
"http://vocab.nerc.ac.uk/collection/P06/current/UUUU/", # or UPCT?
"http://vocab.nerc.ac.uk/collection/P06/current/MCUB/"
)
)
```
## Create Event Hierarchy
DarwinCore uses an event-based model. We'll create a hierarchical event structure:
- cruise
- site
- tow
- net
```{mermaid}
%%| label: fig-larvae_event_hierarchy
%%| fig-cap: "Event hierarchy for CalCOFI larvae data in DarwinCore Archive format with Extended Measurement or Fact."
%%| file: diagrams/larvae_event_hierarchy.mmd
```
Per [3.3.2. Simple nested datasets with Project-level information – Guide for publishing biological survey and monitoring data to GBIF](https://docs.gbif.org/guide-publishing-survey-data/en/#simple-nested-datasets-with-project-level-information)
```{r}
#| label: build-event-hierarchy
# Build complete event hierarchy
q_events <- tbl_net |>
left_join(tbl_tow, by = "tow_uuid") |>
left_join(tbl_tow_type, by = "tow_type_key") |>
left_join(tbl_site, by = "site_uuid") |>
left_join(tbl_cruise, by = "cruise_uuid") |>
left_join(tbl_ship, by = "ship_key")
# Create event core at different levels ----
flds_to_iso8601 <- function(flds, output = "datetime") {
# helper function to convert date/time fields to ISO 8601 format
# per https://dwc.tdwg.org/terms/ eventDate: add 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z (some time within the interval beginning 1 March 2007 at 1pm UTC and before 11 May 2008 at 3:30pm UTC)
stopifnot(length(flds) %in% c(1, 2))
stopifnot(output %in% c("datetime", "time"))
if (length(flds) == 1 & output == "datetime") {
s <- glue(
"to_char({flds}, 'YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"')"
)
}
if (length(flds) == 2 & output == "datetime") {
s <- glue(
"to_char({flds[1]}, 'YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"') || '/' ||
to_char({flds[2]}, 'YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"')"
)
}
if (length(flds) == 1 & output == "time") {
s <- glue(
"to_char({flds}, 'HH24:MI:SS\"Z\"')"
)
}
if (length(flds) == 2 & output == "time") {
s <- glue(
"to_char({flds[1]}, 'HH24:MI:SS\"Z\"') || '/' ||
to_char({flds[2]}, 'HH24:MI:SS\"Z\"')"
)
}
as.character(s)
}
# * Cruise level events ----
# get min/max datetimes for cruises
q_cruise_event_date <- q_events |>
group_by(cruise_uuid) |>
summarize(
dtime_min = min(time_start, na.rm = T),
dtime_max = max(time_start, na.rm = T),
.groups = "drop"
) |>
mutate(
eventDate = sql(flds_to_iso8601(c("dtime_min", "dtime_max")))
) |>
select(cruise_uuid, eventDate)
q_cruise_events <- q_events |>
distinct(cruise_uuid, ship_name, ship_nodc, date_ym) |>
mutate(
eventID = cruise_uuid,
# parentEventID = NA, # no parent for cruise level; NA generated with union_all
eventRemarks = paste("Cruise on", ship_name),
samplingProtocol = "Marine plankton survey cruise",
sampleSizeValue = NA_real_,
sampleSizeUnit = NA_character_,
habitat = "marine",
eventType = "cruise"
) |>
left_join(
q_cruise_event_date,
by = "cruise_uuid"
) |>
select(
eventID, # parentEventID,
eventType,
eventDate,
samplingProtocol,
eventRemarks,
habitat
)
# Site level events
q_site_events <- q_events |>
distinct(
site_uuid,
cruise_uuid,
longitude,
latitude,
line,
station,
orderocc
) |>
mutate(
eventID = site_uuid,
parentEventID = cruise_uuid,
decimalLatitude = latitude,
decimalLongitude = longitude,
eventRemarks = paste("CalCOFI station", station, "on line", line),
# per https://proj.org/en/stable/operations/projections/calcofi.html
locationID = paste0(line, "_", station),
eventType = "site",
samplingProtocol = "Marine plankton survey station"
) |>
select(
eventID,
parentEventID,
eventType,
samplingProtocol,
eventRemarks,
decimalLatitude,
decimalLongitude,
locationID
)
# Tow level events
q_tow_events <- q_events |>
distinct(
tow_uuid,
site_uuid,
tow_number,
tow_type_key,
tow_type_description = description,
time_start
) |>
mutate(
eventID = tow_uuid,
parentEventID = site_uuid,
eventDate = sql(flds_to_iso8601("time_start")),
eventRemarks = paste("Tow", tow_number, "-", tow_type_description),
eventType = "tow",
samplingProtocol = paste(tow_type_key, ":", tow_type_description)
) |>
select(
eventID,
parentEventID,
eventType,
eventDate,
samplingProtocol,
eventRemarks
)
# Net level events (sampling events)
q_net_events_allflds <- q_events |>
distinct(
net_uuid,
tow_uuid,
side,
std_haul_factor,
vol_sampled_m3,
prop_sorted,
smallplankton,
totalplankton
) |>
mutate(
eventID = net_uuid,
parentEventID = tow_uuid,
eventRemarks = paste(
"Net sample from side",
side,
"; only",
prop_sorted,
"proportion sorted of",
vol_sampled_m3,
"cubic meters sampled"
),
eventType = "net_sample",
samplingProtocol = paste("Plankton net tow -", side, "side"),
sampleSizeValue = vol_sampled_m3,
sampleSizeUnit = "m^3"
)
q_net_events <- q_net_events_allflds |>
select(
eventID,
parentEventID,
eventType,
samplingProtocol,
eventRemarks,
sampleSizeValue,
sampleSizeUnit
)
# Combine all event levels
db_version <- tbl(con, "schema_version") |> pull(version)
# dataset_id <- paste0(
# # "https://github.com/CalCOFI/calcofi4db/releases/tag/v",
# "calcofi_db.",
# db_version,
# "_larvae.",
# format(Sys.Date(), "%Y-%m-%d")
# )
# "calcofi_db.1.1.0_larvae.2025-10-09"
q_event_core <- list(
q_cruise_events |>
mutate(
parentEventID = sql("NULL::uuid"),
sampleSizeValue = sql("NULL::numeric"),
sampleSizeUnit = NA_character_
),
q_site_events |>
mutate(
sampleSizeValue = sql("NULL::numeric"),
sampleSizeUnit = NA_character_
),
q_tow_events |>
mutate(
sampleSizeValue = sql("NULL::numeric"),
sampleSizeUnit = NA_character_
),
q_net_events
) |>
reduce(union_all) |>
mutate(
# basisOfRecord = "HumanObservation",
geodeticDatum = "WGS84",
coordinateUncertaintyInMeters = 1000, # TODO: adjust based on accuracy of GPS/Loran/dead-reckoning over time
countryCode = country_code, # country_code <- "US"
waterBody = water_body,
datasetID = dataset_id
)
# TODO: datasetID: permalink to this qmd in calcofi4db release?
```
## Create Occurrence Records
Now let's combine all biological observations into occurrence records:
```{r}
#| label: build-occurrences
# Create consolidated egg occurrences (one per species per net)
q_egg_occurrences <- tbl_egg |>
left_join(tbl_species, by = "species_id") |>
mutate(
occurrenceID = sql("gen_random_uuid()"),
eventID = net_uuid,
scientificName = scientific_name,
scientificNameID = paste0("urn:lsid:marinespecies.org:taxname:", worms_id),
kingdom = "Animalia",
occurrenceStatus = ifelse(tally > 0, "present", "absent"),
organismQuantity = tally, # TOTAL count from egg table
organismQuantityType = "individuals",
lifeStage = "egg",
#lifeStage = "https://vocab.nerc.ac.uk/collection/S11/current/S1122/",
# lifeStageID = "http://vocab.nerc.ac.uk/collection/S11/current/S1110/",
preparations = "egg sample",
basisOfRecord = "HumanObservation"
) |>
select(
occurrenceID,
eventID,
species_id,
scientificName,
scientificNameID,
kingdom,
occurrenceStatus,
organismQuantity,
organismQuantityType,
lifeStage, # lifeStageID,
preparations,
basisOfRecord
)
# Create consolidated larva occurrences (one per species per net)
q_larva_occurrences <- tbl_larva |>
left_join(tbl_species, by = "species_id") |>
mutate(
occurrenceID = sql("gen_random_uuid()"),
eventID = net_uuid,
scientificName = scientific_name,
scientificNameID = paste0("urn:lsid:marinespecies.org:taxname:", worms_id),
kingdom = "Animalia",
occurrenceStatus = ifelse(tally > 0, "present", "absent"),
organismQuantity = tally, # TOTAL count from larva table
organismQuantityType = "individuals",
lifeStage = "larva",
# lifeStageID = "http://vocab.nerc.ac.uk/collection/S11/current/S1133/",
preparations = "larva sample",
basisOfRecord = "HumanObservation"
) |>
select(
occurrenceID,
eventID,
species_id,
scientificName,
scientificNameID,
kingdom,
occurrenceStatus,
organismQuantity,
organismQuantityType,
lifeStage, # lifeStageID,
preparations,
basisOfRecord
)
# Combine egg and larva occurrences
q_occurrence_extension <- list(
q_egg_occurrences,
q_larva_occurrences
) |>
reduce(union_all) |>
# Add additional required/recommended fields
mutate(
# TODO: identifiedBy, dateIdentified,identificationReferences
# identifiedBy = "[NEED IDENTIFIER NAME]", # Prompt: Add taxonomist name
# dateIdentified = "[NEED IDENTIFICATION DATE]", # Prompt: Add identification date
# identificationReferences = "[NEED REFERENCES]", # Prompt: Add identification guides used
modified = sql("CURRENT_DATE")
) |>
# Remove species_id (internal field used only for joins)
select(
occurrenceID,
eventID,
scientificName,
scientificNameID,
kingdom,
occurrenceStatus,
organismQuantity,
organismQuantityType,
lifeStage, # lifeStageID,
preparations,
basisOfRecord,
modified
)
```
## Create ExtendedMeasurementOrFact Extension
Create eMoF records for three types of measurements:
1. Sample-level measurements (linked to eventID)
2. Stage-specific abundance (linked to occurrenceID)
3. Body length measurements (linked to occurrence ID)
```{r}
#| label: build-emof
# 1. Sample-level measurements (eventID, occurrenceID = NA) ----
d_sample_measurements <- q_net_events_allflds |>
select(
eventID,
std_haul_factor,
prop_sorted,
smallplankton,
totalplankton
) |> # vol_sampled_m3 moved to Event core
collect() |>
pivot_longer(
cols = c(std_haul_factor, prop_sorted, smallplankton, totalplankton),
names_to = "meas_type",
values_to = "measurementValue"
) |>
filter(!is.na(measurementValue)) |>
mutate(
measurementID = UUIDgenerate(n = n()),
occurrenceID = NA_character_,
measurementType = case_when(
meas_type == "std_haul_factor" ~ "standardized haul factor",
meas_type == "prop_sorted" ~ "proportion of sample sorted",
meas_type == "smallplankton" ~ "small plankton biomass",
meas_type == "totalplankton" ~ "total plankton biomass"
),
measurementTypeID = case_when(
meas_type == "std_haul_factor" ~ NA_character_, # TODO: search P01 for standardization factor
meas_type == "prop_sorted" ~ NA_character_, # TODO: search P01 for subsample fraction
meas_type == "smallplankton" ~ NA_character_, # TODO: search P01 for zooplankton biomass
meas_type == "totalplankton" ~ NA_character_
), # TODO: search P01 for zooplankton biomass
measurementUnit = case_when(
meas_type == "std_haul_factor" ~ "dimensionless",
meas_type == "prop_sorted" ~ "dimensionless",
meas_type %in% c("smallplankton", "totalplankton") ~ "grams"
),
measurementMethod = "https://oceanview.pfeg.noaa.gov/CalCOFI/calcofi_info.html",
measurementRemarks = case_when(
meas_type ==
"std_haul_factor" ~ "Standardization factor accounting for water filtered; abundances expressed as per 10 m² (Smith 1977)",
meas_type == "prop_sorted" ~ "Fraction of total sample examined",
TRUE ~ NA_character_
)
) |>
left_join(measurement_unit_vocab, by = "measurementUnit") |>
select(
eventID,
occurrenceID,
measurementID,
measurementType,
measurementTypeID,
measurementValue,
measurementUnit,
measurementUnitID,
measurementMethod,
measurementRemarks
)
# 2. Stage-specific abundance (eventID = NA, occurrenceID) ----
# Egg stage abundances - need to join with egg occurrences to get occurrenceID
d_egg_stage_abundance <- tbl_egg_stage |>
left_join(
egg_stage_vocab |>
copy_to(
con,
df = _,
name = "egg_stage_vocab_tmp",
temporary = TRUE,
overwrite = T
),
by = "stage"
) |>
collect() |>
left_join(
q_egg_occurrences |>
collect() |>
select(net_uuid = eventID, species_id, occurrenceID),
by = c("net_uuid", "species_id")
) |>
mutate(
eventID = NA_character_,
measurementID = UUIDgenerate(n = n()),
measurementType = "abundance by life stage",
measurementTypeID = NA_character_, # TODO: search P01 for "abundance" + "development stage"
measurementValue = tally,
measurementValueID = NA_character_, # Could reference custom vocabulary for egg stages
measurementUnit = "individuals",
measurementMethod = "https://oceanview.pfeg.noaa.gov/CalCOFI/calcofi_info.html",
measurementRemarks = stage_description
) |>
left_join(measurement_unit_vocab, by = "measurementUnit") |>
select(
eventID,
occurrenceID,
measurementID,
measurementType,
measurementTypeID,
measurementValue,
measurementValueID,
#measurementUnit, measurementUnitID = measurementUnitID.y, measurementMethod,
measurementUnit,
measurementUnitID,
measurementMethod,
measurementRemarks
)
# Larva stage abundances - need to join with larva occurrences to get occurrenceID
d_larva_stage_abundance <- tbl_larva_stage |>
left_join(
larva_stage_vocab |>
copy_to(
con,
df = _,
name = "larva_stage_vocab_tmp",
temporary = TRUE,
overwrite = T
),
by = "stage"
) |>
collect() |>
left_join(
q_larva_occurrences |>
collect() |>
select(net_uuid = eventID, species_id, occurrenceID),
by = c("net_uuid", "species_id")
) |>
mutate(
eventID = NA_character_,
measurementID = UUIDgenerate(n = n()),
measurementType = "abundance by life stage",
measurementTypeID = NA_character_, # TODO: search P01 for "abundance" + "development stage"
measurementValue = tally,
measurementValueID = NA_character_, # Could reference custom vocabulary for larva stages
measurementUnit = "individuals",
measurementMethod = "https://oceanview.pfeg.noaa.gov/CalCOFI/calcofi_info.html",
measurementRemarks = stage_description
) |>
left_join(measurement_unit_vocab, by = "measurementUnit") |>
select(
eventID,
occurrenceID,
measurementID,
measurementType,
measurementTypeID,
measurementValue,
measurementValueID,
measurementUnit,
measurementUnitID,
measurementMethod,
measurementRemarks
)
# 3. Body length measurements (eventID = NA, occurrenceID) ----
d_body_length <- tbl_larva_size |>
collect() |>
left_join(
q_larva_occurrences |>
collect() |>
select(net_uuid = eventID, species_id, occurrenceID),
by = c("net_uuid", "species_id")
) |>
filter(!is.na(length_mm)) |>
mutate(
eventID = NA_character_,
measurementID = UUIDgenerate(n = n()),
measurementType = "body length",
measurementTypeID = NA_character_, # TODO: search P01 for fish larvae body length (e.g., OBSINDLX)
measurementValue = length_mm,
measurementValueID = NA_character_,
measurementUnit = "millimeters",
measurementMethod = "https://oceanview.pfeg.noaa.gov/CalCOFI/calcofi_info.html",
measurementRemarks = paste0("Count: ", tally, "; Total length measurement")
) |>
left_join(measurement_unit_vocab, by = "measurementUnit") |>
select(
eventID,
occurrenceID,
measurementID,
measurementType,
measurementTypeID,
measurementValue,
measurementValueID,
measurementUnit,
measurementUnitID,
measurementMethod,
measurementRemarks
)
# Combine all eMoF records ----
d_extendedmeasurementorfact_extension <- bind_rows(
d_sample_measurements,
d_egg_stage_abundance,
d_larva_stage_abundance,
d_body_length
)
# table(is.na(d_extendedmeasurementorfact_extension$occurrenceID))
# FALSE TRUE
# 367,218 243,598
#
# table(is.na(d_extendedmeasurementorfact_extension$eventID))
# FALSE TRUE
# 243,598 367,218
```
## Write DarwinCore Archive Files
Export the event core and extension files (occurrence, eMoF) as CSV files:
```{r}
#| label: export-dwc
# Create output directory
dir.create(dir_out, showWarnings = F)
# Write core and extension files
write_csv(q_event_core |> collect(), file.path(dir_out, "event.csv"))
write_csv(
q_occurrence_extension |> collect(),
file.path(dir_out, "occurrence.csv")
)
write_csv(
d_extendedmeasurementorfact_extension,
file.path(dir_out, "extendedMeasurementOrFact.csv")
)
```
## Create meta.xml programmatically by cross-walking CSV fields to DwC terms
Generate `meta.xml` by mapping CSV column names to DarwinCore term URIs:
```{r}
#| label: create-meta-xml
# Define DarwinCore term mappings for each file type
dwc_terms <- list(
event = list(
rowType = "http://rs.tdwg.org/dwc/terms/Event",
idField = "eventID",
terms = c(
eventID = "http://rs.tdwg.org/dwc/terms/eventID",
parentEventID = "http://rs.tdwg.org/dwc/terms/parentEventID",
eventType = "http://rs.tdwg.org/dwc/terms/eventType",
eventDate = "http://rs.tdwg.org/dwc/terms/eventDate",
eventTime = "http://rs.tdwg.org/dwc/terms/eventTime",
samplingProtocol = "http://rs.tdwg.org/dwc/terms/samplingProtocol",
sampleSizeValue = "http://rs.tdwg.org/dwc/terms/sampleSizeValue",
sampleSizeUnit = "http://rs.tdwg.org/dwc/terms/sampleSizeUnit",
eventRemarks = "http://rs.tdwg.org/dwc/terms/eventRemarks",
habitat = "http://rs.tdwg.org/dwc/terms/habitat",
decimalLatitude = "http://rs.tdwg.org/dwc/terms/decimalLatitude",
decimalLongitude = "http://rs.tdwg.org/dwc/terms/decimalLongitude",
geodeticDatum = "http://rs.tdwg.org/dwc/terms/geodeticDatum",
coordinateUncertaintyInMeters = "http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters",
locationID = "http://rs.tdwg.org/dwc/terms/locationID",
countryCode = "http://rs.tdwg.org/dwc/terms/countryCode",
waterBody = "http://rs.tdwg.org/dwc/terms/waterBody",
datasetID = "http://rs.tdwg.org/dwc/terms/datasetID"
# basisOfRecord = "http://rs.tdwg.org/dwc/terms/basisOfRecord"
)
),
occurrence = list(
rowType = "http://rs.tdwg.org/dwc/terms/Occurrence",
idField = "occurrenceID",
coreIdField = "eventID",
terms = c(
occurrenceID = "http://rs.tdwg.org/dwc/terms/occurrenceID",
eventID = "http://rs.tdwg.org/dwc/terms/eventID",
scientificName = "http://rs.tdwg.org/dwc/terms/scientificName",
scientificNameID = "http://rs.tdwg.org/dwc/terms/scientificNameID",
kingdom = "http://rs.tdwg.org/dwc/terms/kingdom",
occurrenceStatus = "http://rs.tdwg.org/dwc/terms/occurrenceStatus",
organismQuantity = "http://rs.tdwg.org/dwc/terms/organismQuantity",
organismQuantityType = "http://rs.tdwg.org/dwc/terms/organismQuantityType",
lifeStage = "http://rs.tdwg.org/dwc/terms/lifeStage",
# lifeStageID = "http://rs.tdwg.org/dwc/terms/lifeStageID",
preparations = "http://rs.tdwg.org/dwc/terms/preparations",
basisOfRecord = "http://rs.tdwg.org/dwc/terms/basisOfRecord",
modified = "http://purl.org/dc/terms/modified"
)
),
extendedmeasurementorfact = list(
rowType = "http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact",
idField = "measurementID",
coreIdField = "eventID", # Primary linking field
terms = c(
eventID = "http://rs.tdwg.org/dwc/terms/eventID",
occurrenceID = "http://rs.tdwg.org/dwc/terms/occurrenceID",
measurementID = "http://rs.tdwg.org/dwc/terms/measurementID",
measurementType = "http://rs.tdwg.org/dwc/terms/measurementType",
measurementTypeID = "http://rs.iobis.org/obis/terms/measurementTypeID",
measurementValue = "http://rs.tdwg.org/dwc/terms/measurementValue",
measurementValueID = "http://rs.iobis.org/obis/terms/measurementValueID",
measurementUnit = "http://rs.tdwg.org/dwc/terms/measurementUnit",
measurementUnitID = "http://rs.iobis.org/obis/terms/measurementUnitID",
measurementMethod = "http://rs.tdwg.org/dwc/terms/measurementMethod",
measurementRemarks = "http://rs.tdwg.org/dwc/terms/measurementRemarks"
)
)
)
# Function to create field elements for meta.xml
create_field_elements <- function(csv_file, term_map, coreid_field = NULL) {
# Read column names from CSV
col_names <- names(read_csv(csv_file, n_max = 0, show_col_types = FALSE))
# Map columns to DwC terms
field_elements <- map(seq_along(col_names), function(i) {
col <- col_names[i]
# Skip coreid column if specified (it's defined separately in <coreid> element)
if (!is.null(coreid_field) && col == coreid_field) {
return(NULL)
}
term <- term_map$terms[[col]]
if (!is.null(term)) {
glue(' <field index="{i-1}" term="{term}"/>')
} else {
# Field exists in CSV but not in DwC mapping - skip or warn
message(
"Warning: Column '",
col,
"' in ",
csv_file,
" has no DwC mapping and will be skipped in meta.xml."
)
NULL
}
})
# Remove NULL entries and combine
field_elements <- compact(field_elements)
paste(field_elements, collapse = "\n")
}
# Function to determine coreID index
get_coreid_index <- function(csv_file, coreid_field) {
col_names <- names(read_csv(csv_file, n_max = 0, show_col_types = FALSE))
# Handle multiple possible coreID fields (eventID or occurrenceID)
if (length(coreid_field) > 1) {
# Find which one exists in the CSV
coreid_field <- intersect(coreid_field, col_names)[1]
}
which(col_names == coreid_field) - 1 # 0-indexed
}
# Generate meta.xml content
generate_meta_xml <- function(dir_out) {
# Core (Event)
event_file <- file.path(dir_out, "event.csv")
event_fields <- create_field_elements(event_file, dwc_terms$event)
event_id_idx <- which(
names(read_csv(event_file, n_max = 0, show_col_types = FALSE)) ==
dwc_terms$event$idField
) -
1
# Extension: Occurrence
occ_file <- file.path(dir_out, "occurrence.csv")
occ_fields <- create_field_elements(
occ_file,
dwc_terms$occurrence,
coreid_field = dwc_terms$occurrence$coreIdField
)
occ_coreid_idx <- get_coreid_index(occ_file, dwc_terms$occurrence$coreIdField)
# Extension: ExtendedMeasurementOrFact
emof_file <- file.path(dir_out, "extendedMeasurementOrFact.csv")
emof_fields <- create_field_elements(
emof_file,
dwc_terms$extendedmeasurementorfact,
coreid_field = dwc_terms$extendedmeasurementorfact$coreIdField
)
emof_coreid_idx <- get_coreid_index(
emof_file,
dwc_terms$extendedmeasurementorfact$coreIdField
)
# Build complete meta.xml
meta_xml <- glue(
'<?xml version="1.0" encoding="UTF-8"?>
<archive xmlns="http://rs.tdwg.org/dwc/text/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://rs.tdwg.org/dwc/text/ http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd">
<core encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\\n"
fieldsEnclosedBy=\'"\' ignoreHeaderLines="1" rowType="{dwc_terms$event$rowType}">
<files>
<location>event.csv</location>
</files>
<id index="{event_id_idx}" />
{event_fields}
</core>
<extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\\n"
fieldsEnclosedBy=\'"\' ignoreHeaderLines="1" rowType="{dwc_terms$occurrence$rowType}">
<files>
<location>occurrence.csv</location>
</files>
<coreid index="{occ_coreid_idx}" />
{occ_fields}
</extension>
<extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\\n"
fieldsEnclosedBy=\'"\' ignoreHeaderLines="1" rowType="{dwc_terms$extendedmeasurementorfact$rowType}">
<files>
<location>extendedMeasurementOrFact.csv</location>
</files>
<coreid index="{emof_coreid_idx}" />
{emof_fields}
</extension>
</archive>'
)
return(meta_xml)
}
# Generate and write meta.xml
meta_xml <- generate_meta_xml(dir_out)
writeLines(meta_xml, file.path(dir_out, "meta.xml"))
cat("Generated meta.xml with field mappings:\n")
cat(meta_xml)
```
## Create EML Metadata
Build EML (Ecological Metadata Language) document with geographic, temporal, and taxonomic coverage:
```{r}
#| label: build-eml
# Geographic coverage
geo_coverage <- tbl_site |>
summarise(
westBoundingCoordinate = min(longitude, na.rm = T),
eastBoundingCoordinate = max(longitude, na.rm = T),
northBoundingCoordinate = max(latitude, na.rm = T),
southBoundingCoordinate = min(latitude, na.rm = T)
) |>
collect()
# Temporal coverage
temp_coverage <- tbl_tow |>
summarise(
beginDate = min(time_start, na.rm = TRUE),
endDate = max(time_start, na.rm = TRUE)
) |>
collect() |>
pivot_longer(
cols = everything(),
names_to = "name",
values_to = "value"
) |>
mutate(
value = as.Date(value)
) |>
deframe() |>
as.list()
# Taxonomic coverage
db <- "https://file.calcofi.io/data/calcofi.duckdb"
tmp_db <- file.path(tempdir(), "calcofi.duckdb")
con_dk <- dbConnect(::duckdb(), dbdir = tmp_db, read_only = F)
# attach remote duckdb database, but first detect if already attached
if (!"taxonomy" %in% dbListTables(con_dk)) {
dbSendQuery(con_dk, glue::glue("ATTACH DATABASE '{db}' AS cc;"))
}
# dbListTables(con)
# [1] "bottle" "cast" "cruise" "egg" "egg_stage" "grid"
# [7] "larva" "larva_size" "larva_stage" "net" "schema_version" "ship"
# [13] "site" "site_seg" "species" "taxa_rank" "taxonomy" "tow"
# [19] "tow_type"
d_taxa <- dbReadTable(con_dk, Id(schema = "cc", table = "taxonomy"))
# WoRMS superseded identifiers for some CalCOFI taxa, otherwise missing in d_taxa so no scientificName found (EML fails validation)
d_worms_superseded = tribble(
~sci_old , ~worms_old , ~sci_new , ~worms_new ,
"Serraninae" , 151756 , "Serranidae" , 125561 ,
"Anthiadinae" , 1061345 , "Anthiadidae" , 1758484 ,
"Epinephelinae" , 151758 , "Epinephelidae" , 1776907 ,
"Paranthias colonus" , 282083 , "Cephalopholis colonus" , 1750630
)
taxa_coverage <- tbl_species |>
mutate(
worms_id = case_when(
is.na(worms_id) | worms_id == 0 ~ NA_integer_,
TRUE ~ worms_id
)
) |>
collect() |>
# replace worms_id with updated WoRMS IDs where applicable
left_join(
d_worms_superseded |> select(worms_old, worms_new),
by = join_by(worms_id == worms_old)
) |>
mutate(
worms_id = case_when(
!is.na(worms_new) ~ worms_new,
TRUE ~ worms_id
)
) |>
select(-worms_new) |>
filter(!is.na(worms_id)) |>
distinct(species_id, worms_id, common_name) |>
left_join(
d_taxa |>
distinct(
taxonID,
taxonRank,
scientificName
),
by = join_by(worms_id == taxonID)
) |>
# De-duplicate in case multiple local species map to same WoRMS ID
group_by(worms_id, taxonRank, scientificName) |>
summarize(
species_id = first(species_id),
common_name = paste(unique(na.omit(common_name)), collapse = ", "),
.groups = "drop"
) |>
arrange(species_id) |>
pmap(function(
species_id,
taxonRank,
scientificName,
common_name,
worms_id,
...
) {
item <- list(
id = glue("calcofi_species_{species_id}"),
taxonRankName = taxonRank,
taxonRankValue = scientificName,
taxonId = list(
provider = "https://www.marinespecies.org",
as.character(worms_id)
)
)
if (!is.na(common_name) && common_name != "") {
item$commonName <- common_name
}
return(item)
})
# Metadata validation result: Invalid
# Basic Metadata Edit
# eml.intellectualRights.license: Data Licence is required.
# Taxonomic Coverage Edit
# eml.taxonomicCoverages[0].taxonKeywords[340].scientificName: Scientific Name is required.
# eml.taxonomicCoverages[0].taxonKeywords[345].scientificName: Scientific Name is required.
# eml.taxonomicCoverages[0].taxonKeywords[348].scientificName: Scientific Name is required.
# eml.taxonomicCoverages[0].taxonKeywords[363].scientificName: Scientific Name is required.
# find taxa_coverage items with missing scientificName
missing_taxa <- keep(taxa_coverage, ~ is.na(.x$taxonRankValue))
missing_sp_ids <- map_chr(missing_taxa, "id") |>
str_replace("calcofi_species_", "") |>
as.integer()
missing_sp <- tbl_species |>
filter(species_id %in% missing_sp_ids)
# species_id scientific_name itis_id worms_id common_name
# <int> <chr> <int> <int> <chr>
# 1 430 Serraninae 643089 151756 NA
# 2 435 Anthiinae 167675 1061345 NA
# 3 438 Grammistini 643093 151758 Soapfishes
# 4 453 Paranthias colonus 551113 282083 Pacific creolefish
if (length(missing_taxa) > 0) {
cat("WARNING: The following taxa are missing scientific names:\n")
print(missing_taxa)
cat(
"\nPlease add scientific names for these taxa before finalizing the dataset.\n"
)
}
# Create EML document
my_eml <- list(
packageId = dataset_id,
system = "calcofi.io",
dataset = list(
title = dataset_title,
intellectualRights = intellectual_rights,
creator = list(
individualName = list(
givenName = "Ed",
surName = "Weber"
),
organizationName = "NOAA SWFSC",
electronicMailAddress = "ed.weber@noaa.gov",
userId = list(
directory = "https://orcid.org/",
userId = "0000-0002-0942-434X"
)
),
abstract = dataset_abstract,
keywordSet = list(
keyword = dataset_keywords,
keywordThesaurus = "GCMD Science Keywords"
),
coverage = list(
geographicCoverage = list(
geographicDescription = "California Current Large Marine Ecoregion",
boundingCoordinates = geo_coverage
),
temporalCoverage = list(
rangeOfDates = list(
beginDate = list(
calendarDate = as.character(temp_coverage$beginDate)
),
endDate = list(calendarDate = as.character(temp_coverage$endDate))
)
),
taxonomicCoverage = list(
generalTaxonomicCoverage = "Marine ichthyoplankton and fish eggs",
taxonomicClassification = taxa_coverage
)
),
contact = list(
individualName = list(
givenName = "Ed",
surName = "Weber"
),
electronicMailAddress = "ed.weber@noaa.gov"
),
methods = list(
methodStep = list(
description = list(
para = sampling_methods_description
)
),
sampling = list(
studyExtent = list(
description = list(
para = study_extent_description
)
),
samplingDescription = list(
para = sampling_description
)
)
),
project = list(
title = dataset_title,
personnel = list(
individualName = list(
givenName = "Ed",
surName = "Weber"
),
role = "Data Manager"
),
funding = list(
para = funding_information
)
)
)
)
eml_xml <- glue("{dir_out}/eml.xml")
# Write EML
write_eml(my_eml, eml_xml)
# replace license placeholder with actual license xml
readLines(eml_xml) |>
str_replace(fixed(names(license_xml)), license_xml) |>
writeLines(eml_xml)
validation_result <- eml_validate(eml_xml)
if (!validation_result) {
cat("EML validation errors:\n")
print(attr(validation_result, "errors"))
stop()
} else {
cat("EML is valid!\n")
}
```
## Data Quality Checks
Validate data quality: check for missing WoRMS IDs, orphan records, and output summary statistics:
```{r}
#| label: check-data-quality
# Check for missing WoRMS IDs
d_missing_worms <- tbl_species |>
filter(is.na(worms_id)) |>
select(species_id, scientific_name) |>
collect()
if (nrow(d_missing_worms) > 0) {
cat("WARNING: The following species are missing WoRMS IDs:\n")
print(d_missing_worms)
cat(
"\nPlease add WoRMS AphiaIDs for these species before finalizing the dataset.\n"
)
}
# Check for orphan records
d_orphan_occurrences <- q_occurrence_extension |>
anti_join(q_event_core, by = "eventID") |>
collect()
if (nrow(d_orphan_occurrences) > 0) {
cat(
"\nWARNING: Found",
nrow(d_orphan_occurrences),
"occurrence records without matching events.\n"
)
}
# Summary statistics
n_events <- q_event_core |> summarize(n = n()) |> pull(n)
n_events <- q_event_core |>
filter(eventType == "net_sample") |>
summarize(n = n()) |>
pull(n)
n_cruises <- q_event_core |>
filter(eventType == "cruise") |>
summarize(n = n()) |>
pull(n)
n_sites <- q_event_core |>
filter(eventType == "site") |>
summarize(n = n()) |>
pull(n)
n_tows <- q_event_core |>
filter(eventType == "tow") |>
summarize(n = n()) |>
pull(n)
n_nets <- q_event_core |>
filter(eventType == "net_sample") |>
summarize(n = n()) |>
pull(n)
n_occs <- q_occurrence_extension |> summarize(n = n()) |> pull(n)
n_species <- q_occurrence_extension |>
distinct(scientificName) |>
summarize(n = n()) |>
pull(n)
n_measures <- d_extendedmeasurementorfact_extension |> nrow()
glue(
"
=== Dataset Summary ===
Total events: {prettyNum(n_events, big.mark = ',')}
- Cruises: {prettyNum(n_cruises, big.mark = ',')}
- Sites: {prettyNum(n_sites, big.mark = ',')}
- Tows: {prettyNum(n_tows, big.mark = ',')}
- Net samples: {prettyNum(n_nets, big.mark = ',')}
Total occurrences: {prettyNum(n_occs, big.mark = ',')}
Total species: {prettyNum(n_species, big.mark = ',')}
Total measurements: {prettyNum(n_measures, big.mark = ',')}"
)
```
## Package Creation
Create the final DarwinCore Archive as a zip file containing all CSV files, meta.xml, and eml.xml:
```{r}
#| label: create-dwc-archive
# Create DwC-A zip file
zip_file <- file.path(dir_out, paste0("../larvae_", Sys.Date(), ".zip"))
zip(
zip_file,
files = file.path(
dir_out,
c(
"event.csv",
"occurrence.csv",
"extendedMeasurementOrFact.csv",
"meta.xml",
"eml.xml"
)
),
flags = "-j"
) # exclude parent folders
cat("\nDarwin Core Archive created:", zip_file, "\n")
```
## Check CSV
Verify the exported CSV files and inspect measurement types and units:
```{r}
#| label: check-csv
d_event <- read_csv(glue("{dir_out}/event.csv"))
d_occurrence <- read_csv(glue("{dir_out}/occurrence.csv"))
d_emof <- read_csv(glue("{dir_out}/extendedMeasurementOrFact.csv"))
# d_event
# d_occurrence
# d_emof
# d_emof |> select(measurementType, measurementUnit) |> table(useNA = "ifany")
# measurementUnit
# measurementType dimensionless grams individuals millimeters
# abundance by life stage 0 0 125347 0
# body length 0 0 0 241871
# proportion of sample sorted 76512 0 0 0
# small plankton biomass 0 45287 0 0
# standardized haul factor 76512 0 0 0
# total plankton biomass 0 45287 0 0
```
## Dataset metadata
- **Dataset ID** \
`r dataset_id` \
_(This points to this workflow. See other <https://manual.obis.org/examples.html>, e.g. `datasetID`: <https://marineinfo.org/id/dataset/6403>)_\
- **Dataset title** \
`r dataset_title`\
- **Dataset abstract** \
_(200-500 words describing the dataset)_\
`r dataset_abstract` \
_(Source: [erdCalCOFIlrvcnt | CoastWatch ERDDAP](https://coastwatch.pfeg.noaa.gov/erddap/info/erdCalCOFIlrvcnt/index.html))_
- **Keywords** \
_(3-5 relevant keywords)_ \
`r dataset_keywords`\
- **Country code** \
_(ISO 3166-1 alpha-2)_ \
`r country_code`\
- **Water body name** \
`r water_body`\
- **Sampling Methods Description**\
`r sampling_methods_description`\
- **Study Extent Description** \
`r study_extent_description`\
- **Sampling Description** \
`r sampling_description`\
- **Funding Information** \
`r funding_information`\
- **Intellectual Rights** \
`r intellectual_rights`\
_(Sources: [erdCalCOFIlrvcnt | CoastWatch ERDDAP](https://coastwatch.pfeg.noaa.gov/erddap/info/erdCalCOFIlrvcnt/index.html); [Applying a License to a Dataset | GBIF IPT User Manual](https://ipt.gbif.org/manual/en/ipt/latest/applying-license#supplementary-information))_
<!--
## Other Checklist Items
### Personnel Information
- [ ] Dataset creator name and ORCID
- [ ] Creator organization
- [ ] Creator email
- [ ] Dataset contact name and email
- [ ] Principal investigator name
### Project Information
- [ ] Project title
- [ ] Funding information (agency and grant numbers)
### Methodology
- [ ] Detailed sampling methods description
- [ ] Study extent description
- [ ] Sampling design and protocols
- [ ] Measurement methods for environmental variables
- [ ] Size measurement methods for larvae
- [ ] Taxonomist name(s) who identified specimens
- [ ] Identification date(s)
- [ ] Identification references used
### Geographic Information
- [ ] Geographic description of the study area
- [ ] GPS accuracy (coordinate uncertainty in meters)
## Done
- [x] drop `lifeStageID` (does not exist) and make **eMoF** like ["Lifestage" in 16 Dataset Examples: ENV-DATA | The OBIS Manual](https://manual.obis.org/examples.html#:~:text=TripNR3242TripStationNR16781MidasTripActionID105598occurenceIDTA_105598_(Pseudo%2D)pediastrum_5-,Lifestage,-http%3A//vocab.nerc):
- `measurementType`: "Lifestage"
- `measurementTypeID`: "http://vocab.nerc.ac.uk/collection/P01/current/LSTAGE01/"
- `measurementValue`: "adult"
- `measurementValueID`: "http://vocab.nerc.ac.uk/collection/S11/current/S1116/"
- `measurementUnit`:
- `measurementUnitID`:
- `measurementDeterminedBy`: "Flanders Marine Institute"
- `measurementMethod`: "identified and counted by image analysis and normalised to a unit volume of water body, validated by human"
-->