This file describes the steps required to map the data to Species Distribution.
Load libraries:
Define data types
Read data
Merge source data:
Preview of the data:
Remove all taxonID’s that are not included in the taxon core
Scan for duplicated taxa (see this issue):
taxonIDSix fields in input_distributions are used to map locationID and locality:
Two of these fields refer to verbatim location information (mapped to locality): - region_country - region_coast
Two of these fields contain a location code (mapped to locationID): - code_region (for location linked to countries) - code_coast (for location linked to coastal areas)
The codes in code_region and code_coast refer to standards used/developed for the codes (mapped to locationID), respectively: - TDWG or DAISIE consortium (system_country) and - IHO23_4 or DIHO23_4 (system_coast)
Note that coastal information is not always provided (NA). For these records, we will exclude information related to coastal regions:
| region_country | region_coast | locationID | locality |
|---|---|---|---|
| country | NA | system_country: code_region | region_country |
| country | coast | system_country: code_region | system_coast: code_coast |
Convert system_country to uppercase:
Map locationID
Map locality
Some information in the description extension (e.g. for impact on use or impact on ecolgy) is a property of a particular taxon in a particular region, identified by id_sp_region, rather than a property of a species as a whole. To emphasize that this descriptor is linked to a taxon in a particular region, we will add the locationID to the descriptor:
| idspecies | sp_in_region | description | description_with_locationID |
|---|---|---|---|
| 1 | 1.1 | impact_on_use_1.1 | impact_on_use_1.1 (locationID) |
| 1 | 1.2 | impact_on_use_1.2 | impact_on_use_1.2 (locationID) |
We here save the link between id_sp_region and locationID in a separate dataframe, to use it later in the mapping of the description extension:
sp_in_region_with_location as .csvInformation to map occurrenceStatus is contained abundance and population_status.
In most cases , we can translate abundance to the GBIF controlled vocabulary of occurrenceStatus:
| abundance | occurrenceStatus |
|---|---|
| Absent or extinct | absent (or extinct, see below) |
| Abundant | common |
| Common | common |
| Local | present |
| Rare | rare |
| Single record | present |
| Sporadic | irregular |
| Unknown | doubtful |
However, for 30 taxa, population_status contains the field extinct, which is valuable information for occurrenceStatus:
| population_status | abundance | records |
|---|---|---|
| Extinct | Absent or extinct | 9 |
| Extinct | Local | 6 |
| Extinct | Rare | 11 |
| Extinct | Single record | 4 |
| Extinct | Unknown | 8 |
We decided to set occurrencesStatus as extinct, irrespective of the content of abundance(see this issue).
Information for the mapping of establishmentMeans is contained in species_status:
For mapping soecies_status to establishmentMeans, see issue 15 on GitHub:
Inspect content of start_year, which contains the information for eventDate:
Besides a lot of NA values, we have many YYYY formatted years (good to go) and a smaller group of others:
The remaining will probably require some cleanup manually.
Get an overview of the amount of records with just a YYYY year format:
## [1] 23753
We also have an amount of negative years to take into account. Let’s just consider these with 3 or 4 digits:
## [1] 178
Let’s clean the start years information step by step:
Everything that is NA or should be NA, make it NA:
For all negative values, make it NA:
When using a < or > sign, with a ? or \n added, just take the year:
When a full date is available, parse it to ISO 8601 date format:
When textwise containing a single or multiple years, extract the first year in the text:
Replace some specials still present:
First have a look at the specials remaining, not being a integer year (1 or more digits [0-9]) or a formatted date format:
and replace those values:
recheck cleanup action:
Show content:
Inspect content of end_year, which contains the information for eventDate:
Besides a lot of NA values, we have many YYYY formatted years (good to go) and a smaller group of others:
The remaining will probably require some cleanup manually.
Get an overview of the amount of records with just a YYYY year format:
## [1] 18222
We also have an amount of negative years to take into account. Let’s just consider these with 3 or 4 digits:
## [1] 129
Let’s clean the end years information step by step:
Everything that is NA or should be NA, make it NA:
For all negative values, make it NA:
When using a < or > sign, with a ? or \n added, just take the year:
When a full date is available, parse it to ISO 8601 date format:
When textwise containing a single or multiple years, extract the first year in the text:
Replace some specials still present:
First have a look at the specials remaining, not being a integer year (1 or more digits [0-9]) or a formatted date format:
and replace those values:
recheck cleanup action:
Show content:
Inspect all combinations for start_year and end_year:
Inspect which end_year falls before start_year:
Create eventDate:
In this field we will gather all the extra information related to a distribution. The syntax we will use is: field: value | field: value. Even if a column does not contain information, it will be included as field: NA which will make it easier for end users to separate the occurrenceRemarks field back into multiple columns. Note that input_pathways and input_impact have a sourceid column that will be ignored in this mapping (it is very seldom populated).
First, we clean region_of_first_record information by removing trailing ;
Map dwc_occurrenceRemarks:
Some records in occurrenceRemarks contain a carriage return. We remove these here:
There’s no sourceid to link the sources in literature_references with the distribution extension. For this, we need the file distribution_sources, generated earlier.
Thus, we need to match distribution_sources
| id_sp_region | field_name | source |
|---|---|---|
| 1 | property_A | source_1 |
| 1 | property_B | source_2 |
| 2 | property_A | source_3 |
| 2 | property_B | source_4 |
| 3 | property_A | source_5 |
| 3 | property_B | source_6 |
with distribution
| id_sp_region | property_A | property_B |
|---|---|---|
| 1 | A1 | B1 |
| 2 | A2 | B2 |
| 3 | A3 | B2 |
To generate:
| id_sp_region | property_A | property_B |
|---|---|---|
| 1 | A1 | B1 |
| 2 | A2 | B2 |
| 3 | A3 | B2 |
To link both datasets, we use id_sp_region
Transform distribution_sources from a long to a wide dataset:
Clean column names in distribution_sources:
Inspect column names:
## [1] "current_distrib" "current_distribution" "distribution"
## [4] "ecoimpact_id" "ecological_impact" "first_observation"
## [7] "general_references" "history_is_known" "impact_on_uses"
## [10] "introduction_dates" "introduction_history" "status"
## [13] "useimpact_id"
Some remarks: - Some field names are linked to variables in distribution - distribution, - general_references, - introduction_dates, - introduction_history - Some field names are linked to variables to be included in description - current_distrib, - current_distribution, - ecoimpact_id, - ecological_impact, - first_observation - impact_on_uses, - status - useimpact_id
Map source information for distribution using ´|´ as a separator
Remove |NA:
Add source information to distribution using id_sp_region:
Rename source:
Only keep the Darwin Core columns
Drop the dwc_ prefix
Preview data