This file describes the steps required to map the data to Species Distribution.
Load libraries:
Define data types
Read data
Merge source data:
Preview of the data:
Remove all taxonID’s that are not included in the taxon core
Scan for duplicated taxa (see this issue):
taxonID
Six fields in input_distributions
are used to map locationID
and locality
:
Two of these fields refer to verbatim location information (mapped to locality
): - region_country
- region_coast
Two of these fields contain a location code (mapped to locationID
): - code_region
(for location linked to countries) - code_coast
(for location linked to coastal areas)
The codes in code_region
and code_coast
refer to standards used/developed for the codes (mapped to locationID
), respectively: - TDWG or DAISIE consortium (system_country
) and - IHO23_4 or DIHO23_4 (system_coast
)
Note that coastal information is not always provided (NA
). For these records, we will exclude information related to coastal regions:
region_country | region_coast | locationID | locality |
---|---|---|---|
country | NA | system_country: code_region | region_country |
country | coast | system_country: code_region | system_coast: code_coast |
Convert system_country
to uppercase:
Map locationID
Map locality
Some information in the description extension (e.g. for impact on use
or impact on ecolgy
) is a property of a particular taxon in a particular region, identified by id_sp_region
, rather than a property of a species as a whole. To emphasize that this descriptor is linked to a taxon in a particular region, we will add the locationID
to the descriptor:
idspecies | sp_in_region | description | description_with_locationID |
---|---|---|---|
1 | 1.1 | impact_on_use_1.1 | impact_on_use_1.1 (locationID) |
1 | 1.2 | impact_on_use_1.2 | impact_on_use_1.2 (locationID) |
We here save the link between id_sp_region
and locationID
in a separate dataframe, to use it later in the mapping of the description extension:
sp_in_region_with_location
as .csvInformation to map occurrenceStatus
is contained abundance
and population_status
.
In most cases , we can translate abundance
to the GBIF controlled vocabulary of occurrenceStatus
:
abundance | occurrenceStatus |
---|---|
Absent or extinct | absent (or extinct, see below) |
Abundant | common |
Common | common |
Local | present |
Rare | rare |
Single record | present |
Sporadic | irregular |
Unknown | doubtful |
However, for 30 taxa, population_status
contains the field extinct
, which is valuable information for occurrenceStatus
:
population_status | abundance | records |
---|---|---|
Extinct | Absent or extinct | 9 |
Extinct | Local | 6 |
Extinct | Rare | 11 |
Extinct | Single record | 4 |
Extinct | Unknown | 8 |
We decided to set occurrencesStatus
as extinct
, irrespective of the content of abundance
(see this issue).
Information for the mapping of establishmentMeans
is contained in species_status
:
For mapping soecies_status
to establishmentMeans
, see issue 15 on GitHub:
Inspect content of start_year
, which contains the information for eventDate
:
Besides a lot of NA
values, we have many YYYY
formatted years (good to go) and a smaller group of others:
The remaining will probably require some cleanup manually.
Get an overview of the amount of records with just a YYYY
year format:
## [1] 23753
We also have an amount of negative years to take into account. Let’s just consider these with 3 or 4 digits:
## [1] 178
Let’s clean the start years information step by step:
Everything that is NA or should be NA, make it NA:
For all negative values, make it NA:
When using a <
or >
sign, with a ?
or \n
added, just take the year:
When a full date is available, parse it to ISO 8601 date format:
When textwise containing a single or multiple years, extract the first year in the text:
Replace some specials still present:
First have a look at the specials remaining, not being a integer year (1 or more digits [0-9]) or a formatted date format:
and replace those values:
recheck cleanup action:
Show content:
Inspect content of end_year
, which contains the information for eventDate
:
Besides a lot of NA
values, we have many YYYY
formatted years (good to go) and a smaller group of others:
The remaining will probably require some cleanup manually.
Get an overview of the amount of records with just a YYYY
year format:
## [1] 18222
We also have an amount of negative years to take into account. Let’s just consider these with 3 or 4 digits:
## [1] 129
Let’s clean the end years information step by step:
Everything that is NA or should be NA, make it NA:
For all negative values, make it NA:
When using a <
or >
sign, with a ?
or \n
added, just take the year:
When a full date is available, parse it to ISO 8601 date format:
When textwise containing a single or multiple years, extract the first year in the text:
Replace some specials still present:
First have a look at the specials remaining, not being a integer year (1 or more digits [0-9]) or a formatted date format:
and replace those values:
recheck cleanup action:
Show content:
Inspect all combinations for start_year
and end_year
:
Inspect which end_year
falls before start_year
:
Create eventDate:
In this field we will gather all the extra information related to a distribution. The syntax we will use is: field: value | field: value
. Even if a column does not contain information, it will be included as field: NA
which will make it easier for end users to separate the occurrenceRemarks
field back into multiple columns. Note that input_pathways
and input_impact
have a sourceid
column that will be ignored in this mapping (it is very seldom populated).
First, we clean region_of_first_record
information by removing trailing ;
Map dwc_occurrenceRemarks
:
Some records in occurrenceRemarks
contain a carriage return. We remove these here:
There’s no sourceid
to link the sources in literature_references
with the distribution extension. For this, we need the file distribution_sources
, generated earlier.
Thus, we need to match distribution_sources
id_sp_region | field_name | source |
---|---|---|
1 | property_A | source_1 |
1 | property_B | source_2 |
2 | property_A | source_3 |
2 | property_B | source_4 |
3 | property_A | source_5 |
3 | property_B | source_6 |
with distribution
id_sp_region | property_A | property_B |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
3 | A3 | B2 |
To generate:
id_sp_region | property_A | property_B |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
3 | A3 | B2 |
To link both datasets, we use id_sp_region
Transform distribution_sources
from a long to a wide dataset:
Clean column names in distribution_sources
:
Inspect column names:
## [1] "current_distrib" "current_distribution" "distribution"
## [4] "ecoimpact_id" "ecological_impact" "first_observation"
## [7] "general_references" "history_is_known" "impact_on_uses"
## [10] "introduction_dates" "introduction_history" "status"
## [13] "useimpact_id"
Some remarks: - Some field names are linked to variables in distribution
- distribution
, - general_references
, - introduction_dates
, - introduction_history
- Some field names are linked to variables to be included in description
- current_distrib
, - current_distribution
, - ecoimpact_id
, - ecological_impact
, - first_observation
- impact_on_uses
, - status
- useimpact_id
Map source information for distribution
using ´|´ as a separator
Remove |NA
:
Add source information to distribution
using id_sp_region
:
Rename source
:
Only keep the Darwin Core columns
Drop the dwc_
prefix
Preview data