This document describes the pre-processing of occurrence data of alien species. The outputs of this pipeline will be used as input for modelling.
Load libraries:
library(tidyverse) # To do data science
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
library(tidylog) # To provide feedback on dplyr functions
library(here) # To find files
library(lubridate) # To work with dates
library(rgbif) # To get taxa from publisehd unified checklist
We download occurrence cubes from Zenodo deposit: https://zenodo.org/records/10527772.
We read occurrence data related to alien species in Belgium:
df <- read_csv(
file = "https://zenodo.org/records/10527772/files/be_alientaxa_cube.csv?download=1",
col_types = cols(
year = col_double(),
eea_cell_code = col_character(),
taxonKey = col_double(),
n = col_double(),
min_coord_uncertainty = col_double()
),
na = ""
)
We remove columns we will never use:
df <-
df %>%
tidylog::select(-min_coord_uncertainty)
and rename column n
to obs
:
df <-
df %>%
tidylog::rename(obs = n)
We also remove data not assigned to any cell code:
df <-
df %>%
tidylog::filter(!is.na(eea_cell_code))
We read occurrence related data at class level used for correcting the research bias effort:
df_bl <- read_csv(
file = "https://zenodo.org/records/10527772/files/be_classes_cube.csv?download=1",
col_types = cols(
year = col_double(),
eea_cell_code = col_character(),
classKey = col_double(),
n = col_double(),
min_coord_uncertainty = col_double()
),
na = ""
)
We remove columns we will never use:
df_bl <-
df_bl %>%
tidylog::select(-min_coord_uncertainty)
and rename column n
to cobs
:
df_bl <-
df_bl %>%
tidylog::rename(cobs = n)
We remove data not assigned to any cell code:
df_bl <-
df_bl %>%
tidylog::filter(!is.na(eea_cell_code))
We remove also data with classKey
equal to
NA
. These data are related to IAS species not linked to any
class:
df_bl <-
df_bl %>%
tidylog::filter(!is.na(classKey))
We read input data based on Global
Register of Introduced and Invasive Species - Belgium, typically
referred to as the unified checklist. These data contain
taxonomic information as well as distribution, description and species
profiles. We limit to taxa introduced in Belgium
(locationId
: ISO_3166:BE
).
taxa_df <-
read_tsv(here::here("data", "interim", "data_input_checklist_indicators.tsv"),
na = "",
guess_max = 5000
)
taxa_df <-
taxa_df %>%
tidylog::filter(locationId == "ISO_3166:BE")
The occurrence cubes were created based on GBIF occurrences filtered
using country = BE
. The cell codes were then created by
manipulating the coordinates as strings, i.e. no spatial filter with the
EEA grid was applied. So, it can happen that (few) cells in the cube are
not linked with occurrences in Belgian territory as
country = BE
is in rare cases not enough. We remove them by
using the geopackage containing the regional information of the EEA grid
cells as shown in the following step. The regional information will be
added at a later stage.
Read the EEA grid with regional information from
utm1_bel_with_regions.gpkg
in data/output
:
gpkg_path <- here::here("data", "output", "utm1_bel_with_regions.gpkg")
eea_grid_regions <- sf::st_read(gpkg_path)
## Reading layer `utm1_bel_with_regions' from data source
## `C:\Users\damiano_oldoni\Documents\GitHub\indicators\data\output\utm1_bel_with_regions.gpkg'
## using driver `GPKG'
## Simple feature collection with 51726 features and 7 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 3768000 ymin: 2926000 xmax: 4080000 ymax: 3237000
## Projected CRS: ETRS89 / ETRS-LAEA
Preview:
eea_grid_regions %>% head()
Notice that the column isBelgium
is the union of the
columns isFlanders
, isWallonia
and
isBrussels
. The 15km buffer zone and the maritime federal
zones are not covered. At the moment we will not use the column
isBelgium
as occurrences on the maritime federal zones or
in proximity of the coast would be otherwise excluded. Still, notice
that these areas cover almost 40% of the EEA grid:
eea_grid_regions %>%
# Transform to data.frame
as.data.frame() %>%
# Count the number of cells in Belgium
count(isBelgium) %>%
# Add percentage as a column
mutate(percentage = n / sum(n) * 100)
We identify the rows to remove in alien species occurrence cube,
df
, and baseline, df_bl
, by doing an
anti_join
:
df %>%
tidylog::filter(!eea_cell_code %in% eea_grid_regions$CELLCODE)
df_bl %>%
tidylog::filter(!eea_cell_code %in% eea_grid_regions$CELLCODE)
We remove them:
df <- df %>%
tidylog::filter(eea_cell_code %in% eea_grid_regions$CELLCODE)
df_bl <- df_bl %>%
tidylog::inner_join(eea_grid_regions %>% as.data.frame(),
by = c("eea_cell_code" = "CELLCODE")
)
There is a group of taxa defined in the unified checklist which are not present in the occurrence data, i.e. the presence of these taxa, although confirmed by authoritative’s checklists, is still not sustained by observations:
alien_taxa_without_occs <-
taxa_df %>%
tidylog::filter(!nubKey %in% df$taxonKey) %>%
tidylog::distinct(
key,
nubKey,
canonicalName,
first_observed,
last_observed,
class,
kingdom
) %>%
dplyr::arrange(nubKey)
alien_taxa_without_occs
where key
is the unique GBIF identifier of the taxon as
published in the unified checklist, while nubKey
is the
unique GBIF identifier of the taxon in the GBIF Backbone Taxonomy. We
save this data.frame as alien_taxa_without_occurrences.tsv
in data/output
:
write_tsv(
alien_taxa_without_occs,
here::here("data", "output", "alien_taxa_without_occs.tsv"),
na = ""
)
We get information about the intersection of grid cells with Belgian
protected areas by reading file
intersect_EEA_ref_grid_protected_areas.tsv
in
data/output
:
df_prot_areas <- readr::read_tsv(
here::here(
"data",
"interim",
"intersect_EEA_ref_grid_protected_areas.tsv"
),
na = ""
)
Preview:
df_prot_areas %>%
head()
We are interested in areas included in Natura2000. We remove columns we are not interested in:
df_prot_areas <-
df_prot_areas %>%
tidylog::select(
CELLCODE,
natura2000
)
For better interpretation of the data, we retrieve scientific names (without authorship) of the species whose data we are going to analyze. We also retrieve the class of each alien species to link occurrence data with the related baseline data. This step requires an internet connection and can take long time as we are querying the GBIF API:
taxon_key <-
df %>%
tidylog::distinct(taxonKey) %>%
pull()
spec_names <- purrr::map_df(
taxon_key,
function(k) {
name_usage(key = k)$data
}
) %>%
tidylog::select(
taxonKey = key,
canonicalName,
scientificName,
kingdomKey, classKey
) %>%
tidylog::mutate(canonicalName = ifelse(
is.na(canonicalName), scientificName, canonicalName
))
In the rare case multiple taxa share the same scientific name, authorship is added:
spec_names <-
spec_names %>%
tidylog::group_by(canonicalName) %>%
tidylog::add_tally() %>%
tidylog::ungroup() %>%
tidylog::mutate(canonicalName = if_else(n > 1,
scientificName,
canonicalName
)) %>%
tidylog::select(-c(n, scientificName))
We also retrieve kingdom and class:
class_key <-
spec_names %>%
tidylog::distinct(classKey) %>%
tidylog::filter(!is.na(classKey)) %>%
pull()
kingdom_class <- map_df(
class_key,
function(x) {
name_usage(key = x)$data
}
) %>%
tidylog::select(classKey, class, kingdomKey, kingdom)
and we add them:
# add class
spec_names <-
spec_names %>%
tidylog::left_join(
kingdom_class %>%
tidylog::distinct(.data$classKey, .data$class),
by = "classKey"
)
# add kingdom
spec_names <-
spec_names %>%
tidylog::left_join(
kingdom_class %>%
tidylog::distinct(.data$kingdomKey, .data$kingdom),
by = "kingdomKey"
)
Preview:
spec_names %>% head(10)
Taxa not assigned to any class:
spec_names %>%
tidylog::filter(is.na(classKey))
Extract coordinates from column eea_cell_code
. For
example, the coordinates of cellcode 1kmE3895N3116
are
x = 3895
and y = 3116
:
df_xy <-
df %>%
tidylog::distinct(eea_cell_code) %>%
dplyr::bind_cols(
dplyr::tibble(
x = unlist(stringr::str_extract_all(unique(df$eea_cell_code),
pattern = "(?<=E)[0-9\\-]+"
)),
y = unlist(stringr::str_extract_all(unique(df$eea_cell_code),
pattern = "(?<=N)[0-9\\-]+"
))
) %>%
tidylog::mutate(across(where(is.character), as.integer))
)
Preview:
df_xy %>% head()
Do the same for baseline data:
df_bl_xy <-
df_bl %>%
tidylog::distinct(eea_cell_code) %>%
bind_cols(
tibble(
x = unlist(str_extract_all(unique(df_bl$eea_cell_code),
pattern = "(?<=E)[0-9\\-]+"
)),
y = unlist(str_extract_all(unique(df_bl$eea_cell_code),
pattern = "(?<=N)[0-9\\-]+"
))
) %>%
mutate_all(as.integer)
)
Preview:
df_bl_xy %>% head()
Select alien species introduced after 1950
(year_of_introduction
: 1950) or whose date of introduction
is not known:
year_of_introduction <- 1950
recent_alien_species <-
taxa_df %>%
# Lower limit on date of first introduction if present
tidylog::filter(first_observed >= year_of_introduction | is.na(first_observed)) %>%
# Remove duplicates due to other columns as pathways
tidylog::distinct(nubKey, first_observed, last_observed) %>%
# Get classKey info and filter out taxa without data
tidylog::inner_join(spec_names, by = c("nubKey" = "taxonKey")) %>%
tidylog::distinct(
taxonKey = nubKey,
.data$canonicalName,
.data$first_observed,
.data$last_observed,
.data$class,
.data$kingdom,
.data$classKey,
.data$kingdomKey
)
Number of taxa recently introduced or whose introduction year is not specified:
nrow(recent_alien_species)
## [1] 1898
List of alien species excluded as first introduction date is earlier than 1950:
old_introductions_taxa <-
taxa_df %>%
tidylog::filter(first_observed < year_of_introduction) %>%
tidylog::anti_join(
recent_alien_species %>%
tidylog::select(taxonKey),
by = c("nubKey" = "taxonKey")
) %>%
tidylog::distinct(
nubKey,
first_observed,
last_observed
) %>%
tidylog::inner_join(spec_names, by = c("nubKey" = "taxonKey")) %>%
tidylog::select(
taxonKey = nubKey,
canonicalName,
first_observed,
last_observed,
class,
kingdom,
everything()
)
old_introductions_taxa
We save them in file
taxa_introduced_in_BE_before_1950.tsv
in
data/output
:
write_tsv(old_introductions_taxa,
here::here(
"data",
"output",
paste0(
"taxa_introduced_in_BE_before_",
year_of_introduction,
".tsv"
)
),
na = ""
)
We also limit the time series analysis on observations from 1950, start of state of invasion researches:
first_year <- 1950
Apply temporal restraint to both occurrence data:
df <-
df %>%
# Recently introduced taxa
tidylog::filter(taxonKey %in% recent_alien_species$taxonKey) %>%
tidylog::left_join(df_xy, by = "eea_cell_code") %>%
# Recent history
tidylog::filter(year > first_year)
and baseline data:
df_bl <-
df_bl %>%
tidylog::left_join(df_bl_xy, by = "eea_cell_code") %>%
# Recent history
tidylog::filter(year > first_year)
We add whether the grid cell intersects any of the Natura2000 Belgian protected areas.
df <-
df %>%
tidylog::left_join(df_prot_areas,
by = c("eea_cell_code" = "CELLCODE")
)
Preview:
df %>% head()
We do the same for baseline data:
df_bl <-
df_bl %>%
tidylog::left_join(df_prot_areas,
by = c("eea_cell_code" = "CELLCODE")
)
Preview:
df_bl %>% head()
For each species, define cells with at least one observation:
df_cc <-
df %>%
tidylog::group_by(taxonKey) %>%
tidylog::distinct(eea_cell_code) %>%
tidylog::ungroup()
For each species, identify the first year with at least one observation:
df_begin_year <-
df %>%
tidylog::group_by(taxonKey) %>%
tidylog::summarize(begin_year = min(year))
For each species, combine begin_year
and unique
eea_cell_code
as found above:
df_cc <-
df_cc %>%
tidylog::left_join(df_begin_year, by = "taxonKey") %>%
tidylog::select(taxonKey, begin_year, eea_cell_code)
Preview:
df_cc %>% head()
For each cell (eea_cell_code
) and species
(taxonKey
) we can now create a time series:
make_time_series <- function(eea_cell_code, taxonKey, begin_year, last_year) {
expand_grid(
eea_cell_code = eea_cell_code,
taxonKey = taxonKey,
year = seq(from = begin_year, to = last_year)
)
}
# create timeseries slots
df_ts <- purrr::pmap_dfr(df_cc,
.f = make_time_series,
last_year = year(Sys.Date())
)
## Add data
# Add occurrence data and protected areas
df_ts <-
df_ts %>%
tidylog::left_join(
df %>% tidylog::select(
taxonKey,
year,
eea_cell_code,
obs
),
by = c("taxonKey", "year", "eea_cell_code")
)
# Add membership to protected areas
df_ts <-
df_ts %>%
tidylog::left_join(
df_prot_areas %>%
tidylog::select(CELLCODE, natura2000),
by = c("eea_cell_code" = "CELLCODE")
)
# Add regional information
df_ts <-
df_ts %>%
tidylog::left_join(
eea_grid_regions %>%
as.data.frame() %>%
tidylog::select(
CELLCODE,
isFlanders,
isWallonia,
isBrussels
),
by = c("eea_cell_code" = "CELLCODE")
)
# Add classKey
df_ts <-
df_ts %>%
tidylog::left_join(spec_names %>% tidylog::select(taxonKey, classKey),
by = "taxonKey"
)
To correct the effect of research effort of an alien speies, we calculate the number of observations at class level excluding the observations of the alien species itself:
# Add baseline data (at class level) diminished by obs of specific alien taxon
df_ts <-
df_ts %>%
tidylog::left_join(
df_bl %>%
tidylog::select(year, eea_cell_code, classKey, cobs),
by = c("year", "eea_cell_code", "classKey")
) %>%
tidylog::mutate(cobs = cobs - obs)
# replace NAs with 0
df_ts <-
df_ts %>%
tidylog::replace_na(list(cobs = 0, obs = 0))
Preview:
df_ts %>% head(n = 30)
Add column for presence (1) or absence (0)
df_ts <-
df_ts %>%
tidylog::mutate(
pa_cobs = if_else(cobs > 0, 1, 0),
pa_obs = if_else(obs > 0, 1, 0)
)
Arrange order columns:
df_ts <-
df_ts %>%
tidylog::relocate(
taxonKey,
year,
eea_cell_code,
obs,
pa_obs,
cobs,
pa_cobs,
classKey,
isFlanders,
isWallonia,
isBrussels,
natura2000
)
Save df_ts
as it will be used as start point of
modelling pipelines:
write_tsv(df_ts,
file = here::here("data", "interim", "df_timeseries.tsv"),
na = ""
)
We save also the taxonomic information:
write_tsv(spec_names,
file = here::here(
"data",
"interim",
"timeseries_taxonomic_info.tsv"
),
na = ""
)