This document describes the pre-processing of occurrence data of alien species. The outputs of this pipeline will be used as input for modelling.

1 Setup

Load libraries:

library(tidyverse) # To do data science
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
library(tidylog) # To provide feedback on dplyr functions
library(here) # To find files
library(lubridate) # To work with dates
library(rgbif) # To get taxa from publisehd unified checklist

2 Get data

2.1 Occurrence data

2.1.1 Download

We download occurrence cubes from Zenodo deposit: https://zenodo.org/records/10527772.

2.1.2 Alien species

We read occurrence data related to alien species in Belgium:

df <- read_csv(
  file = "https://zenodo.org/records/10527772/files/be_alientaxa_cube.csv?download=1",
  col_types = cols(
    year = col_double(),
    eea_cell_code = col_character(),
    taxonKey = col_double(),
    n = col_double(),
    min_coord_uncertainty = col_double()
  ),
  na = ""
)

We remove columns we will never use:

df <-
  df %>%
  tidylog::select(-min_coord_uncertainty)

and rename column n to obs:

df <-
  df %>%
  tidylog::rename(obs = n)

We also remove data not assigned to any cell code:

df <-
  df %>%
  tidylog::filter(!is.na(eea_cell_code))

2.1.3 Baseline

We read occurrence related data at class level used for correcting the research bias effort:

df_bl <- read_csv(
  file = "https://zenodo.org/records/10527772/files/be_classes_cube.csv?download=1",
  col_types = cols(
    year = col_double(),
    eea_cell_code = col_character(),
    classKey = col_double(),
    n = col_double(),
    min_coord_uncertainty = col_double()
  ),
  na = ""
)

We remove columns we will never use:

df_bl <-
  df_bl %>%
  tidylog::select(-min_coord_uncertainty)

and rename column n to cobs:

df_bl <-
  df_bl %>%
  tidylog::rename(cobs = n)

We remove data not assigned to any cell code:

df_bl <-
  df_bl %>%
  tidylog::filter(!is.na(eea_cell_code))

We remove also data with classKey equal to NA. These data are related to IAS species not linked to any class:

df_bl <-
  df_bl %>%
  tidylog::filter(!is.na(classKey))

2.2 Checklist data

We read input data based on Global Register of Introduced and Invasive Species - Belgium, typically referred to as the unified checklist. These data contain taxonomic information as well as distribution, description and species profiles. We limit to taxa introduced in Belgium (locationId : ISO_3166:BE).

taxa_df <-
  read_tsv(here::here("data", "interim", "data_input_checklist_indicators.tsv"),
    na = "",
    guess_max = 5000
  )
taxa_df <-
  taxa_df %>%
  tidylog::filter(locationId == "ISO_3166:BE")

2.3 Remove some taxa

2.3.1 Cells in occurrence cubes not in EEA Belgian grid

The occurrence cubes were created based on GBIF occurrences filtered using country = BE. The cell codes were then created by manipulating the coordinates as strings, i.e. no spatial filter with the EEA grid was applied. So, it can happen that (few) cells in the cube are not linked with occurrences in Belgian territory as country = BE is in rare cases not enough. We remove them by using the geopackage containing the regional information of the EEA grid cells as shown in the following step. The regional information will be added at a later stage.

Read the EEA grid with regional information from utm1_bel_with_regions.gpkg in data/output:

gpkg_path <- here::here("data", "output", "utm1_bel_with_regions.gpkg")
eea_grid_regions <- sf::st_read(gpkg_path)
## Reading layer `utm1_bel_with_regions' from data source 
##   `C:\Users\damiano_oldoni\Documents\GitHub\indicators\data\output\utm1_bel_with_regions.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 51726 features and 7 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 3768000 ymin: 2926000 xmax: 4080000 ymax: 3237000
## Projected CRS: ETRS89 / ETRS-LAEA

Preview:

eea_grid_regions %>% head()

Notice that the column isBelgium is the union of the columns isFlanders, isWallonia and isBrussels. The 15km buffer zone and the maritime federal zones are not covered. At the moment we will not use the column isBelgium as occurrences on the maritime federal zones or in proximity of the coast would be otherwise excluded. Still, notice that these areas cover almost 40% of the EEA grid:

eea_grid_regions %>%
  # Transform to data.frame
  as.data.frame() %>%
  # Count the number of cells in Belgium
  count(isBelgium) %>%
  # Add percentage as a column
  mutate(percentage = n / sum(n) * 100)

We identify the rows to remove in alien species occurrence cube, df, and baseline, df_bl, by doing an anti_join:

df %>%
  tidylog::filter(!eea_cell_code %in% eea_grid_regions$CELLCODE)
df_bl %>%
  tidylog::filter(!eea_cell_code %in% eea_grid_regions$CELLCODE)

We remove them:

df <- df %>%
  tidylog::filter(eea_cell_code %in% eea_grid_regions$CELLCODE)
df_bl <- df_bl %>%
  tidylog::inner_join(eea_grid_regions %>% as.data.frame(),
    by = c("eea_cell_code" = "CELLCODE")
  )

2.3.2 Taxa in unified checklist without occurrences

There is a group of taxa defined in the unified checklist which are not present in the occurrence data, i.e. the presence of these taxa, although confirmed by authoritative’s checklists, is still not sustained by observations:

alien_taxa_without_occs <-
  taxa_df %>%
  tidylog::filter(!nubKey %in% df$taxonKey) %>%
  tidylog::distinct(
    key,
    nubKey,
    canonicalName,
    first_observed,
    last_observed,
    class,
    kingdom
  ) %>%
  dplyr::arrange(nubKey)
alien_taxa_without_occs

where key is the unique GBIF identifier of the taxon as published in the unified checklist, while nubKey is the unique GBIF identifier of the taxon in the GBIF Backbone Taxonomy. We save this data.frame as alien_taxa_without_occurrences.tsv in data/output:

write_tsv(
  alien_taxa_without_occs,
  here::here("data", "output", "alien_taxa_without_occs.tsv"),
  na = ""
)

2.4 Grid cells intersecting protected areas

We get information about the intersection of grid cells with Belgian protected areas by reading file intersect_EEA_ref_grid_protected_areas.tsv in data/output:

df_prot_areas <- readr::read_tsv(
  here::here(
    "data",
    "interim",
    "intersect_EEA_ref_grid_protected_areas.tsv"
  ),
  na = ""
)

Preview:

df_prot_areas %>%
  head()

We are interested in areas included in Natura2000. We remove columns we are not interested in:

df_prot_areas <-
  df_prot_areas %>%
  tidylog::select(
    CELLCODE,
    natura2000
  )

3 Pre-processing

3.1 Add informative columns

For better interpretation of the data, we retrieve scientific names (without authorship) of the species whose data we are going to analyze. We also retrieve the class of each alien species to link occurrence data with the related baseline data. This step requires an internet connection and can take long time as we are querying the GBIF API:

taxon_key <-
  df %>%
  tidylog::distinct(taxonKey) %>%
  pull()
spec_names <- purrr::map_df(
  taxon_key,
  function(k) {
    name_usage(key = k)$data
  }
) %>%
  tidylog::select(
    taxonKey = key,
    canonicalName,
    scientificName,
    kingdomKey, classKey
  ) %>%
  tidylog::mutate(canonicalName = ifelse(
    is.na(canonicalName), scientificName, canonicalName
  ))

In the rare case multiple taxa share the same scientific name, authorship is added:

spec_names <-
  spec_names %>%
  tidylog::group_by(canonicalName) %>%
  tidylog::add_tally() %>%
  tidylog::ungroup() %>%
  tidylog::mutate(canonicalName = if_else(n > 1,
    scientificName,
    canonicalName
  )) %>%
  tidylog::select(-c(n, scientificName))

We also retrieve kingdom and class:

class_key <-
  spec_names %>%
  tidylog::distinct(classKey) %>%
  tidylog::filter(!is.na(classKey)) %>%
  pull()
kingdom_class <- map_df(
  class_key,
  function(x) {
    name_usage(key = x)$data
  }
) %>%
  tidylog::select(classKey, class, kingdomKey, kingdom)

and we add them:

# add class
spec_names <-
  spec_names %>%
  tidylog::left_join(
    kingdom_class %>%
      tidylog::distinct(.data$classKey, .data$class),
    by = "classKey"
  )

# add kingdom
spec_names <-
  spec_names %>%
  tidylog::left_join(
    kingdom_class %>%
      tidylog::distinct(.data$kingdomKey, .data$kingdom),
    by = "kingdomKey"
  )

Preview:

spec_names %>% head(10)

Taxa not assigned to any class:

spec_names %>%
  tidylog::filter(is.na(classKey))

3.2 Extract x-y coordinates from cellcodes

Extract coordinates from column eea_cell_code. For example, the coordinates of cellcode 1kmE3895N3116 are x = 3895 and y = 3116:

df_xy <-
  df %>%
  tidylog::distinct(eea_cell_code) %>%
  dplyr::bind_cols(
    dplyr::tibble(
      x = unlist(stringr::str_extract_all(unique(df$eea_cell_code),
        pattern = "(?<=E)[0-9\\-]+"
      )),
      y = unlist(stringr::str_extract_all(unique(df$eea_cell_code),
        pattern = "(?<=N)[0-9\\-]+"
      ))
    ) %>%
      tidylog::mutate(across(where(is.character), as.integer))
  )

Preview:

df_xy %>% head()

Do the same for baseline data:

df_bl_xy <-
  df_bl %>%
  tidylog::distinct(eea_cell_code) %>%
  bind_cols(
    tibble(
      x = unlist(str_extract_all(unique(df_bl$eea_cell_code),
        pattern = "(?<=E)[0-9\\-]+"
      )),
      y = unlist(str_extract_all(unique(df_bl$eea_cell_code),
        pattern = "(?<=N)[0-9\\-]+"
      ))
    ) %>%
      mutate_all(as.integer)
  )

Preview:

df_bl_xy %>% head()

3.3 Apply temporal and spatial restraints

Select alien species introduced after 1950 (year_of_introduction: 1950) or whose date of introduction is not known:

year_of_introduction <- 1950

recent_alien_species <-
  taxa_df %>%
  # Lower limit on date of first introduction if present
  tidylog::filter(first_observed >= year_of_introduction | is.na(first_observed)) %>%
  # Remove duplicates due to other columns as pathways
  tidylog::distinct(nubKey, first_observed, last_observed) %>%
  # Get classKey info and filter out taxa without data
  tidylog::inner_join(spec_names, by = c("nubKey" = "taxonKey")) %>%
  tidylog::distinct(
    taxonKey = nubKey,
    .data$canonicalName,
    .data$first_observed,
    .data$last_observed,
    .data$class,
    .data$kingdom,
    .data$classKey,
    .data$kingdomKey
  )

Number of taxa recently introduced or whose introduction year is not specified:

nrow(recent_alien_species)
## [1] 1898

List of alien species excluded as first introduction date is earlier than 1950:

old_introductions_taxa <-
  taxa_df %>%
  tidylog::filter(first_observed < year_of_introduction) %>%
  tidylog::anti_join(
    recent_alien_species %>%
      tidylog::select(taxonKey),
    by = c("nubKey" = "taxonKey")
  ) %>%
  tidylog::distinct(
    nubKey,
    first_observed,
    last_observed
  ) %>%
  tidylog::inner_join(spec_names, by = c("nubKey" = "taxonKey")) %>%
  tidylog::select(
    taxonKey = nubKey,
    canonicalName,
    first_observed,
    last_observed,
    class,
    kingdom,
    everything()
  )

old_introductions_taxa

We save them in file taxa_introduced_in_BE_before_1950.tsv in data/output:

write_tsv(old_introductions_taxa,
  here::here(
    "data",
    "output",
    paste0(
      "taxa_introduced_in_BE_before_",
      year_of_introduction,
      ".tsv"
    )
  ),
  na = ""
)

We also limit the time series analysis on observations from 1950, start of state of invasion researches:

first_year <- 1950

Apply temporal restraint to both occurrence data:

df <-
  df %>%
  # Recently introduced taxa
  tidylog::filter(taxonKey %in% recent_alien_species$taxonKey) %>%
  tidylog::left_join(df_xy, by = "eea_cell_code") %>%
  # Recent history
  tidylog::filter(year > first_year)

and baseline data:

df_bl <-
  df_bl %>%
  tidylog::left_join(df_bl_xy, by = "eea_cell_code") %>%
  # Recent history
  tidylog::filter(year > first_year)

3.4 Add information about presence in protected areas

We add whether the grid cell intersects any of the Natura2000 Belgian protected areas.

df <-
  df %>%
  tidylog::left_join(df_prot_areas,
    by = c("eea_cell_code" = "CELLCODE")
  )

Preview:

df %>% head()

We do the same for baseline data:

df_bl <-
  df_bl %>%
  tidylog::left_join(df_prot_areas,
    by = c("eea_cell_code" = "CELLCODE")
  )

Preview:

df_bl %>% head()

3.5 Create time series

For each species, define cells with at least one observation:

df_cc <-
  df %>%
  tidylog::group_by(taxonKey) %>%
  tidylog::distinct(eea_cell_code) %>%
  tidylog::ungroup()

For each species, identify the first year with at least one observation:

df_begin_year <-
  df %>%
  tidylog::group_by(taxonKey) %>%
  tidylog::summarize(begin_year = min(year))

For each species, combine begin_year and unique eea_cell_code as found above:

df_cc <-
  df_cc %>%
  tidylog::left_join(df_begin_year, by = "taxonKey") %>%
  tidylog::select(taxonKey, begin_year, eea_cell_code)

Preview:

df_cc %>% head()

For each cell (eea_cell_code) and species (taxonKey) we can now create a time series:

make_time_series <- function(eea_cell_code, taxonKey, begin_year, last_year) {
  expand_grid(
    eea_cell_code = eea_cell_code,
    taxonKey = taxonKey,
    year = seq(from = begin_year, to = last_year)
  )
}

# create timeseries slots
df_ts <- purrr::pmap_dfr(df_cc,
  .f = make_time_series,
  last_year = year(Sys.Date())
)

## Add data

# Add occurrence data and protected areas
df_ts <-
  df_ts %>%
  tidylog::left_join(
    df %>% tidylog::select(
      taxonKey,
      year,
      eea_cell_code,
      obs
    ),
    by = c("taxonKey", "year", "eea_cell_code")
  )
# Add membership to protected areas
df_ts <-
  df_ts %>%
  tidylog::left_join(
    df_prot_areas %>%
      tidylog::select(CELLCODE, natura2000),
    by = c("eea_cell_code" = "CELLCODE")
  )
# Add regional information
df_ts <-
  df_ts %>%
  tidylog::left_join(
    eea_grid_regions %>%
      as.data.frame() %>%
      tidylog::select(
        CELLCODE,
        isFlanders,
        isWallonia,
        isBrussels
      ),
    by = c("eea_cell_code" = "CELLCODE")
  )
# Add classKey
df_ts <-
  df_ts %>%
  tidylog::left_join(spec_names %>% tidylog::select(taxonKey, classKey),
    by = "taxonKey"
  )

3.6 Research effort correction

To correct the effect of research effort of an alien speies, we calculate the number of observations at class level excluding the observations of the alien species itself:

# Add baseline data (at class level) diminished by obs of specific alien taxon
df_ts <-
  df_ts %>%
  tidylog::left_join(
    df_bl %>%
      tidylog::select(year, eea_cell_code, classKey, cobs),
    by = c("year", "eea_cell_code", "classKey")
  ) %>%
  tidylog::mutate(cobs = cobs - obs)

# replace NAs with 0
df_ts <-
  df_ts %>%
  tidylog::replace_na(list(cobs = 0, obs = 0))

Preview:

df_ts %>% head(n = 30)

Add column for presence (1) or absence (0)

df_ts <-
  df_ts %>%
  tidylog::mutate(
    pa_cobs = if_else(cobs > 0, 1, 0),
    pa_obs = if_else(obs > 0, 1, 0)
  )

Arrange order columns:

df_ts <-
  df_ts %>%
  tidylog::relocate(
    taxonKey,
    year,
    eea_cell_code,
    obs,
    pa_obs,
    cobs,
    pa_cobs,
    classKey,
    isFlanders,
    isWallonia,
    isBrussels,
    natura2000
  )

4 Save data

4.1 Save timeseries and taxonomic metadata

Save df_ts as it will be used as start point of modelling pipelines:

write_tsv(df_ts,
  file = here::here("data", "interim", "df_timeseries.tsv"),
  na = ""
)

We save also the taxonomic information:

write_tsv(spec_names,
  file = here::here(
    "data",
    "interim",
    "timeseries_taxonomic_info.tsv"
  ),
  na = ""
)