Skip to contents

Introduction

This vignette provides worked examples for fitting joint species distribution models in the spOccupancy R package (Doser et al. 2021). Joint species distribution models (JSDMs) are a series of regression-based approaches that explicitly accommodate residual species correlations (Latimer et al. 2009; Ovaskainen, Hottola, and Siitonen 2010). spOccupancy provides a series of functions that account for various combinations of the three major complexities often encountered in multi-species detection-nondetection data: (1) residual species correlations (Ovaskainen, Hottola, and Siitonen 2010), (2) imperfect detection (MacKenzie et al. 2002), and (3) spatial autocorrelation (Finley, Banerjee, and McRoberts 2009). In this vignette, we will provide step by step examples on how to fit the following models:

  1. A latent factor multi-species occupancy model using lfMsPGOcc() that accommodates residual species correlations and imperfect detection.
  2. A spatial latent factor multi-species occupancy model using sfMsPGOcc() that accommodates residual species correlations, imperfect detection, and spatial autocorrelation.
  3. A spatial latent factor joint species distribution model using sfJSDM() that accommodates residual species correlations and spatial autocorrelation.
  4. A latent factor joint species distribution model using lfJSDM() that accommodates residual species correlations.

For a detailed vignette on non-spatial and spatial multi-species occupancy models that do not account for residual species correlations, see the introductory spOccupancy vignette.

As with all models implemented in spOccupancy, we use Pólya-Gamma data augmentation for computational efficiency (Polson, Scott, and Windle 2013). Here we provide a brief description of each model, with full statistical details provided on the Gibbs sampler implementations of the models in Appendix S2. In addition to fitting each model, we will show how spOccupancy provides functionality for posterior predictive checks as a Goodness of Fit assessment, model comparison and assessment using the Widely Applicable Information Criterion (WAIC), k-fold cross-validation, and out-of-sample predictions using standard R helper functions (e.g., predict()).

Below, we first load the spOccupancy package, the coda package for some additional MCMC diagnostics, as well as the stars and ggplot2 packages to create some basic plots of our results. We also set a seed so you can reproduce the same results we do.

Example data set: Foliage-gleaning birds at Hubbard Brook

As an example data set throughout this vignette, we will use data from twelve foliage-gleaning bird species collected from point count surveys at Hubbard Brook Experimental Forest (HBEF) in New Hampshire, USA. Specific details on the data set, which is just a small subset from a long-term survey, are available on the Hubbard Brook website and Doser et al. (2022). The data are provided as part of the spOccupancy package and are loaded with data(hbef2015). Point count surveys were conducted at 373 sites over three replicates, each of 10 minutes in length and with a detection radius of 100m. In the data set provided here, we converted these data to detection-nondetection data. Some sites were not visited for all three replicates. Additional information on the data set (including individual species in the data set) can be viewed in the man page using help(hbef2015).

data(hbef2015) # Load the data set.
str(hbef2015) # Get an overview of what's in the data.
List of 4
 $ y       : num [1:12, 1:373, 1:3] 0 0 0 0 0 1 0 0 0 0 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : chr [1:12] "AMRE" "BAWW" "BHVI" "BLBW" ...
  .. ..$ : chr [1:373] "1" "2" "3" "4" ...
  .. ..$ : chr [1:3] "1" "2" "3"
 $ occ.covs: num [1:373, 1] 475 494 546 587 588 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr "Elevation"
 $ det.covs:List of 2
  ..$ day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ...
  ..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
 $ coords  : num [1:373, 1:2] 280000 280000 280000 280001 280000 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:373] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "X" "Y"
# Species codes
sp.names <- rownames(hbef2015$y)

The object hbef2015 is a list comprised of the detection-nondetection data (y), covariates on the occurrence portion of the model (occ.covs), covariates on the detection portion of the model (det.covs), and the spatial coordinates of each site (coords) for use in spatial occupancy models and in plotting. Note that spOccupancy functions assume the spatial coordinates are specified in a projected coordinate system. This list is the format required for input to spOccupancy model functions. hbef2015 contains data on 12 species in the three-dimensional array y, where the dimensions of y correspond to species (12), sites (373), and replicates (3).

Latent factor multi-species occupancy models

Basic model description

Let \(z_{i, j}\) be the true presence (1) or absence (0) of some species \(i\) at site \(j\) for a total of \(i = 1, \dots, N\) species and \(j = 1, \dots, J\) sites. For our HBEF example, \(N = 12\) and \(J = 373\). We assume \(z_{i, j}\) arises from a Bernoulli process following

\[\begin{equation} \begin{split} &z_{i, j} \sim \text{Bernoulli}(\psi_{i, j}), \\ &\text{logit}(\psi_{i, j}) = \boldsymbol{x}^{\top}_{j} \boldsymbol{\beta}_i + \text{w}^*_{i, j}, \end{split} \end{equation}\]

where \(\psi_{i, j}\) is the probability of occurrence of species \(i\) at site \(j\), which is a function of site-specific covariates \(\boldsymbol{X}\), a vector of species-specific regression coefficients (\(\boldsymbol{\beta}_i\)) for those covariates, and a latent process \(\text{w}^*_{i, j}\). We incorporate residual species correlations through the formulation of the latent process \(\text{w}^*_{i, j}\). We use a factor modeling approach, which is a dimension reduction technique that can account for correlations among a large number of species without a massive computational cost (Hogan and Tchernis 2004). Specifically, we decompose \(\text{w}^*_{i, j}\) into a linear combination of \(q\) latent variables (i.e., factors) and their associated species-specific coefficients (i.e., factor loadings). Thus, we have

\[\begin{equation} \text{w}^*_{i, j} = \boldsymbol{\lambda}_i^\top\textbf{w}_j, \end{equation}\]

where \(\boldsymbol{\lambda}_i\) is the \(i\)th row of factor loadings from an \(N \times q\) matrix \(\boldsymbol{\Lambda}\), and \(\textbf{w}_j\) is a \(q \times 1\) vector of independent latent factors at site \(j\). We achieve computational improvements by setting \(q << N\), where often a small number of factors (e.g., \(q = 5\)) is sufficient (Taylor-Rodriguez et al. 2019). For our HBEF example, our community is relatively small (\(N = 12\)) and so we use \(q = 2\) latent factors as our initial choice, and will discuss assessing this choice of the number of factors later in the example. We account for residual species correlations via their individual responses (i.e., loadings) to the \(q\) latent spatial factors. We can envision the latent variables \(\textbf{w}_j\) as unmeasured site-specific covariates that are treated as random variables in the model estimation procedure. For the non-spatial latent factor model, we assign a standard normal prior distribution to the latent factors (i.e., we assume each latent factor is independent and arises from a normal distribution with mean 0 and standard deviation 1).

We envision the species-specific regression coefficients (\(\boldsymbol{\beta}_i\)) as random effects arising from a common community-level distribution:

\[\begin{equation} \boldsymbol{\beta}_i \sim \text{Normal}(\boldsymbol{\mu_{\beta}}, \boldsymbol{T}_{\beta}), \end{equation}\]

where \(\boldsymbol{\mu_{\beta}}\) is a vector of community-level mean effects for each occurrence covariate effect (including the intercept) and \(\boldsymbol{T}_{\beta}\) is a diagonal matrix with diagonal elements \(\boldsymbol{\tau}^2_{\beta}\) that represent the variance of each occurrence covariate effect among species in the community.

We do not directly observe \(z_{i, j}\), but rather we observe an imperfect representation of the latent occurrence process. Let \(y_{i, j, k}\) be the observed detection (1) or nondetection (0) of a species \(i\) of interest at site \(j\) during replicate \(k\) for each of \(k = 1, \dots, K_j\) replicates at each site \(j\). We envision the detection-nondetection data as arising from a Bernoulli process conditional on the true latent occurrence process:

\[\begin{equation} \begin{split} &y_{i, j, k} \sim \text{Bernoulli}(p_{i, j, k}z_{i, j}), \\ &\text{logit}(p_{i, j, k}) = \boldsymbol{v}^{\top}_{i, j, k}\boldsymbol{\alpha}_i, \end{split} \end{equation}\]

where \(p_{i, j, k}\) is the probability of detecting species \(i\) at site \(j\) during replicate \(k\) (given it is present at site \(j\)), which is a function of site and replicate-specific covariates \(\boldsymbol{V}\) and a vector of species-specific regression coefficients (\(\boldsymbol{\alpha}_i\)). Similarly to the occurrence regression coefficients, the species-specific detection coefficients are envisioned as random effects arising from a common community-level distribution:

\[\begin{equation} \boldsymbol{\alpha}_i \sim \text{Normal}(\boldsymbol{\mu_{\alpha}}, \boldsymbol{T}_{\alpha}), \end{equation}\]

where \(\boldsymbol{\mu_{\alpha}}\) is a vector of community-level mean effects for each detection covariate effect (including the intercept) and \(\boldsymbol{T}_{\alpha}\) is a diagonal matrix with diagonal elements \(\boldsymbol{\tau}^2_{\alpha}\) that represent the variability of each detection covariate effect among species in the community.

We assign multivariate normal priors for the community-level occurrence (\(\boldsymbol{\mu_{\beta}}\)) and detection (\(\boldsymbol{\mu_{\alpha}}\)) means, and assign independent inverse-Gamma priors on the community-level occurrence (\(\tau^2_{\beta}\)) and detection (\(\tau^2_{\alpha}\)) variance parameters. To ensure identifiability of the latent factors and factor loadings, we set all elements in the upper triangle of the factor loadings matrix \(\boldsymbol{\Lambda}\) equal to 0 and its diagonal elements equal to 1.

Fitting latent factor multi-species occupancy models with lfMsPGOcc()

The lfMsPGOcc() function fits latent factor multi-species occupancy models. lfMsPGOcc() has the following arguments:

lfMsPGOcc(occ.formula, det.formula, data, inits, priors, n.factors, 
          n.samples, n.omp.threads = 1, verbose = TRUE, n.report = 100, 
          n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1,
          k.fold, k.fold.threads = 1, k.fold.seed, ...)

The first two arguments, occ.formula and det.formula, use standard R model syntax to denote the covariates to be included in the occurrence and detection portions of the model, respectively. We only specify the right hand side of the formula. We can include random intercepts in both the occurrence and detection portions of the model using lme4 syntax (Bates et al. 2015). The names of variables given in the formulas should correspond to those found in data, which is a list consisting of the following tags: y (detection-nondetection data), occ.covs (occurrence covariates), det.covs (detection covariates), and coords (spatial coordinates of sites). y is a three-dimensional array with dimensions corresponding to species, sites, and replicates, occ.covs is a matrix or data frame with site-specific covariate values, and det.covs is a list with each list element corresponding to a covariate to include in the detection portion of the model. Covariates on detection can vary by site and/or survey, and so these covariates may be specified as a site by survey matrix for survey-level covariates or as a one-dimensional vector for survey level covariates. The hbef2015 list is already in the required format. For our example, we will model species-specific occurrence as a function of linear and quadratic elevation, and will include three observational covariates (linear and quadratic day of survey, time of day of survey) on the detection portion of the model. We standardize all covariates by using the scale() function in our model specification, and use the I() function to specify quadratic effects.

occ.formula <- ~ scale(Elevation) + I(scale(Elevation)^2)
det.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2)

Next, we will specify the number of latent factors to use in our model. This is not an arbitrary decision, and it is difficult to determine the optimal number of latent factors in the model. While other approaches exist to estimate the “optimal” number of factors directly in the modeling framework (Tikhonov, Duan, et al. 2020; Ovaskainen et al. 2016), these approaches do not allow for interpretability of the latent factors and the latent factor loadings (see Figure 2 in Doser, Finley, Banerjee (2022)). The specific restraints and priors we place on the factor loadings matrix (\(\boldsymbol{\Lambda}\)) in our approach allows for interpretation of the latent factors and the factor loadings, but does not automatically determine the number of factors for optimal predictive performance. Thus, there is a tradeoff between interpretability of the latent factor and factor loadings and optimal predictive performance. In the spOccupancy implementation, we chose to allow for interpretability of the factor and factor loadings at risk of inferior predictive performance if too many or too few factors are specified by the user.

The number of latent factors can range from 1 to \(N\) (the total number of species in the modeled community). Conceptually, choosing the number of factors is similar to performing a Principal Components Analysis and looking at which components explain a large amount of variation. We want to choose the number of factors that explains an adequate amount of variability among species in the community, but we want to keep this number as small as possible to avoid overfitting the model and large model run times. When initially specifying the number of factors, we suggest the following:

  1. Consider the size of the community and how much variation you expect from species to species. If you expect large variation in occurrence patterns for all species in the community, you may require a larger number of factors. If your modeled community is comprised of certain groups of species that you expect to behave similarly (e.g., insectivores, frugivores, granivores), then a smaller number of factors may suffice. Further, as shown by Tikhonov, Duan, et al. (2020), as the number of species increases, you will likely need more factors to adequately represent the community.
  2. Consider how much time you have to run the model. The more factors included in the model, the longer the model will take to run. Under certain circumstances, like when you are running a model across a large number of spatial locations, you may simply be restricted to a small number of factors in order to achieve reasonable run times.
  3. Consider how rare your species in the community are, how many data locations you have (i.e., sites), and how many replicates you have at each site. Models with more latent factors have more parameters to estimate, and thus require more data. If you have a lot of rare species in the community, you will likely be limited to a very small number of factors, as models with more than a few factors may not be identifiable. The same can be said if you are working with a small number of spatial locations (e.g., 30 sites) or replicates (e.g., 1 or 2 replicates at each site).

In our HBEF example, the community is relatively small (\(N = 12\)) and the species are all quite similar (after all, they are all classified as foliage-gleaning birds). Let’s take a look at the raw probabilities of occurrence from the detection-nondetection data (ignoring imperfect detection) to give an idea of how rare the species are

apply(hbef2015$y, 1, mean, na.rm = TRUE)
       AMRE        BAWW        BHVI        BLBW        BLPW        BTBW 
0.003616637 0.039783002 0.059674503 0.379746835 0.059674503 0.559674503 
       BTNW        CAWA        MAWA        NAWA        OVEN        REVI 
0.556057866 0.024412297 0.084990958 0.023508137 0.531645570 0.554249548 

It looks like we have some really rare species (e.g., AMRE, NAWA, BAWW) and some pretty common species (e.g., OVEN, REVI, BTBW). Taking all of this in consideration, it makes sense to initially try the model with a small number of factors, and so we will work with \(q = 2\) factors.

# Number of latent factors (q in statistical notation)
n.factors <- 2

Because of the restrictions we place on the factor loadings matrix (diagonal elements equal to 1 and upper triangle elements equal to 0), another important modeling decision we need to make is how to order the species in our detection-nondetection array. More specifically, we need to carefully choose the first \(q\) species in the our array, as these are the species that will have restrictions on their factor loadings. While from a theoretical perspective the order of the species will only influence the resulting interpretation of the latent factors and factor loadings matrix and not the model estimates, this decision does have practical implications. We have found that careful consideration of the ordering of species can lead to (1) increased interpretability of the factors and factor loadings; (2) faster model convergence; and (3) improved mixing. Determining the order of the factors is less important when you have an adequate number of observations for all species in the community, but it becomes increasingly important the more rare species you have in your data set. We suggest the following when considering the order of the species in the detection-nondetection array (y):

  1. Place a common species first. The first species has all of its factor loadings set to fixed values, and so it can have a large influence on the resulting interpretation on the factor loadings and latent factors. We have also found that having a very rare species first can result in very slow mixing and increased sensitivity to initial values of the latent factor loadings matrix.
  2. For the remaining \(q - 1\) factors, place species that you believe will show different occurrence patterns than the first species, and the other species placed before it. Place these remaining \(q - 1\) species in order of decreasing differences from the initial factor. For example, if I had \(q = 3\), for the second species in the array, I would place a species that I a priori think is most different from the first species. For the third species in the array, I would place a species that I think will show different occurrence patterns than both the first and second species, but its patterns may not be as noticeably different compared to the first and second species.

In our HBEF example, it is clear that there is a set of fairly common species as well as very rare species. This is likely related to the specific elevation these species tend to occurr at as a result of varied habitat requirements. Accordingly, we will reorder the species matrix (hbef2015$y) to place one of the common species first that occurs at relatively moderate elevations (OVEN) and then place a rare species second that tends to occurr at high elevational habitat (BLPW). The order of the remaining \(N - q = 10\) species does not matter. Below we reorder the species following this logic, and then create a new data object hbef.ordered that we will supply to lfMsPGOcc().

# Current species ordering
sp.names
 [1] "AMRE" "BAWW" "BHVI" "BLBW" "BLPW" "BTBW" "BTNW" "CAWA" "MAWA" "NAWA"
[11] "OVEN" "REVI"
# Reorder species. 
sp.ordered <- c('OVEN', 'BLPW', 'AMRE', 'BAWW', 'BHVI', 'BLBW', 
                'BTBW', 'BTNW', 'CAWA', 'MAWA', 'NAWA', 'REVI')
# Create new detection-nondetection data matrix in the new order
y.new <- hbef2015$y[sp.ordered, , ]
# Create a new data array
hbef.ordered <- hbef2015
# Change the data to the new ordered data
hbef.ordered$y <- y.new
str(hbef.ordered)
List of 4
 $ y       : num [1:12, 1:373, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : chr [1:12] "OVEN" "BLPW" "AMRE" "BAWW" ...
  .. ..$ : chr [1:373] "1" "2" "3" "4" ...
  .. ..$ : chr [1:3] "1" "2" "3"
 $ occ.covs: num [1:373, 1] 475 494 546 587 588 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr "Elevation"
 $ det.covs:List of 2
  ..$ day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ...
  ..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
 $ coords  : num [1:373, 1:2] 280000 280000 280000 280001 280000 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:373] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "X" "Y"

Next we specify the initial values in inits and the prior distributions in priors. These arguments are optional, as spOccupancy will set default initial values and prior distributions if these arguments are not specified. If verbose = TRUE, messages will be printed to the screen to indicate what initial values and priors are used by default for each model parameter. Here (and throughout this vignette), we will explicitly specify initial values and priors.

However, we will point out that all models in spOccupancy that use a factor modeling approach can be fairly sensitive to the initial values of the latent factor loadings. This is primarily an issue when there are a large number of rare species. If you encounter difficulties in model convergence when running factor models in spOccupancy across multiple chains, we recommend first running a single chain of the model for a moderate number of iterations until the traceplots look like they are settling around a value (i.e., convergence is closed to being reached). Then extract the estimated mean values for the factor loadings matrix (\(\boldsymbol{\Lambda}\)) and supply these as initial values to the spOccupancy function when running the full model across multiple chains. When running multiple chains when not paying much attention to the initial values, you may see large discrepancies between certain chains with very large Rhat values for the latent factor loadings matrix (and spatial range parameters for spatially-explicit factor models). However, this may not necessarily be a convergence issue. Rather, what may happen is that depending on the initial values, the specific factors in the model may be estimated in a different order. For example, if estimating a model with two latent factors with two chains, the latent factors may correspond to a latitudinal and a longitudinal gradient in the first chain, but in the second chain these factors could be reversed with the first factor corresponding to the longitudinal gradient and the second factor corresponding to the latitudinal gradient. This is because it is only the sum of the product of the factor loadings and factors that influences occurrence probability, and so the specific ordering of the factors may switch depending on (1) the first \(q\) species relationships to the latent factors and (2) the initial values. Thus, we encourage looking at the traceplots of each individual chain for the latent factor loadings (and spatial range parameters if using a spatial factor model). If the chain has an adequately large effective sample size for the parameters and appears to have reached convergence, we then recommend fixing the initial values at the estimated means from the preliminary model run and then running multiple chains to further assess convergence.

In lfMsPGOcc(), we will supply initial values for the following parameters: alpha.comm (community-level detection coefficients), beta.comm (community-level occurrence coefficients), alpha (species-level detection coefficients), beta (species-level occurrence coefficients), tau.sq.beta (community-level occurrence variance parameters), tau.sq.alpha (community-level detection variance parameters), lambda (the species-specific factor loadings), and z (latent occurrence variables for all species). These are all specified in a single list. Initial values for community-level parameters are either vectors of length corresponding to the number of community-level detection or occurrence parameters in the model (including the intercepts) or a single value if all parameters are assigned the same initial values. Initial values for species level regression coefficients are either matrices with the number of rows indicating the number of species, and each column corresponding to a different regression parameter, or a single value if the same initial value is used for all species and parameters. Initial values for the species-specific factor loadings (lambda) are specified as a numeric matrix with \(N\) rows and \(q\) columns, where \(N\) is the number of species and \(q\) is the number of latent factors used in the model. The diagonal elements of the matrix must be 1, and values in the upper triangle must be set to 0 to ensure identifiability of the latent factors. The initial values for the latent occurrence matrix are specified as a matrix with \(N\) rows corresponding to the number of species and \(J\) columns corresponding to the number of sites.

# Number of species
N <- nrow(hbef.ordered$y)
# Initiate all lambda initial values to 0. 
lambda.inits <- matrix(0, N, n.factors)
# Set diagonal elements to 1
diag(lambda.inits) <- 1
# Set lower triangular elements to random values from a standard normal distribution
lambda.inits[lower.tri(lambda.inits)] <- rnorm(sum(lower.tri(lambda.inits)))
# Check it out. Note this is also how spOccupancy specifies default
# initial values for lambda.
lambda.inits
             [,1]        [,2]
 [1,]  1.00000000  0.00000000
 [2,] -0.50219235  1.00000000
 [3,]  0.13153117  0.09627446
 [4,] -0.07891709 -0.20163395
 [5,]  0.88678481  0.73984050
 [6,]  0.11697127  0.12337950
 [7,]  0.31863009 -0.02931671
 [8,] -0.58179068 -0.38885425
 [9,]  0.71453271  0.51085626
[10,] -0.82525943 -0.91381419
[11,] -0.35986213  2.31029682
[12,]  0.08988614 -0.43808998
# Create list of initial values. 
inits <- list(alpha.comm = 0,
              beta.comm = 0,
              beta = 0,
              alpha = 0,
              tau.sq.beta = 1,
              tau.sq.alpha = 1,
              lambda = lambda.inits, 
              z = apply(hbef.ordered$y, c(1, 2), max, na.rm = TRUE))

Notice that we set initial values of the latent species occurrence (\(z\)) to 1 if there was at least one observation of the species at the given site, and 0 if the species was not detected at that site (this is also the default value spOccupancy will use if initial values for \(z\) are not provided). We set the lower triangular elements of the factor loadings matrix to random values from a standard normal distribution, as we have found these parameters to be relatively insensitive to initial values for this specific data set.

We specify the priors in the priors argument with the following tags: beta.comm.normal (normal prior on the community-level occurrence mean effects), alpha.comm.normal (normal prior on the community-level detection mean effects), tau.sq.beta.ig (inverse-Gamma prior on the community-level occurrence variance parameters), tau.sq.alpha.ig (inverse-Gamma prior on the community-level detection variance parameters). Each tag consists of a list with elements corresponding to the mean and variance for normal priors and scale and shape for inverse-Gamma priors. Values can be specified individually for each parameter or as a single value if the same prior is assigned to all parameters of a given type.

Below we specify normal priors to be relatively vague on the probability scale with a mean of 0 and a variance of 2.72, and specify vague inverse gamma priors on the community-level variance parameters setting both the shape and scale parameters to 0.1.

priors <- list(beta.comm.normal = list(mean = 0, var = 2.72),
               alpha.comm.normal = list(mean = 0, var = 2.72),
               tau.sq.beta.ig = list(a = 0.1, b = 0.1),
               tau.sq.alpha.ig = list(a = 0.1, b = 0.1))

Our next step is to specify the number of samples to produce with the MCMC algorithm (n.samples), the length of burn-in (n.burn), the rate at which we want to thin the posterior samples (n.thin), and the number of MCMC chains to run (n.chains). Note that currently spOccupancy runs multiple chains sequentially and does not allow chains to be run simultaneously in parallel across multiple threads. Instead, we allow for within-chain parallelization using the n.omp.threads argument. We can set n.omp.threads to a number greater than 1 and smaller than the number of threads on the computer you are using. Generally, setting n.omp.threads > 1 will not result in decreased run times for non-spatial joint species distribution models in spOccupancy, but can substantially decrease run time when fitting spatially-explicit models (Finley, Datta, and Banerjee 2020). Here we set n.omp.threads = 1.

n.samples <- 5000
n.burn <- 1000
n.thin <- 8
n.chains <- 3

We are now nearly set to run the latent factor multi-species occupancy model. The verbose argument is a logical value indicating whether or not MCMC sampler progress is reported to the screen. If verbose = TRUE, sampler progress is reported after every multiple of the specified number of iterations in the n.report argument. We set verbose = TRUE and n.report = 1000 to report progress after every 1000th MCMC iteration. Additionally, the last three arguments of lfMsPGOcc() (and all spOccupancy model fitting functions), k.fold, k.fold.threads, and k.fold.seed, allow us to perform k-fold cross-validation after fitting the model. Here we will not perform k-fold cross-validation, but see the introductory spOccupancy vignette for details and examples of running spOccupancy functions for k-fold cross-validation.

# Approx run time: 78 seconds
out.lfMsPGOcc <- lfMsPGOcc(occ.formula = occ.formula, det.formula = det.formula, 
                           data = hbef.ordered, inits = inits, priors = priors, 
                           n.factors = n.factors, n.samples = n.samples, 
                           n.omp.threads = 1, verbose = TRUE, n.report = 1000, 
                           n.burn = n.burn, n.thin = n.thin, n.chains = n.chains)
----------------------------------------
    Preparing to run the model
----------------------------------------
----------------------------------------
    Model description
----------------------------------------
Latent Factor Multispecies Occupancy Model with Polya-Gamma latent
variable fit with 373 sites and 12 species.

Samples per Chain: 5000 
Burn-in: 1000 
Thinning Rate: 8 
Number of Chains: 3 
Total Posterior Samples: 1500 

Using 2 latent factors.

Source compiled with OpenMP support and model fit using 1 thread(s).

----------------------------------------
    Chain 1
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
    Chain 2
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
    Chain 3
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%

The resulting object out.lfMsPGOcc is a list of class lfMsPGOcc consisting primarily of posterior samples of all community and species-level parameters, as well as some additional objects that are used for summaries, prediction, and model fit/evaluation. We can display a nice summary of these results using the summary() function. When using summary, we can specify the level of parameters we want to summarize. We do this using the argument level, which takes values community, species, or both to print results for community-level parameters, species-level parameters, or all parameters. The default value prints a summary for all model parameters.

summary(out.lfMsPGOcc)

Call:
lfMsPGOcc(occ.formula = occ.formula, det.formula = det.formula, 
    data = hbef.ordered, inits = inits, priors = priors, n.factors = n.factors, 
    n.samples = n.samples, n.omp.threads = 1, verbose = TRUE, 
    n.report = 1000, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains)

Samples per Chain: 5000
Burn-in: 1000
Thinning Rate: 8
Number of Chains: 3
Total Posterior Samples: 1500
Run Time (min): 1.4978

----------------------------------------
    Community Level
----------------------------------------
Occurrence Means (logit scale): 
                         Mean     SD    2.5%     50%  97.5%   Rhat  ESS
(Intercept)            0.0983 0.9393 -1.7194  0.1360 1.9742 1.0299 1500
scale(Elevation)       0.3859 0.5727 -0.7515  0.4000 1.4938 1.0078 1297
I(scale(Elevation)^2) -0.2024 0.3304 -0.8804 -0.2076 0.4526 1.0794  396

Occurrence Variances (logit scale): 
                         Mean     SD   2.5%     50%   97.5%   Rhat ESS
(Intercept)           16.5364 9.1439 6.2711 14.1860 40.7536 1.0342 488
scale(Elevation)       4.3913 2.6404 1.4700  3.7561 10.9150 1.1785 192
I(scale(Elevation)^2)  1.0229 0.7694 0.2342  0.8224  2.9948 1.0663 269

Detection Means (logit scale): 
                   Mean     SD    2.5%     50%  97.5%   Rhat  ESS
(Intercept)     -0.7395 0.4376 -1.6115 -0.7384 0.0991 1.0123  919
scale(day)       0.0578 0.0921 -0.1230  0.0555 0.2385 1.0024 1500
scale(tod)      -0.0403 0.0778 -0.1958 -0.0396 0.1171 1.0068 1500
I(scale(day)^2) -0.0241 0.0855 -0.1973 -0.0233 0.1449 1.0046 1451

Detection Variances (logit scale): 
                  Mean     SD   2.5%    50%  97.5%   Rhat  ESS
(Intercept)     2.3581 1.3893 0.7984 2.0007 5.9680 1.2190  191
scale(day)      0.0687 0.0444 0.0230 0.0584 0.1812 1.0046 1500
scale(tod)      0.0498 0.0361 0.0174 0.0411 0.1345 1.0294 1500
I(scale(day)^2) 0.0572 0.0383 0.0188 0.0476 0.1492 1.0106 1342

----------------------------------------
    Species Level
----------------------------------------
Occurrence (logit scale): 
                              Mean     SD    2.5%     50%   97.5%   Rhat  ESS
(Intercept)-OVEN            2.3819 0.2746  1.8890  2.3693  2.9307 1.0033 1262
(Intercept)-BLPW           -5.7981 0.7856 -7.4283 -5.7374 -4.4167 1.0018  324
(Intercept)-AMRE           -3.1334 1.5491 -5.6628 -3.3035  0.2687 1.1664   41
(Intercept)-BAWW            3.5527 2.0656  0.4644  3.2926  8.6067 2.2650   43
(Intercept)-BHVI            0.0589 0.8946 -1.3927 -0.0413  1.9872 1.0384  130
(Intercept)-BLBW            3.3122 1.0281  1.9871  3.0715  5.9279 1.0255  111
(Intercept)-BTBW            4.7973 0.8272  3.4016  4.7032  6.6545 1.0311  280
(Intercept)-BTNW            3.1657 0.7288  2.1357  3.0296  5.0666 1.0034  155
(Intercept)-CAWA           -2.5220 0.7233 -3.9607 -2.4866 -1.1603 1.0251  123
(Intercept)-MAWA           -2.7620 0.5592 -4.0112 -2.7019 -1.7845 1.0678  237
(Intercept)-NAWA           -3.7956 0.9032 -5.8622 -3.6998 -2.2654 1.1006  179
(Intercept)-REVI            3.6403 0.6587  2.6577  3.5435  5.1140 1.0827  198
scale(Elevation)-OVEN      -1.8300 0.2723 -2.4344 -1.8043 -1.3788 1.0072  867
scale(Elevation)-BLPW       3.1954 0.6998  2.0554  3.1158  4.8313 1.0360  236
scale(Elevation)-AMRE       1.2678 1.0381 -0.4522  1.1464  3.6175 1.1226  155
scale(Elevation)-BAWW      -1.1782 1.3752 -4.3842 -0.9053  1.3580 1.5140   42
scale(Elevation)-BHVI       0.3713 0.6421 -1.0379  0.3884  1.6027 1.1416  188
scale(Elevation)-BLBW      -0.5649 0.3611 -1.4759 -0.4963 -0.0543 1.0547  128
scale(Elevation)-BTBW      -0.5976 0.2290 -1.0939 -0.5782 -0.1876 1.0126  865
scale(Elevation)-BTNW       0.8977 0.4053  0.2703  0.8343  1.8792 1.0064  303
scale(Elevation)-CAWA       2.4264 0.9637  1.0664  2.2803  4.8781 1.0918   88
scale(Elevation)-MAWA       2.3958 0.5101  1.5349  2.3516  3.5091 1.0115  316
scale(Elevation)-NAWA       1.4823 0.4991  0.6339  1.4355  2.6224 1.0263  309
scale(Elevation)-REVI      -2.4054 0.7792 -4.3578 -2.2740 -1.2449 1.1294  158
I(scale(Elevation)^2)-OVEN -0.4932 0.2057 -0.8738 -0.5086 -0.0524 1.0010 1128
I(scale(Elevation)^2)-BLPW  1.0555 0.4565  0.1176  1.0792  1.9069 1.0370  468
I(scale(Elevation)^2)-AMRE -0.8837 0.7393 -2.4072 -0.8233  0.4487 1.1940  135
I(scale(Elevation)^2)-BAWW -1.0515 0.8698 -2.4716 -1.1552  1.0262 1.7579   64
I(scale(Elevation)^2)-BHVI  0.7924 0.7444 -0.2617  0.6347  2.6085 1.0712  103
I(scale(Elevation)^2)-BLBW -0.7275 0.2628 -1.3234 -0.6884 -0.2929 1.0035  286
I(scale(Elevation)^2)-BTBW -1.3390 0.2649 -1.9159 -1.3090 -0.8928 1.0247  357
I(scale(Elevation)^2)-BTNW -0.1502 0.2229 -0.5672 -0.1520  0.2983 1.0044  988
I(scale(Elevation)^2)-CAWA -0.5121 0.5531 -1.5695 -0.5374  0.6275 1.0623  194
I(scale(Elevation)^2)-MAWA  0.5407 0.3595 -0.1331  0.5284  1.2832 1.0483  582
I(scale(Elevation)^2)-NAWA  0.4177 0.3460 -0.2477  0.4159  1.1289 1.0067  644
I(scale(Elevation)^2)-REVI -0.0398 0.4435 -0.7662 -0.0729  0.9455 1.0730  197

Detection (logit scale): 
                        Mean     SD    2.5%     50%   97.5%   Rhat  ESS
(Intercept)-OVEN      0.8165 0.1196  0.5951  0.8124  1.0628 1.0010 1363
(Intercept)-BLPW     -0.4724 0.2635 -0.9964 -0.4683  0.0273 1.0105 1500
(Intercept)-AMRE     -2.4870 1.2526 -4.9511 -2.3714 -0.2973 1.2739   34
(Intercept)-BAWW     -2.8617 0.3271 -3.4228 -2.8892 -2.1637 1.7479   32
(Intercept)-BHVI     -2.2621 0.2915 -2.8154 -2.2785 -1.6644 1.0479   94
(Intercept)-BLBW     -0.0662 0.1266 -0.3095 -0.0680  0.1844 1.0029  389
(Intercept)-BTBW      0.6475 0.1063  0.4383  0.6488  0.8676 1.0051 1500
(Intercept)-BTNW      0.5803 0.1166  0.3545  0.5766  0.8116 1.0030 1291
(Intercept)-CAWA     -1.6942 0.5870 -2.8377 -1.6788 -0.6573 1.0555  111
(Intercept)-MAWA     -0.7661 0.2227 -1.2077 -0.7739 -0.3237 1.0003  887
(Intercept)-NAWA     -1.4381 0.4353 -2.2542 -1.4329 -0.6062 1.0764  436
(Intercept)-REVI      0.5441 0.1072  0.3404  0.5427  0.7570 1.0037 1500
scale(day)-OVEN      -0.0733 0.0734 -0.2161 -0.0715  0.0663 1.0002 1500
scale(day)-BLPW       0.0673 0.1576 -0.2373  0.0685  0.3704 1.0262 1500
scale(day)-AMRE       0.0417 0.2338 -0.4395  0.0431  0.4862 0.9999 1500
scale(day)-BAWW       0.2223 0.1452 -0.0539  0.2166  0.5204 1.0036 1142
scale(day)-BHVI       0.2027 0.1235 -0.0302  0.2036  0.4369 1.0012 1500
scale(day)-BLBW      -0.2200 0.0680 -0.3554 -0.2200 -0.0909 1.0142 1500
scale(day)-BTBW       0.2695 0.0694  0.1320  0.2708  0.4074 1.0004 1500
scale(day)-BTNW       0.1479 0.0663  0.0155  0.1498  0.2758 1.0063 1500
scale(day)-CAWA      -0.0173 0.1730 -0.3621 -0.0167  0.3156 1.0277 1500
scale(day)-MAWA       0.1066 0.1230 -0.1337  0.1044  0.3420 0.9999 1500
scale(day)-NAWA       0.0150 0.1769 -0.3379  0.0168  0.3596 0.9993 1500
scale(day)-REVI      -0.0728 0.0678 -0.2024 -0.0740  0.0558 1.0059 1347
scale(tod)-OVEN      -0.0424 0.0718 -0.1859 -0.0420  0.1030 1.0002 1500
scale(tod)-BLPW       0.0924 0.1429 -0.1794  0.0861  0.3785 0.9999 1659
scale(tod)-AMRE      -0.0429 0.2246 -0.4729 -0.0417  0.4225 1.0092 1500
scale(tod)-BAWW      -0.1761 0.1326 -0.4448 -0.1735  0.0728 1.0033 1272
scale(tod)-BHVI      -0.0519 0.1133 -0.2855 -0.0506  0.1619 0.9995 1500
scale(tod)-BLBW       0.0542 0.0664 -0.0807  0.0544  0.1840 1.0025 1500
scale(tod)-BTBW      -0.0324 0.0667 -0.1605 -0.0314  0.0960 1.0060 1790
scale(tod)-BTNW       0.0361 0.0645 -0.0839  0.0371  0.1600 1.0001 1076
scale(tod)-CAWA      -0.2184 0.1691 -0.5704 -0.2047  0.0911 1.0062 1500
scale(tod)-MAWA       0.0083 0.1122 -0.2102  0.0058  0.2340 1.0011 1673
scale(tod)-NAWA      -0.0854 0.1630 -0.4101 -0.0857  0.2299 1.0072 1500
scale(tod)-REVI      -0.0742 0.0648 -0.2004 -0.0768  0.0493 1.0034 1624
I(scale(day)^2)-OVEN  0.0203 0.0907 -0.1509  0.0193  0.1943 1.0007 1500
I(scale(day)^2)-BLPW  0.2047 0.1879 -0.1301  0.1994  0.5952 1.0117 1500
I(scale(day)^2)-AMRE -0.1214 0.2292 -0.6184 -0.1070  0.2963 1.0024 1500
I(scale(day)^2)-BAWW -0.0338 0.1467 -0.3225 -0.0290  0.2552 1.0440  975
I(scale(day)^2)-BHVI  0.0673 0.1322 -0.1893  0.0672  0.3244 0.9997 1439
I(scale(day)^2)-BLBW -0.1737 0.0777 -0.3240 -0.1747 -0.0190 0.9994 1500
I(scale(day)^2)-BTBW -0.0552 0.0795 -0.2172 -0.0538  0.0984 1.0097 1500
I(scale(day)^2)-BTNW -0.0602 0.0812 -0.2189 -0.0581  0.1016 1.0039 1500
I(scale(day)^2)-CAWA -0.0277 0.1841 -0.4067 -0.0193  0.3241 1.0087 1500
I(scale(day)^2)-MAWA  0.0108 0.1393 -0.2647  0.0095  0.2861 1.0019 1458
I(scale(day)^2)-NAWA -0.1359 0.1890 -0.5249 -0.1284  0.2144 0.9994 1500
I(scale(day)^2)-REVI  0.0411 0.0789 -0.1198  0.0414  0.1954 1.0057 1500

We see the summary() function displays the posterior mean, standard deviation, and posterior quantiles (2.5%, 50%, and 97.5%) for a quick summarization of model findings, with all summaries of parameters on the logit scale. Note that all spOccupancy summary() functions have a quantiles argument where you can supply the specific quantiles you want to be displayed in the summary output (by default, this is set to quantiles = c(0.025, 0.5, 0.975)). Looking at the community-level parameters, we see there is large variation in average occurrence (i.e., the occurrence intercept) across the study region, and more moderate variation in the effect of elevation on occurrence of the 12 bird species across the region. On average, bird occurrence in the community tends to peak at mid-level elevations (i.e., the community-level quadratic effect of elevation is negative).

Additionally, summary() returns Rhat (the Gelman-Rubin diagnostic; Brooks and Gelman (1998)) as well as the effective sample size (ESS) for convergence assessments. Here we see most Rhat values are less than 1.1 and the ESS values are sufficiently large. For a complete analysis, we would run the model for longer to ensure all Rhat values were less than 1.1 and ESS values were sufficiently large. Further, we can use the coda::plot() function to plot traceplots of the individual model parameters that are contained in the resulting lfMsPGOcc object. All posterior samples are stored in objects that end in “samples” in the resulting out.lfMsPGOcc object.

# Check out traceplot of the community-level occurrence means. 
plot(out.lfMsPGOcc$beta.comm.samples, density = FALSE)

The summary() function does not present any information on the latent factor loadings or latent factors, but the full posterior samples are available in the lambda.samples and w.samples tags in the out.lfMsPGOcc object, respectively. Below we display the posterior summaries of the latent factor loadings.

summary(out.lfMsPGOcc$lambda.samples)

Iterations = 1:1500
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 1500 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

            Mean     SD Naive SE Time-series SE
OVEN-1  1.000000 0.0000  0.00000        0.00000
BLPW-1 -0.882363 0.4511  0.01165        0.02217
AMRE-1 -0.637875 0.7588  0.01959        0.05350
BAWW-1  0.024814 0.7766  0.02005        0.05943
BHVI-1 -1.716477 0.6326  0.01633        0.04218
BLBW-1 -0.067492 0.6211  0.01604        0.05822
BTBW-1  1.496820 0.4924  0.01271        0.02859
BTNW-1  1.021993 0.4880  0.01260        0.03120
CAWA-1 -0.645987 0.5674  0.01465        0.04594
MAWA-1 -1.932342 0.5223  0.01349        0.03154
NAWA-1 -1.657577 0.6200  0.01601        0.04169
REVI-1  0.731117 0.4818  0.01244        0.03350
OVEN-2  0.000000 0.0000  0.00000        0.00000
BLPW-2  1.000000 0.0000  0.00000        0.00000
AMRE-2  0.008694 1.0353  0.02673        0.12548
BAWW-2  0.001111 0.9166  0.02367        0.08514
BHVI-2  0.226287 0.8916  0.02302        0.08868
BLBW-2  0.189532 1.0451  0.02698        0.12456
BTBW-2 -0.201146 0.6622  0.01710        0.04808
BTNW-2 -0.079416 0.9649  0.02491        0.10007
CAWA-2 -0.225772 0.9260  0.02391        0.10478
MAWA-2  0.160195 0.6962  0.01798        0.07025
NAWA-2  0.012736 0.9140  0.02360        0.11989
REVI-2  0.071562 0.7949  0.02052        0.07954

2. Quantiles for each variable:

          2.5%     25%        50%     75%    97.5%
OVEN-1  1.0000  1.0000  1.000e+00  1.0000  1.00000
BLPW-1 -1.7738 -1.1818 -8.703e-01 -0.5789 -0.03538
AMRE-1 -2.1539 -1.1327 -6.248e-01 -0.1693  0.94785
BAWW-1 -1.5394 -0.4532  6.063e-02  0.5014  1.49338
BHVI-1 -2.9867 -2.1184 -1.722e+00 -1.3356 -0.34155
BLBW-1 -1.4868 -0.4116 -4.060e-02  0.2989  1.11847
BTBW-1  0.6056  1.1460  1.486e+00  1.8326  2.49709
BTNW-1  0.1507  0.6859  9.871e-01  1.3286  2.09466
CAWA-1 -1.9103 -0.9899 -6.086e-01 -0.2511  0.41199
MAWA-1 -3.0176 -2.2753 -1.890e+00 -1.5646 -1.03734
NAWA-1 -3.0121 -2.0421 -1.616e+00 -1.2145 -0.57469
REVI-1 -0.1109  0.3913  6.897e-01  1.0115  1.78439
OVEN-2  0.0000  0.0000  0.000e+00  0.0000  0.00000
BLPW-2  1.0000  1.0000  1.000e+00  1.0000  1.00000
AMRE-2 -2.0837 -0.7007  3.519e-05  0.7192  2.06135
BAWW-2 -1.7059 -0.6712  3.792e-03  0.6437  1.80803
BHVI-2 -1.7020 -0.3498  2.549e-01  0.8181  1.95112
BLBW-2 -1.9739 -0.4545  2.906e-01  0.9197  1.99340
BTBW-2 -1.4853 -0.6514 -2.150e-01  0.2411  1.11520
BTNW-2 -2.0424 -0.6893 -5.008e-02  0.5353  1.81312
CAWA-2 -2.1012 -0.8626 -2.003e-01  0.4129  1.59162
MAWA-2 -1.2693 -0.3094  1.727e-01  0.6293  1.53020
NAWA-2 -1.7374 -0.6092 -7.732e-03  0.6351  1.83879
REVI-2 -1.4076 -0.4847  6.178e-02  0.5960  1.72262

The latent factor loadings and latent factors can provide information on the additional environmental drivers of species occurrence patterns, and what species are respondingly similarly to these environmental gradients. See Figure 2 in Doser, Finley, Banerjee (2022) for an example using North American Breeding Bird Survey data.

As previously discussed, determining the number of factors to include in the model is not straightforward. However, looking at the posterior summaries of the latent factor loadings can provide information on how many factors are necessary for the given data set. In particular, we can look at the posterior mean or median of the latent factor loadings for each factor. If the factor loadings for all species are very close to zero for a given factor, that suggests that factor is not an important driver of species-specific occurrence across space, and thus you may consider removing it from the model. Additionally, we can look at the 95% credible intervals, and if the 95% credible intervals for the factor loadings of all species for a specific factor all contain zero this is further support to reduce the number of factors in the model. In our HBEF example, we see the factor loadings for the second factor for all species are very close to zero, and zero is contained in all 95% credible intervals. On the other hand, the estimated factor loadings for the first factor range from significantly positive to significantly negative values for different species, indicating it is an important driver of occurrence across space. Given these findings, we refit the model below with only a single latent factor. Note also that we fit the model with default initial values and priors, and when we set verbose = TRUE this information is clearly printed out to the screen.

# Use default initial values and priors
# Approx. run time: 75 seconds
out.lfMsPGOcc.2 <- lfMsPGOcc(occ.formula = occ.formula, det.formula = det.formula, 
                             data = hbef.ordered, n.factors = 1, n.samples = n.samples, 
                             n.omp.threads = 1, verbose = TRUE, n.report = 1000, 
                             n.burn = n.burn, n.thin = n.thin, n.chains = n.chains)
----------------------------------------
    Preparing to run the model
----------------------------------------
No prior specified for beta.comm.normal.
Setting prior mean to 0 and prior variance to 2.72
No prior specified for alpha.comm.normal.
Setting prior mean to 0 and prior variance to 2.72
No prior specified for tau.sq.beta.ig.
Setting prior shape to 0.1 and prior scale to 0.1
No prior specified for tau.sq.alpha.ig.
Setting prior shape to 0.1 and prior scale to 0.1
z is not specified in initial values.
Setting initial values based on observed data
beta.comm is not specified in initial values.
Setting initial values to random values from the prior distribution
alpha.comm is not specified in initial values.
Setting initial values to random values from the prior distribution
tau.sq.beta is not specified in initial values.
Setting initial values to random values between 0.5 and 10
tau.sq.alpha is not specified in initial values.
Setting initial values to random values between 0.5 and 10
beta is not specified in initial values.
Setting initial values to random values from the community-level normal distribution
alpha is not specified in initial values.
Setting initial values to random values from the community-level normal distribution
lambda is not specified in initial values.
Setting initial values of the lower triangle to random values from a standard normal
----------------------------------------
    Model description
----------------------------------------
Latent Factor Multispecies Occupancy Model with Polya-Gamma latent
variable fit with 373 sites and 12 species.

Samples per Chain: 5000 
Burn-in: 1000 
Thinning Rate: 8 
Number of Chains: 3 
Total Posterior Samples: 1500 

Using 1 latent factors.

Source compiled with OpenMP support and model fit using 1 thread(s).

----------------------------------------
    Chain 1
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
    Chain 2
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
    Chain 3
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%

Posterior Predictive Checks

The spOccupancy function ppcOcc() performs a posterior predictive check for all spOccupancy model objects as an assessment of Goodness of Fit (GoF). The key idea of GoF testing is that a good model should generate data that closely align with the observed data. If there are large differences in the observed data from the model-generated data, our model is likely not very useful (Hooten and Hobbs 2015). We can use the ppcOcc() and summary() functions to generate a Bayesian p-value as a quick assessment of model fit. A Bayesian p-value that hovers around 0.5 indicates adequate model fit, while values less than 0.1 or greater than 0.9 suggest our model does not fit the data well. ppcOcc will return an overall Bayesian p-value for the entire community, as well as an individual Bayesian p-value for each species. See the introductory spOccupancy vignette and the ppcOcc() help page for additional details. Below we perform a posterior predictive check with the Freeman-Tukey statistic, grouping the data by individual sites.

# Approx run time: 30 seconds
ppc.out <- ppcOcc(out.lfMsPGOcc.2, fit.stat = 'freeman-tukey', group = 1)
Currently on species 1 out of 12
Currently on species 2 out of 12
Currently on species 3 out of 12
Currently on species 4 out of 12
Currently on species 5 out of 12
Currently on species 6 out of 12
Currently on species 7 out of 12
Currently on species 8 out of 12
Currently on species 9 out of 12
Currently on species 10 out of 12
Currently on species 11 out of 12
Currently on species 12 out of 12
# Calculate Bayesian p-values
summary(ppc.out)

Call:
ppcOcc(object = out.lfMsPGOcc.2, fit.stat = "freeman-tukey", 
    group = 1)

Samples per Chain: 5000
Burn-in: 1000
Thinning Rate: 8
Number of Chains: 3
Total Posterior Samples: 1500

----------------------------------------
    Community Level
----------------------------------------
Bayesian p-value:  0.4933 

----------------------------------------
    Species Level
----------------------------------------
OVEN Bayesian p-value: 0.3673
BLPW Bayesian p-value: 0.644
AMRE Bayesian p-value: 0.5893
BAWW Bayesian p-value: 0.6693
BHVI Bayesian p-value: 0.8627
BLBW Bayesian p-value: 0.2547
BTBW Bayesian p-value: 0.58
BTNW Bayesian p-value: 0.2633
CAWA Bayesian p-value: 0.29
MAWA Bayesian p-value: 0.464
NAWA Bayesian p-value: 0.4747
REVI Bayesian p-value: 0.46
Fit statistic:  freeman-tukey 

Here, our overall Bayesian p-value for the full community is close to 0.5, and the individual species Bayesian p-values also indicate adequate model fit.

Model Selection using WAIC

The spOccupancy function waicOCC() calculates the Widely Applicable Information Criterion (Watanabe 2010) for all spOccupancy fitted model objects. The WAIC is a useful fully Bayesian information criterion that is adequate for comparing a set of hierarchical models and selecting the best-performing model for final analysis (see the introductory spOccupancy vignette for further details on WAIC implementation in spOccupancy). Smaller values of WAIC indicate a better performing model. Below, we fit a multi-species occupancy model without species correlations using the msPGOcc() function, and subsequently compare the model to the model that does account for species correlations. Syntax for the msPGOcc() function is identical to that for lfMsPGOcc(), with the exception of the n.factors argument no longer being included (since species correlations are not accommodated).

# Approx run time: 71 seconds
out.msPGOcc <- msPGOcc(occ.formula = occ.formula, det.formula = det.formula, 
                       data = hbef.ordered, inits = inits, priors = priors, 
                       n.samples = n.samples, n.omp.threads = 1, 
                       verbose = TRUE, n.report = 1000, n.burn = n.burn, 
                       n.thin = n.thin, n.chains = n.chains)
----------------------------------------
    Preparing to run the model
----------------------------------------
----------------------------------------
    Model description
----------------------------------------
Multispecies Occupancy Model with Polya-Gamma latent
variable fit with 373 sites and 12 species.

Samples per Chain: 5000 
Burn-in: 1000 
Thinning Rate: 8 
Number of Chains: 3 
Total Posterior Samples: 1500 


Source compiled with OpenMP support and model fit using 1 thread(s).

----------------------------------------
    Chain 1
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
    Chain 2
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
    Chain 3
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
# Compute WAIC for the the latent factor multi-species occupancy model.
waicOcc(out.lfMsPGOcc.2)
      elpd         pD       WAIC 
-4354.6047   184.8315  9078.8723 
# Compute WAIC for the basic multi-species occupancy model.
waicOcc(out.msPGOcc)
       elpd          pD        WAIC 
-4531.69679    65.77426  9194.94210 

Here, we see the WAIC for the latent factor multi-species occupancy model is substantially lower than the WAIC for the multi-species occupancy model, suggesting that accommodating residual species correlations leads to improved model performance in the foliage-gleaning bird data set.

Prediction

Finally, we can use the predict() function with all spOccupancy model-fitting functions to generate a series of posterior predictive samples at new locations, given a set of covariates and their spatial locations. spOccupancy supports prediction of both new occurrence values at a set of spatial locations and as of v0.3.0, spOccupancy supports predictions of detection probability over a range of covariate values.

First, we show how to use predict() to create a map of species richness across HBEF. The object hbefElev (which comes as part of the spOccupancy package) contains elevation data at a 30x30m resolution from the National Elevation Dataset across the entire HBEF. We load the data below.

data(hbefElev)
str(hbefElev)
'data.frame':   46090 obs. of  3 variables:
 $ val     : num  914 916 918 920 922 ...
 $ Easting : num  276273 276296 276318 276340 276363 ...
 $ Northing: num  4871424 4871424 4871424 4871424 4871424 ...

The column val contains the elevation values, while Easting and Northing contain the spatial coordinates of the prediction sites. Below we standardize our new elevation values using the mean and standard deviation of the elevation values we used to fit the data, and then predict occurrence for each species across all 46090 spatial locations. The out.pred object consists of posterior predictive samples of the latent occurrence probability (psi.0.samples) as well as the latent occurrence state (z.0.samples). We can calculate species richness as a derived quantity by summing up the latent occurrence states for each species at each MCMC sample.

# Not run (note this takes a large amount of memory to run).
elev.pred <- (hbefElev$val - mean(hbef2015$occ.covs[, 1])) / sd(hbef2015$occ.covs[, 1])
# Order: intercept, elevation (linear), elevation (quadratic)
X.0 <- cbind(1, elev.pred, elev.pred^2) 
# Spatial coordinates
coords.0 <- as.matrix(hbefElev[, c('Easting', 'Northing')])
# type = 'occupancy' specifies prediction of occupancy (or occurrence). 
# This is also the default.
# Approximate run time: 30 sec
out.pred <- predict(out.lfMsPGOcc, X.0, coords.0, type = 'occupancy')
str(out.pred)
# Species richness samples
rich.pred <- apply(out.pred$z.0.samples, c(1, 3), sum)
plot.dat <- data.frame(x = hbefElev$Easting, 
                       y = hbefElev$Northing, 
                       rich.mean = apply(rich.pred, 2, mean), 
                       rich.sd = apply(rich.pred, 2, sd))
# Plot species richness of the foliage-gleaning bird community
# across the Hubbard Brook Experimental Forest
dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y'))
ggplot() + 
  geom_stars(data = dat.stars, aes(x = x, y = y, fill = rich.mean)) +
  scale_fill_viridis_c(na.value = 'transparent') +
  labs(x = 'Easting', y = 'Northing', fill = '', 
       title = 'Mean Species Richness') +
  theme_bw()

Note that when predicting new values using a latent factor multi-species occupancy model (or a latent factor joint species distribution model as we will see with lfJSDM()), we make predictions at both sampled and non-sampled locations using the latent factors. At sampled locations, we directly use the posterior samples from the model fit in the prediction algorithm, which will generally lead to improved predictions at the sampled sites compared to a multi-species model that does not account for species correlations and incorporate these factors. At non-sampled sites, we do not know the value of the latent factors, and so we simply draw random values from a standard normal distribution at each iteration of the posterior predictive sampling algorithm. Because these values are drawn form a normal distribution with a mean of 0, including these predictions of the latent factors at new sites will not change the overall mean estimate of occurrence probability at that site, but it will account for the uncertainty we have in the latent factor values, and thus will fully propagate uncertainty from our model fit to the resulting predictions.

Finally, we can generate predicted values of detection probability across a range of covariate values to generate a marginal response curve for each individual species across any given covariate value. Below we predict and plot the relationships between detection probability and time of day for each of the 12 species, while holding the day of the year at it’s mean value (0).

# Minimum observed time of day
min.tod <- min(hbef2015$det.covs$tod, na.rm = TRUE)
# Maximum
max.tod <- max(hbef2015$det.covs$tod, na.rm = TRUE)
# Generate 100 time of day values between the observed range
J.0 <- 100
tod.pred.vals <- seq(from = min.tod, to = max.tod, length.out = J.0)
# Standardize the new values by the mean and sd of the data
# used to fit the model. 
mean.tod <- mean(hbef2015$det.covs$tod, na.rm = TRUE)
sd.tod <- sd(hbef2015$det.covs$tod, na.rm = TRUE)
tod.stand <- (tod.pred.vals - mean.tod) / sd.tod
# Generate covariate matrix for prediction
X.p.0 <- cbind(1, 0, tod.stand, 0)
colnames(X.p.0) <- c('intercept', 'day', 'tod', 'day2')
out.det.pred <- predict(out.lfMsPGOcc, X.p.0, type = 'detection')
str(out.det.pred)
List of 2
 $ p.0.samples: num [1:1500, 1:12, 1:100] 0.697 0.68 0.717 0.617 0.71 ...
 $ call       : language predict.lfMsPGOcc(object = out.lfMsPGOcc, X.0 = X.p.0, type = "detection")
 - attr(*, "class")= chr "predict.lfMsPGOcc"

The p.0.samples tag in the out.det.pred object consists of the posterior predictive samples of detection probability for each species across the 100 generated time of day values. We finally create a marginal response curve for each species using ggplot2.

# Extract the means from the posterior samples and convert to vector
p.0.ests <- c(apply(out.det.pred$p.0.samples, c(2, 3), mean))
p.plot.dat <- data.frame(det.prob = p.0.ests, 
                         sp = rep(sp.names, J.0), 
                         tod = rep(tod.pred.vals, each = N))
ggplot(p.plot.dat, aes(x = tod, y = det.prob)) + 
  geom_line() + 
  theme_bw() + 
  scale_y_continuous(limits = c(0, 1)) + 
  facet_wrap(vars(sp)) + 
  labs(x = 'Time of day (min since sunrise)', y = 'Detection Probability') 

The relatively flat lines here for most species indicates that detection probability does not vary to a large extent across the time of day range that is sampled in the data, although there is some apparent variability in the effect across species (e.g., BAWW vs. MAWA).

Spatial factor multi-species occupancy models

Basic model description

While the latent factor multi-species occupancy model accounts for species correlations and imperfect detection, it fails to address spatial autocorrelation. The spatial factor multi-species occupancy model is identical to the latent factor multi-species occupancy model (lfMsPGOcc()), except the latent factors are now assumed to arise from a spatial process rather than a standard normal distribution, which accounts for spatial autocorrelation in latent species occurrence. More specifically, each latent factor (now called a spatial factor) \(\textbf{w}_r\) for each \(r = 1, \dots, q\) is modeled using a Nearest Neighbor Gaussian Process (Datta et al. 2016), i.e.,

\[\begin{equation} \text{w}_r(\boldsymbol{s}_j) \sim N(\boldsymbol{0}, \tilde{\boldsymbol{C}}_r(\boldsymbol{\theta}_r)), \end{equation}\]

where \(\tilde{\boldsymbol{C}}_r(\boldsymbol{\theta}_r)\) is the NNGP-derived covariance matrix for the \(r^{\text{th}}\) spatial factor. The vector \(\boldsymbol{\theta}_r\) consists of parameters governing the spatial process according to a spatial correlation function (Banerjee, Carlin, and Gelfand 2014). spOccupancy implements four spatial correlation functions: exponential, spherical, Gaussian, and Matern. For the exponential, spherical, and Gaussian functions, \(\boldsymbol{\theta}_r\) includes a spatial variance parameter, \(\sigma^2_r\), and a spatial range parameter, \(\phi_r\), while the Matern correlation function includes an additional spatial smoothness parameter, \(\nu_r\).

We assume the same priors and identifiability constraints as the latent factor multi-species occupancy model. We assign a uniform prior the spatial range parameters, \(\phi_r\), and the spatial smoothness parameters, \(\nu_r\), if using a Matern correlation function.

Fitting spatial factor multi-species occupancy models with sfMsPGOcc

The function sfMsPGOcc() fits spatial factor multi-species occupancy models using Pólya-Gamma data augmentation. sfMsPGOcc() has the following arguments:

sfMsPGOcc(occ.formula, det.formula, data, inits, priors, 
          tuning, cov.model = 'exponential', NNGP = TRUE, 
          n.neighbors = 15, search.type = "cb", n.factors, 
          n.batch, batch.length, accept.rate = 0.43,
          n.omp.threads = 1, verbose = TRUE, n.report = 100, 
          n.burn = round(.10 * n.batch * batch.length), 
          n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1, 
          k.fold.seed = 100, ...){

We will walk through each of the arguments to sfMsPGOcc() in the context of our HBEF example. The occurrence (occ.formula) and detection (det.formula) formulas, as well as the list of data (data) follow the same form as we saw in lfMsPGOcc(). We will specify these again below for clarity.

occ.formula <- ~ scale(Elevation) + I(scale(Elevation)^2)
det.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2)
# Remind ourselves what the data look like
str(hbef.ordered)
List of 4
 $ y       : num [1:12, 1:373, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : chr [1:12] "OVEN" "BLPW" "AMRE" "BAWW" ...
  .. ..$ : chr [1:373] "1" "2" "3" "4" ...
  .. ..$ : chr [1:3] "1" "2" "3"
 $ occ.covs: num [1:373, 1] 475 494 546 587 588 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr "Elevation"
 $ det.covs:List of 2
  ..$ day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ...
  ..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
 $ coords  : num [1:373, 1:2] 280000 280000 280000 280001 280000 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:373] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "X" "Y"

Following our findings from using lfMsPGOcc(), we will use 1 latent spatial factor.

n.factors <- 1

We will next specify the initial values in the inits argument. Just as before, this argument is optional as spOccupancy will by default set the initial values based on the prior distributions. Valid tags for initial values include all the parameters described for the latent factor multi-species occupancy model using lfMsPGOcc() as well as parameters associated with the spatial latent processes. These include: phi (the spatial range parameter) and nu (the spatial smoothness parameter), where the latter is only specified if adopting a Matern covariance function (i.e., cov.model = 'matern'). spOccupancy supports four spatial covariance models (exponential, spherical, gaussian, and matern), which are specified in the cov.model argument. Here we will use an exponential correlation function. When using an exponential correlation function, \(\frac{3}{\phi}\) is the effective range, or the distance at which the residual spatial correlation between two sites drops to 0.05 (Banerjee, Carlin, and Gelfand 2014). As an initial value for the spatial range parameter phi, we compute the mean distance between points in HBEF and then set it equal to 3 divided by this mean distance. Thus, our initial guess for the effective range is the average distance between sites across HBEF. We will set all other initial values to the same values we used for lfMsPGOcc().

# Pair-wise distance between all sites
dist.hbef <- dist(hbef.ordered$coords)
# Exponential correlation model
cov.model <- "exponential"
# Specify all other initial values identical to lfMsPGOcc() from before
# Number of species
N <- nrow(hbef.ordered$y)
# Initiate all lambda initial values to 0. 
lambda.inits <- matrix(0, N, n.factors)
# Set diagonal elements to 1
diag(lambda.inits) <- 1
# Set lower triangular elements to random values from a standard normal dist
lambda.inits[lower.tri(lambda.inits)] <- rnorm(sum(lower.tri(lambda.inits)))
# Check it out
lambda.inits
             [,1]
 [1,]  1.00000000
 [2,]  0.08726913
 [3,]  1.04191535
 [4,] -0.57058400
 [5,] -1.03868418
 [6,]  0.57980601
 [7,] -0.91943441
 [8,]  1.54189049
 [9,]  0.39333145
[10,]  1.75861735
[11,]  2.15505834
[12,]  0.11110001
# Create list of initial values. 
inits <- list(alpha.comm = 0,
              beta.comm = 0,
              beta = 0,
              alpha = 0,
              tau.sq.beta = 1,
              tau.sq.alpha = 1,
              lambda = lambda.inits, 
              phi = 3 / mean(dist.hbef),
              z = apply(hbef.ordered$y, c(1, 2), max, na.rm = TRUE))

The next three arguments (n.batch, batch.length, and accept.rate) are all related to the Adaptive MCMC sampler used when we fit the model. Updates for the spatial range parameter (and the smoothness parameter if cov.model = 'matern') require the use of a Metropolis-Hastings algorithm. We implement an adaptive Metropolis-Hastings algorithm as discussed in Roberts and Rosenthal (2009). This algorithm adjusts the tuning values for each parameter that requires a Metropolis-Hastings update within the sampler itself. This process results in a more efficient sampler than if we were to fix the tuning parameters prior to fitting the model. The parameter accept.rate is the target acceptance rate for each parameter, and the algorithm will adjust the tuning parameters to hover around this value. The default value is 0.43, which we suggest leaving as is unless you have a good reason to change it. The tuning parameters are updated after a single “batch”. In lfMsPGOcc(), we specified an n.samples argument which consisted of the total number of samples to run each chain of the MCMC. For sfMsPGOcc() (and all spatially-explicit models in spOccupancy), we break up the total number of MCMC samples into a set of “batches”, where each batch has a specific number of samples. We must specify both the total number of batches (n.batch) as well as the number of MCMC samples each batch contains (batch.length). Thus, the total number of MCMC samples is n.batch * batch.length. Typically, we set batch.length = 25 and then play around with n.batch until convergence is reached. We recommend keeping this at 25 unless you have a specific reason to change it. Here we set n.batch = 200 for a total of 5000 MCMC samples in each of 3 chains. We will additionally specify a burn-in period of length 3000 and a thinning rate of 2. Importantly, we also need to specify an initial value for the tuning parameters for the spatial decay and smoothness parameters (if applicable). These values are supplied as input in the form of a list with tags phi and nu. The initial tuning value can be any value greater than 0, but we recommend starting the value out around 0.5. After some initial runs of the model, if you notice the final acceptance rate of a parameter is much larger or smaller than the target acceptance rate (accept.rate), you can then change the initial tuning value to get closer to the target rate. Here we set the initial tuning value for phi to 1 after some initial exploratory runs of the model.

batch.length <- 25
n.batch <- 200
n.burn <- 3000
n.thin <- 2
n.chains <- 3

Priors are again specified in a list in the priors argument. We assume uniform priors for the spatial decay parameter phi and smoothness parameter nu (if using the Matern correlation function), with the associated tags phi.unif and nu.unif. The lower and upper bounds of the uniform distribution are passed as a two-element vector for the uniform priors.

Here we use an exponential correlation function, so we only need to specify priors for the spatial decay parameter phi for each of the spatial factors (which in this case is just 1). We recommend determining the bounds of the uniform distribution by computing the smallest distance between sites and the largest distance between sites in the observed data set. We then set the lower bound of the uniform to 3/max and the upper bound to 3/min, where min and max correspond to the predetermined distances between sites. This equates to a vague prior that states that spatial autocorrelation in the spatial factors could only be between sites that are very close to each other, or could span across the entire observed study area. We recommend using these bounds for the prior unless you have prior information about the range of the spatial autocorrelation. The remaining priors are identical to what we saw in lfMsPGOcc(). We use the same priors for all other parameters as those we used for lfMsPGOcc().

min.dist <- min(dist.hbef)
max.dist <- max(dist.hbef)
priors <- list(beta.comm.normal = list(mean = 0, var = 2.72),
               alpha.comm.normal = list(mean = 0, var = 2.72),
               tau.sq.beta.ig = list(a = 0.1, b = 0.1),
               tau.sq.alpha.ig = list(a = 0.1, b = 0.1), 
               phi.unif = list(3 / max.dist, 3 / min.dist))

Importantly, we also need to specify an initial value for the tuning parameters for the spatial decay and smoothness parameters (if applicable). These values are supplied as input in the form of a list with tags phi and nu. The initial tuning value can be any value greater than 0, but we recommend starting the value out around 0.5. After some initial runs of the model, if you notice the final acceptance rate of a parameter is much larger or smaller than the target acceptance rate (accept.rate), you can then change the initial tuning value to get closer to the target rate. Here we set the initial tuning value for phi to 1 after some initial exploratory runs of the model.

tuning <- list(phi = 1)

The argument n.omp.threads specifies the number of threads to use for within-chain parallelization, which can greatly decrease run time for spatially-explicit models (Finley, Datta, and Banerjee 2020), while verbose specifies whether or not to print the progress of the sampler. We highly recommend setting verbose = TRUE for all spatial models to ensure the adaptive MCMC is working as you want (and this is the reason for why this is the default for this argument). The argument n.report specifies the interval to report the Metropolis-Hastings sampler acceptance rate. Note that n.report is specified in terms of batches, not the overall number of samples. Below we set n.report = 50, which will result in information on the acceptance rate and tuning parameters every 50th batch (not sample). Ideally, you should see the printed acceptance rate values however around the value supplied to the accept.rate argument (which by default is 0.43). If by the end of the MCMC run you see the values are well below the target acceptance rate, we recommend rerunning the model with a larger initial tuning parameter (higher than the final reported value in the displayed output of model progress). If you see the values are well above the target acceptance rate, we recommend rerunning the model with a smaller initial tuning parameter (smaller than the final reported value).

n.omp.threads <- 1
verbose <- TRUE
n.report <- 50 # Report progress at every 50th batch.

The parameters NNGP, n.neighbors, and search.type relate to whether or not you want to fit the model with a Gaussian Process or with NNGP, which is a much more computationally efficient approximation. The argument NNGP is a logical value indicating whether to fit the model with an NNGP (TRUE) or a regular Gaussian Process (FALSE). Currently, sfMsPGOcc() only supports NNGP models, so you will receive an error message if you set NNGP = FALSE that tells you to switch to NNGP = TRUE. We plan to implement these models with a full Gaussian Process in future development. The argument n.neighbors and search.type specify the number of neighbors used in the NNGP and the nearest neighbor search algorithm, respectively, to use for the NNGP model. Generally, the default values of these arguments will be adequate. Datta et al. (2016) showed that setting n.neighbors = 15 is usually sufficient, although for certain data sets a good approximation can be achieved with as few as five neighbors, which could substantially decrease run time for the model. We generally recommend leaving search.type = "cb", as this results in a fast code book nearest neighbor search algorithm. However, details on when you may want to change this are described in Finley, Datta, and Banerjee (2020). We will run an NNGP model using the default value for search.type and setting n.neighbors = 5. Here we choose 5 neighbors because we found in Doser et al. (2021) that estimates from a spatial multi-species occupancy model for this data set were relatively robust to the number of neighbors in the model. Generally, we recommend using the default value of n.neighbors = 15, and if additional decreases in computation time are desired, you can fit the model with n.neighbors = 5 and compare their performance using WAIC and/or k-fold cross-validation.

We now fit the model using sfMsPGOcc() and summarize the results using summary().

# Approx run time: 2 min
out.sfMsPGOcc <- sfMsPGOcc(occ.formula = occ.formula, 
                           det.formula = det.formula, 
                           data = hbef.ordered, 
                           inits = inits, 
                           n.batch = n.batch, 
                           batch.length = batch.length, 
                           accept.rate = 0.43, 
                           priors = priors, 
                           n.factors = n.factors,
                           cov.model = cov.model, 
                           tuning = tuning, 
                           n.omp.threads = n.omp.threads, 
                           verbose = TRUE, 
                           NNGP = TRUE, 
                           n.neighbors = 5, 
                           n.report = n.report, 
                           n.burn = n.burn, 
                           n.thin = n.thin, 
                           n.chains = n.chains)
----------------------------------------
    Preparing to run the model
----------------------------------------
----------------------------------------
    Building the neighbor list
----------------------------------------
----------------------------------------
Building the neighbors of neighbors list
----------------------------------------
----------------------------------------
    Model description
----------------------------------------
Spatial Factor NNGP Multispecies Occupancy Model with Polya-Gamma latent
variable fit with 373 sites and 12 species.

Samples per chain: 5000 (200 batches of length 25)
Burn-in: 3000 
Thinning Rate: 2 
Number of Chains: 3 
Total Posterior Samples: 3000 

Using the exponential spatial correlation model.

Using 1 latent spatial factors.
Using 5 nearest neighbors.

Source compiled with OpenMP support and model fit using 1 thread(s).

Adaptive Metropolis with target acceptance rate: 43.0
----------------------------------------
    Chain 1
----------------------------------------
Sampling ... 
Batch: 50 of 200, 25.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     16.0        0.60050
-------------------------------------------------
Batch: 100 of 200, 50.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     20.0        0.37908
-------------------------------------------------
Batch: 150 of 200, 75.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     52.0        0.29820
-------------------------------------------------
Batch: 200 of 200, 100.00%
----------------------------------------
    Chain 2
----------------------------------------
Sampling ... 
Batch: 50 of 200, 25.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     52.0        0.28083
-------------------------------------------------
Batch: 100 of 200, 50.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     48.0        0.28083
-------------------------------------------------
Batch: 150 of 200, 75.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     40.0        0.29820
-------------------------------------------------
Batch: 200 of 200, 100.00%
----------------------------------------
    Chain 3
----------------------------------------
Sampling ... 
Batch: 50 of 200, 25.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     52.0        0.27527
-------------------------------------------------
Batch: 100 of 200, 50.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     32.0        0.26982
-------------------------------------------------
Batch: 150 of 200, 75.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     28.0        0.25924
-------------------------------------------------
Batch: 200 of 200, 100.00%
# Take a look at the resulting object
names(out.sfMsPGOcc)
 [1] "rhat"                 "beta.comm.samples"    "alpha.comm.samples"  
 [4] "tau.sq.beta.samples"  "tau.sq.alpha.samples" "beta.samples"        
 [7] "alpha.samples"        "lambda.samples"       "theta.samples"       
[10] "z.samples"            "w.samples"            "psi.samples"         
[13] "like.samples"         "X.p"                  "X.p.re"              
[16] "lambda.p"             "X.re"                 "lambda.psi"          
[19] "ESS"                  "X"                    "y"                   
[22] "call"                 "n.samples"            "x.names"             
[25] "sp.names"             "x.p.names"            "theta.names"         
[28] "type"                 "coords"               "cov.model.indx"      
[31] "n.neighbors"          "q"                    "n.post"              
[34] "n.thin"               "n.burn"               "n.chains"            
[37] "pRE"                  "psiRE"                "run.time"            
summary(out.sfMsPGOcc)

Call:
sfMsPGOcc(occ.formula = occ.formula, det.formula = det.formula, 
    data = hbef.ordered, inits = inits, priors = priors, tuning = tuning, 
    cov.model = cov.model, NNGP = TRUE, n.neighbors = 5, n.factors = n.factors, 
    n.batch = n.batch, batch.length = batch.length, accept.rate = 0.43, 
    n.omp.threads = n.omp.threads, verbose = TRUE, n.report = n.report, 
    n.burn = n.burn, n.thin = n.thin, n.chains = n.chains)

Samples per Chain: 5000
Burn-in: 3000
Thinning Rate: 2
Number of Chains: 3
Total Posterior Samples: 3000
Run Time (min): 2.4385

----------------------------------------
    Community Level
----------------------------------------
Occurrence Means (logit scale): 
                         Mean     SD    2.5%     50%  97.5%   Rhat ESS
(Intercept)            0.4138 0.9397 -1.4779  0.3957 2.3251 1.2572 305
scale(Elevation)       0.4409 0.5785 -0.6575  0.4315 1.6466 1.0221 694
I(scale(Elevation)^2) -0.2005 0.3121 -0.8185 -0.2045 0.4300 1.0172 507

Occurrence Variances (logit scale): 
                         Mean     SD   2.5%     50%   97.5%   Rhat ESS
(Intercept)           15.3672 8.9701 5.2263 13.1667 37.3060 1.1911 194
scale(Elevation)       4.1324 2.8989 1.2564  3.4288 11.5424 1.0062 187
I(scale(Elevation)^2)  1.0268 0.7498 0.2302  0.8291  2.9140 1.0933 159

Detection Means (logit scale): 
                   Mean     SD    2.5%     50%  97.5%   Rhat  ESS
(Intercept)     -0.8440 0.5077 -1.8520 -0.8361 0.1299 1.1440  655
scale(day)       0.0538 0.0865 -0.1116  0.0519 0.2276 1.0035 2085
scale(tod)      -0.0402 0.0783 -0.2038 -0.0382 0.1106 1.0043 1746
I(scale(day)^2) -0.0245 0.0874 -0.2030 -0.0236 0.1471 1.0071 1849

Detection Variances (logit scale): 
                  Mean     SD   2.5%    50%  97.5%   Rhat  ESS
(Intercept)     3.2253 1.9799 0.9487 2.7714 8.0954 1.6838   91
scale(day)      0.0661 0.0392 0.0237 0.0565 0.1665 1.0074 2360
scale(tod)      0.0513 0.0330 0.0165 0.0425 0.1409 1.0033 1895
I(scale(day)^2) 0.0594 0.0418 0.0182 0.0486 0.1654 1.0049 2009

----------------------------------------
    Species Level
----------------------------------------
Occurrence (logit scale): 
                              Mean     SD    2.5%     50%   97.5%   Rhat ESS
(Intercept)-OVEN            2.2232 0.3314  1.5895  2.2180  2.8748 1.0701 177
(Intercept)-BLPW           -5.7697 1.0465 -8.0832 -5.6610 -3.9472 1.0100 107
(Intercept)-AMRE            1.2309 4.6123 -4.2302 -0.2234 10.8491 6.3902   4
(Intercept)-BAWW            3.3036 1.9751  0.3983  3.0504  8.2504 1.6138  28
(Intercept)-BHVI            0.1886 0.7891 -1.2358  0.1418  1.8634 1.0618  94
(Intercept)-BLBW            2.6816 0.5819  1.7909  2.5941  4.0177 1.1033 233
(Intercept)-BTBW            4.8988 0.9606  3.3064  4.8145  7.0738 1.1175  83
(Intercept)-BTNW            2.7775 0.6249  1.9130  2.6695  4.2350 1.1857  93
(Intercept)-CAWA           -1.9070 0.7719 -3.1182 -2.0276 -0.0458 1.4897  90
(Intercept)-MAWA           -2.9031 0.6689 -4.2598 -2.8512 -1.6852 1.0923 122
(Intercept)-NAWA           -2.9662 0.7356 -4.4512 -2.9766 -1.3863 1.0539 130
(Intercept)-REVI            3.2167 0.5093  2.3532  3.1647  4.3385 1.0479 252
scale(Elevation)-OVEN      -1.7234 0.3096 -2.3872 -1.7024 -1.1968 1.0083 320
scale(Elevation)-BLPW       3.1076 0.8360  1.8207  2.9639  5.0454 1.1226  83
scale(Elevation)-AMRE       1.0788 1.3709 -1.8998  1.0092  3.8415 1.2556  38
scale(Elevation)-BAWW      -1.0660 1.1156 -4.1174 -0.8335  0.4457 1.4923  45
scale(Elevation)-BHVI       0.5318 0.6454 -0.6206  0.4600  1.9835 1.0225 109
scale(Elevation)-BLBW      -0.4785 0.2256 -0.9609 -0.4619 -0.0797 1.0268 512
scale(Elevation)-BTBW      -0.4066 0.2933 -1.0163 -0.3921  0.1442 1.0263 186
scale(Elevation)-BTNW       0.7903 0.3206  0.2689  0.7513  1.5356 1.0517 272
scale(Elevation)-CAWA       2.4901 1.0073  1.0395  2.3054  4.9912 1.6194  61
scale(Elevation)-MAWA       2.4449 0.6415  1.3712  2.4044  3.8010 1.0154 124
scale(Elevation)-NAWA       1.1467 0.4250  0.4113  1.1183  2.0699 1.0113 258
scale(Elevation)-REVI      -2.2433 0.7679 -4.0150 -2.1394 -1.0763 1.0355  86
I(scale(Elevation)^2)-OVEN -0.4673 0.2336 -0.9013 -0.4776  0.0200 1.0079 397
I(scale(Elevation)^2)-BLPW  1.0045 0.5287 -0.0138  0.9963  2.0751 1.0390 183
I(scale(Elevation)^2)-AMRE -0.8195 0.8992 -2.5996 -0.7959  1.0328 1.2683  57
I(scale(Elevation)^2)-BAWW -1.2670 0.6803 -2.6098 -1.2460  0.3018 1.6690  91
I(scale(Elevation)^2)-BHVI  0.6318 0.5286 -0.2078  0.5625  1.9148 1.0919 108
I(scale(Elevation)^2)-BLBW -0.5973 0.1854 -0.9983 -0.5924 -0.2655 1.0772 425
I(scale(Elevation)^2)-BTBW -1.3991 0.3022 -2.0297 -1.3830 -0.8683 1.0601 134
I(scale(Elevation)^2)-BTNW -0.2038 0.2099 -0.6486 -0.1993  0.1901 1.0110 246
I(scale(Elevation)^2)-CAWA -0.3640 0.6941 -1.6325 -0.4155  1.1594 1.2152  89
I(scale(Elevation)^2)-MAWA  0.6933 0.4285 -0.0979  0.6780  1.5865 1.0528 256
I(scale(Elevation)^2)-NAWA  0.3176 0.3450 -0.3511  0.3166  0.9757 1.0241 279
I(scale(Elevation)^2)-REVI  0.0477 0.4042 -0.6370  0.0156  0.9618 1.0113 124

Detection (logit scale): 
                        Mean     SD    2.5%     50%   97.5%   Rhat  ESS
(Intercept)-OVEN      0.8300 0.1164  0.6087  0.8260  1.0617 1.0031 2790
(Intercept)-BLPW     -0.4649 0.2634 -0.9933 -0.4538  0.0274 1.0033 2029
(Intercept)-AMRE     -3.8784 1.5851 -6.1965 -4.2629 -0.7433 4.1472    7
(Intercept)-BAWW     -2.8206 0.3262 -3.4224 -2.8417 -2.1062 1.1753   75
(Intercept)-BHVI     -2.2355 0.2699 -2.7663 -2.2290 -1.7087 1.0502  156
(Intercept)-BLBW     -0.0464 0.1268 -0.3027 -0.0456  0.2008 1.0321  568
(Intercept)-BTBW      0.6366 0.1073  0.4288  0.6359  0.8542 1.0030 3000
(Intercept)-BTNW      0.5789 0.1133  0.3628  0.5772  0.8085 1.0040 1284
(Intercept)-CAWA     -2.0038 0.6100 -3.0334 -2.0647 -0.8061 1.7153   47
(Intercept)-MAWA     -0.7535 0.2118 -1.1669 -0.7552 -0.3235 1.0151 1100
(Intercept)-NAWA     -1.5087 0.4549 -2.4288 -1.4964 -0.6260 1.1064  215
(Intercept)-REVI      0.5461 0.1089  0.3334  0.5456  0.7645 1.0006 2444
scale(day)-OVEN      -0.0758 0.0740 -0.2247 -0.0746  0.0679 1.0025 3000
scale(day)-BLPW       0.0606 0.1553 -0.2413  0.0608  0.3704 1.0011 2751
scale(day)-AMRE       0.0262 0.2355 -0.4381  0.0250  0.4916 1.0044 1213
scale(day)-BAWW       0.2124 0.1417 -0.0597  0.2103  0.4976 1.0014 1276
scale(day)-BHVI       0.1884 0.1215 -0.0517  0.1850  0.4272 1.0086 1858
scale(day)-BLBW      -0.2241 0.0677 -0.3617 -0.2226 -0.0929 1.0064 3000
scale(day)-BTBW       0.2702 0.0695  0.1330  0.2690  0.4039 1.0003 2660
scale(day)-BTNW       0.1464 0.0661  0.0163  0.1454  0.2758 1.0022 3000
scale(day)-CAWA      -0.0218 0.1746 -0.3657 -0.0192  0.3250 1.0007 2156
scale(day)-MAWA       0.1064 0.1242 -0.1440  0.1053  0.3572 1.0072 2689
scale(day)-NAWA       0.0136 0.1770 -0.3505  0.0130  0.3610 1.0021 2373
scale(day)-REVI      -0.0722 0.0677 -0.2047 -0.0725  0.0604 1.0069 3016
scale(tod)-OVEN      -0.0413 0.0729 -0.1846 -0.0418  0.1037 1.0061 2723
scale(tod)-BLPW       0.0899 0.1493 -0.1961  0.0901  0.3912 1.0069 2396
scale(tod)-AMRE      -0.0206 0.2148 -0.4509 -0.0170  0.4142 1.0006 1033
scale(tod)-BAWW      -0.1808 0.1370 -0.4663 -0.1767  0.0821 1.0012 1179
scale(tod)-BHVI      -0.0495 0.1120 -0.2736 -0.0477  0.1715 1.0011 1752
scale(tod)-BLBW       0.0573 0.0664 -0.0707  0.0566  0.1904 1.0007 2527
scale(tod)-BTBW      -0.0308 0.0662 -0.1650 -0.0294  0.0937 1.0023 2794
scale(tod)-BTNW       0.0374 0.0640 -0.0889  0.0377  0.1623 1.0024 3000
scale(tod)-CAWA      -0.2210 0.1682 -0.5842 -0.2132  0.0761 1.0017 1846
scale(tod)-MAWA       0.0110 0.1144 -0.2139  0.0108  0.2333 1.0023 3000
scale(tod)-NAWA      -0.0792 0.1596 -0.4079 -0.0747  0.2330 1.0012 2418
scale(tod)-REVI      -0.0770 0.0664 -0.2056 -0.0751  0.0508 1.0103 3000
I(scale(day)^2)-OVEN  0.0189 0.0867 -0.1491  0.0182  0.1860 1.0028 2762
I(scale(day)^2)-BLPW  0.2073 0.1868 -0.1405  0.1973  0.5880 1.0091 2382
I(scale(day)^2)-AMRE -0.1300 0.2429 -0.6376 -0.1160  0.3248 1.0090  957
I(scale(day)^2)-BAWW -0.0341 0.1524 -0.3317 -0.0322  0.2650 0.9999 1506
I(scale(day)^2)-BHVI  0.0707 0.1378 -0.2004  0.0691  0.3410 1.0014 1794
I(scale(day)^2)-BLBW -0.1706 0.0797 -0.3288 -0.1698 -0.0155 1.0026 3000
I(scale(day)^2)-BTBW -0.0587 0.0784 -0.2104 -0.0577  0.0919 1.0036 3000
I(scale(day)^2)-BTNW -0.0613 0.0773 -0.2113 -0.0610  0.0841 0.9996 3000
I(scale(day)^2)-CAWA -0.0331 0.1800 -0.3767 -0.0334  0.3204 1.0090 2374
I(scale(day)^2)-MAWA  0.0064 0.1351 -0.2610  0.0068  0.2742 1.0002 3000
I(scale(day)^2)-NAWA -0.1421 0.1833 -0.5126 -0.1381  0.2072 1.0015 2129
I(scale(day)^2)-REVI  0.0383 0.0797 -0.1245  0.0367  0.1953 1.0008 3000

----------------------------------------
    Spatial Covariance
----------------------------------------
        Mean    SD   2.5%    50%  97.5%   Rhat ESS
phi-1 0.0023 6e-04 0.0013 0.0023 0.0037 1.0728 112

We see pretty adequate convergence and effective sample sizes for the parameters, although we certainly would run the model longer for a full analysis to ensure all Rhat values are less than 1.1. If we compare the community-level parameters from sfMsPGOcc() with those from lfMsPGOcc(), we see their is a fair amount of correspondence between the two models.

Next we summarize the spatial factor loadings

summary(out.sfMsPGOcc$lambda.samples)

Iterations = 1:3000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 3000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

          Mean     SD Naive SE Time-series SE
OVEN-1  1.0000 0.0000 0.000000        0.00000
BLPW-1 -1.4086 0.5045 0.009210        0.04614
AMRE-1 -0.4813 0.9145 0.016697        0.14416
BAWW-1 -0.1296 0.7035 0.012844        0.07139
BHVI-1 -1.8301 0.6017 0.010985        0.05991
BLBW-1 -0.2936 0.3516 0.006420        0.02705
BTBW-1  1.5974 0.4802 0.008768        0.04102
BTNW-1  0.8370 0.4206 0.007679        0.04157
CAWA-1 -0.7803 0.6066 0.011074        0.07098
MAWA-1 -2.3117 0.4869 0.008890        0.03550
NAWA-1 -1.2409 0.4898 0.008943        0.03818
REVI-1  0.3939 0.2817 0.005143        0.01437

2. Quantiles for each variable:

          2.5%     25%     50%      75%   97.5%
OVEN-1  1.0000  1.0000  1.0000  1.00000  1.0000
BLPW-1 -2.4559 -1.7424 -1.3791 -1.05149 -0.5230
AMRE-1 -2.1920 -1.0884 -0.5577  0.11222  1.4148
BAWW-1 -1.5199 -0.5772 -0.1546  0.28082  1.3638
BHVI-1 -3.0779 -2.2152 -1.7996 -1.41289 -0.7257
BLBW-1 -1.0295 -0.5137 -0.2770 -0.06451  0.3682
BTBW-1  0.7241  1.2596  1.5705  1.90856  2.6074
BTNW-1  0.1758  0.5509  0.7753  1.05371  1.8625
CAWA-1 -2.2071 -1.1255 -0.6748 -0.34939  0.1264
MAWA-1 -3.4184 -2.5836 -2.2794 -1.97962 -1.4586
NAWA-1 -2.4027 -1.5085 -1.1926 -0.89590 -0.4276
REVI-1 -0.1318  0.2015  0.3831  0.58346  0.9681

Here we see variable responses to the latent spatial factor. In particular, we see that common species (OVEN, BTBW, BTNW, REVI) that occur at low-mid level elevations have a positive coefficent, while more rare species (MAWA, NAWA) have negative values of the coefficient. As we did in Figure 2 of Doser, Finley, Banerjee (2022) for the Breeding Bird Survey data, we could plot the factor loadings and the estimated spatial factor side-by-side to better understand what these effects mean for the specific community of interest.

Posterior predictive checks

Analogous to lfMsPGOcc(), we can perform a posterior predictive check using ppcOcc().

# Takes a few seconds to run. 
ppc.occ.out <- ppcOcc(out.sfMsPGOcc, 'freeman-tukey', group = 2)
Currently on species 1 out of 12
Currently on species 2 out of 12
Currently on species 3 out of 12
Currently on species 4 out of 12
Currently on species 5 out of 12
Currently on species 6 out of 12
Currently on species 7 out of 12
Currently on species 8 out of 12
Currently on species 9 out of 12
Currently on species 10 out of 12
Currently on species 11 out of 12
Currently on species 12 out of 12
summary(ppc.occ.out)

Call:
ppcOcc(object = out.sfMsPGOcc, fit.stat = "freeman-tukey", group = 2)

Samples per Chain: 5000
Burn-in: 3000
Thinning Rate: 2
Number of Chains: 3
Total Posterior Samples: 3000

----------------------------------------
    Community Level
----------------------------------------
Bayesian p-value:  0.5096 

----------------------------------------
    Species Level
----------------------------------------
OVEN Bayesian p-value: 0.391
BLPW Bayesian p-value: 0.376
AMRE Bayesian p-value: 0.7317
BAWW Bayesian p-value: 0.5307
BHVI Bayesian p-value: 0.5723
BLBW Bayesian p-value: 0.292
BTBW Bayesian p-value: 0.4993
BTNW Bayesian p-value: 0.516
CAWA Bayesian p-value: 0.546
MAWA Bayesian p-value: 0.5757
NAWA Bayesian p-value: 0.6013
REVI Bayesian p-value: 0.4837
Fit statistic:  freeman-tukey 

Model selection using WAIC

Below we compute the WAIC using waicOcc() and compare it to the WAIC for the non-spatial multi-species occupancy model.

waicOcc(out.sfMsPGOcc)
      elpd         pD       WAIC 
-4352.3203   151.0262  9006.6930 
waicOcc(out.lfMsPGOcc.2)
      elpd         pD       WAIC 
-4354.6047   184.8315  9078.8723 

As always, remember there is Monte Carlo error in these numbers, and so the values you receive will be slightly different if you run this on your own machine. The WAIC for the spatial factor model is much smaller than the WAIC for the latent factor model, suggesting that accounting for spatial autocorrelation improves model fit. However, in a complete analysis we should ensure the models fully converge (and we have adequate ESS) before performing any model selection or comparison.

Prediction

Finally, we can use the predict() function as we saw with lfMsPGOcc() to predict new occurrence or detection probability values given a set of covariates and spatial locations.

Below, we provide code to produce a map of species richness across HBEF, which is exactly analogous to our code for lfMsPGOcc().

# Not run (note this takes a large amount of memory to run).
data(hbefElev)
str(hbefElev)
elev.pred <- (hbefElev$val - mean(hbef.ordered$occ.covs[, 1])) / sd(hbef.ordered$occ.covs[, 1])
# Order: intercept, elevation (linear), elevation (quadratic)
X.0 <- cbind(1, elev.pred, elev.pred^2) 
# Spatial coordinates
coords.0 <- as.matrix(hbefElev[, c('Easting', 'Northing')])
# type = 'occupancy' specified prediction of occupancy (or occurrence). 
# This is also the default.
# Approximate run time: 30 sec
out.pred <- predict(out.sfMsPGOcc, X.0, coords.0, type = 'occupancy')
str(out.pred)
# Species richness samples
rich.pred <- apply(out.pred$z.0.samples, c(1, 3), sum)
plot.dat <- data.frame(x = hbefElev$Easting, 
                       y = hbefElev$Northing, 
                       rich.mean = apply(rich.pred, 2, mean), 
                       rich.sd = apply(rich.pred, 2, sd))
# Plot species richness of the foliage-gleaning bird community
# across the Hubbard Brook Experimental Forest
dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y'))
ggplot() + 
  geom_stars(data = dat.stars, aes(x = x, y = y, fill = rich.mean)) +
  scale_fill_viridis_c(na.value = 'transparent') +
  labs(x = 'Easting', y = 'Northing', fill = '', 
       title = 'Mean Species Richness') +
  theme_bw()

Additionally, we can generate predicted values of detection probability across a range of covariate values to generate a marginal response curve for each individual species across any given covariate value. As we saw with lfMsPGOcc(), below we predict and plot the relationships between detection probability and time of day for each of the 12 species, while holding the day of the year at it’s mean value (0).

# Minimum observed time of day
min.tod <- min(hbef2015$det.covs$tod, na.rm = TRUE)
# Maximum
max.tod <- max(hbef2015$det.covs$tod, na.rm = TRUE)
# Generate 100 time of day values between the observed range
J.0 <- 100
tod.pred.vals <- seq(from = min.tod, to = max.tod, length.out = J.0)
mean.tod <- mean(hbef2015$det.covs$tod, na.rm = TRUE)
sd.tod <- sd(hbef2015$det.covs$tod, na.rm = TRUE)
tod.stand <- (tod.pred.vals - mean.tod) / sd.tod
# Generate covariate matrix for prediction
X.p.0 <- cbind(1, 0, tod.stand, 0)
colnames(X.p.0) <- c('intercept', 'day', 'tod', 'day2')
out.det.pred <- predict(out.sfMsPGOcc, X.p.0, type = 'detection')
str(out.det.pred)
List of 4
 $ p.0.samples : num [1:3000, 1:12, 1:100] 0.747 0.71 0.717 0.759 0.71 ...
 $ run.time    : 'proc_time' Named num [1:5] 1.71 1.62 0.43 0 0
  ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
 $ call        : language predict.sfMsPGOcc(object = out.sfMsPGOcc, X.0 = X.p.0, type = "detection")
 $ object.class: chr "sfMsPGOcc"
 - attr(*, "class")= chr "predict.sfMsPGOcc"
# Extract the means from the posterior samples and convert to vector
p.0.ests <- c(apply(out.det.pred$p.0.samples, c(2, 3), mean))
p.plot.dat <- data.frame(det.prob = p.0.ests, 
                         sp = rep(sp.names, J.0), 
                         tod = rep(tod.pred.vals, each = N))
ggplot(p.plot.dat, aes(x = tod, y = det.prob)) + 
  geom_line() + 
  theme_bw() + 
  scale_y_continuous(limits = c(0, 1)) + 
  facet_wrap(vars(sp)) + 
  labs(x = 'Time of day (min since sunrise)', y = 'Detection Probability') 

Spatial factor joint species distribution models

The spOccupancy function sfJSDM() fits a spatial factor joint species distribution model. The spatial factor JSDM is a joint species distribution model that ignores imperfect detection, but accounts for species residual correlations and spatial autocorrelation. As in the spatial factor multi-species occupancy model, we account for species correlations using a spatial factor model, where the spatial factors arise from \(q\) independent NNGPs. This model is very similar to the NNGP model of Tikhonov, Opedal, et al. (2020), with the only differences being in the prior distributions and the identifiability constraints placed on the spatial factor loadings matrix.

While sfJSDM() (and it’s non-spatial counterpart lfJSDM()) are not occupancy models since they do not account for imperfect detection, we included them in spOccupancy to allow for direct comparison of these traditional JSDMs (which historically have not accounted for imperfect detection) with the JSDMs with imperfect detection fit by lfMSPGOcc() and sfMsPGOcc(). We hope inclusion of these functions, together with lfMsPGOcc(), sfMsPGOcc(), and the multi-species occupancy models that do not account for species correlations (msPGOcc() and spMsPGOcc()), will provide users and practitioners with practical tools to assess whether or not they need to account for species correlations, imperfect detection, and/or spatial autocorrelation in their specific data sets.

Basic model description

Because this model does not account for imperfect detection, we eliminate the detection sub-model and rather directly model a simplifieid version of the replicated detection-nondetection data. Define \(y^*_{i, j} = I(\sum_{k = 1}^{K_j}y_{i, j, k} > 0)\), with \(I(\cdot)\) an indicator function denoting whether or not species \(i\) was detected during at least one of the \(K_j\) replicate surveys at site \(j\). Note that this model does not require there to be more than one replicate survey at any location (since we do not account for imperfect detection), and thus may be fit to data where lfMsPGOcc() and sfMsPGOcc() will not provide reliable estimates. However, this comes at the cost of not explicitly accounting for imperfect detection, and thus we need to interpret all covariate effects as effects on a confounded process of detection and occurrence rather than explicitly separating the two as we have seen in the previous two models. The model description of the spatial factor joint species distribution model is identical to the occurrence model of the spatial factor multi-species occupancy model, except we replace the latent occurrence \(z_{i, j}\) with the observed data \(y^*_{i, j}\). This model can be thought of as a Generalized Linear Mixed Model with a binary response and spatial random effects that are modeled using a spatial factor approach.

Fitting spatial factor joint species distribution models with sfJSDM

The function sfJSDM() fits spatial factor joint species distribution models using Pólya-Gamma data augmentation. sfJSDM() has the following arguments:

sfJSDM(formula, data, inits, priors, tuning, 
       cov.model = 'exponential', NNGP = TRUE, 
       n.neighbors = 15, search.type = 'cb', n.factors, n.batch, 
       batch.length, accept.rate = 0.43, n.omp.threads = 1, 
       verbose = TRUE, n.report = 100, 
       n.burn = round(.10 * n.batch * batch.length), n.thin = 1, 
       n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed, ...)

Notice the similarity between the arguments for sfJSDM() and sfMsPGOcc(). The main differences when fitting JSDMs in spOccupancy is that there is now only a single formula that is specified in the model (formula), which is where we specify all covariate effects we think influences our observations. Let’s walk through the arguments in the context of our HBEF example. The data argument for JSDMs in spOccupancy has three required elements: y (the collapsed detection-nondetection data), covs (the covariates), and coords (spatial coordinates of sites). y is a matrix with rows corresponding to species and columns corresponding to sites. covs is a matrix or data frame with site-specific covariate values. coords is a matrix of spatial coordinates that is exactly the same as we have seen before. For our example, we need to collapse the replicated detection-nondetection data into the required format for JSDMs. We can do this using our good friend the apply() function.

# Form collapsed detection-nondetection data
y.star <- apply(hbef.ordered$y, c(1, 2), max, na.rm = TRUE)
str(y.star)
 num [1:12, 1:373] 1 0 0 0 0 1 1 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:12] "OVEN" "BLPW" "AMRE" "BAWW" ...
  ..$ : chr [1:373] "1" "2" "3" "4" ...

Notice in the code above we set the value for each site \(j\) and each species \(i\) to the maximum value observed across the 3 replicate surveys, which is equivalent to setting the value to 1 if the species was observed at that site and 0 if not. Next we specify our covariates. Because we have eliminated the replicate surveys, we can only include site-level covariates in our formula. In the context of our example, this means we cannot include the day or tod covariates that we placed on the detection portion of our occupancy models, as these covariates vary across the replicate surveys. Instead, we will simply include elevation in our model (which is what we used for modeling latent occurrence in our occupancy models).

covs <- hbef.ordered$occ.covs
str(covs)
 num [1:373, 1] 475 494 546 587 588 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr "Elevation"

We finally can just grab the spatial coordinates we used in our occupancy models, and then combine all three elements into a list that we will use to fit the JSDM.

# Grab spatial coordinates of the sites
coords <- hbef.ordered$coords
# Put it all together in a list
jsdm.list <- list(y = y.star, 
                  covs = covs, 
                  coords = coords)
str(jsdm.list)
List of 3
 $ y     : num [1:12, 1:373] 1 0 0 0 0 1 1 0 0 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:12] "OVEN" "BLPW" "AMRE" "BAWW" ...
  .. ..$ : chr [1:373] "1" "2" "3" "4" ...
 $ covs  : num [1:373, 1] 475 494 546 587 588 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr "Elevation"
 $ coords: num [1:373, 1:2] 280000 280000 280000 280001 280000 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:373] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "X" "Y"

We next specify the covariates we wish to include in the model with formula, as well as the number of latent spatial factors to include in the model. We will include linear and quadratic elevation as covariates in the model, and fit the model with a single spatial factor.

jsdm.formula <- ~ scale(Elevation) + I(scale(Elevation)^2)
n.factors <- 1

The remaining arguments are all identical to those we saw with the spatial factor multi-species occupancy model using sfMsPGOcc(). We specify values for all additional arguments below. See the section on “Fitting models with sfMsPGOcc()” for specific details on each of these arguments. Note that for the initial values and priors, we only need to specify these values for the “occurrence” values since there is no detection sub-model. Further, we do not need to specify initial values for the latent occurrence values \(z\) (since there aren’t any).

# Initial values ----------------------
# Pair-wise distance between all sites
dist.hbef <- dist(hbef.ordered$coords)
# Exponential correlation model
cov.model <- "exponential"
# Specify all other initial values identical to sfMsPGOcc() from before
# Number of species
N <- nrow(jsdm.list$y)
# Initiate all lambda initial values to 0. 
lambda.inits <- matrix(0, N, n.factors)
# Set diagonal elements to 1
diag(lambda.inits) <- 1
# Set lower triangular elements to random values from a standard normal dist
lambda.inits[lower.tri(lambda.inits)] <- rnorm(sum(lower.tri(lambda.inits)))
# Check it out
lambda.inits
             [,1]
 [1,]  1.00000000
 [2,] -0.81591886
 [3,]  0.52773493
 [4,]  0.24716672
 [5,] -0.38416333
 [6,]  0.03897993
 [7,] -0.02972734
 [8,]  0.34170062
 [9,]  0.45539728
[10,]  0.59588668
[11,]  0.78005116
[12,]  0.45098759
# Create list of initial values. 
inits <- list(beta.comm = 0,
              beta = 0,
              tau.sq.beta = 1,
              lambda = lambda.inits, 
              phi = 3 / mean(dist.hbef))
# Priors ------------------------------
min.dist <- min(dist.hbef)
max.dist <- max(dist.hbef)
priors <- list(beta.comm.normal = list(mean = 0, var = 2.72),
               tau.sq.beta.ig = list(a = 0.1, b = 0.1),
               phi.unif = list(3 / max.dist, 3 / min.dist))
# Additional arguments ----------------
batch.length <- 25
n.batch <- 200
n.burn <- 3000
n.thin <- 2
n.chains <- 3
tuning <- list(phi = 1)
n.omp.threads <- 1
verbose <- TRUE
n.report <- 50 # Report progress at every 50th batch.

We are now ready to run the model. We run the model for 200 batches of length 25 (a total of 5000 MCMC samples), discarding the first 3000 as burn-in and thinning by every 2 samples. We fit the model with 5 nearest neighbors.

# Approx run time: 1.25 min
out.sfJSDM <- sfJSDM(formula = jsdm.formula, 
                     data = jsdm.list, 
                     inits = inits, 
                     n.batch = n.batch, 
                     batch.length = batch.length, 
                     accept.rate = 0.43, 
                     priors = priors, 
                     n.factors = n.factors,
                     cov.model = cov.model, 
                     tuning = tuning, 
                     n.omp.threads = n.omp.threads, 
                     verbose = TRUE, 
                     NNGP = TRUE, 
                     n.neighbors = 5, 
                     n.report = n.report, 
                     n.burn = n.burn, 
                     n.thin = n.thin, 
                     n.chains = n.chains)
----------------------------------------
    Preparing to run the model
----------------------------------------
----------------------------------------
    Building the neighbor list
----------------------------------------
----------------------------------------
Building the neighbors of neighbors list
----------------------------------------
----------------------------------------
    Model description
----------------------------------------
Spatial Factor NNGP JSDM with Polya-Gamma latent
variable fit with 373 sites and 12 species.

Samples per chain: 5000 (200 batches of length 25)
Burn-in: 3000 
Thinning Rate: 2 
Number of Chains: 3 
Total Posterior Samples: 3000 

Using the exponential spatial correlation model.

Using 1 latent spatial factors.
Using 5 nearest neighbors.

Source compiled with OpenMP support and model fit using 1 thread(s).

Adaptive Metropolis with target acceptance rate: 43.0
----------------------------------------
    Chain 1
----------------------------------------
Sampling ... 
Batch: 50 of 200, 25.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     20.0        0.60050
-------------------------------------------------
Batch: 100 of 200, 50.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     36.0        0.41895
-------------------------------------------------
Batch: 150 of 200, 75.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     40.0        0.31664
-------------------------------------------------
Batch: 200 of 200, 100.00%
----------------------------------------
    Chain 2
----------------------------------------
Sampling ... 
Batch: 50 of 200, 25.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     44.0        0.28650
-------------------------------------------------
Batch: 100 of 200, 50.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     48.0        0.30422
-------------------------------------------------
Batch: 150 of 200, 75.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     40.0        0.28650
-------------------------------------------------
Batch: 200 of 200, 100.00%
----------------------------------------
    Chain 3
----------------------------------------
Sampling ... 
Batch: 50 of 200, 25.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     60.0        0.29229
-------------------------------------------------
Batch: 100 of 200, 50.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     40.0        0.28083
-------------------------------------------------
Batch: 150 of 200, 75.00%
    Latent Factor   Parameter   Acceptance  Tuning
    1       phi     32.0        0.26982
-------------------------------------------------
Batch: 200 of 200, 100.00%
# Take a look at the resulting object
names(out.sfJSDM)
 [1] "rhat"                "beta.comm.samples"   "tau.sq.beta.samples"
 [4] "beta.samples"        "lambda.samples"      "theta.samples"      
 [7] "w.samples"           "psi.samples"         "like.samples"       
[10] "z.samples"           "X.re"                "lambda.psi"         
[13] "ESS"                 "X"                   "y"                  
[16] "call"                "n.samples"           "x.names"            
[19] "sp.names"            "theta.names"         "type"               
[22] "coords"              "cov.model.indx"      "n.neighbors"        
[25] "q"                   "n.post"              "n.thin"             
[28] "n.burn"              "n.chains"            "psiRE"              
[31] "run.time"           
summary(out.sfJSDM)

Call:
sfJSDM(formula = jsdm.formula, data = jsdm.list, inits = inits, 
    priors = priors, tuning = tuning, cov.model = cov.model, 
    NNGP = TRUE, n.neighbors = 5, n.factors = n.factors, n.batch = n.batch, 
    batch.length = batch.length, accept.rate = 0.43, n.omp.threads = n.omp.threads, 
    verbose = TRUE, n.report = n.report, n.burn = n.burn, n.thin = n.thin, 
    n.chains = n.chains)

Samples per Chain: 5000
Burn-in: 3000
Thinning Rate: 2
Number of Chains: 3
Total Posterior Samples: 3000
Run Time (min): 1.4971

----------------------------------------
    Community Level
----------------------------------------
Means (logit scale): 
                         Mean     SD    2.5%     50%  97.5%   Rhat  ESS
(Intercept)           -0.8311 0.8219 -2.4196 -0.8201 0.8121 1.0005 2728
scale(Elevation)       0.4029 0.3883 -0.3413  0.3932 1.1922 1.0016 2102
I(scale(Elevation)^2) -0.2481 0.1763 -0.5932 -0.2487 0.1015 1.0027 1598

Variances (logit scale): 
                         Mean     SD   2.5%    50%   97.5%   Rhat  ESS
(Intercept)           10.5487 5.6222 4.2831 9.1443 25.6212 1.0163 2035
scale(Elevation)       1.7494 1.0207 0.6029 1.5089  4.5507 1.0269  438
I(scale(Elevation)^2)  0.3183 0.2169 0.0756 0.2608  0.8848 1.0430  374

----------------------------------------
    Species Level
----------------------------------------
Estimates (logit scale): 
                              Mean     SD    2.5%     50%   97.5%   Rhat  ESS
(Intercept)-OVEN            1.9339 0.2792  1.3750  1.9401  2.4779 1.1627  162
(Intercept)-BLPW           -5.0366 0.6279 -6.3697 -5.0051 -3.9002 1.0155  249
(Intercept)-AMRE           -4.5973 0.6693 -6.1307 -4.5315 -3.4824 1.0244  275
(Intercept)-BAWW           -1.5917 0.2088 -2.0038 -1.5876 -1.1945 1.0082 1424
(Intercept)-BHVI           -2.0248 0.3442 -2.7563 -2.0061 -1.4005 1.2583  175
(Intercept)-BLBW            1.1991 0.1692  0.8728  1.1948  1.5361 1.0409  750
(Intercept)-BTBW            2.9897 0.3417  2.3650  2.9764  3.6881 1.1102  372
(Intercept)-BTNW            1.8800 0.2215  1.4622  1.8742  2.3325 1.1169  472
(Intercept)-CAWA           -3.1196 0.3378 -3.8325 -3.1036 -2.4838 1.0030  637
(Intercept)-MAWA           -3.0628 0.5689 -4.2426 -3.0257 -2.0524 1.4451  141
(Intercept)-NAWA           -3.8924 0.5040 -4.9812 -3.8497 -3.0380 1.0469  222
(Intercept)-REVI            2.3058 0.2381  1.8676  2.3026  2.7803 1.0097  860
scale(Elevation)-OVEN      -1.3848 0.2146 -1.8195 -1.3755 -0.9706 1.0211  302
scale(Elevation)-BLPW       2.5926 0.6479  1.5748  2.4979  4.0726 1.2151  114
scale(Elevation)-AMRE       0.6686 0.6334 -0.5163  0.6657  1.9183 1.0240  219
scale(Elevation)-BAWW      -0.0657 0.2360 -0.5133 -0.0706  0.3994 1.0099 1142
scale(Elevation)-BHVI       0.0386 0.2070 -0.3815  0.0406  0.4364 1.0335  360
scale(Elevation)-BLBW      -0.2289 0.1189 -0.4656 -0.2284  0.0027 1.0001 1854
scale(Elevation)-BTBW      -0.2206 0.1739 -0.5591 -0.2229  0.1294 1.0101  285
scale(Elevation)-BTNW       0.5137 0.1644  0.2036  0.5082  0.8579 1.0023  813
scale(Elevation)-CAWA       1.5813 0.4465  0.7736  1.5530  2.5287 1.0110  273
scale(Elevation)-MAWA       1.6413 0.3896  0.9113  1.6278  2.4372 1.0159  181
scale(Elevation)-NAWA       0.9658 0.3477  0.3412  0.9402  1.6990 1.0102  389
scale(Elevation)-REVI      -1.0863 0.1849 -1.4653 -1.0770 -0.7482 1.0051 1328
I(scale(Elevation)^2)-OVEN -0.4559 0.1590 -0.7726 -0.4565 -0.1378 1.0091  801
I(scale(Elevation)^2)-BLPW  0.6305 0.3772 -0.1057  0.6467  1.3346 1.1001  216
I(scale(Elevation)^2)-AMRE -0.6310 0.4324 -1.5442 -0.5989  0.1186 1.0332  303
I(scale(Elevation)^2)-BAWW -0.7095 0.2261 -1.1852 -0.6957 -0.3071 1.0508  694
I(scale(Elevation)^2)-BHVI  0.0484 0.1511 -0.2524  0.0531  0.3371 1.0006  512
I(scale(Elevation)^2)-BLBW -0.3254 0.0941 -0.5064 -0.3244 -0.1367 1.0009 1577
I(scale(Elevation)^2)-BTBW -0.8646 0.1446 -1.1590 -0.8608 -0.5961 1.0202  532
I(scale(Elevation)^2)-BTNW -0.1234 0.1164 -0.3550 -0.1227  0.1009 1.0031 1292
I(scale(Elevation)^2)-CAWA -0.5240 0.2876 -1.1299 -0.5062  0.0211 1.0101  341
I(scale(Elevation)^2)-MAWA  0.0620 0.2709 -0.4481  0.0624  0.5939 1.0001  444
I(scale(Elevation)^2)-NAWA  0.2045 0.2426 -0.2974  0.2088  0.6670 1.0063  437
I(scale(Elevation)^2)-REVI -0.3592 0.1414 -0.6302 -0.3623 -0.0656 1.0038 1507

----------------------------------------
    Spatial Covariance
----------------------------------------
        Mean    SD   2.5%    50%  97.5%   Rhat ESS
phi-1 0.0025 6e-04 0.0014 0.0025 0.0039 1.0216  76

Notice the difference in run time between sfJSDM() and sfMsPGOcc(). The run time for sfJSDM() is almost half that of sfMsPGOcc(). This is a nice benefit of models that don’t account for imperfect detection, but of course it comes with big sacrifices. Looking at the above output, we see all parameters have converged.

Model selection using WAIC

We can compute the WAIC for spatial factor JSDMs using the waicOcc() function just as we have seen previously.

waicOcc(out.sfJSDM)
      elpd         pD       WAIC 
-1279.8983   141.2529  2842.3023 

However, the WAIC values for JSDM models are not directly comparable to those from lfMsPGOcc() or sfMsPGOcc() because they do not use the same data (JSDMs use a collapsed form of the replicated detection-nondetection data used in the occupancy models). This makes comparisons between models that do and do not account for imperfect detection a bit tricky. See Doser, Finley, Banerjee (2022) for one cross-validation approach to comparing predictive performance of JSDMs and occupancy models. While k-fold cross-validation is implemented for JSDMs in spOccupancy and all other model-fitting functions, we have yet to implement the specific approach we used in Doser, Finley, Banerjee (2022) to directly compare predictive performance JSDMs and occupancy models. We plan to do this in future development of the package.

For now, we can do some visual comparisons of the predicted occurrence probabilities from the spatial factor JSDM and the spatial factor multi-species occupancy model.

# Extract mean occurrence probabilities for each species from sfMsPGOcc
psi.mean.sfMsPGOcc <- apply(out.sfMsPGOcc$psi.samples, c(2, 3), mean)
# Extract mean occurrence probabilities for each species from sfJSDM
psi.mean.sfJSDM <- apply(out.sfJSDM$psi.samples, c(2, 3), mean)
# Plot results for the Red-eyed Vireo (REVI)
curr.sp <- which(sp.ordered == 'REVI')
# Color the points blue if sfJSDM > sfMsPGOcc, red otherwise
my.col <- ifelse(psi.mean.sfMsPGOcc[curr.sp, ] > psi.mean.sfJSDM[curr.sp, ], 
         'lightsalmon', 'lightskyblue1')
plot(psi.mean.sfMsPGOcc[curr.sp, ], psi.mean.sfJSDM[curr.sp, ], pch = 21,
     bg = my.col, xlab = 'sfMsPGOcc', ylab = 'sfJSDM', main = 'Red-eyed Vireo', 
     ylim = c(0, 1), xlim = c(0, 1))
abline(0, 1)

Not surprisingly, most estimates of occurrence probability are smaller for the model that does not account for imperfect detection (sfJSDM()). However, the differences here are not very large for this species, which is exactly what we would expect for a fairly common species. Let’s take a look at the results for a more rare species (CAWA (Canda Warbler)).

curr.sp <- which(sp.ordered == 'CAWA')
# Color the points blue if sfJSDM > sfMsPGOcc, red otherwise
my.col <- ifelse(psi.mean.sfMsPGOcc[curr.sp, ] > psi.mean.sfJSDM[curr.sp, ], 
         'lightsalmon', 'lightskyblue1')
plot(psi.mean.sfMsPGOcc[curr.sp, ], psi.mean.sfJSDM[curr.sp, ], pch = 21,
     bg = my.col, xlab = 'sfMsPGOcc', ylab = 'sfJSDM', main = 'Canada Warbler', 
     ylim = c(0, 1), xlim = c(0, 1))
abline(0, 1)

Here we see a clear discrepancy between the occurrence probability estimates from models that do and do not account for imperfect detection.

Prediction

We can use the predict() function to generate new predictions of “occurrence” probability values given a set of covariates and spatial locations. Note that these predictions are estimates of a confounded probability of occurrence and detection (hence the quotations around occurrence throughout this section). This code looks analogous to what we saw with sfMsPGOcc().

# Not run (note this takes a large amount of memory to run).
data(hbefElev)
str(hbefElev)
elev.pred <- (hbefElev$val - mean(hbef.ordered$occ.covs[, 1])) / sd(hbef.ordered$occ.covs[, 1])
# Order: intercept, elevation (linear), elevation (quadratic)
X.0 <- cbind(1, elev.pred, elev.pred^2) 
# Spatial coordinates
coords.0 <- as.matrix(hbefElev[, c('Easting', 'Northing')])
# Approximate run time: 30 sec
out.pred <- predict(out.sfJSDM, X.0, coords.0)

Latent factor joint species distribution models

The spOccupancy function lfJSDM() fits a latent factor joint species distribution model. This model is analogous to the latent factor multi-species occupancy model, except it does not account for imperfect detection. Alternatively, the model can be viewed as a non-spatial alternative to sfJSDM(). Just as we saw with lfMsPGOcc(), we will account for species correlations using a latent factor model, where the latent factors are assumed to arise from independent standard normal distributions.

Basic model description

The latent factor joint species distribution model is identical to the spatial factor joint species distribution model fit by sfJSDM(), except now the latent factors do not have spatial structure and are modeled using independent standard normal distributions. See previous model descriptions for the spatial factor joint species distribution model as well as the latent factor multi-species occupancy model for additional details.

Fitting latent factor joint species distribution models with lfJSDM

The function lfJSDM() fits latent factor joint species distribution models using Pólya-Gamma data augmentation. lfJSDM() has the following arguments.

lfJSDM(formula, data, inits, priors, n.factors, 
       n.samples, n.omp.threads = 1, verbose = TRUE, n.report = 100, 
       n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1,
       k.fold, k.fold.threads = 1, k.fold.seed, ...)

There are no new arguments in lfJSDM() that we have not seen before in previous factor model functions. Notice that like sfJSDM(), there is only a single formula since we do not explicitly separate imperfect detection from latent occurrence. Similar to sfJSDM(), the data list should consist of the collapsed detection-nondetection data matrix (y), the covariates (covs), and the spatial coordinates (coords). We can use the same data list we constructed previously for sfJSDM(), and we remind ourselves of it’s structure below.

str(jsdm.list)
List of 3
 $ y     : num [1:12, 1:373] 1 0 0 0 0 1 1 0 0 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:12] "OVEN" "BLPW" "AMRE" "BAWW" ...
  .. ..$ : chr [1:373] "1" "2" "3" "4" ...
 $ covs  : num [1:373, 1] 475 494 546 587 588 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr "Elevation"
 $ coords: num [1:373, 1:2] 280000 280000 280000 280001 280000 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:373] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "X" "Y"

Specifying covariates in lfJSDM() is exactly analogous to what we saw with sfJSDM(). Again, note that we can include random intercepts in formula using lme4 syntax (Bates et al. 2015). As we have done before, we will use a single latent factor.

jsdm.formula <- ~ scale(Elevation) + I(scale(Elevation)^2)
n.factors <- 1

The remaining arguments are all identical to the arguments we saw in lfMsPGOcc() when fitting a latent factor multi-species occupancy model. See the section on “Fitting models with lfMsPGOcc()” for specific details on these arguments. Note that for the initial values and priors, we only need to specify these values for the “occurrence” values since there is no detection sub-model. Just like with sfJSDM(), we do not need to specify initial values for the latent occurrence values \(z\) (since we assume perfect detection).

# Initial values ----------------------
# Number of species
N <- nrow(jsdm.list$y)
# Initiate all lambda initial values to 0. 
lambda.inits <- matrix(0, N, n.factors)
# Set diagonal elements to 1
diag(lambda.inits) <- 1
# Set lower triangular elements to random values from a standard normal dist
lambda.inits[lower.tri(lambda.inits)] <- rnorm(sum(lower.tri(lambda.inits)))
# Create list of initial values. 
inits <- list(beta.comm = 0,
              beta = 0,
              tau.sq.beta = 1,
              lambda = lambda.inits)
# Priors ------------------------------
priors <- list(beta.comm.normal = list(mean = 0, var = 2.72),
               tau.sq.beta.ig = list(a = 0.1, b = 0.1))
# Additional arguments ----------------
n.samples <- 5000
n.burn <- 3000
n.thin <- 2
n.chains <- 3
n.report <- 1000

With all the arguments set for lfJSDM(), we are now ready to run the model. We run the model for 5000 MCMC samples, eliminating the first 3000 as burn-in and thinning by every 2 samples.

# Approx run time: 32 seconds
out.lfJSDM <- lfJSDM(formula = jsdm.formula, 
                     data = jsdm.list, 
                     inits = inits, 
                     priors = priors, 
                     n.factors = n.factors, 
                     n.samples = n.samples, 
                     n.report = n.report, 
                     n.burn = n.burn, 
                     n.thin = n.thin, 
                     n.chains = n.chains)
----------------------------------------
    Preparing to run the model
----------------------------------------
----------------------------------------
    Model description
----------------------------------------
Latent Factor JSDM with Polya-Gamma latent
variable fit with 373 sites and 12 species.

Samples per Chain: 5000 
Burn-in: 3000 
Thinning Rate: 2 
Number of Chains: 3 
Total Posterior Samples: 3000 

Using 1 latent factors.

Source compiled with OpenMP support and model fit using 1 thread(s).

----------------------------------------
    Chain 1
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
    Chain 2
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
    Chain 3
----------------------------------------
Sampling ... 
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
# Quick summary of model results
summary(out.lfJSDM)

Call:
lfJSDM(formula = jsdm.formula, data = jsdm.list, inits = inits, 
    priors = priors, n.factors = n.factors, n.samples = n.samples, 
    n.report = n.report, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains)

Samples per Chain: 5000
Burn-in: 3000
Thinning Rate: 2
Number of Chains: 3
Total Posterior Samples: 3000
Run Time (min): 0.605

----------------------------------------
    Community Level
----------------------------------------
Means (logit scale): 
                         Mean     SD    2.5%     50%  97.5%   Rhat  ESS
(Intercept)           -0.7768 0.8133 -2.3371 -0.7967 0.8630 1.0034 3000
scale(Elevation)       0.4188 0.3971 -0.3412  0.4102 1.2394 0.9996 2134
I(scale(Elevation)^2) -0.2662 0.1782 -0.6278 -0.2651 0.0797 1.0012 1522

Variances (logit scale): 
                         Mean     SD   2.5%    50%   97.5%   Rhat  ESS
(Intercept)           10.5757 5.6231 4.4303 9.3197 24.4341 1.0057 2497
scale(Elevation)       1.8959 1.1071 0.6755 1.6301  4.6075 1.0117  680
I(scale(Elevation)^2)  0.3156 0.2065 0.0919 0.2655  0.8436 1.0141  996

----------------------------------------
    Species Level
----------------------------------------
Estimates (logit scale): 
                              Mean     SD    2.5%     50%   97.5%   Rhat  ESS
(Intercept)-OVEN            2.0007 0.2145  1.5814  1.9985  2.4254 1.0048 1903
(Intercept)-BLPW           -4.8914 0.5929 -6.1678 -4.8319 -3.8850 1.0191  268
(Intercept)-AMRE           -4.5202 0.6114 -5.8544 -4.4678 -3.4781 1.0149  253
(Intercept)-BAWW           -1.6207 0.2082 -2.0454 -1.6193 -1.2163 1.0167 1413
(Intercept)-BHVI           -2.0333 0.2851 -2.6500 -2.0151 -1.5325 1.0525  394
(Intercept)-BLBW            1.1727 0.1559  0.8710  1.1683  1.4796 1.0058 3000
(Intercept)-BTBW            3.2629 0.4083  2.5502  3.2366  4.1926 1.0130  362
(Intercept)-BTNW            1.9528 0.2168  1.5635  1.9415  2.4134 1.0074 1153
(Intercept)-CAWA           -3.1560 0.3488 -3.8744 -3.1382 -2.5091 1.0070  718
(Intercept)-MAWA           -2.6909 0.3831 -3.5258 -2.6663 -2.0247 1.0140  317
(Intercept)-NAWA           -4.0745 0.5281 -5.2628 -4.0400 -3.1555 1.0660  302
(Intercept)-REVI            2.3963 0.2548  1.9264  2.3836  2.9345 1.0043 1130
scale(Elevation)-OVEN      -1.5022 0.1786 -1.8592 -1.5028 -1.1698 1.0025 1604
scale(Elevation)-BLPW       2.6723 0.6449  1.7315  2.5644  4.2724 1.0176  109
scale(Elevation)-AMRE       0.7609 0.6631 -0.4128  0.7205  2.2415 1.0135  226
scale(Elevation)-BAWW      -0.0704 0.2454 -0.5459 -0.0738  0.4236 1.0015 1082
scale(Elevation)-BHVI       0.1588 0.1650 -0.1741  0.1570  0.4829 1.0029 1843
scale(Elevation)-BLBW      -0.2001 0.1120 -0.4175 -0.1999  0.0165 1.0071 3000
scale(Elevation)-BTBW      -0.3329 0.1554 -0.6540 -0.3291 -0.0401 1.0028 1822
scale(Elevation)-BTNW       0.4808 0.1560  0.1935  0.4764  0.8035 1.0028 1503
scale(Elevation)-CAWA       1.6392 0.4801  0.7903  1.5983  2.6980 1.0148  262
scale(Elevation)-MAWA       1.6759 0.3306  1.0903  1.6556  2.3668 1.0030  335
scale(Elevation)-NAWA       1.1210 0.3374  0.5345  1.0912  1.8490 1.0204  436
scale(Elevation)-REVI      -1.1424 0.1896 -1.5356 -1.1339 -0.7981 1.0083 1018
I(scale(Elevation)^2)-OVEN -0.4884 0.1481 -0.7778 -0.4887 -0.2037 1.0041 1738
I(scale(Elevation)^2)-BLPW  0.5959 0.3625 -0.1435  0.6132  1.2355 1.0028  211
I(scale(Elevation)^2)-AMRE -0.6322 0.4274 -1.5787 -0.6060  0.1337 1.0109  356
I(scale(Elevation)^2)-BAWW -0.7019 0.2175 -1.1601 -0.6915 -0.3052 1.0090  821
I(scale(Elevation)^2)-BHVI  0.0506 0.1316 -0.2123  0.0527  0.3031 1.0020 2163
I(scale(Elevation)^2)-BLBW -0.3146 0.0912 -0.4970 -0.3139 -0.1374 1.0044 3143
I(scale(Elevation)^2)-BTBW -0.9298 0.1569 -1.2688 -0.9221 -0.6398 1.0015  586
I(scale(Elevation)^2)-BTNW -0.1305 0.1127 -0.3500 -0.1351  0.1072 1.0031 2239
I(scale(Elevation)^2)-CAWA -0.5292 0.3078 -1.2001 -0.5045  0.0116 1.0208  355
I(scale(Elevation)^2)-MAWA  0.0450 0.2201 -0.3924  0.0531  0.4524 1.0082  683
I(scale(Elevation)^2)-NAWA  0.1991 0.2243 -0.2456  0.2088  0.6254 1.0042  707
I(scale(Elevation)^2)-REVI -0.3759 0.1429 -0.6547 -0.3757 -0.0899 1.0185 1640

Not surprisingly, lfJSDM() is the fastest of the four models we have fit to the Hubbard Brook data. This makes sense since it only accounts for one (residual species correlations) of the three complexities and ignores the other two (imperfect detection and spatial autocorrelation). Note that we also see nice converging and mixing of all model parameters. This is a constant battle we fight with fitting Bayesian models. Generally, the more complex a model is, the longer it takes to run to reach convergence. For more simple models like the latent factor joint species distribution model, models usually run quite fast and we often will see adequate convergence and mixing without having to run too many MCMC samples.

Model selection using WAIC

We compute the WAIC for latent factor JSDMs using the waicOcc() function, and we compare this value to the WAIC for the spatial factor JSDM we fit with sfJSDM().

waicOcc(out.lfJSDM)
     elpd        pD      WAIC 
-1270.350   182.064  2904.828 
waicOcc(out.sfJSDM)
      elpd         pD       WAIC 
-1279.8983   141.2529  2842.3023 

The WAIC value for the spatial factor JSDM is smaller than that of the latnet factor JSDM, indicating that accounting for spatial autocorrelation improves model fit. As we noted for sfJSDM(), we cannot directly compare the WAIC value from a latent factor JSDM to those obtained from lfMsPGOcc() or sfMsPGOcc() since they do not use the same data. As with all factor model functions, we can additionaly use k-fold cross-validation to assess out-of-sample predictive performance. See the main spOccupancy vignette for examples of cross-validation using spOccupancy model functions.

Prediction

We can use the predict() function to generate new predictions of “occurrence” probability values given a set of covariates and spatial locations. Again, note that these predictions are estimates of a confounded probability of occurrence and detection. This code is analogous to what we saw with lfMsPGOcc().

# Not run (note this takes a large amount of memory to run).
data(hbefElev)
elev.pred <- (hbefElev$val - mean(hbef2015$occ.covs[, 1])) / sd(hbef2015$occ.covs[, 1])
# Order: intercept, elevation (linear), elevation (quadratic)
X.0 <- cbind(1, elev.pred, elev.pred^2) 
# Spatial coordinates
coords.0 <- as.matrix(hbefElev[, c('Easting', 'Northing')])
# Approximate run time: 30 sec
out.pred <- predict(out.lfJSDM, X.0, coords.0)
str(out.pred)
# Species richness samples
rich.pred <- apply(out.pred$z.0.samples, c(1, 3), sum)
plot.dat <- data.frame(x = hbefElev$Easting, 
                       y = hbefElev$Northing, 
                       rich.mean = apply(rich.pred, 2, mean), 
                       rich.sd = apply(rich.pred, 2, sd))
# Plot species richness of the foliage-gleaning bird community
# across the Hubbard Brook Experimental Forest
dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y'))
ggplot() + 
  geom_stars(data = dat.stars, aes(x = x, y = y, fill = rich.mean)) +
  scale_fill_viridis_c(na.value = 'transparent') +
  labs(x = 'Easting', y = 'Northing', fill = '', 
       title = 'Mean Species Richness') +
  theme_bw()

References

Banerjee, Sudipto, Bradley P Carlin, and Alan E Gelfand. 2014. Hierarchical Modeling and Analysis for Spatial Data. CRC press.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Brooks, Stephen P., and Andrew Gelman. 1998. “General Methods for Monitoring Convergence of Iterative Simulations.” Journal of Computational and Graphical Statistics 7 (4): 434–55.

Datta, Abhirup, Sudipto Banerjee, Andrew O Finley, and Alan E Gelfand. 2016. “Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets.” Journal of the American Statistical Association 111 (514): 800–812.

Doser, Jeffrey W, Andrew O Finley, Marc Kéry, and Elise F Zipkin. 2021. “spOccupancy: An R package for single species, multispecies, and integrated spatial occupancy models.” arXiv Preprint arXiv:2111.12163.

Doser, Jeffrey W, Wendy Leuenberger, T Scott Sillett, Michael T Hallworth, and Elise F Zipkin. 2022. “Integrated Community Occupancy Models: A Framework to Assess Occurrence and Biodiversity Dynamics Using Multiple Data Sources.” Methods in Ecology and Evolution.

Finley, Andrew O., Sudipto Banerjee, and Ronald E. McRoberts. 2009. “Hierarchical spatial models for predicting tree species assemblages across large domains.” The Annals of Applied Statistics 3 (3): 1052–79. https://doi.org/10.1214/09-AOAS250.

Finley, Andrew O, Abhirup Datta, and Sudipto Banerjee. 2020. “spNNGP R package for nearest neighbor Gaussian process models.” arXiv Preprint arXiv:2001.09111.

Hogan, Joseph W, and Rusty Tchernis. 2004. “Bayesian Factor Analysis for Spatially Correlated Data, with Application to Summarizing Area-Level Material Deprivation from Census Data.” Journal of the American Statistical Association 99 (466): 314–24.

Hooten, Mevin B, and N Thompson Hobbs. 2015. “A guide to Bayesian model selection for ecologists.” Ecological Monographs 85 (1): 3–28.

Latimer, AM, S Banerjee, H Sang Jr, ES Mosher, and JA Silander Jr. 2009. “Hierarchical Models Facilitate Spatial Analysis of Large Data Sets: A Case Study on Invasive Plant Species in the Northeastern United States.” Ecology Letters 12 (2): 144–54.

MacKenzie, Darryl I, James D Nichols, Gideon B Lachman, Sam Droege, J Andrew Royle, and Catherine A Langtimm. 2002. “Estimating Site Occupancy Rates When Detection Probabilities Are Less Than One.” Ecology 83 (8): 2248–55.

Ovaskainen, Otso, Jenni Hottola, and Juha Siitonen. 2010. “Modeling Species Co-Occurrence by Multivariate Logistic Regression Generates New Hypotheses on Fungal Interactions.” Ecology 91 (9): 2514–21.

Ovaskainen, Otso, David B Roy, Richard Fox, and Barbara J Anderson. 2016. “Uncovering Hidden Spatial Structure in Species Communities with Spatially Explicit Joint Species Distribution Models.” Methods in Ecology and Evolution 7 (4): 428–36.

Polson, Nicholas G, James G Scott, and Jesse Windle. 2013. “Bayesian inference for logistic models using Pólya–Gamma latent variables.” Journal of the American Statistical Association 108 (504): 1339–49.

Roberts, Gareth O, and Jeffrey S Rosenthal. 2009. “Examples of Adaptive Mcmc.” Journal of Computational and Graphical Statistics 18 (2): 349–67.

Taylor-Rodriguez, Daniel, Andrew O Finley, Abhirup Datta, Chad Babcock, Hans-Erik Andersen, Bruce D Cook, Douglas C Morton, and Sudipto Banerjee. 2019. “Spatial factor models for high-dimensional and large spatial data: An application in forest variable mapping.” Statistica Sinica 29: 1155.

Tikhonov, Gleb, Li Duan, Nerea Abrego, Graeme Newell, Matt White, David Dunson, and Otso Ovaskainen. 2020. “Computationally Efficient Joint Species Distribution Modeling of Big Spatial Data.” Ecology 101 (2): e02929.

Tikhonov, Gleb, Øystein H Opedal, Nerea Abrego, Aleksi Lehikoinen, Melinda MJ de Jonge, Jari Oksanen, and Otso Ovaskainen. 2020. “Joint species distribution modelling with the r-package Hmsc.” Methods in Ecology and Evolution 11 (3): 442–47.

Watanabe, Sumio. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11 (12).