## Introduction

This vignette provides worked examples and explanations for fitting single-species, multi-species, and integrated occupancy models available in the spOccupancy R package. We will provide step by step examples on how to fit the following models:

1. Occupancy model using PGOcc().
2. Spatial occupancy model using spPGOcc().
3. Multi-species occupancy model using msPGOcc().
4. Spatial multi-species occupancy model using spMsPGOcc().
5. Integrated occupancy model using intPGOcc().
6. Spatial integrated occupancy model using spIntPGOcc().

We fit all occupancy models using Pólya-Gamma data augmentation (Polson, Scott, and Windle 2013) for computational efficiency (Pólya-Gamma is where the PG comes from in all spOccupancy model fitting function names). In this vignette, we will provide a brief description of each model, with full statistical details provided in a separate MCMC sampler vignette. We will also show how spOccupancy provides functions for posterior predictive checks as a Goodness of Fit assessment, model comparison and assessment using the Widely Applicable Information Criterion (WAIC), k-fold cross-validation, and out of sample predictions using standard R helper functions (e.g., predict()).

To get started, we load the spOccupancy package, as well as the coda package, which we will use for some MCMC summary and diagnostics. We will also use the stars and ggplot2 packages to create some basic plots of our results. We then set a seed so you can reproduce the same results as we do.

library(spOccupancy)
library(coda)
library(stars)
library(ggplot2)
set.seed(102)

### Example data set: Foliage-gleaning birds at Hubbard Brook

As an example data set throughout this vignette, we will use data from twelve foliage-gleaning bird species collected from point count surveys at Hubbard Brook Experimental Forest (HBEF) in New Hampshire, USA. Specific details on the data set, which is just a small subset from a long-term survey, are available on the Hubbard Brook website and Doser et al. (2022). The data are provided as part of the spOccupancy package and are loaded with data(hbef2015). Point count surveys were conducted at 373 sites over three replicates, each of 10 minutes in length and with a detection radius of 100m. In the data set provided here, we converted these data to detection-nondetection data. Some sites were not visited for all three replicates. Additional information on the data set (including individual species in the data set) can be viewed in the man page using help(hbef2015).

data(hbef2015) # Load the data set.
str(hbef2015) # Get an overview of what's in the data.
List of 4
$y : num [1:12, 1:373, 1:3] 0 0 0 0 0 1 0 0 0 0 ... ..- attr(*, "dimnames")=List of 3 .. ..$ : chr [1:12] "AMRE" "BAWW" "BHVI" "BLBW" ...
.. ..$: chr [1:373] "1" "2" "3" "4" ... .. ..$ : chr [1:3] "1" "2" "3"
$occ.covs: num [1:373, 1] 475 494 546 587 588 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL
.. ..$: chr "Elevation"$ det.covs:List of 2
..$day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ... ..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
$coords : num [1:373, 1:2] 280000 280000 280000 280001 280000 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$: chr [1:2] "X" "Y" Thus, the object hbef2015 is a list comprised of the detection-nondetection data (y), covariates on the occurrence portion of the model (occ.covs), covariates on the detection portion of the model (det.covs), and the spatial coordinates of each site (coords) for use in spatial occupancy models and in plotting. This list is in the exact format required for input to spOccupancy model functions. hbef2015 contains data on 12 species in the three-dimensional array y, where the dimensions of y correspond to species (12), sites (373), and replicates (3). For single-species occupancy models in Section 2 and 3, we will only use data on the charming Ovenbird (OVEN; Seiurus aurocapilla), so we next subset the hbef2015 list to only include data from OVEN in a new object ovenHBEF. sp.names <- dimnames(hbef2015$y)[[1]]
ovenHBEF <- hbef2015
ovenHBEF$y <- ovenHBEF$y[sp.names == "OVEN", , ]
table(ovenHBEF$y) # Quick summary.  0 1 518 588  We see that OVEN is detected at a little over half of all site-replicate combinations. ## Single-species occupancy models ### Basic model description Let $$z_j$$ be the true presence (1) or absence (0) of a species at site $$j$$, with $$j = 1, \dots, J$$. For our OVEN example, $$J = 373$$. Following the basic occupancy model (MacKenzie et al. 2002; Tyre et al. 2003), we assume this latent occurrence variable arises from a Bernoulli process following $$$\begin{split} &z_j \sim \text{Bernoulli}(\psi_j), \\ &\text{logit}(\psi_j) = \boldsymbol{x}^{\top}_j\boldsymbol{\beta}, \end{split}$$$ where $$\psi_j$$ is the probability of occurrence at site $$j$$, which is a function of site-specific covariates $$\boldsymbol{X}$$ and a vector of regression coefficients ($$\boldsymbol{\beta}$$). We do not directly observe $$z_j$$, but rather we observe an imperfect representation of the latent occurrence process as a result of imperfect detection (i.e., the failure to detect a species at a site when it is truly present). Let $$y_{j, k}$$ be the observed detection (1) or nondetection (0) of a species of interest at site $$j$$ during replicate $$k$$ for each of $$k = 1, \dots, K_j$$ replicates. Note that the number of replicates, $$K_j$$, can vary by site and in practical applications will often be equal to 1 for a subset of sites (i.e., some sites will have no replicate surveys). For our OVEN example, the maximum value of $$K_j$$ is three. We envision the detection-nondetection data as arising from a Bernoulli process conditional on the true latent occurrence process: $$$\begin{split} &y_{j, k} \sim \text{Bernoulli}(p_{j, k}z_j), \\ &\text{logit}(p_{j, k}) = \boldsymbol{v}^{\top}_{j, k}\boldsymbol{\alpha}, \end{split}$$$ where $$p_{j, k}$$ is the probability of detecting a species at site $$j$$ during replicate $$k$$ (given it is present at site $$j$$), which is a function of site and replicate specific covariates $$\boldsymbol{V}$$ and a vector of regression coefficients ($$\boldsymbol{\alpha}$$). To complete the Bayesian specification of the model, we assign multivariate normal priors for the occurrence ($$\boldsymbol{\beta}$$) and detection ($$\boldsymbol{\alpha}$$) regression coefficients. To yield an efficient implementation of the occupancy model using a logit link function, we use Pólya-Gamma data augmentation (Polson, Scott, and Windle 2013), which is described in depth in a separate MCMC sampler vignette. ### Fitting single-species occupancy models with PGOcc() The PGOcc() function fits single-species occupancy models using Pólya-Gamma latent variables, which makes it more efficient than standard Bayesian implementations of occupancy models using a logit link function (Polson, Scott, and Windle 2013; Clark and Altwegg 2019). PGOcc() has the following arguments: PGOcc(occ.formula, det.formula, data, inits, priors, n.samples, n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed, ...) The first two arguments, occ.formula and det.formula, use standard R model syntax to denote the covariates to be included in the occurrence and detection portions of the model, respectively. Only the right hand side of the formulas are included. Random intercepts can be included in both the occurrence and detection portions of the single-species occupancy model using lme4 syntax (Bates et al. 2015). For example, to include a random intercept for different observers in the detection portion of the model, we would include (1 | observer) in the det.formula, where observer indicates the specific observer for each data point. The names of variables given in the formulas should correspond to those found in data, which is a list consisting of the following tags: y (detection-nondetection data), occ.covs (occurrence covariates), det.covs (detection covariates). y should be stored as a sites x replicate matrix, occ.covs as a matrix or data frame with site-specific covariate values, and det.covs as a list with each list element corresponding to a covariate to include in the detection portion of the model. Covariates on detection can vary by site and/or survey, and so these covariates may be specified as a site by survey matrix for survey-level covariates or as a one-dimensional vector for survey level covariates. The ovenHBEF list is already in the required format. Here we will model OVEN occurrence as a function of linear and quadratic elevation and will include three observational covariates (linear and quadratic day of survey, time of day of survey) on the detection portion of the model. We standardize all covariates by using the scale() function in our model specification, and use the I() function to specify quadratic effects: oven.occ.formula <- ~ scale(Elevation) + I(scale(Elevation)^2) oven.det.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2) # Check out the format of ovenHBEF str(ovenHBEF) List of 4$ y       : num [1:373, 1:3] 1 1 0 1 0 0 1 0 1 1 ...
..- attr(*, "dimnames")=List of 2
.. ..$: chr [1:373] "1" "2" "3" "4" ... .. ..$ : chr [1:3] "1" "2" "3"
$occ.covs: num [1:373, 1] 475 494 546 587 588 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL
.. ..$: chr "Elevation"$ det.covs:List of 2
..$day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ... ..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
$coords : num [1:373, 1:2] 280000 280000 280000 280001 280000 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$: chr [1:2] "X" "Y" Next, we specify the initial values for the MCMC sampler in inits. PGOcc() (and all other spOccupancy model fitting functions) will set initial values by default, but here we will do this explicitly, since in more complicated cases setting initial values close to the presumed solutions can be vital for success of an MCMC-based analysis. For all models described in this vignette (in particular the non-spatial models), choice of the initial values is largely inconsequential, with the exception being that specifying initial values close to the presumed solutions can decrease the amount of samples you need to run to arrive at convergence of the MCMC chains. Thus, when first running a model in spOccupancy, we recommend fitting the model using the default initial values that spOccupancy provides, which are based on the prior distributions. After running the model for a reasonable period, if you find the chains are taking a long time to reach convergence, you then may wish to set the initial values to the mean estimates of the parameters from the initial model fit, as this will likely help reduce the amount of time you need to run the model. The default initial values for occurrence and detection regression coefficients (including the intercepts) are random values from the prior distributions adopted in the model, while the default initial values for the latent occurrence effects are set to 1 if the species was observed at a site and 0 otherwise. Initial values are specified in a list with the following tags: z (latent occurrence values), alpha (detection intercept and regression coefficients), and beta (occurrence intercept and regression coefficients). Below we set all initial values of the regression coefficients to 0, and set initial values for z based on the detection-nondetection data matrix. For the occurrence (beta) and detection (alpha) regression coefficients, the initial values are passed either as a vector of length equal to the number of estimated parameters (including an intercept, and in the order specified in the model formula), or as a single value if setting the same initial value for all parameters (including the intercept). Below we take the latter approach. To specify the initial values for the latent occurrence at each site (z), we must ensure we set the value to 1 at all sites where OVEN was detected at least once, because if we detect OVEN at a site then the value of z is 1 with complete certainty (under the assumption of the model that there are no false positives). If the initial value for z is set to 0 at one or more sites when the species was detected, the occupancy model will fail. spOccupancy will provide a clear error message if the supplied initial values for z are invalid. Below we use the raw detection-nondetection data and the apply() function to set the initial values to 1 if OVEN was detected at that site and 0 otherwise. # Format with explicit specification of inits for alpha and beta # with four detection parameters and three occurrence parameters # (including the intercept). oven.inits <- list(alpha = c(0, 0, 0, 0), beta = c(0, 0, 0), z = apply(ovenHBEF$y, 1, max, na.rm = TRUE))
# Format with abbreviated specification of inits for alpha and beta.
oven.inits <- list(alpha = 0,
beta = 0,
z = apply(ovenHBEF$y, 1, max, na.rm = TRUE)) We next specify the priors for the occurrence and detection regression coefficients. The Pólya-Gamma data augmentation algorithm employed by spOccupancy assumes normal priors for both the detection and occurrence regression coefficients. These priors are specified in a list with tags beta.normal for occurrence and alpha.normal for detection parameters (including intercepts). Each list element is then itself a list, with the first element of the list consisting of the hypermeans for each coefficient and the second element of the list consisting of the hypervariances for each coefficient. Alternatively, the hypermean and hypervariances can be specified as a single value if the same prior is used for all regression coefficients. By default, spOccupancy will set the hypermeans to 0 and the hypervariances to 2.72, which corresponds to a relatively flat prior on the probability scale (0, 1; Lunn et al. (2013)). Broms, Hooten, and Fitzpatrick (2016), Northrup and Gerber (2018), and others show such priors are an adequate choice when the goal is to specify relatively non-informative priors. We will use these default priors here, but we specify them explicitly below for clarity oven.priors <- list(alpha.normal = list(mean = 0, var = 2.72), beta.normal = list(mean = 0, var = 2.72)) Our last step is to specify the number of samples to produce with the MCMC algorithm (n.samples), the length of burn-in (n.burn), the rate at which we want to thin the posterior samples (n.thin), and the number of MCMC chains to run (n.chains). Note that currently spOccupancy runs multiple chains sequentially and does not allow chains to be run simultaneously in parallel across multiple threads. Instead, we allow for within-chain parallelization using the n.omp.threads argument. We can set n.omp.threads to a number greater than 1 and smaller than the number of threads on the computer you are using. Generally, setting n.omp.threads > 1 will not result in decreased run times for non-spatial models in spOccupancy, but can substantially decrease run time when fitting spatial models (Finley, Datta, and Banerjee 2020). Here we set n.omp.threads = 1. For a simple single-species occupancy model, we shouldn’t need too many samples and will only need a small amount of burn-in and thinning. We will run the model using three chains to assess convergence using the Gelman-Rubin diagnostic (Rhat; Brooks and Gelman (1998)). n.samples <- 5000 n.burn <- 3000 n.thin <- 2 n.chains <- 3 We are now nearly set to run the occupancy model. The verbose argument is a logical value indicating whether or not MCMC sampler progress is reported to the screen. If verbose = TRUE, sampler progress is reported after every multiple of the specified number of iterations in the n.report argument. We set verbose = TRUE and n.report = 1000 to report progress after every 1000th MCMC iteration. The last three arguments to PGOcc() (k.fold, k.fold.threads, k.fold.seed) are used for performing k-fold cross-validation for model assessment, which we will illustrate in a subsequent section. For now, we won’t specify the arguments, which will tell PGOcc() not to perform k-fold cross-validation. out <- PGOcc(occ.formula = oven.occ.formula, det.formula = oven.det.formula, data = ovenHBEF, inits = oven.inits, n.samples = n.samples, priors = oven.priors, n.omp.threads = 1, verbose = TRUE, n.report = 1000, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains) ---------------------------------------- Preparing to run the model ---------------------------------------- ---------------------------------------- Model description ---------------------------------------- Occupancy model with Polya-Gamma latent variable fit with 373 sites. Samples per Chain: 5000 Burn-in: 3000 Thinning Rate: 2 Number of Chains: 3 Total Posterior Samples: 3000 Source compiled with OpenMP support and model fit using 1 thread(s). ---------------------------------------- Chain 1 ---------------------------------------- Sampling ... Sampled: 1000 of 5000, 20.00% ------------------------------------------------- Sampled: 2000 of 5000, 40.00% ------------------------------------------------- Sampled: 3000 of 5000, 60.00% ------------------------------------------------- Sampled: 4000 of 5000, 80.00% ------------------------------------------------- Sampled: 5000 of 5000, 100.00% ---------------------------------------- Chain 2 ---------------------------------------- Sampling ... Sampled: 1000 of 5000, 20.00% ------------------------------------------------- Sampled: 2000 of 5000, 40.00% ------------------------------------------------- Sampled: 3000 of 5000, 60.00% ------------------------------------------------- Sampled: 4000 of 5000, 80.00% ------------------------------------------------- Sampled: 5000 of 5000, 100.00% ---------------------------------------- Chain 3 ---------------------------------------- Sampling ... Sampled: 1000 of 5000, 20.00% ------------------------------------------------- Sampled: 2000 of 5000, 40.00% ------------------------------------------------- Sampled: 3000 of 5000, 60.00% ------------------------------------------------- Sampled: 4000 of 5000, 80.00% ------------------------------------------------- Sampled: 5000 of 5000, 100.00% names(out) # Look at the contents of the resulting object.  [1] "rhat" "beta.samples" "alpha.samples" "z.samples" [5] "psi.samples" "like.samples" "ESS" "X" [9] "X.p" "X.re" "X.p.re" "lambda.p" [13] "y" "n.samples" "call" "n.post" [17] "n.thin" "n.burn" "n.chains" "pRE" [21] "psiRE" "run.time"  PGOcc() returns a list of class PGOcc with a suite of different objects, many of them being coda::mcmc objects of posterior samples. Note the “Preparing to run the model” printed section doesn’t have any information shown in it. spOccupancy model fitting functions will present messages when preparing the data for the model in this section, or will print out the default priors or initial values used when they are not specified in the function call. Here we specified everything explicitly so no information was reported. For a concise yet informative summary of the regression parameters and convergence of the MCMC chains, we can use summary() on the resulting PGOcc() object. summary(out)  Call: PGOcc(occ.formula = oven.occ.formula, det.formula = oven.det.formula, data = ovenHBEF, inits = oven.inits, priors = oven.priors, n.samples = n.samples, n.omp.threads = 1, verbose = TRUE, n.report = 1000, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains) Samples per Chain: 5000 Burn-in: 3000 Thinning Rate: 2 Number of Chains: 3 Total Posterior Samples: 3000 Run Time (min): 0.137 Occurrence (logit scale): Mean SD 2.5% 50% 97.5% Rhat ESS (Intercept) 2.1370 0.2728 1.6490 2.1188 2.7079 1.0001 834 scale(Elevation) -1.7024 0.2983 -2.4095 -1.6701 -1.2336 1.0100 402 I(scale(Elevation)^2) -0.3992 0.2038 -0.7687 -0.4103 0.0481 1.0030 645 Detection (logit scale): Mean SD 2.5% 50% 97.5% Rhat ESS (Intercept) 0.7942 0.1269 0.5394 0.7940 1.0407 1.0018 2694 scale(day) -0.0904 0.0773 -0.2450 -0.0910 0.0587 1.0021 3814 scale(tod) -0.0460 0.0781 -0.2008 -0.0457 0.1030 0.9996 3000 I(scale(day)^2) 0.0238 0.0966 -0.1625 0.0231 0.2209 1.0053 3000 Note that all coefficients are printed on the logit scale. We see OVEN is fairly widespread (at the scale of the survey) in the forest given the large intercept value, which translates to a value of 0.89 on the probability scale (run plogis(2.13)). In addition, the negative linear and quadratic terms for Elevation suggest occurrence probability peaks at mid-elevations. On average, detection probability is about 0.69. In principle, before we even look at the parameter summaries, we ought to ensure that the MCMC chains have reached convergence. The summary function also returns the Gelman-Rubin diagnostic (Brooks and Gelman 1998) and the effective samples size (ESS) of the posterior samples, which we can use to assess convergence. Here we see all Rhat values are less than 1.1 and the ESS values indicate adequate mixing of the MCMC chains. Additionally, the posterior samples in the PGOcc object are coda::mcmc objects, which we can quickly assess for convergence visually using trace plots. plot(out$beta.samples, density = FALSE) # Occupancy parameters.

plot(outalpha.samples, density = FALSE) # Detection parameters. ### Posterior predictive checks The function ppcOcc() performs a posterior predictive check on all spOccupancy model objects as a Goodness of Fit (GoF) assessment. The fundamental idea of GoF testing is that a good model should generate data that closely align with the observed data. If there are drastic differences in the true data from the model generated data, our model is likely not very useful (Hobbs and Hooten 2015). GoF assessments are more complicated using binary data, like detection-nondetection used in occupancy models, as standard approaches are not valid assessments for the raw data in binary response models such as occupancy models (Broms, Hooten, and Fitzpatrick 2016; McCullagh and Nelder 2019). Thus, any approach to assess model fit for detection-nondetection data must bin, or aggregate, the raw values in some manner, and then perform a model fit assessment on the binned values. There are numerous ways we could envision binning the raw detection-nondetection values (MacKenzie and Bailey 2004; Kéry and Royle 2016). In spOccupancy, a posterior predictive check broadly takes the following steps: 1. Fit the model using any of the model-fitting functions (here PGOcc()), which generates replicated values for all detection-nondetection data points. 2. Bin both the actual and the replicated detection-nondetection data in a suitable manner, such as by site or replicate (MacKenzie and Bailey 2004). 3. Compute a fit statistic on both the actual data and also on the model-generated ‘replicate data’. 4. Compare the fit statistics for the true data and replicate data. If they are widely different, this suggests a lack of fit of the model to the actual data set at hand. To perform a posterior predictive check, we send the resulting PGOcc model object as input to the ppcOcc() function, along with a fit statistic (fit.stat) and a numeric value indicating how to group, or bin, the data (group). Currently supported fit statistics include the Freeman-Tukey statistic and the Chi-Squared statistic (freeman-tukey or chi-squared, respectively, Kéry and Royle (2016)). Currently, ppcOcc() allows the user to group the data by row (site; group = 1) or column (replicate; group = 2). ppcOcc() will then return a set of posterior samples for the fit statistic (or discrepancy measure) using the actual data (fit.y) and model generated replicate data set (fit.y.rep), summed across all data points in the chosen manner. We generally recommend performing a posterior predictive check when grouping data both across sites (group = 1) as well as across replicates (group = 2), as they may reveal (or fail to reveal) different inadequacies of the model for the specific data set at hand (Kéry and Royle 2016). In particular, binning the data across sites (group = 1) may help reveal whether the model fails to adequately represent variation in occurrence and detection probability across space, while binning the data across replicates (group = 2) may help reveal whether the model fails to adequately represent variation in detection probability across the different replicate surveys. Similarly, we suggest exploring posterior predictive checks using both the Freeman-Tukey Statistic as well as the Chi-Squared statistic. Generally, the more ways we explore the fit of our model to the data, the more confidence we have that our model is adequately representing the data and we can draw reasonable conclusions from it. By performing posterior predictive checks with the two fit statistics and two ways of grouping the data, spOccupancy provides us with four different measures that we can use to assess the validity of our model. Throughout this vignette, we will display different types of posterior predictive checks using different combinations of the fit statistic and grouping approach. In a more complete analysis, we would run all four types of posterior predictive checks for each model we fit as a more complete GoF assessment. The resulting values from a call to ppcOcc() can be used with the summary() function to generate a Bayesian p-value, which is the probability, under the fitted model, to obtain a value of the fit statistic that is more extreme (i.e., larger) than the one observed, i.e., for the actual data. Bayesian p-values are sensitive to individual values, so we should also explore the discrepancy measures for each “grouped” data point. ppcOcc() returns a matrix of posterior quantiles for the fit statistic for both the observed (fit.y.group.quants) and model generated, replicate data (fit.y.rep.group.quants) for each “grouped” data point. We next perform a posterior predictive check using the Freeman-Tukey statistic grouping the data by sites. We summarize the posterior predictive check with the summary() function, which reports a Bayesian p-value. A Bayesian p-value that hovers around 0.5 indicates adequate model fit, while values less than 0.1 or greater than 0.9 suggest our model does not fit the data well (Hobbs and Hooten 2015). As always with a simulation-based analysis using MCMC, you will get numerically slightly different values. ppc.out <- ppcOcc(out, fit.stat = 'freeman-tukey', group = 1) summary(ppc.out)  Call: ppcOcc(object = out, fit.stat = "freeman-tukey", group = 1) Samples per Chain: 5000 Burn-in: 3000 Thinning Rate: 2 Number of Chains: 3 Total Posterior Samples: 3000 Bayesian p-value: 0.2403 Fit statistic: freeman-tukey  The Bayesian p-value is the proportion of posterior samples of the fit statistic of the model generated data that are greater than the corresponding fit statistic of the true data, summed across all “grouped” data points. We can create a visual representation of the Bayesian p-value as follows (Kéry and Royle 2016): ppc.df <- data.frame(fit = ppc.outfit.y,
fit.rep = ppc.out$fit.y.rep, color = 'lightskyblue1') ppc.df$color[ppc.df$fit.rep > ppc.df$fit] <- 'lightsalmon'
plot(ppc.df$fit, ppc.df$fit.rep, bg = ppc.df$color, pch = 21, ylab = 'Fit', xlab = 'True') lines(ppc.df$fit, ppc.df$fit, col = 'black') Our Bayesian p-value is greater than 0.1 indicating no lack of fit, although the above plot shows most of the fit statistics are smaller for the replicate data than the actual data set. Relying solely on the Bayesian p-value as an assessment of model fit is not always a great option, as individual data points can have an overbearing influence on the resulting summary value. Instead of summing across all data points for a single discrepancy measure, ppcOcc() also allows us to explore discrepancy measures on a “grouped” point by point basis. The resulting ppcOcc object will contain the objects fit.y.group.quants and fit.y.rep.group.quants, which contain quantiles of the posterior distributions for the discrepancy measures of each grouped data point. Below we plot the difference in the discrepancy measure between the replicate and actual data across each of the sites. diff.fit <- ppc.out$fit.y.rep.group.quants[3, ] - ppc.outfit.y.group.quants[3, ] plot(diff.fit, pch = 19, xlab = 'Site ID', ylab = 'Replicate - True Discrepancy') We see there are four sites that contribute to the overall GoF measure far more than in proportion to their number, i.e., whose discrepancy measure for the actual data is much larger than the corresponding discrepancy for the replicate data. Here we will ignore this, but in a real analysis it might be very insightful to further explore these sites to see what could explain this pattern (e.g., are the sites close together in space? Could the data be the result of erroneous recording, or of extraordinary local habitat that is not adequately described by the occurrence covariates?). Although rarely done, closer investigations of outlying values in statistical analyses may sometimes teach one more than mere acceptance of a fitting model. ### Model selection using WAIC and k-fold cross-validation Posterior predictive checks allow us to assess how well our model fits the data, but they are not very useful if we want to compare multiple competing models and ultimately select a final model based on some criterion. Bayesian model selection is very much a constantly changing field. See Hooten and Hobbs (2015) for an accessible overview of Bayesian model selection for ecologists. For Bayesian hierarchical models like occupancy models, the most common Bayesian model selection criterion, the deviance information criterion or DIC, is not applicable (Hooten and Hobbs 2015). Instead, the Widely Applicable Information Criterion (Watanabe 2010) is recommended to compare a set of models and select the best-performing model for final analysis. The WAIC is calculated for all spOccupancy model objects using the function waicOcc(). We calculate the WAIC as $\text{WAIC} = -2 \times (\text{elpd} - \text{pD}),$ where elpd is the expected log point-wise predictive density and PD is the effective number of parameters. We calculate elpd by calculating the likelihood for each posterior sample, taking the mean of these likelihood values, taking the log of the mean of the likelihood values, and summing these values across all sites. We calculate the effective number of parameters by calculating the variance of the log likelihood for each site taken over all posterior samples, and then summing these values across all sites. See Appendix S1 from Broms, Hooten, and Fitzpatrick (2016) for more details. We calculate the WAIC using waicOcc() for our OVEN model below (as always, note some slight differences with your solutions due to Monte Carlo error). waicOcc(out)  elpd pD WAIC -632.960543 6.290149 1278.501382  For illustration, let’s next rerun the OVEN model, but this time we assume occurrence is constant across the HBEF, and subsequently compare the WAIC value to the full model out.small <- PGOcc(occ.formula = ~ 1, det.formula = oven.det.formula, data = ovenHBEF, inits = oven.inits, n.samples = n.samples, priors = oven.priors, n.omp.threads = 1, verbose = FALSE, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains) waicOcc(out.small)  elpd pD WAIC -692.046183 4.932945 1393.958257  Smaller values of WAIC indicate models with better performance. We see the WAIC for the model with elevation is smaller than the intercept-only model, indicating elevation is an important predictor for OVEN occurrence in HBEF. When focusing primarily on predictive performance, a k-fold cross-validation (CV) approach is another attractive, though computationally more intensive, alternative to compare a series of models, especially since WAIC may not always be reliable for occupancy models (Broms, Hooten, and Fitzpatrick 2016). In spOccupancy, k-fold cross-validation is accomplished using the arguments k.fold, k.fold.threads, and k.fold.seed in the model fitting function. A k-fold cross validation approach requires fitting a model $$k$$ times, where each time the model is fit using $$J / k$$ data points, where $$J$$ is the total number of sites surveyed at least once in the data set. Each time the model is fit, it uses a different portion of the data and then predicts the remaining $$J - J/k$$ hold out values. Because the data are not used to fit the model, CV yields true samples from the posterior predictive distribution that we can use to assess the predictive capacity of the model. As a measure of out-of-sample predictive performance, we use the deviance as a scoring rule following Hooten and Hobbs (2015). For K-fold cross-validation, it is computed as $$$-2 \sum_{k = 1}^K \text{log}\Bigg(\frac{\sum_{q = 1}^Q \text{Bernoulli}(\boldsymbol{y}_k \mid \boldsymbol{p}^{(q)}\boldsymbol{z}_k^{(q)})}{Q}\Bigg),$$$ where $$\boldsymbol{p}^{(q)}$$ and $$\boldsymbol{z}_k^{(q)}$$ are MCMC samples of detection probability and latent occurrence, respectively, arising from a model that is fit without the observations $$\boldsymbol{y}_k$$, and $$Q$$ is the total number of posterior samples produced from the MCMC sampler. The -2 is used so that smaller values indicate better model fit, which aligns with most information criteria used for model assessment (like the WAIC implemented using waicOcc()). The final three arguments in PGOcc() (k.fold, k.fold.threads, k.fold.seed) control whether or not k-fold cross validation is performed following the complete fit of the model using the entire data set. The k.fold argument indicates the number of $$k$$ folds to use for cross-validation. If k.fold is not specified, cross-validation is not performed and k.fold.threads and k.fold.seed are ignored. The k.fold.threads argument indicates the number of threads to use for running the $$k$$ models in parallel across multiple threads. Parallel processing is accomplished using the R packages foreach and doParallel. Specifying k.fold.threads > 1 can substantially decrease run time since it allows for models to be fit simultaneously on different threads rather than sequentially. The k.fold.seed indicates the seed used to randomly split the data into $$k$$ groups. This is by default set to 100. Below we refit the occupancy model with elevation (linear and quadratic) as an occurrence predictor this time performing 4-fold cross-validation. We set k.fold = 4 to perform 4-fold cross-validation and k.fold.threads = 1 to run the model using 1 thread. Normally we would set k.fold.threads = 4, but using multiple threads leads to complications when compiling this vignette, so we leave that to you to explore the computational improvements of performing cross-validation across multiple cores. out.k.fold <- PGOcc(occ.formula = oven.occ.formula, det.formula = oven.det.formula, data = ovenHBEF, inits = oven.inits, n.samples = n.samples, priors = oven.priors, n.omp.threads = 1, verbose = TRUE, n.report = 1000, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains, k.fold = 4, k.fold.threads = 1) ---------------------------------------- Preparing to run the model ---------------------------------------- ---------------------------------------- Model description ---------------------------------------- Occupancy model with Polya-Gamma latent variable fit with 373 sites. Samples per Chain: 5000 Burn-in: 3000 Thinning Rate: 2 Number of Chains: 3 Total Posterior Samples: 3000 Source compiled with OpenMP support and model fit using 1 thread(s). ---------------------------------------- Chain 1 ---------------------------------------- Sampling ... Sampled: 1000 of 5000, 20.00% ------------------------------------------------- Sampled: 2000 of 5000, 40.00% ------------------------------------------------- Sampled: 3000 of 5000, 60.00% ------------------------------------------------- Sampled: 4000 of 5000, 80.00% ------------------------------------------------- Sampled: 5000 of 5000, 100.00% ---------------------------------------- Chain 2 ---------------------------------------- Sampling ... Sampled: 1000 of 5000, 20.00% ------------------------------------------------- Sampled: 2000 of 5000, 40.00% ------------------------------------------------- Sampled: 3000 of 5000, 60.00% ------------------------------------------------- Sampled: 4000 of 5000, 80.00% ------------------------------------------------- Sampled: 5000 of 5000, 100.00% ---------------------------------------- Chain 3 ---------------------------------------- Sampling ... Sampled: 1000 of 5000, 20.00% ------------------------------------------------- Sampled: 2000 of 5000, 40.00% ------------------------------------------------- Sampled: 3000 of 5000, 60.00% ------------------------------------------------- Sampled: 4000 of 5000, 80.00% ------------------------------------------------- Sampled: 5000 of 5000, 100.00% ---------------------------------------- Cross-validation ---------------------------------------- Performing 4-fold cross-validation using 1 thread(s). We subsequently refit the intercept only occupancy model, and compare the deviance metrics from the 4-fold cross-validation. # Model fitting information is suppressed for space. out.int.k.fold <- PGOcc(occ.formula = ~ 1, det.formula = oven.det.formula, data = ovenHBEF, inits = oven.inits, n.samples = n.samples, priors = oven.priors, n.omp.threads = 1, verbose = FALSE, n.report = 1000, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains, k.fold = 4, k.fold.threads = 1) The cross-validation metric (model deviance) is stored in the k.fold.deviance tag of the resulting model object. out.k.foldk.fold.deviance # Larger model. 
[1] 1510.854
out.int.k.fold$k.fold.deviance # Intercept-only model. [1] 1532.225 Similar to the results from the WAIC, CV also suggests that the model including elevation with a predictor outperforms the intercept only model. ### Prediction All model objects resulting from a call to spOccupancy model-fitting functions can be used with predict() to generate a series of posterior predictive samples at new locations, given the values of all covariates used in the model fitting process. The object hbefElev (which comes as part of the spOccupancy package) contains elevation data at a 30x30m resolution from the National Elevation Data set across the entire HBEF. We load the data below data(hbefElev) str(hbefElev) 'data.frame': 46090 obs. of 3 variables:$ val     : num  914 916 918 920 922 ...
$Easting : num 276273 276296 276318 276340 276363 ...$ Northing: num  4871424 4871424 4871424 4871424 4871424 ...

The column val contains the elevation values, while Easting and Northing contain the spatial coordinates that we will use for plotting. We can obtain posterior predictive samples for the occurrence probabilities at these sites by using the predict() function and our PGOcc fitted model object. Given that we standardized the elevation values when we fit the model, we need to standardize the elevation values for prediction using the exact same values of the mean and standard deviation of the elevation values used to fit the data.

elev.pred <- (hbefElev$val - mean(ovenHBEF$occ.covs[, 1])) / sd(ovenHBEF$occ.covs[, 1]) # These are the new intercept and covariate data. X.0 <- cbind(1, elev.pred, elev.pred^2) out.pred <- predict(out, X.0) For PGOcc objects, the predict function takes two arguments: (1) the PGOcc fitted model object; and (2) a matrix or data frame consisting of the design matrix for the prediction locations (which must include and intercept if our model contained one). The resulting object consists of posterior predictive samples for the latent occurrence probabilities (psi.0.samples) and latent occurrence values (z.0.samples). The beauty of the Bayesian paradigm, and the MCMC computing machinery, is that these predictions all have fully propagated uncertainty. We can use these values to create plots of the predicted mean occurrence values, as well as of their standard deviation as a measure of the uncertainty in these predictions akin to a standard error associated with maximum likelihood estimates. We could also produce a map for the Bayesian credible interval length as an alternative measure of prediction uncertainty or two maps, one showing the lower and the other the upper limit of, say, a 95% credible interval. plot.dat <- data.frame(x = hbefElev$Easting,
y = hbefElev$Northing, mean.psi = apply(out.pred$psi.0.samples, 2, mean),
sd.psi = apply(out.pred$psi.0.samples, 2, sd), stringsAsFactors = FALSE) # Make a species distribution map showing the point estimates, # or predictions (posterior means) dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y')) ggplot() + geom_stars(data = dat.stars, aes(x = x, y = y, fill = mean.psi)) + scale_fill_viridis_c(na.value = 'transparent') + labs(x = 'Easting', y = 'Northing', fill = '', title = 'Mean OVEN occurrence probability') + theme_bw() # Map of the associated uncertainty of these predictions # (i.e., posterior sds) ggplot() + geom_stars(data = dat.stars, aes(x = x, y = y, fill = sd.psi)) + scale_fill_viridis_c(na.value = 'transparent') + labs(x = 'Easting', y = 'Northing', fill = '', title = 'SD OVEN occurrence probability') + theme_bw() ## Single-species spatial occupancy models ### Basic model description When working across large spatial domains, accounting for residual spatial autocorrelation in species distributions can often improve predictive performance of a model, leading to more accurate species distribution maps (Guélat and Kéry 2018; Lany et al. 2020). We here extend the basic single-species occupancy model to incorporate a spatial Gaussian Process that accounts for unexplained spatial variation in species occurrence across a region of interest. Let $$\boldsymbol{s}_j$$ denote the geographical coordinates of site $$j$$ for $$j = 1, \dots, J$$. In all spatially-explicit models, we include $$\boldsymbol{s}_j$$ directly in the notation of spatially-indexed variables to indicate the model is spatially-explicit. More specifically, the occurrence probability at site $$j$$ with coordinates $$\boldsymbol{s}_j$$, $$\psi(\boldsymbol{s}_j)$$, now takes the form $$$\text{logit}(\psi(\boldsymbol{s}_j) = \boldsymbol{x}(\boldsymbol{s}_j)^{\top}\boldsymbol{\beta} + \omega(\boldsymbol{s}_j),$$$ where $$\omega_j$$ is a realization from a zero-mean spatial Gaussian Process, i.e., $$$\boldsymbol{\omega}(\boldsymbol{s}) \sim N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\boldsymbol{s}, \boldsymbol{s}', \theta})).$$$ We define $$\boldsymbol{\Sigma}(\boldsymbol{s}, \boldsymbol{s}', \boldsymbol{\theta})$$ as a $$J \times J$$ covariance matrix that is a function of the distances between any pair of site coordinates $$\boldsymbol{s}$$ and $$\boldsymbol{s}'$$ and a set of parameters $$(\boldsymbol{\theta})$$ that govern the spatial process. The vector $$\boldsymbol{\theta}$$ is equal to $$\boldsymbol{\theta} = \{\sigma^2, \phi, \nu\}$$, where $$\sigma^2$$ is a spatial variance parameter, $$\phi$$ is a spatial decay parameter, and $$\nu$$ is a spatial smoothness parameter. $$\nu$$ is only specified when using a Matern correlation function. The detection portion of the occupancy model remains unchanged from the non-spatial occupancy model. Single-species spatial occupancy models, like all models in spOccupancy are fit using Pólya-Gamma data augmentation (see MCMC sampler vignette for details). When the number of sites is moderately large, say 1000, the above described spatial Gaussian process model can become drastically slow as a result of the need to take the inverse of the spatial covariance matrix $$\boldsymbol{\Sigma}(\boldsymbol{s}, \boldsymbol{s}', \boldsymbol{\theta})$$ at each MCMC iteration. Numerous approximation methods exist to reduce this computational cost (see Heaton et al. (2019) for an overview and comparison of multiple methods). One attractive approach is the Nearest Neighbor Gaussian Process (NNGP; Datta et al. (2016)). Instead of modeling the spatial process using a full Gaussian Process, we replace the Gaussian Process prior specification with a NNGP, which leads to drastic improvements of run time with nearly identical inference and prediction as the full Gaussian Process specification. See Datta et al. (2016), Finley et al. (2019), and the MCMC sampler vignette for additional statistical details on NNGPs and their implementation in spatial occupancy models. ### Fitting single-species spatial occupancy models with spPGOcc() The function spPGOcc() fits single-species spatial occupancy models using Pólya-Gamma latent variables, where spatial autocorrelation is accounted for using a spatial Gaussian Process. spPGOcc() fits spatial occupancy models using either a full Gaussian process or an NNGP. See Finley, Datta, and Banerjee (2020) for details on using NNGPs with Pólya-Gamma latent variables. We will fit the same occupancy model for OVEN that we fit previously using PGOcc(), but we will now make the model spatially explicit by incorporating a spatial process with spPGOcc(). First, let’s take a look at the arguments for spPGOcc(): spPGOcc(occ.formula, det.formula, data, inits, n.batch, batch.length, accept.rate = 0.43, priors, cov.model = "exponential", tuning, n.omp.threads = 1, verbose = TRUE, NNGP = FALSE, n.neighbors = 15, search.type = "cb", n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed = 100, ...) We will walk through each of the arguments to spPGOcc() in the context of our Ovenbird example. The occurrence (occ.formula) and detection (det.formula) formulas, as well as the list of data (data), take the same form as we saw in PGOcc, with random intercepts allowed in both the occurrence and detection models. Notice the coords matrix in the ovenHBEF list of data. We did not use this for PGOcc() but specifying the spatial coordinates in data is required for all spatially explicit models in spOccupancy. oven.occ.formula <- ~ scale(Elevation) + I(scale(Elevation)^2) oven.det.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2) str(ovenHBEF) # coords is required for spPGOcc. List of 4$ y       : num [1:373, 1:3] 1 1 0 1 0 0 1 0 1 1 ...
..- attr(*, "dimnames")=List of 2
.. ..$: chr [1:373] "1" "2" "3" "4" ... .. ..$ : chr [1:3] "1" "2" "3"
$occ.covs: num [1:373, 1] 475 494 546 587 588 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL
.. ..$: chr "Elevation"$ det.covs:List of 2
..$day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ... ..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
$coords : num [1:373, 1:2] 280000 280000 280000 280001 280000 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$: chr [1:2] "X" "Y" The initial values (inits) are again specified in a list. Valid tags for initial values now additionally include the parameters associated with the spatial random effects. These include: sigma.sq (spatial variance parameter), phi (spatial range parameter), w (the latent spatial random effects at each site), and nu (spatial smoothness parameter), where the latter is only specified if adopting a Matern covariance function (i.e., cov.model = 'matern'). spOccupancy supports four spatial covariance models (exponential, spherical, gaussian, and matern), which are specified in the cov.model argument. Throughout this vignette, we will use an exponential covariance model, which we often use as our default covariance model when fitting spatially-explicit models and is commonly used throughout ecology. To determine which covariance function to use, we can fit models with the different covariance functions and compare them using WAIC or k-fold cross-validation to select the best performing function. We will note that the Matern covariance function has the additional spatial smoothness parameter $$\nu$$ and thus can often be more flexible than the other functions. However, because we need to estimate an additional parameter, this also tends to require more data (i.e., a larger number of sites) than the other covariance functions, and so we encourage use of the three simpler functions if your data set is sparse. We note that model estimates are generally fairly robust to the different covariance functions, although certain functions may provide substantially better estimates depending on the specific form of the underlying spatial autocorrelation in the data. For example, the Gaussian covariance function is often useful for accounting for spatial autocorrelation that is very smooth (i.e., long range spatial dependence). See Chapter 2 in Banerjee, Carlin, and Gelfand (2003) for a more thorough discussion of these functions and their mathematical properties. The default initial values for phi, sigma.sq, and nu are all set to random values from the prior distribution, which generally will be sufficient for the models discussed in this vignette. In all spatially-explicit models described in this vignette, the spatial range parameter phi is the most sensitive to initial values. In general, the spatial range parameter will often have poor mixing and take longer to converge than the rest of the parameters in the model, so specifying an initial value that is reasonably close to the resulting value can really help decrease run times for complicated models. As an initial value for the spatial range parameter phi, we compute the mean distance between points in HBEF and then set it equal to 3 divided by this mean distance. When using an exponential covariance function, $$\frac{3}{\phi}$$ is the effective range, or the distance at which the residual spatial correlation between two sites drops to 0.05 (Banerjee, Carlin, and Gelfand 2003). Thus our initial guess for this effective range is the average distance between sites across HBEF. As with all other parameters, we generally recommend using the default initial values for an initial model run, and if the model is taking a very long time to converge you can rerun the model with initial values based on the posterior means of estimated parameters from the initial model fit. For the spatial variance parameter sigma.sq, we set the initial value to 2. This corresponds to a fairly small spatial variance, which we expect based on our previous work with this data set. Further, we set the initial values of the latent spatial random effects at each site to 0. The initial values for these random effects has an extremely small influence on the model results, so we generally recommend setting their initial values to 0 as we have done here (this is also the default). However, if you are running your model for a very long time and are seeing very slow convergence of the MCMC chains, setting the initial values of the spatial random effects to the mean estimates from a previous run of the model could help reach convergence faster. # Pair-wise distances between all sites dist.hbef <- dist(ovenHBEF$coords)
# Exponential covariance model
cov.model <- "exponential"
# Specify list of inits
oven.inits <- list(alpha = 0,
beta = 0,
z = apply(ovenHBEF$y, 1, max, na.rm = TRUE), sigma.sq = 2, phi = 3 / mean(dist.hbef), w = rep(0, nrow(ovenHBEF$y)))

The next three arguments (n.batch, batch.length, and accept.rate) are all related to the specific type of MCMC sampler we use when we fit the model. The spatial range parameter (and the spatial smoothness parameter if cov.model = 'matern') are the two hardest parameters to estimate in spatially-explicit models in spOccupancy. In other words, you will often see slow mixing and convergence of the MCMC chains for these parameters. To try to speed up this slow mixing and convergence of these parameters, we use an algorithm called an adaptive Metropolis-Hastings algorithm for all spatially-explicit models in spOccupancy (see Roberts and Rosenthal (2009) for more details on this algorithm). In this approach, we break up the total number of MCMC samples into a set of “batches”, where each batch has a specific number of MCMC samples. When we fit a spatially-explicit model in spOccupancy, instead of specifying the total number of MCMC samples in the n.samples argument like we did in PGOcc(), we must specify the total number of batches (n.batch) as well as the number of MCMC samples each batch contains. Thus, the total number of MCMC samples is n.batch * batch.length. Typically, we set batch.length = 25 and then play around with n.batch until convergence of all model parameters is reached. We recommend setting batch.length = 25 unless you have a specific reason to change it. Here we set n.batch = 400 for a total of 10,000 MCMC samples in each of 3 chains. We additionally specify a burn-in period of length 2000 and a thinning rate of 20.

batch.length <- 25
n.batch <- 400
n.burn <- 2000
n.thin <- 20
n.chains <- 3

Importantly, we also need to specify an acceptance rate and a tuning parameter for the spatial range parameter (and spatial smoothness parameter if cov.model = 'matern'), which are both features of the adaptive algorithm we use to sample these parameters. In this algorithm, we propose new values for phi (and nu), compare them to our previous values, and use a statistical algorithm to determine if we should accept the new proposed value or keep the old one. The accept.rate argument specifies the ideal proportion of times we will accept the newly proposed values for these parameters. Roberts and Rosenthal (2009) show that if we accept new values around 43% of the time, this will lead to optimal mixing and convergence of the MCMC chains. Following these recommendations, we should strive for an algorithm that accepts new values about 43% of the time. Thus, we recommend setting accept.rate = 0.43 unless you have a specific reason not to (this is the default value). The values specified in the tuning argument helps control the initial values we will propose for both phi and nu. These values are supplied as input in the form of a list with tags phi and nu. The initial tuning value can be any value greater than 0, but we generally recommend starting the value out around 0.5. This initial tuning value is closely related to the ideal (or target) acceptance rate we specified in accept.rate. In short, the algorithm we use is “adaptive” because the algorithm will change the tuning values after each batch of the MCMC to yield acceptance rates that are close to our target acceptance rate that we specified in the accept.rate argument. Information on the acceptance rates for phi and nu in your model will be displayed when setting verbose = TRUE. After some initial runs of the model, if you notice the final acceptance rate is much larger or smaller than the target acceptance rate (accept.rate), you can then change the initial tuning value to get closer to the target rate. While use of this algorithm requires us to specify more arguments in spatially-explicit models, this leads to much shorter run times compared to a more simple approach where we do not have an “adaptive” sampling approach, and it should thus save you time in the long haul when waiting for these models to run. For our example here, we set the initial tuning value to 1 after some initial exploratory runs of the model.

oven.tuning <- list(phi = 1)
# accept.rate = 0.43 by default, so we do not specify it.

Priors are again specified in a list in the argument priors. We follow standard recommendations for prior distributions from the spatial statistics literature (Banerjee, Carlin, and Gelfand 2003). We assume an inverse gamma prior for the spatial variance parameter sigma.sq (the tag of which is sigma.sq.ig), and uniform priors for the spatial decay parameter phi and smoothness parameter nu (if using the Matern correlation function), with the associated tags phi.unif and nu.unif. The hyperparameters of the inverse Gamma are passed as a vector of length two, with the first and second elements corresponding to the shape and scale, respectively. The lower and upper bounds of the uniform distribution are passed as a two-element vector for the uniform priors.

As of v0.3.2, we allow users to specify a uniform prior on the spatial variance parameter sigma.sq (using the tag sigma.sq.unif) instead of an inverse-Gamma prior. This can be useful in certain situations when working with a binary response variable (which we inherently do in occupancy models), as there is a confounding between the spatial variance parameter sigma.sq and the occurrence intercept. This occurs as a result of the logit transformation and a mathematical statement known as Jensen’s Inequality (Bolker 2015). Briefly, Jensen’s Inequality tells us that while the spatial random effects don’t influence the mean on the logit scale (since we give them a prior with mean 0), they do have an influence on the mean on the actual probability scale (after back-transforming), which is what leads to potential confounding between the spatial variance parameter and the occurrence intercept. Generally, we have found this confounding to be inconsequential, as the spatial structure of the random effects helps to separate the spatial variance from the occurrence intercept. However, there may be certain circumstances when $$\sigma^2$$ is estimated to be extremely large, and the estimate of the occurrence intercept is a very large magnitude negative number in order to compensate. It can be helpful in these situations to use a uniform distribution on sigma.sq to restrict it to taking more reasonable values. We have rarely encountered such situations, and so throughout this vignette (and in our the vast majority of our analyses) we will use an inverse-Gamma prior.

Note that the priors for the spatial parameters in a spatially-explicit model must be at least weakly informative for the model to converge (Banerjee, Carlin, and Gelfand 2003). For the inverse-Gamma prior on the spatial variance, we typically set the shape parameter to 2 and the scale parameter equal to our best guess of the spatial variance. The default prior hyperparameter values for the spatial variance $$\sigma^2$$ are a shape parameter of 2 and a scale parameter of 1. This weakly informative prior suggests a prior mean of 1 for the spatial variance, which is a moderately small amount of spatial variation. Based on our previous work with these data, we expected the residual spatial variation to be relatively small, and so we use this default value and set the scale parameter below to 1. For the spatial decay parameter, our default approach is to set the lower and upper bounds of the uniform prior based on the minimum and maximum distances between sites in the data. More specifically, by default we set the lower bound to 3 / max and the upper bound to 3 / min, where min and max are the minimum and maximum distances between sites in the data set, respectively. This equates to a vague prior that states the spatial autocorrelation in the data could only exist between sites that are very close together, or could span across the entire observed study area. If additional information is known on the extent of the spatial autocorrelation in the data, you may place more restrictive bounds on the uniform prior, which would reduce the amount of time needed for adequate mixing and convergence of the MCMC chains. Here we use this default approach, but will explicitly set the values for transparency.

min.dist <- min(dist.hbef)
max.dist <- max(dist.hbef)
oven.priors <- list(beta.normal = list(mean = 0, var = 2.72),
alpha.normal = list(mean = 0, var = 2.72),
sigma.sq.ig = c(2, 1),
phi.unif = c(3/max.dist, 3/min.dist))

The argument n.omp.threads specifies the number of threads to use for within-chain parallelization, while verbose specifies whether or not to print the progress of the sampler. We highly recommend setting verbose = TRUE for all spatial models to ensure the adaptive MCMC is working as you want (and this is the reason for why this is the default for this argument). The argument n.report specifies the interval to report the Metropolis-Hastings sampler acceptance rate. Note that n.report is specified in terms of batches, not the overall number of samples. Below we set n.report = 100, which will result in information on the acceptance rate and tuning parameters every 100th batch (not sample).

n.omp.threads <- 1
verbose <- TRUE
n.report <- 100 # Report progress at every 100th batch.

The parameters NNGP, n.neighbors, and search.type relate to whether or not you want to fit the model with a Gaussian Process or with NNGP, which is a much more computationally efficient approximation. The argument NNGP is a logical value indicating whether to fit the model with an NNGP (TRUE) or a regular Gaussian Process (FALSE). For data sets that have more than 1000 locations, using an NNGP will have substantial improvements in run time. Even for more modest-sized data sets, like the HBEF data set, using an NNGP will be quite a bit faster (especially for multi-species models as shown in subsequent sections). Unless your data set is particularly small (e.g., 100 points) and you are concerned about the NNGP approximation, we recommend setting NNGP = TRUE, which is the default. The arguments n.neighbors and search.type specify the number of neighbors used in the NNGP and the nearest neighbor search algorithm, respectively, to use for the NNGP model. Generally, the default values of these arguments will be adequate. Datta et al. (2016) showed that setting n.neighbors = 15 is usually sufficient, although for certain data sets a good approximation can be achieved with as few as five neighbors, which could substantially decrease run time for the model. We generally recommend leaving search.type = "cb", as this results in a fast code book nearest neighbor search algorithm. However, details on when you may want to change this are described in Finley, Datta, and Banerjee (2020). We will run an NNGP model using the default value for search.type and setting n.neighbors = 5, which we have found in exploratory analysis to closely approximate a full Gaussian Process.

We now fit the model (without k-fold cross-validation) and summarize the results using summary().

# Approx. run time: < 1 minute
out.sp <- spPGOcc(occ.formula = oven.occ.formula,
det.formula = oven.det.formula,
data = ovenHBEF,
inits = oven.inits,
n.batch = n.batch,
batch.length = batch.length,
priors = oven.priors,
cov.model = cov.model,
NNGP = TRUE,
n.neighbors = 5,
tuning = oven.tuning,
n.report = n.report,
n.burn = n.burn,
n.thin = n.thin,
n.chains = n.chains)
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Building the neighbor list
----------------------------------------
----------------------------------------
Building the neighbors of neighbors list
----------------------------------------
----------------------------------------
Model description
----------------------------------------
NNGP Occupancy model with Polya-Gamma latent
variable fit with 373 sites.

Samples per chain: 10000 (400 batches of length 25)
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200

Using the exponential spatial correlation model.

Using 5 nearest neighbors.

Source compiled with OpenMP support and model fit using 1 thread(s).

Adaptive Metropolis with target acceptance rate: 43.0
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Parameter   Acceptance  Tuning
phi     16.0        0.42741
-------------------------------------------------
Batch: 200 of 400, 50.00%
Parameter   Acceptance  Tuning
phi     20.0        0.30422
-------------------------------------------------
Batch: 300 of 400, 75.00%
Parameter   Acceptance  Tuning
phi     44.0        0.29229
-------------------------------------------------
Batch: 400 of 400, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Parameter   Acceptance  Tuning
phi     36.0        0.29820
-------------------------------------------------
Batch: 200 of 400, 50.00%
Parameter   Acceptance  Tuning
phi     28.0        0.31664
-------------------------------------------------
Batch: 300 of 400, 75.00%
Parameter   Acceptance  Tuning
phi     40.0        0.28083
-------------------------------------------------
Batch: 400 of 400, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Parameter   Acceptance  Tuning
phi     48.0        0.28650
-------------------------------------------------
Batch: 200 of 400, 50.00%
Parameter   Acceptance  Tuning
phi     56.0        0.28650
-------------------------------------------------
Batch: 300 of 400, 75.00%
Parameter   Acceptance  Tuning
phi     40.0        0.29229
-------------------------------------------------
Batch: 400 of 400, 100.00%
class(out.sp) # Look at what spOccupancy produced.
[1] "spPGOcc"
names(out.sp)
 [1] "rhat"           "beta.samples"   "alpha.samples"  "theta.samples"
[5] "coords"         "z.samples"      "X"              "X.re"
[9] "w.samples"      "psi.samples"    "like.samples"   "X.p"
[13] "X.p.re"         "lambda.p"       "y"              "ESS"
[17] "call"           "n.samples"      "n.neighbors"    "cov.model.indx"
[21] "type"           "n.post"         "n.thin"         "n.burn"
[25] "n.chains"       "pRE"            "psiRE"          "run.time"      
summary(out.sp)

Call:
spPGOcc(occ.formula = oven.occ.formula, det.formula = oven.det.formula,
data = ovenHBEF, inits = oven.inits, priors = oven.priors,
tuning = oven.tuning, cov.model = cov.model, NNGP = TRUE,
n.neighbors = 5, n.batch = n.batch, batch.length = batch.length,
n.report = n.report, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains)

Samples per Chain: 10000
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200
Run Time (min): 1.4349

Occurrence (logit scale):
Mean     SD    2.5%     50%   97.5%   Rhat ESS
(Intercept)            3.0292 0.7355  1.5816  2.9908  4.5583 1.0203 221
scale(Elevation)      -2.6150 0.6037 -3.9895 -2.5396 -1.6115 1.0433 226
I(scale(Elevation)^2) -0.7462 0.3753 -1.5448 -0.7321  0.0096 1.0259 462

Detection (logit scale):
Mean     SD    2.5%     50%  97.5%   Rhat  ESS
(Intercept)      0.8220 0.1208  0.5832  0.8195 1.0659 1.0201 1102
scale(day)      -0.0919 0.0829 -0.2569 -0.0934 0.0726 1.0013 1476
scale(tod)      -0.0465 0.0790 -0.1950 -0.0480 0.1090 1.0025 1200
I(scale(day)^2)  0.0257 0.0959 -0.1568  0.0247 0.2120 1.0145 1200

Spatial Covariance:
Mean     SD   2.5%    50%   97.5%   Rhat ESS
sigma.sq 6.4647 3.9863 1.7700 5.4539 17.5299 1.1522 120
phi      0.0022 0.0009 0.0008 0.0020  0.0043 1.0187 186

Most Rhat values are less than 1.1, although for a full analysis we should run the model longer to ensure the spatial variance parameter (sigma.sq) has converged. The effective sample sizes of the spatial covariance parameters are somewhat low, which indicates limited mixing of these parameters. This is common when fitting spatial models. We note that there is substantial MC error in the spatial covariance parameters, and so your estimates may be slightly different from what we have shown here. We see spPGOcc() returns a list of class spPGOcc and consists of posterior samples for all parameters. Note that posterior samples for spatial parameters are stored in the list element theta.samples.

### Posterior predictive checks

For our posterior predictive check, we send the spPGOcc model object to the ppcOcc() function, this time grouping by replicate, or survey occasion, (group = 2) instead of by site (group = 1).

ppc.sp.out <- ppcOcc(out.sp, fit.stat = 'freeman-tukey', group = 2)
summary(ppc.sp.out)

Call:
ppcOcc(object = out.sp, fit.stat = "freeman-tukey", group = 2)

Samples per Chain: 10000
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200

Bayesian p-value:  0.3525
Fit statistic:  freeman-tukey 

The Bayesian p-value indicates adequate fit of the spatial occupancy model.

### Model selection using WAIC and k-fold cross-validation

We next use the waicOcc() function to compute the WAIC, which we can compare to the non-spatial model to assess the benefit of incorporating the spatial random effects in the occurrence model.

waicOcc(out.sp)
      elpd         pD       WAIC
-563.78006   49.80549 1227.17109 
# Compare to non-spatial model
waicOcc(out)
       elpd          pD        WAIC
-632.960543    6.290149 1278.501382 

We see the WAIC value for the spatial model is substantially reduced compared to the nonspatial model, indicating that incorporation of the spatial random effects yields an improvement in predictive performance.

k-fold cross-validation is accomplished by specifying the k.fold argument in spPGOcc just as we saw in PGOcc.

### Prediction

Finally, we can perform out of sample prediction using the predict function just as before. Prediction for spatial models is more computationally intensive than for non-spatial models, and so the predict function for spPGOcc class objects also has options for parallelization (n.omp.threads) and reporting sampler progress (verbose and n.report). Note that for spPGOcc(), you also need to supply the coordinates of the out of sample prediction locations in addition to the covariate values.

# Do prediction.
coords.0 <- as.matrix(hbefElev[, c('Easting', 'Northing')])
# Approx. run time: 6 min
out.sp.pred <- predict(out.sp, X.0, coords.0, verbose = FALSE)
# Produce a species distribution map (posterior predictive means of occupancy)
plot.dat <- data.frame(x = hbefElev$Easting, y = hbefElev$Northing,
mean.psi = apply(out.sp.pred$psi.0.samples, 2, mean), sd.psi = apply(out.sp.pred$psi.0.samples, 2, sd))
dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y'))
ggplot() +
geom_stars(data = dat.stars, aes(x = x, y = y, fill = mean.psi)) +
scale_fill_viridis_c(na.value = 'transparent') +
labs(x = 'Easting', y = 'Northing', fill = '',
title = 'Mean OVEN occurrence probability') +
theme_bw()

# Produce an uncertainty map of the SDM (posterior predictive SDs of occupancy)
ggplot() +
geom_stars(data = dat.stars, aes(x = x, y = y, fill = sd.psi)) +
scale_fill_viridis_c(na.value = 'transparent') +
labs(x = 'Easting', y = 'Northing', fill = '',
title = 'SD OVEN occurrence probability') +
theme_bw()

Comparing this to the non-spatial occupancy model, the spatial model appears to identify areas in HBEF with low OVEN occurrence that are not captured in the non-spatial model. We will resist trying to hypothesize what environmental factors could lead to these patterns.

## Multi-species occupancy models

### Basic model description

Let $$z_{i, j}$$ be the true presence (1) or absence (0) of a species $$i$$ at site $$j$$, with $$j = 1, \dots, J$$ and $$i = 1, \dots, N$$. We assume the latent occurrence variable arises from a Bernoulli process following

$$$\begin{split} &z_{i, j} \sim \text{Bernoulli}(\psi_{i, j}), \\ &\text{logit}(\psi_{i, j}) = \boldsymbol{x}^{\top}_{j}\boldsymbol{\beta}_i, \end{split}$$$

where $$\psi_{i, j}$$ is the probability of occurrence of species $$i$$ at site $$j$$, which is a function of site-specific covariates $$\boldsymbol{X}$$ and a vector of species-specific regression coefficients ($$\boldsymbol{\beta}_i$$). The regression coefficients in multi-species occupancy models are envisioned as random effects arising from a common community level distribution:

$$$\boldsymbol{\beta}_i \sim \text{Normal}(\boldsymbol{\mu}_{\beta}, \boldsymbol{T}_{\beta}),$$$

where $$\boldsymbol{\mu}_{\beta}$$ is a vector of community level mean effects for each occurrence covariate effect (including the intercept) and $$\boldsymbol{T}_{\beta}$$ is a diagonal matrix with diagonal elements $$\boldsymbol{\tau}^2_{\beta}$$ that represent the variability of each occurrence covariate effect among species in the community.

We do not directly observe $$z_{i, j}$$, but rather we observe an imperfect representation of the latent occurrence process. Let $$y_{i, j, k}$$ be the observed detection (1) or nondetection (0) of a species $$i$$ at site $$j$$ during replicate/survey $$k$$ for each of $$k = 1, \dots, K_j$$ replicates at each site $$j$$. We envision the detection-nondetection data as arising from a Bernoulli process conditional on the true latent occurrence process:

$$$\begin{split} &y_{i, j, k} \sim \text{Bernoulli}(p_{i, j, k}z_{i, j}), \\ &\text{logit}(p_{i, j, k}) = \boldsymbol{v}^{\top}_{i, j, k}\boldsymbol{\alpha}_i, \end{split}$$$

where $$p_{i, j, k}$$ is the probability of detecting species $$i$$ at site $$j$$ during replicate $$k$$ (given it is present at site $$j$$), which is a function of site and replicate specific covariates $$\boldsymbol{V}$$ and a vector of species-specific regression coefficients ($$\boldsymbol{\alpha}_i$$). Similarly to the occurrence regression coefficients, the species specific detection coefficients are envisioned as random effects arising from a common community level distribution:

$$$\boldsymbol{\alpha}_i \sim \text{Normal}(\boldsymbol{\mu}_{\alpha}, \boldsymbol{T}_{\alpha}),$$$

where $$\boldsymbol{\mu}_{\alpha}$$ is a vector of community level mean effects for each detection covariate effect (including the intercept) and $$\boldsymbol{T}_{\alpha}$$ is a diagonal matrix with diagonal elements $$\boldsymbol{\tau}^2_{\alpha}$$ that represent the variability of each detection covariate effect among species in the community.

To complete the Bayesian specification of the model, we assign multivariate normal priors for the occurrence ($$\boldsymbol{\mu}_{\beta}$$) and detection ($$\boldsymbol{\mu}_{\alpha}$$) community-level regression coefficient means and independent inverse-Gamma priors for each element of $$\boldsymbol{\tau}^2_{\beta}$$ and $$\boldsymbol{\tau}^2_{\alpha}$$. We again use Pólya-Gamma data augmentation to yield an efficient implementation of the multi-species occupancy model, which is described in depth in the MCMC sampler vignette.

### Fitting multi-species occupancy models with msPGOcc()

spOccupancy uses nearly identical syntax for fitting multi-species occupancy models as it does for single-species models and provides the same functionality for posterior predictive checks, model assessment and selection using WAIC and k-fold cross-validation, and out of sample prediction. The msPGOcc() function fits nonspatial multi-species occupancy models using Pólya-Gamma latent variables, which results in substantial increases in run time compared to standard implementations of logit link multi-species occupancy models. msPGOcc() has exactly the same arguments as PGOcc():

msPGOcc(occ.formula, det.formula, data, inits, n.samples, priors,
n.omp.threads = 1, verbose = TRUE, n.report = 100,
n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1,
k.fold, k.fold.threads = 1, k.fold.seed, ...)

For illustration, we will again use the Hubbard Brook data in hbef2015, but we will now model occurrence for all 12 species in the community. Below we reload the hbef2015 data set to get a fresh copy.

data(hbef2015)
# str(hbef2015) # Remind ourselves of what's in it (not shown).

We will model occurrence for all species as a function of linear and quadratic terms for elevation, and detection as a function of linear and quadratic day of survey as well as the time of day during which the survey was conducted. These models are specified in occ.formula and det.formula as before, which reference variables stored in the data list. Random intercepts can be included in both the occurrence and detection portions of the occupancy model using lme4 syntax (Bates et al. 2015). For multi-species models, the multi-species detection-nondetection data y is now a three-dimensional array with dimensions corresponding to species, sites, and replicates. This is how the data are provided in the hbef2015 object, so we don’t need to do any additional preparations.

occ.ms.formula <- ~ scale(Elevation) + I(scale(Elevation)^2)
det.ms.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2)
str(hbef2015)
List of 4
$y : num [1:12, 1:373, 1:3] 0 0 0 0 0 1 0 0 0 0 ... ..- attr(*, "dimnames")=List of 3 .. ..$ : chr [1:12] "AMRE" "BAWW" "BHVI" "BLBW" ...
.. ..$: chr [1:373] "1" "2" "3" "4" ... .. ..$ : chr [1:3] "1" "2" "3"
$occ.covs: num [1:373, 1] 475 494 546 587 588 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL
.. ..$: chr "Elevation"$ det.covs:List of 2
..$day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ... ..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
$coords : num [1:373, 1:2] 280000 280000 280000 280001 280000 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$: chr [1:2] "X" "Y" Next we specify the initial values in inits. For multi-species occupancy models, we supply initial values for community-level and species-level parameters. In msPGOcc(), we will supply initial values for the following parameters: alpha.comm (community-level detection coefficients), beta.comm (community-level occurrence coefficients), alpha (species-level detection coefficients), beta (species-level occurrence coefficients), tau.sq.beta (community-level occurrence variance parameters), tau.sq.alpha (community-level detection variance parameters, z (latent occurrence variables for all species). These are all specified in a single list. Initial values for community-level parameters are either vectors of length corresponding to the number of community-level detection or occurrence parameters in the model (including the intercepts) or a single value if all parameters are assigned the same initial values. Initial values for species level parameters are either matrices with the number of rows indicating the number of species, and each column corresponding to a different regression parameter, or a single value if the same initial value is used for all species and parameters. The initial values for the latent occurrence matrix are specified as a matrix with $$N$$ rows corresponding to the number of species and $$J$$ columns corresponding to the number of sites. N <- dim(hbef2015$y)[1]
ms.inits <- list(alpha.comm = 0,
beta.comm = 0,
beta = 0,
alpha = 0,
tau.sq.beta = 1,
tau.sq.alpha = 1,
z = apply(hbef2015$y, c(1, 2), max, na.rm = TRUE)) In multi-species models, we specify priors on the community-level coefficients (or hyperparameters) rather than the species-level effects. For nonspatial models, these priors are specified with the following tags: beta.comm.normal (normal prior on the community-level occurrence mean effects), alpha.comm.normal (normal prior on the community-level detection mean effects), tau.sq.beta.ig (inverse-Gamma prior on the community-level occurrence variance parameters), tau.sq.alpha.ig (inverse-Gamma prior on the community-level detection variance parameters). Each tag consists of a list with elements corresponding to the mean and variance for normal priors and scale and shape for inverse-Gamma priors. Values can be specified individually for each parameter or as a single value if the same prior is assigned to all parameters of a given type. By default, we set the prior hyperparameter values for the community-level means to a mean of 0 and a variance of 2.72, which results in a vague prior on the probability scale as we discussed for the single-species spatial occupancy model. Below we specify these priors explicitly. For the community-level variance parameters, by default we set the scale and shape parameters to 0.1 following the recommendations of (Lunn et al. 2013), which results in a weakly informative prior on the community-level variances. This may lead to shrinkage of the community-level variance towards zero under certain circumstances so that all species will have fairly similar values for the species-specific covariate effect (Gelman 2006), although we have found multi-species occupancy models to be relatively robust to this prior specification. However, if you want to explore the impact of this prior on species-specific covariate effects, we recommend running single-species occupancy models for a select number of species in the data set, and assessing how large the differences in the estimated species-specific effects are between the single-species model estimates and those from the multi-species model. Below we explicitly set these default prior values for our HBEF example. ms.priors <- list(beta.comm.normal = list(mean = 0, var = 2.72), alpha.comm.normal = list(mean = 0, var = 2.72), tau.sq.beta.ig = list(a = 0.1, b = 0.1), tau.sq.alpha.ig = list(a = 0.1, b = 0.1)) All that’s now left to do is specify the number of threads to use (n.omp.threads), the number of MCMC samples (n.samples), the amount of samples to discard as burn-in (n.burn), the thinning rate (n.thin), and arguments to control the display of sampler progress (verbose, n.report). # Approx. run time: 6 min out.ms <- msPGOcc(occ.formula = occ.ms.formula, det.formula = det.ms.formula, data = hbef2015, inits = ms.inits, n.samples = 30000, priors = ms.priors, n.omp.threads = 1, verbose = TRUE, n.report = 6000, n.burn = 10000, n.thin = 50, n.chains = 3) ---------------------------------------- Preparing to run the model ---------------------------------------- ---------------------------------------- Model description ---------------------------------------- Multispecies Occupancy Model with Polya-Gamma latent variable fit with 373 sites and 12 species. Samples per Chain: 30000 Burn-in: 10000 Thinning Rate: 50 Number of Chains: 3 Total Posterior Samples: 1200 Source compiled with OpenMP support and model fit using 1 thread(s). ---------------------------------------- Chain 1 ---------------------------------------- Sampling ... Sampled: 6000 of 30000, 20.00% ------------------------------------------------- Sampled: 12000 of 30000, 40.00% ------------------------------------------------- Sampled: 18000 of 30000, 60.00% ------------------------------------------------- Sampled: 24000 of 30000, 80.00% ------------------------------------------------- Sampled: 30000 of 30000, 100.00% ---------------------------------------- Chain 2 ---------------------------------------- Sampling ... Sampled: 6000 of 30000, 20.00% ------------------------------------------------- Sampled: 12000 of 30000, 40.00% ------------------------------------------------- Sampled: 18000 of 30000, 60.00% ------------------------------------------------- Sampled: 24000 of 30000, 80.00% ------------------------------------------------- Sampled: 30000 of 30000, 100.00% ---------------------------------------- Chain 3 ---------------------------------------- Sampling ... Sampled: 6000 of 30000, 20.00% ------------------------------------------------- Sampled: 12000 of 30000, 40.00% ------------------------------------------------- Sampled: 18000 of 30000, 60.00% ------------------------------------------------- Sampled: 24000 of 30000, 80.00% ------------------------------------------------- Sampled: 30000 of 30000, 100.00% The resulting object out.ms is a list of class msPGOcc() consisting primarily of posterior samples of all community and species-level parameters, as well as some additional objects that are used for summaries, prediction, and model fit evaluation. We can display a nice summary of these results using the summary() function as before. For multi-species objects, when using summary we need to specify the level of parameters we want to summarize. We do this using the argument level, which takes values community, species, or both to print results for community-level parameters, species-level parameters, or all parameters. By default, level is set to both to display both species and community-level parameters. summary(out.ms)  Call: msPGOcc(occ.formula = occ.ms.formula, det.formula = det.ms.formula, data = hbef2015, inits = ms.inits, priors = ms.priors, n.samples = 30000, n.omp.threads = 1, verbose = TRUE, n.report = 6000, n.burn = 10000, n.thin = 50, n.chains = 3) Samples per Chain: 30000 Burn-in: 10000 Thinning Rate: 50 Number of Chains: 3 Total Posterior Samples: 1200 Run Time (min): 8.0261 ---------------------------------------- Community Level ---------------------------------------- Occurrence Means (logit scale): Mean SD 2.5% 50% 97.5% Rhat ESS (Intercept) 0.3543 0.8370 -1.3016 0.3240 1.9512 1.0004 978 scale(Elevation) 0.2679 0.5016 -0.7169 0.2776 1.2843 1.0060 1200 I(scale(Elevation)^2) -0.2165 0.2595 -0.7027 -0.2207 0.3206 1.0020 1101 Occurrence Variances (logit scale): Mean SD 2.5% 50% 97.5% Rhat ESS (Intercept) 10.9032 6.5399 3.9471 9.3436 26.5235 1.0116 712 scale(Elevation) 2.9349 1.8947 1.0017 2.4329 7.8181 1.0053 767 I(scale(Elevation)^2) 0.6147 0.4394 0.1440 0.4932 1.7338 1.0208 941 Detection Means (logit scale): Mean SD 2.5% 50% 97.5% Rhat ESS (Intercept) -0.7672 0.4744 -1.7072 -0.7646 0.1168 1.0143 877 scale(day) 0.0589 0.0840 -0.1060 0.0604 0.2326 0.9991 1200 scale(tod) -0.0395 0.0767 -0.1867 -0.0413 0.1152 1.0022 1200 I(scale(day)^2) -0.0246 0.0842 -0.1770 -0.0282 0.1552 1.0006 1318 Detection Variances (logit scale): Mean SD 2.5% 50% 97.5% Rhat ESS (Intercept) 2.7007 1.6171 0.8984 2.3029 6.8622 1.0259 299 scale(day) 0.0655 0.0368 0.0238 0.0562 0.1667 1.0000 1328 scale(tod) 0.0507 0.0360 0.0173 0.0423 0.1320 1.0371 1052 I(scale(day)^2) 0.0559 0.0439 0.0178 0.0458 0.1574 1.0376 1119 ---------------------------------------- Species Level ---------------------------------------- Occurrence (logit scale): Mean SD 2.5% 50% 97.5% Rhat ESS (Intercept)-AMRE -1.8107 2.1105 -4.5386 -2.4000 3.3652 1.0593 66 (Intercept)-BAWW 2.0466 1.5090 -0.1170 1.8089 5.9299 1.0190 111 (Intercept)-BHVI 3.1041 2.3441 0.0144 2.6511 8.9521 1.0531 156 (Intercept)-BLBW 2.5512 0.5865 1.7183 2.4376 4.0163 1.0083 1044 (Intercept)-BLPW -4.6594 0.6160 -5.9920 -4.6079 -3.5216 1.0030 915 (Intercept)-BTBW 3.6416 0.4762 2.8556 3.6093 4.7216 1.0039 1387 (Intercept)-BTNW 2.4139 0.3400 1.8539 2.3723 3.1978 0.9995 1200 (Intercept)-CAWA -2.1378 0.5583 -3.0914 -2.1840 -0.9588 1.0051 598 (Intercept)-MAWA -1.7047 0.2654 -2.2323 -1.6992 -1.1922 1.0083 1102 (Intercept)-NAWA -2.5000 0.5463 -3.4426 -2.5546 -1.3104 1.0062 617 (Intercept)-OVEN 2.1586 0.2735 1.6550 2.1462 2.7440 1.0045 1200 (Intercept)-REVI 3.1395 0.4624 2.3684 3.1025 4.1672 1.0059 921 scale(Elevation)-AMRE 0.8762 1.0901 -1.1293 0.8184 3.1547 1.0240 248 scale(Elevation)-BAWW -0.7958 1.0684 -3.9238 -0.5719 0.5389 1.0023 157 scale(Elevation)-BHVI 0.1010 1.0025 -1.9299 0.1079 1.9618 1.0150 615 scale(Elevation)-BLBW -0.4381 0.2303 -0.9829 -0.4111 -0.0692 1.0089 1079 scale(Elevation)-BLPW 2.6891 0.6631 1.6980 2.5727 4.1881 1.0116 718 scale(Elevation)-BTBW -0.5043 0.1649 -0.8348 -0.5021 -0.2107 1.0024 1200 scale(Elevation)-BTNW 0.6480 0.2862 0.1913 0.6210 1.2933 1.0075 1009 scale(Elevation)-CAWA 1.8800 0.6213 0.8493 1.8250 3.1988 1.0011 750 scale(Elevation)-MAWA 1.5670 0.2892 1.0519 1.5393 2.1608 0.9993 1421 scale(Elevation)-NAWA 1.1545 0.3681 0.5569 1.1172 1.9559 1.0007 963 scale(Elevation)-OVEN -1.6726 0.2695 -2.2725 -1.6422 -1.2330 1.0062 1200 scale(Elevation)-REVI -2.1313 0.7091 -3.6974 -2.0146 -1.1043 1.0082 792 I(scale(Elevation)^2)-AMRE -0.6650 0.6718 -2.0704 -0.6461 0.6285 1.0058 310 I(scale(Elevation)^2)-BAWW -0.9815 0.5035 -1.9458 -0.9884 0.1314 1.0054 428 I(scale(Elevation)^2)-BHVI 0.4369 0.7635 -0.8651 0.3335 2.2730 1.0225 456 I(scale(Elevation)^2)-BLBW -0.5725 0.1888 -0.9976 -0.5535 -0.2554 1.0050 1200 I(scale(Elevation)^2)-BLPW 0.7498 0.3787 -0.0799 0.7684 1.4496 1.0055 957 I(scale(Elevation)^2)-BTBW -1.0344 0.1657 -1.3904 -1.0289 -0.7354 1.0021 1200 I(scale(Elevation)^2)-BTNW -0.1412 0.1713 -0.4694 -0.1397 0.1908 1.0020 1200 I(scale(Elevation)^2)-CAWA -0.5376 0.4077 -1.3442 -0.5209 0.2371 0.9999 1100 I(scale(Elevation)^2)-MAWA 0.2938 0.2421 -0.1924 0.3018 0.7695 0.9996 1200 I(scale(Elevation)^2)-NAWA 0.2807 0.2661 -0.2549 0.2872 0.7932 1.0004 1200 I(scale(Elevation)^2)-OVEN -0.4236 0.1852 -0.7832 -0.4328 -0.0435 1.0063 1200 I(scale(Elevation)^2)-REVI -0.0335 0.3699 -0.6636 -0.0664 0.7631 1.0070 865 Detection (logit scale): Mean SD 2.5% 50% 97.5% Rhat ESS (Intercept)-AMRE -2.9297 1.4415 -5.6275 -2.8097 -0.5810 1.0243 83 (Intercept)-BAWW -2.6905 0.3594 -3.3086 -2.7218 -1.8952 1.0352 340 (Intercept)-BHVI -2.7047 0.2316 -3.1381 -2.7138 -2.2058 1.0174 398 (Intercept)-BLBW -0.0359 0.1256 -0.2896 -0.0376 0.2047 1.0058 1200 (Intercept)-BLPW -0.4388 0.2629 -0.9502 -0.4314 0.0596 1.0020 1200 (Intercept)-BTBW 0.6365 0.1097 0.4368 0.6346 0.8605 1.0010 1200 (Intercept)-BTNW 0.5921 0.1129 0.3705 0.5908 0.8058 1.0097 1200 (Intercept)-CAWA -1.5263 0.5334 -2.6624 -1.5019 -0.5798 1.0132 583 (Intercept)-MAWA -0.7789 0.2419 -1.2226 -0.7784 -0.2928 1.0026 1200 (Intercept)-NAWA -1.5088 0.4980 -2.5089 -1.4999 -0.5645 1.0005 750 (Intercept)-OVEN 0.7971 0.1207 0.5621 0.7997 1.0311 1.0060 1200 (Intercept)-REVI 0.5440 0.1067 0.3347 0.5435 0.7494 1.0020 1200 scale(day)-AMRE 0.0434 0.2364 -0.3831 0.0330 0.5803 1.0087 1200 scale(day)-BAWW 0.2136 0.1379 -0.0410 0.2118 0.4997 1.0025 1200 scale(day)-BHVI 0.2082 0.1230 -0.0238 0.2060 0.4398 1.0180 1200 scale(day)-BLBW -0.2208 0.0651 -0.3477 -0.2183 -0.0926 1.0106 1200 scale(day)-BLPW 0.0707 0.1512 -0.2244 0.0707 0.3578 1.0013 1200 scale(day)-BTBW 0.2680 0.0678 0.1354 0.2668 0.3958 1.0041 1200 scale(day)-BTNW 0.1522 0.0671 0.0192 0.1545 0.2838 1.0034 1200 scale(day)-CAWA -0.0115 0.1715 -0.3490 -0.0061 0.3302 1.0119 1200 scale(day)-MAWA 0.1134 0.1268 -0.1379 0.1106 0.3603 1.0010 1200 scale(day)-NAWA 0.0302 0.1780 -0.3133 0.0292 0.3746 1.0059 1200 scale(day)-OVEN -0.0747 0.0782 -0.2262 -0.0772 0.0692 1.0017 994 scale(day)-REVI -0.0719 0.0663 -0.2053 -0.0692 0.0609 1.0078 1013 scale(tod)-AMRE -0.0369 0.2195 -0.5051 -0.0311 0.3818 1.0028 1104 scale(tod)-BAWW -0.1770 0.1323 -0.4445 -0.1678 0.0687 1.0019 1270 scale(tod)-BHVI -0.0482 0.1108 -0.2573 -0.0445 0.1703 1.0036 1200 scale(tod)-BLBW 0.0573 0.0644 -0.0675 0.0551 0.1895 1.0014 1200 scale(tod)-BLPW 0.1075 0.1463 -0.1677 0.1094 0.3982 1.0048 1200 scale(tod)-BTBW -0.0338 0.0669 -0.1733 -0.0332 0.0912 1.0112 1589 scale(tod)-BTNW 0.0402 0.0642 -0.0863 0.0402 0.1644 1.0064 1063 scale(tod)-CAWA -0.2118 0.1639 -0.5613 -0.2053 0.0806 1.0003 1021 scale(tod)-MAWA 0.0216 0.1151 -0.2088 0.0249 0.2393 1.0043 1200 scale(tod)-NAWA -0.0600 0.1591 -0.3768 -0.0564 0.2305 1.0105 1200 scale(tod)-OVEN -0.0455 0.0728 -0.1935 -0.0457 0.0959 1.0025 1269 scale(tod)-REVI -0.0783 0.0661 -0.2009 -0.0782 0.0536 1.0077 1200 I(scale(day)^2)-AMRE -0.1226 0.2318 -0.5927 -0.1243 0.3385 1.0053 1580 I(scale(day)^2)-BAWW -0.0423 0.1500 -0.3701 -0.0339 0.2375 1.0145 817 I(scale(day)^2)-BHVI 0.0475 0.1273 -0.1930 0.0456 0.3014 1.0117 1090 I(scale(day)^2)-BLBW -0.1748 0.0793 -0.3251 -0.1748 -0.0090 1.0166 1200 I(scale(day)^2)-BLPW 0.1902 0.1800 -0.1453 0.1816 0.5770 1.0006 1200 I(scale(day)^2)-BTBW -0.0616 0.0782 -0.2208 -0.0608 0.0869 1.0035 1200 I(scale(day)^2)-BTNW -0.0666 0.0773 -0.2194 -0.0677 0.0947 1.0072 1200 I(scale(day)^2)-CAWA -0.0336 0.1827 -0.3951 -0.0347 0.3359 1.0072 1200 I(scale(day)^2)-MAWA 0.0051 0.1327 -0.2636 0.0079 0.2564 1.0057 1200 I(scale(day)^2)-NAWA -0.1335 0.1874 -0.5206 -0.1331 0.2350 1.0099 1200 I(scale(day)^2)-OVEN 0.0176 0.0877 -0.1537 0.0177 0.1852 1.0002 1200 I(scale(day)^2)-REVI 0.0372 0.0772 -0.1105 0.0360 0.1877 1.0011 1200 # Or more explicitly # summary(out.ms, level = 'both') Looking at the community-level variance parameters, we see substantial variability in the average occurrence (the intercept) for the twelve species, as well as considerable variability in the effect of elevation across the community. There appears to be less variability across species in the detection portion of the model. We can look directly at the species-specific effects to confirm this. All community-level Rhat values are less than 1.1 and most species-level Rhat values are less than 1.1, with the exception of a few of the parameters for rare species. Mixing of the MCMC chains appears adequate when looking at the ESS values, although we can clearly see the ESS is lower for rare species (e.g., AMRE, BAWW) compared to more common species in the forest (e.g., OVEN, REVI). ### Posterior predictive checks We can use the ppcOcc() function to perform a posterior predictive check, and summarize the check with a Bayesian p-value using the summary() function, all as shown before. The summary() function again requires the level argument to specify if you want an overall Bayesian p-value for the entire community (level = 'community'), each individual species (level = 'species'), or for both (level = 'both'). By default, we set level = ‘both’. # Approx. run time: 20 sec ppc.ms.out <- ppcOcc(out.ms, 'chi-squared', group = 1) Currently on species 1 out of 12 Currently on species 2 out of 12 Currently on species 3 out of 12 Currently on species 4 out of 12 Currently on species 5 out of 12 Currently on species 6 out of 12 Currently on species 7 out of 12 Currently on species 8 out of 12 Currently on species 9 out of 12 Currently on species 10 out of 12 Currently on species 11 out of 12 Currently on species 12 out of 12 summary(ppc.ms.out)  Call: ppcOcc(object = out.ms, fit.stat = "chi-squared", group = 1) Samples per Chain: 30000 Burn-in: 10000 Thinning Rate: 50 Number of Chains: 3 Total Posterior Samples: 1200 ---------------------------------------- Community Level ---------------------------------------- Bayesian p-value: 0.3697 ---------------------------------------- Species Level ---------------------------------------- AMRE Bayesian p-value: 0.585 BAWW Bayesian p-value: 0.6608 BHVI Bayesian p-value: 0.5775 BLBW Bayesian p-value: 0.1042 BLPW Bayesian p-value: 0.6292 BTBW Bayesian p-value: 0.21 BTNW Bayesian p-value: 0.04 CAWA Bayesian p-value: 0.4733 MAWA Bayesian p-value: 0.4708 NAWA Bayesian p-value: 0.545 OVEN Bayesian p-value: 0.0533 REVI Bayesian p-value: 0.0875 Fit statistic: chi-squared  The Bayesian p-value for the overall community suggests an adequate model fit, but we see clear variation in the values for each individual species. We should explore this further in a complete analysis. ### Model selection using WAIC and k-fold cross-validation We can compute the WAIC for comparison with alternative models using the waicOcc() function. waicOcc(out.ms)  elpd pD WAIC -4531.71692 65.44891 9194.33166  k-fold cross-validation is again accomplished using the k.fold argument as we have seen previously. For multi-species occupancy models, using multiple threads can greatly reduce the time needed for k-fold cross-validation, so we encourage the use of multiple threads if such computing power is readily available. Using up to $$k$$ threads will generally involve substantial decreases in run time. For multi-species models, a separate deviance measure is reported for each species as a measure of predictive capacity, allowing for comparisons across multiple models for individual species, as well as for the entire community (by summing all species-specific values). ### Prediction Prediction with msPGOcc objects is exactly analogous to what we saw with PGOcc. We can use the predict function along with a data frame of covariates at new locations. We can predict across the entire HBEF for all twelve species using the elevation data stored in hbefElev. elev.pred <- (hbefElev$val - mean(ovenHBEF$occ.covs[, 1])) / sd(ovenHBEF$occ.covs[, 1])
X.0 <- cbind(1, elev.pred, elev.pred^2)
# Approx. run time: 100 sec
out.ms.pred <- predict(out.ms, X.0)

## Multi-species spatial occupancy models

### Basic model description

Residual spatial autocorrelation may perhaps be more prominent in multi-species occupancy models compared to single-species models, as a single set of covariates is used to explain occurrence probability across a region of interest for every one of the species in the modeled community. Given the large variety individual species show in habitat requirements, this may result in important drivers of occurrence probability not being included for certain species, resulting in many species having high residual spatial autocorrelation. We extend the previous multi-species occupancy model to incorporate a distinct spatial Gaussian Process (GP) for each species that accounts for unexplained spatial variation in each individual species occurrence across a spatial region. Occurrence probability for species $$i$$ at site $$j$$ with spatial coordinates $$\boldsymbol{s}_j$$, $$\psi_{i}(\boldsymbol{s}_j)$$, now takes the form

$$$\text{logit}(\psi_{i}(\boldsymbol{s}_j)) = \boldsymbol{x}^{\top}(\boldsymbol{s}_j)\boldsymbol{\beta}_i + \omega_{i}(\boldsymbol{s}_j),$$$

where the species-specific regression coefficients $$\boldsymbol{\beta}_i$$ follow the community-level distribution described previously for nonspatial multi-species occupancy models, and $$\omega_{i}(\boldsymbol{s}_j)$$ is a realization from a zero-mean spatial GP, i.e.,

$$$\boldsymbol{\omega}_{i}(\boldsymbol{s}) \sim \text{Normal}(\boldsymbol{0}, \boldsymbol{\Sigma}_i(\boldsymbol{s}, \boldsymbol{s}', \boldsymbol{\theta}_i)).$$$

We define $$\boldsymbol{\Sigma}_i(\boldsymbol{s}, \boldsymbol{s}', \boldsymbol{\theta}_i)$$ as a $$J \times J$$ covariance matrix that is a function of the distances between any pair of site coordinates $$\boldsymbol{s}$$ and $$\boldsymbol{s}'$$ and a set of parameters $$(\boldsymbol{\theta}_i)$$ that govern the spatial process. The vector $$\boldsymbol{\theta}_i$$ is equal to $$\boldsymbol{\theta}_i = \{\sigma^2_i, \phi_i, \nu_i\}$$, where $$\sigma^2_i$$ is a spatial variance parameter for species $$i$$, $$\phi_i$$ is a spatial decay parameter for species $$i$$, and $$\nu_i$$ is a spatial smoothness parameter for species $$i$$, where as before, $$\nu_i$$ is only specified when using a Matern correlation function.

The detection portion of the multi-species spatial occupancy model remains unchanged from the non-spatial multi-species occupancy model. We fit the model again using Pólya-Gamma data augmentation to enable an efficient Gibbs sampler (see MCMC sampler vignette for details). Similar to our discussion on the single-species spatial occupancy model, we also allow for specification of the spatial process using an NNGP instead of a full GP. This leads to even larger computational gains over the full GP than in the case of a single-species model, given that a separate covariance matrix is specified for each species in the model. See Datta et al. (2016), Finley, Datta, and Banerjee (2020), and the MCMC sampler vignette for additional details on NNGPs and their implementation in multi-species spatial occupancy models.

### Fitting multi-species spatial occupancy models with spMsPGOcc()

The function spMsPGOcc() fits spatially explicit multi-species occupancy models. Similar to single-species models using spPGOcc(), models can be fit using either a full Gaussian Process (GP) or a Nearest Neighbor Gaussian Process (NNGP). spMsPGOcc() fits a separate spatial process for each species. The syntax for spMsPGOcc() is analogous to the syntax for single-species spatially-explicit models using spPGOcc().

spMsPGOcc(occ.formula, det.formula, data, inits, n.batch,
batch.length, accept.rate = 0.43, priors,
cov.model = "exponential", tuning, n.omp.threads = 1,
verbose = TRUE, NNGP = TRUE, n.neighbors = 15,
search.type = "cb", n.report = 100,
n.burn = round(.10 * n.batch * batch.length), n.thin = 1,
n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed, ...)

We will again showcase the model using the HBEF foliage-gleaning bird data set, with the same predictors in our occurrence and detection models

occ.ms.sp.formula <- ~ scale(Elevation) + I(scale(Elevation)^2)
det.ms.sp.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2)

Our initial values in the inits argument will look analogous to what we had specified for the nonspatial multi-species occupancy model using msPGOcc(), but we will now also include additional initial values for the parameters controlling the spatial processes: sigma.sq is the species-specific spatial variance parameter, phi is the species specific spatial decay parameter, and w is the latent spatial process for each species at each site. We will use an exponential covariance model, but note that when using a Matern covariance model we must also specify initial values for nu, the species-specific spatial smoothness parameter. Note that all species-specific spatial parameters are independent of each other. We currently do not leverage any correlation between spatial processes of different species, although this is something we plan to incorporate for future spOccupancy development. Initial values for phi, sigma.sq, and nu (if applicable) are specified as vectors with $$N$$ elements (the number of species being modeled) or as a single value that is used for all species, while the initial values for the latent spatial processes are specified as a matrix with $$N$$ rows (i.e., species) and $$J$$ columns (i.e., sites). Here we set the initial value for the spatial variances equal to 2 for all species and set the initial values for the spatial decay parameter to yield an effective range of the average distance between sites across the HBEF.

# Number of species
N <- dim(hbef2015$y)[1] # Distances between sites dist.hbef <- dist(hbef2015$coords)
# Exponential covariance model
cov.model <- "exponential"
ms.inits <- list(alpha.comm = 0,
beta.comm = 0,
beta = 0,
alpha = 0,
tau.sq.beta = 1,
tau.sq.alpha = 1,
z = apply(hbef2015$y, c(1, 2), max, na.rm = TRUE), sigma.sq = 2, phi = 3 / mean(dist.hbef), w = matrix(0, N, dim(hbef2015$y)[2]))

We next specify the priors in the priors argument. The priors are the same as those we specified for the non-spatial multi-species model, with the addition of priors for the parameters controlling the species-specific spatial processes. We assume independent priors for all spatial parameters across the different species. Priors (and their default values in spOccupancy) for the spatial parameters are exactly analogous to those we saw for the single-species spatially-explicit occupancy model. For each species, we assign an inverse gamma prior for the spatial variance parameter sigma.sq (tag is sigma.sq.ig). As of v0.3.2, we can also specify a uniform prior for the spatial variance parameter with tag sigma.sq.unif (see relevant discussion in single-species spatial occupancy models). We assign uniform priors for the spatial decay parameter phi and smoothness parameter nu (if cov.model = 'matern'), with the associated tags phi.unif and nu.unif. All priors are specified as lists with two elements. For the inverse-Gamma prior, the first element is a length $$N$$ vector of shape parameters for each species, and the second element is a length $$N$$ vector of scale parameters for each species. As we have seen many times before, if the same prior is used for all species, both elements can be specified as single values. For the uniform priors, the first element is a length $$N$$ vector of the lower bounds for each species, and the second element is a length $$N$$ vector of upper bounds for each species. If the same prior is used for all species, both the lower and upper bounds can be specified as single values. For the inverse-Gamma prior on the spatial variances, here we set the shape parameter to 2 and the scale parameter also equal to 2. For a more formal analysis, we may want to do some exploratory data analysis to obtain a better guess for the spatial variance for each species, and then replace the scale parameter with this estimated guess for each species. For the spatial decay parameter, we determine the bounds of the uniform distribution by computing the smallest distance between sites and the largest distance between sites. We then set the lower bound of the uniform to 3/max and the upper bound to 3/min, where min and max correspond to the predetermined distances between sites. Again, these are the default hyperparameter values for the prior distribution of phi that spOccupancy will assign.

# Minimum value is 0, so need to grab second element.
min.dist <- sort(unique(dist.hbef))[2]
max.dist <- max(dist.hbef)
ms.priors <- list(beta.comm.normal = list(mean = 0, var = 2.72),
alpha.comm.normal = list(mean = 0, var = 2.72),
tau.sq.beta.ig = list(a = 0.1, b = 0.1),
tau.sq.alpha.ig = list(a = 0.1, b = 0.1),
sigma.sq.ig = list(a = 2, b = 2),
phi.unif = list(a = 3 / max.dist, b = 3 / min.dist))

We next set the parameters controlling the Adaptive MCMC algorithm (see spPGOcc() section for details). Notice our specification of the initial tuning values is exactly the same as for spPGOcc. We assume the same initial tuning value for all species. However, the adaptive algorithm will allow for species-specific tuning parameters, so these will be adjusted in the algorithm as needed (and reported to the R console if verbose = TRUE).

batch.length <- 25
n.batch <- 400
n.burn <- 2000
n.thin <- 20
n.chains <- 3
ms.tuning <- list(phi = 0.5)
# Values for reporting
verbose <- TRUE
n.report <- 100

Spatially explicit multi-species occupancy models are currently the most computationally intensive models fit by spOccupancy. Even for modest sized data sets, we encourage the use of NNGPs instead of full GPs when fitting spatially-explicit models to ease the computational burden of fitting these models. We fit the model with an NNGP below using 5 neighbors and summarize it using the summary() function, where we specify that we want to summarize both species and community-level parameters.

# Approx. run time: 10 min
out.sp.ms <- spMsPGOcc(occ.formula = occ.ms.sp.formula,
det.formula = det.ms.sp.formula,
data = hbef2015,
inits = ms.inits,
n.batch = n.batch,
batch.length = batch.length,
accept.rate = 0.43,
priors = ms.priors,
cov.model = cov.model,
tuning = ms.tuning,
verbose = TRUE,
NNGP = TRUE,
n.neighbors = 5,
n.report = n.report,
n.burn = n.burn,
n.thin = n.thin,
n.chains = n.chains)
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Building the neighbor list
----------------------------------------
----------------------------------------
Building the neighbors of neighbors list
----------------------------------------
----------------------------------------
Model description
----------------------------------------
NNGP Multispecies Occupancy Model with Polya-Gamma latent
variable fit with 373 sites and 12 species.

Samples per chain: 10000 (400 batches of length 25)
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200

Using the exponential spatial correlation model.

Using 5 nearest neighbors.

Source compiled with OpenMP support and model fit using 1 thread(s).

Adaptive Metropolis with target acceptance rate: 43.0
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Species     Parameter   Acceptance  Tuning
1       phi     60.0        0.50503
2       phi     76.0        0.58092
3       phi     20.0        0.50503
4       phi     12.0        0.50503
5       phi     36.0        0.35946
6       phi     48.0        0.39727
7       phi     32.0        0.46620
8       phi     76.0        0.47561
9       phi     56.0        0.31250
10      phi     20.0        0.80000
11      phi     40.0        0.33853
12      phi     44.0        0.41348
-------------------------------------------------
Batch: 200 of 400, 50.00%
Species     Parameter   Acceptance  Tuning
1       phi     28.0        0.70953
2       phi     44.0        0.45697
3       phi     76.0        0.73849
4       phi     40.0        0.84947
5       phi     28.0        0.36672
6       phi     36.0        0.50503
7       phi     28.0        0.48522
8       phi     88.0        0.59265
9       phi     36.0        0.28847
10      phi     20.0        0.73849
11      phi     52.0        0.31250
12      phi     52.0        0.42183
-------------------------------------------------
Batch: 300 of 400, 75.00%
Species     Parameter   Acceptance  Tuning
1       phi     12.0        0.56941
2       phi     60.0        0.55814
3       phi     72.0        1.05850
4       phi     40.0        0.81616
5       phi     40.0        0.30631
6       phi     44.0        0.62930
7       phi     20.0        0.53625
8       phi     32.0        0.54709
9       phi     40.0        0.31881
10      phi     60.0        0.60462
11      phi     40.0        0.28847
12      phi     68.0        0.42183
-------------------------------------------------
Batch: 400 of 400, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Species     Parameter   Acceptance  Tuning
1       phi     16.0        0.51523
2       phi     24.0        0.92022
3       phi     0.0     1.14666
4       phi     48.0        0.60462
5       phi     48.0        0.39727
6       phi     60.0        0.35234
7       phi     20.0        0.62930
8       phi     48.0        0.55814
9       phi     44.0        0.31250
10      phi     8.0     0.59265
11      phi     52.0        0.30025
12      phi     32.0        0.38169
-------------------------------------------------
Batch: 200 of 400, 50.00%
Species     Parameter   Acceptance  Tuning
1       phi     32.0        0.58092
2       phi     64.0        0.99686
3       phi     8.0     1.14666
4       phi     12.0        0.75341
5       phi     20.0        0.34537
6       phi     56.0        0.37413
7       phi     16.0        0.65498
8       phi     20.0        0.44792
9       phi     40.0        0.31250
10      phi     68.0        0.75341
11      phi     60.0        0.30025
12      phi     48.0        0.48522
-------------------------------------------------
Batch: 300 of 400, 75.00%
Species     Parameter   Acceptance  Tuning
1       phi     64.0        0.59265
2       phi     24.0        0.43035
3       phi     20.0        1.26725
4       phi     48.0        0.81616
5       phi     32.0        0.34537
6       phi     36.0        0.37413
7       phi     36.0        0.49502
8       phi     60.0        0.55814
9       phi     44.0        0.31250
10      phi     36.0        0.88413
11      phi     48.0        0.27716
12      phi     32.0        0.37413
-------------------------------------------------
Batch: 400 of 400, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Species     Parameter   Acceptance  Tuning
1       phi     36.0        0.73849
2       phi     40.0        0.53625
3       phi     24.0        0.81616
4       phi     20.0        0.90199
5       phi     32.0        0.50503
6       phi     48.0        0.38169
7       phi     76.0        0.40529
8       phi     32.0        0.55814
9       phi     28.0        0.28276
10      phi     0.0     1.67674
11      phi     36.0        0.30025
12      phi     44.0        0.38940
-------------------------------------------------
Batch: 200 of 400, 50.00%
Species     Parameter   Acceptance  Tuning
1       phi     80.0        0.66821
2       phi     20.0        0.62930
3       phi     40.0        0.37413
4       phi     36.0        0.95777
5       phi     48.0        0.37413
6       phi     28.0        0.48522
7       phi     60.0        0.46620
8       phi     36.0        0.41348
9       phi     40.0        0.28847
10      phi     36.0        1.48714
11      phi     76.0        0.33853
12      phi     48.0        0.40529
-------------------------------------------------
Batch: 300 of 400, 75.00%
Species     Parameter   Acceptance  Tuning
1       phi     24.0        0.70953
2       phi     40.0        0.52564
3       phi     96.0        0.40529
4       phi     72.0        0.80000
5       phi     92.0        0.36672
6       phi     24.0        0.35234
7       phi     68.0        0.64201
8       phi     48.0        0.44792
9       phi     44.0        0.28276
10      phi     0.0     1.37280
11      phi     32.0        0.31881
12      phi     36.0        0.40529
-------------------------------------------------
Batch: 400 of 400, 100.00%
summary(out.sp.ms, level = 'both')

Call:
spMsPGOcc(occ.formula = occ.ms.sp.formula, det.formula = det.ms.sp.formula,
data = hbef2015, inits = ms.inits, priors = ms.priors, tuning = ms.tuning,
cov.model = cov.model, NNGP = TRUE, n.neighbors = 5, n.batch = n.batch,
verbose = TRUE, n.report = n.report, n.burn = n.burn, n.thin = n.thin,
n.chains = n.chains)

Samples per Chain: 10000
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200
Run Time (min): 17.2565

----------------------------------------
Community Level
----------------------------------------
Occurrence Means (logit scale):
Mean     SD    2.5%     50%  97.5%   Rhat  ESS
(Intercept)            0.3597 1.0176 -1.6049  0.3543 2.4254 1.0011 1032
scale(Elevation)       0.3016 0.6899 -1.0360  0.3053 1.7081 1.0064  883
I(scale(Elevation)^2) -0.2811 0.3538 -0.9803 -0.2856 0.4529 1.0218  723

Occurrence Variances (logit scale):
Mean      SD   2.5%     50%   97.5%   Rhat ESS
(Intercept)           20.7689 11.6274 7.2632 17.8888 50.1195 1.0069 187
scale(Elevation)       6.3631  3.9216 1.9893  5.2830 17.1310 1.0900 246
I(scale(Elevation)^2)  1.1639  0.8402 0.2748  0.9586  3.3140 1.0161 258

Detection Means (logit scale):
Mean     SD    2.5%     50%  97.5%   Rhat  ESS
(Intercept)     -0.7027 0.4561 -1.6310 -0.6969 0.2389 1.0159  424
scale(day)       0.0533 0.0887 -0.1241  0.0519 0.2276 1.0034 1200
scale(tod)      -0.0367 0.0756 -0.1884 -0.0363 0.1095 1.0007 1200
I(scale(day)^2) -0.0207 0.0863 -0.1922 -0.0177 0.1447 1.0011 1415

Detection Variances (logit scale):
Mean     SD   2.5%    50%  97.5%   Rhat  ESS
(Intercept)     2.3997 1.3965 0.8328 2.0825 5.6853 1.1251  244
scale(day)      0.0647 0.0350 0.0240 0.0553 0.1576 1.0085 1200
scale(tod)      0.0495 0.0309 0.0163 0.0423 0.1265 1.0196 1200
I(scale(day)^2) 0.0582 0.0390 0.0188 0.0475 0.1583 1.0055 1200

----------------------------------------
Species Level
----------------------------------------
Occurrence (logit scale):
Mean     SD    2.5%     50%   97.5%   Rhat ESS
(Intercept)-AMRE           -2.7839 2.7995 -6.2666 -3.3099  5.8533 2.4781  20
(Intercept)-BAWW            2.3955 2.3571 -1.0727  1.9490  8.3598 1.8358  33
(Intercept)-BHVI            3.6254 3.0337 -1.1396  3.2159 10.4335 1.4752  47
(Intercept)-BLBW            3.3546 1.0681  2.0451  3.1021  6.0533 1.0495 115
(Intercept)-BLPW           -5.7827 1.0190 -8.0843 -5.6333 -4.1817 1.0164 121
(Intercept)-BTBW            5.1493 1.1764  3.4881  4.9594  8.0912 1.0921  81
(Intercept)-BTNW            3.3242 0.8880  2.0654  3.1474  5.6730 1.0717  94
(Intercept)-CAWA           -2.9981 0.9436 -5.0567 -2.9225 -1.2636 1.0179 129
(Intercept)-MAWA           -3.8120 1.2331 -6.4453 -3.6571 -1.7271 1.1076  92
(Intercept)-NAWA           -3.1808 0.7509 -4.9280 -3.1228 -1.7643 1.0078 218
(Intercept)-OVEN            3.6986 1.1148  1.5970  3.6054  6.2405 1.2508  80
(Intercept)-REVI            4.3757 1.0710  2.8313  4.1667  7.1326 1.1002 109
scale(Elevation)-AMRE       1.2484 1.3159 -0.8497  1.1333  4.2380 1.0957  96
scale(Elevation)-BAWW      -1.1052 1.7476 -6.4918 -0.7208  1.5054 2.0653  32
scale(Elevation)-BHVI       0.2855 1.3816 -2.4782  0.2416  3.2837 1.0068 169
scale(Elevation)-BLBW      -0.5659 0.5475 -1.4730 -0.4810 -0.0115 1.4423  76
scale(Elevation)-BLPW       3.1770 0.8314  1.9339  3.0566  5.0817 1.0212 368
scale(Elevation)-BTBW      -0.6607 0.3144 -1.3604 -0.6364 -0.0997 1.0356 295
scale(Elevation)-BTNW       0.8616 0.4211  0.2232  0.7996  1.8574 1.0040 535
scale(Elevation)-CAWA       2.5017 0.8703  1.1382  2.4132  4.5013 1.0093 314
scale(Elevation)-MAWA       3.1731 1.0251  1.5879  3.0129  5.4575 1.0739  81
scale(Elevation)-NAWA       1.3518 0.4610  0.5993  1.2988  2.3902 1.0532 467
scale(Elevation)-OVEN      -3.0981 0.9067 -5.2549 -2.9631 -1.7147 1.0773 116
scale(Elevation)-REVI      -2.9657 1.1394 -5.8219 -2.7039 -1.4194 1.0982 182
I(scale(Elevation)^2)-AMRE -0.8691 0.8807 -2.6965 -0.8757  0.9554 1.2924 109
I(scale(Elevation)^2)-BAWW -1.2662 0.7206 -2.7113 -1.2555  0.2190 1.2940 113
I(scale(Elevation)^2)-BHVI  0.5053 0.9266 -1.3355  0.4406  2.4339 1.0144 241
I(scale(Elevation)^2)-BLBW -0.7195 0.3017 -1.3739 -0.6956 -0.2447 1.0260 151
I(scale(Elevation)^2)-BLPW  1.0131 0.5114 -0.0250  1.0373  1.9844 1.0328 415
I(scale(Elevation)^2)-BTBW -1.4685 0.3629 -2.3426 -1.4104 -0.9319 1.0927  98
I(scale(Elevation)^2)-BTNW -0.1944 0.2512 -0.7000 -0.2055  0.2947 1.0404 844
I(scale(Elevation)^2)-CAWA -0.6908 0.5777 -1.8527 -0.6590  0.4648 1.0016 450
I(scale(Elevation)^2)-MAWA  0.6772 0.5442 -0.2918  0.6537  1.8326 1.0053 517
I(scale(Elevation)^2)-NAWA  0.3880 0.3277 -0.2428  0.3777  1.0331 1.0142 653
I(scale(Elevation)^2)-OVEN -0.8304 0.4283 -1.6853 -0.8213 -0.0175 1.0024 497
I(scale(Elevation)^2)-REVI  0.0063 0.5554 -0.9405 -0.0508  1.2308 1.0512 239

Detection (logit scale):
Mean     SD    2.5%     50%   97.5%   Rhat  ESS
(Intercept)-AMRE     -2.5565 1.4008 -5.4953 -2.2636 -0.2155 1.5742   45
(Intercept)-BAWW     -2.5949 0.3923 -3.2812 -2.6311 -1.8298 1.4104  106
(Intercept)-BHVI     -2.6425 0.3212 -3.1665 -2.6867 -1.8837 1.7077   34
(Intercept)-BLBW     -0.0252 0.1279 -0.2801 -0.0212  0.2179 1.0528  378
(Intercept)-BLPW     -0.4344 0.2682 -0.9807 -0.4349  0.0584 1.0014 1099
(Intercept)-BTBW      0.6272 0.1042  0.4179  0.6247  0.8268 0.9995 1088
(Intercept)-BTNW      0.5902 0.1105  0.3695  0.5917  0.8056 1.0080 1200
(Intercept)-CAWA     -1.4753 0.5489 -2.6137 -1.4608 -0.4972 1.0107  265
(Intercept)-MAWA     -0.6802 0.2323 -1.1362 -0.6740 -0.1938 0.9998  961
(Intercept)-NAWA     -1.4636 0.4748 -2.3708 -1.4631 -0.5867 1.0142  567
(Intercept)-OVEN      0.8149 0.1185  0.5891  0.8109  1.0389 1.0053 1200
(Intercept)-REVI      0.5407 0.1087  0.3332  0.5448  0.7498 1.0036 1097
scale(day)-AMRE       0.0377 0.2349 -0.4194  0.0360  0.4849 1.0122 1200
scale(day)-BAWW       0.2076 0.1435 -0.0631  0.2091  0.4954 1.0047  972
scale(day)-BHVI       0.2048 0.1186 -0.0141  0.2015  0.4361 1.0104 1200
scale(day)-BLBW      -0.2304 0.0680 -0.3618 -0.2286 -0.0932 1.0012 1200
scale(day)-BLPW       0.0629 0.1562 -0.2476  0.0635  0.3640 1.0004 1200
scale(day)-BTBW       0.2670 0.0665  0.1411  0.2664  0.3976 1.0100 1200
scale(day)-BTNW       0.1461 0.0686  0.0115  0.1453  0.2761 1.0106 1200
scale(day)-CAWA      -0.0230 0.1747 -0.3750 -0.0247  0.3076 1.0055 1200
scale(day)-MAWA       0.1047 0.1252 -0.1323  0.1032  0.3559 1.0166 1341
scale(day)-NAWA       0.0126 0.1796 -0.3268  0.0063  0.3637 1.0037 1200
scale(day)-OVEN      -0.0758 0.0707 -0.2084 -0.0743  0.0633 1.0037 1200
scale(day)-REVI      -0.0754 0.0652 -0.2042 -0.0755  0.0427 1.0019 1200
scale(tod)-AMRE      -0.0318 0.2173 -0.4623 -0.0309  0.4051 1.0212  987
scale(tod)-BAWW      -0.1747 0.1392 -0.4514 -0.1718  0.0780 1.0016 1200
scale(tod)-BHVI      -0.0424 0.1101 -0.2709 -0.0447  0.1704 1.0174 1200
scale(tod)-BLBW       0.0594 0.0673 -0.0738  0.0580  0.1875 1.0121 1093
scale(tod)-BLPW       0.1024 0.1455 -0.1827  0.1033  0.3946 1.0030 1200
scale(tod)-BTBW      -0.0361 0.0658 -0.1692 -0.0372  0.0938 1.0018 1200
scale(tod)-BTNW       0.0401 0.0644 -0.0885  0.0412  0.1721 1.0026 1200
scale(tod)-CAWA      -0.2041 0.1746 -0.5550 -0.1954  0.1211 1.0044 1200
scale(tod)-MAWA       0.0180 0.1162 -0.2001  0.0147  0.2474 1.0040 1200
scale(tod)-NAWA      -0.0610 0.1612 -0.3602 -0.0686  0.2751 1.0123 1200
scale(tod)-OVEN      -0.0425 0.0734 -0.1828 -0.0420  0.0964 1.0030 1200
scale(tod)-REVI      -0.0800 0.0653 -0.2125 -0.0812  0.0438 1.0060 1200
I(scale(day)^2)-AMRE -0.1050 0.2253 -0.5649 -0.0978  0.3139 1.0013 1241
I(scale(day)^2)-BAWW -0.0382 0.1456 -0.3232 -0.0383  0.2489 1.0033 1200
I(scale(day)^2)-BHVI  0.0552 0.1320 -0.1815  0.0538  0.3287 1.0020 1200
I(scale(day)^2)-BLBW -0.1704 0.0814 -0.3267 -0.1727 -0.0116 1.0029 1200
I(scale(day)^2)-BLPW  0.2060 0.1897 -0.1579  0.2028  0.6264 1.0039 1200
I(scale(day)^2)-BTBW -0.0604 0.0786 -0.2210 -0.0599  0.0990 1.0005 1200
I(scale(day)^2)-BTNW -0.0599 0.0807 -0.2176 -0.0610  0.0969 1.0066 1143
I(scale(day)^2)-CAWA -0.0353 0.1837 -0.3963 -0.0350  0.3161 1.0051 1088
I(scale(day)^2)-MAWA  0.0007 0.1399 -0.2760  0.0019  0.2884 1.0057 1200
I(scale(day)^2)-NAWA -0.1388 0.1903 -0.5233 -0.1325  0.2080 1.0004 1200
I(scale(day)^2)-OVEN  0.0208 0.0883 -0.1535  0.0197  0.1985 0.9996 1200
I(scale(day)^2)-REVI  0.0394 0.0771 -0.1070  0.0390  0.1841 0.9993 1200

Spatial Covariance:
Mean      SD   2.5%     50%   97.5%   Rhat ESS
sigma.sq-AMRE  2.1899  2.2940 0.4064  1.4374  9.0988 1.1310 102
sigma.sq-BAWW  6.5982 16.9455 0.3967  1.4193 54.3471 2.7628  19
sigma.sq-BHVI  9.7972 19.1097 0.4265  1.7880 70.9977 7.6504  15
sigma.sq-BLBW  2.8018  3.6071 0.3978  1.6198 13.0921 1.2352  78
sigma.sq-BLPW  2.1438  1.9955 0.3600  1.5393  8.2392 1.0245  62
sigma.sq-BTBW  2.8815  2.4739 0.5351  2.0934 10.0705 1.0782  64
sigma.sq-BTNW  3.3354  3.4263 0.4612  2.2685 12.7757 1.0235  90
sigma.sq-CAWA  2.6432  2.5349 0.4600  1.7529  9.9263 1.0666  71
sigma.sq-MAWA 12.9775  9.8118 2.3417 10.9231 34.1622 1.1662  48
sigma.sq-NAWA  1.7696  1.7041 0.3605  1.1927  7.6151 1.1013 103
sigma.sq-OVEN  8.9635  5.9542 2.1749  7.4681 25.3373 1.0642 108
sigma.sq-REVI  3.0870  2.8910 0.5851  2.1013 11.3286 1.1086  86
phi-AMRE       0.0124  0.0084 0.0010  0.0103  0.0291 1.0080 106
phi-BAWW       0.0106  0.0087 0.0008  0.0077  0.0283 1.1699  57
phi-BHVI       0.0135  0.0088 0.0008  0.0126  0.0288 1.8660  59
phi-BLBW       0.0139  0.0079 0.0024  0.0122  0.0292 1.0137 156
phi-BLPW       0.0062  0.0067 0.0007  0.0037  0.0266 1.2457  55
phi-BTBW       0.0068  0.0070 0.0007  0.0040  0.0269 1.1759  41
phi-BTNW       0.0101  0.0077 0.0008  0.0074  0.0284 1.2180  77
phi-CAWA       0.0100  0.0091 0.0006  0.0064  0.0288 1.0189  27
phi-MAWA       0.0019  0.0008 0.0007  0.0018  0.0037 1.1800 150
phi-NAWA       0.0150  0.0085 0.0020  0.0152  0.0293 1.0669  60
phi-OVEN       0.0021  0.0008 0.0007  0.0020  0.0040 1.0501 180
phi-REVI       0.0074  0.0061 0.0011  0.0053  0.0250 1.0618  95

The resulting object out.sp.ms is a list of class spMsPGOcc consisting primarily of posterior samples of all community and species-level parameters, as well as some additional objects that are used for summaries, predictions, and model fit evaluation. The Rhat values indicate convergence of most community and species-level parameters, but some parameters for rare species have not yet converged and the ESS of many species-level parameters are quite small. This is a clear reminder that spatially-explicit models often have slow mixing rates, and so you will often need to run these models for a large number of samples in order to achieve convergence and adequate ESS.

### Posterior predictive checks

We perform posterior predictive checks to assess Goodness of Fit using ppcOcc() just as we have previously seen.

# Takes a few seconds to run.
ppc.sp.ms.out <- ppcOcc(out.sp.ms, 'freeman-tukey', group = 2)
Currently on species 1 out of 12
Currently on species 2 out of 12
Currently on species 3 out of 12
Currently on species 4 out of 12
Currently on species 5 out of 12
Currently on species 6 out of 12
Currently on species 7 out of 12
Currently on species 8 out of 12
Currently on species 9 out of 12
Currently on species 10 out of 12
Currently on species 11 out of 12
Currently on species 12 out of 12
summary(ppc.sp.ms.out, level = 'both')

Call:
ppcOcc(object = out.sp.ms, fit.stat = "freeman-tukey", group = 2)

Samples per Chain: 10000
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200

----------------------------------------
Community Level
----------------------------------------
Bayesian p-value:  0.5103

----------------------------------------
Species Level
----------------------------------------
AMRE Bayesian p-value: 0.7467
BAWW Bayesian p-value: 0.5342
BHVI Bayesian p-value: 0.5783
BLBW Bayesian p-value: 0.3092
BLPW Bayesian p-value: 0.4025
BTBW Bayesian p-value: 0.4958
BTNW Bayesian p-value: 0.48
CAWA Bayesian p-value: 0.5483
MAWA Bayesian p-value: 0.57
NAWA Bayesian p-value: 0.5992
OVEN Bayesian p-value: 0.385
REVI Bayesian p-value: 0.475
Fit statistic:  freeman-tukey 

### Model selection using WAIC

Below we compute the WAIC using waicOcc() and compare it to the WAIC for the non-spatial multi-species occupancy model.

waicOcc(out.sp.ms)
      elpd         pD       WAIC
-4198.9219   310.4972  9018.8380 
waicOcc(out.ms)
       elpd          pD        WAIC
-4531.71692    65.44891  9194.33166 

As always, remember there is MC error in these numbers, and so the values you receive will be slightly different. The WAIC for the spatial model is considerably smaller than that for the nonspatial model, indicating the species-specific spatial processes improve prediction across the entire community. However, in a complete analysis we should ensure the models fully converge (and we have adequate ESS) before performing any model selection or comparison.

k-fold cross-validation proceeds using the k.fold argument as we have seen previously, returning a separate scoring rule (deviance) for each species.

### Prediction

Prediction with spMsPGOcc objects again uses the predict() function given a set of covariates and spatial coordinates of a set of locations. Results are very similar to the nonspatial multi-species model, so we do not execute the following code, since prediction in spatial models takes some time and run time is multiplied by the number of species here.

elev.pred <- (hbefElev$val - mean(ovenHBEF$occ.covs[, 1])) / sd(ovenHBEF$occ.covs[, 1]) X.0 <- cbind(1, elev.pred, elev.pred^2) coords.0 <- as.matrix(hbefElev[, c('Easting', 'Northing')]) out.sp.ms.pred <- predict(out.sp.ms, X.0, coords.0) ## Single-species integrated occupancy models Data integration is a model-based approach that leverages multiple data sources to provide inference and prediction on some latent process of interest (Miller et al. 2019). Data integration is particularly relevant in ecology as many data sources are often collected to better understand a single ecological phenomenon, with each data source having advantages and disadvantages, say in terms of sample size, spatial extent, or data reliability. Often, multiple detection-nondetection data sources are available to study the occurrence and distribution of some species of interest. For example, both point count surveys performed by human observers and autonomous recording units could be used to monitor a bird species (Doser et al. 2021). Different types of data have different sources of observation error, which we should explicitly incorporate into a model to avoid attributing any variation in detection probability to the true ecological process. Here we describe single-species integrated occupancy models, which combine multiple sources of detection-nondetection data in a single hierarchical modeling framework. These data sources may or may not be replicated, but we note that the combination of two or more sources always results in a replicated data set. ### Basic model description The integrated occupancy model has an identical process model to that of the single-species occupancy model, but has a distinct detection model for each data source, each of which is conditional on the same, shared ecological process (i.e., species occurrence). Let $$z_j$$ be the presence or absence of a species at site $$j$$, with $$j = 1, \dots, J$$. We assume this latent occurrence variable arises from a Bernoulli process following $$$\begin{split} &z_j \sim \text{Bernoulli}(\psi_j), \\ &\text{logit}(\psi_j) = \boldsymbol{x}^{\top}_j\boldsymbol{\beta}, \end{split}$$$ where $$\psi_j$$ is the probability of occurrence at site $$j$$, which is a function of site-specific covariates $$\boldsymbol{X}$$ and a vector of regression coefficients ($$\boldsymbol{\beta}$$). We do not directly observe $$z_j$$, but rather we observe an imperfect representation of the latent occurrence process. In integrated models, we have $$r = 1, \dots, R$$ distinct sources of data that are all imperfect representations of a single, shared occurrence process. Let $$y_{r, a, k}$$ be the observed detection (1) or nondetection (0) of a species of interest in data set $$r$$ at site $$a$$ during replicate $$k$$. Because different data sources have different variables influencing the observation process, we envision a separate detection model for each data source that is defined conditional on a single, shared ecological process described above. We envision the detection-nondetection data from source $$r$$ as arising from a Bernoulli process conditional on the true latent occurrence process: $$$\begin{split} &y_{r, a, k} \sim \text{Bernoulli}(p_{r, a, k}z_{j[a]}), \\ &\text{logit}(p_{r, a, k}) = \boldsymbol{v}^{\top}_{r, a, k}\boldsymbol{\alpha}_r, \end{split}$$$ where $$p_{r, a, k}$$ is the probability of detecting a species at site $$a$$ during replicate $$k$$ (given it is present at site $$a$$) for data source $$r$$, which is a function of site, replicate, and data source specific covariates $$\boldsymbol{V}_r$$ and a vector of regression coefficients specific to each data source ($$\boldsymbol{\alpha}_r$$). Note that $$z_{j[a]}$$ is the true occurrence status at site $$j$$ corresponding to the $$a$$th data source site in the given data set $$r$$. Each data source may be available at all $$J$$ sites in the region of interest or at a subset of the $$J$$ sites. Additionally, data sources can overlap in the sites they sample, or they can be obtained at distinct sites within all $$J$$ sites of interest in the overall region. We assume multivariate normal priors for the occurrence ($$\boldsymbol{\beta}$$) and data-set specific detection ($$\boldsymbol{\alpha}$$) regression coefficients to complete the Bayesian specification of a single-species occupancy model. Pólya-Gamma data augmentation is implemented in an analogous manner to that of the previous models to yield an efficient implementation of integrated occupancy models. ### Example data sources: Ovenbird occurrence in the White Mountain National Forest To illustrate an integrated occupancy model, we will use two data sets that come from the White Mountain National Forest (WMNF) in New Hampshire and Maine, USA. Our goal is to model the occurrence of OVEN in the WMNF in 2015. Our first data source is the HBEF data set we have used to illustrate all single data source models. Our second data source comes from the National Ecological Observatory Network (NEON) at Bartlett Experimental Forest (Barnett et al. 2019; National Ecological Observatory Network (NEON) 2021). The Bartlett Forest and HBEF both lie within the larger WMNF. Suppose we are interested in OVEN occurrence across the entire WMNF. By leveraging both data sources in a single integrated model, we will expand the range of covariates across which we can make reliable predictions, and may obtain results that are more indicative across the entire region of interest and not just a single data source location (Doser et al. 2022). In this particular case, there is no overlap between the two data sources (i.e., Bartlett Forest and HBEF do not overlap spatially). However, the integrated occupancy models fit by spOccupancy can integrate data sources with no overlap, partial overlap, or complete overlap. The NEON data are provided along with spOccupancy in the neon2015 list. We load the NEON data along with the HBEF data below data(hbef2015) data(neon2015) str(neon2015) # Look at what's in the new data set. List of 4$ y       : num [1:12, 1:80, 1:3] 0 0 0 0 0 0 1 0 0 0 ...
$occ.covs: num [1:80, 1] 390 425 443 382 441 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL
.. ..$: chr "Elevation"$ det.covs:List of 2
..$day: num [1:80] 169 169 169 169 169 169 169 169 169 170 ... ..$ tod: int [1:80] 8 8 6 7 8 6 7 7 7 8 ...
$coords : num [1:80, 1:2] 318472 318722 318972 318472 318722 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:80] "1" "2" "3" "4" ...
.. ..$: chr [1:2] "X" "Y" Details on the NEON data set are provided in the package documentation as well as in Doser et al. (2022). The NEON data are collected at 80 point count sites in Bartlett Forest using a removal protocol with three time periods, resulting in replicated detection-nondetection data that can be used in an occupancy modeling framework (after reducing counts to binary detection indicators). The neon2015 list, like the hbef2015 object, contains the detection-nondetection data for 12 foliage-gleaning bird species (y), occurrence covariates stored in occ.covs, detection covariates stored in det.covs, and the coordinates of the 80 point count locations stored in coords. Below we subset the detection-nondetection data in both data sources to solely work with OVEN. sp.names <- dimnames(hbef2015$y)[[1]]
ovenHBEF <- hbef2015
ovenHBEF$y <- ovenHBEF$y[sp.names == "OVEN", , ]
ovenNEON <- neon2015
ovenNEON$y <- ovenNEON$y[sp.names == "OVEN", , ]
table(ovenHBEF$y)  0 1 518 588  table(ovenNEON$y)

0  1
61 62 

Thus, the ovenbird is observed in a little over half of the possible site/replicate combinations in both of the data sources.

### Fitting single-species integrated occupancy models with intPGOcc()

The function intPGOcc() fits single-species integrated occupancy models in spOccupancy. Syntax is very similar to that of single data source models, and specifically takes the following form:

intPGOcc(occ.formula, det.formula, data, inits, n.samples, priors,
n.omp.threads = 1, verbose = TRUE, n.report = 1000,
n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1,
k.fold, k.fold.threads = 1, k.fold.seed, k.fold.data, ...)

The data argument contains the list of data elements necessary for fitting an integrated occupancy model. For nonspatial integrated occupancy models, data should be a list comprised of the following objects: y (list of detection-nondetection data matrices for each data source), occ.covs (data frame or matrix of covariates for occurrence model), det.covs (a list of lists where each element of the list corresponds to the detection-nondetection data for the given data source), sites (a list where each element consists of the site indices for the given data source.

The ovenHBEF and ovenNEON lists are currently formatted for use in single data source models and so we need to combine these data sources together. Perhaps the trickiest part of data integration is ensuring each point count location in each data source lines up with the correct geographical location where you want to determine the true presence/absence of the species of interest. In spOccupancy, most of this bookkeeping is done under the hood, but we will need to combine the two data sources together into a single list in which we are consistent about how the data sources are sorted. To accomplish this, we recommend first creating the occurrence covariates matrix for all data sources. Because our two data sources do not overlap spatially, this is relatively simple here as we can just use rbind().

occ.covs.int <- rbind(ovenHBEF$occ.covs, ovenNEON$occ.covs)
str(occ.covs.int)
 num [1:453, 1] 475 494 546 587 588 ...
- attr(*, "dimnames")=List of 2
..$: NULL ..$ : chr "Elevation"

Notice the order in which we placed these covariates: all covariate values for HBEF come first, followed by all covariates for NEON. We need to ensure that we use this identical ordering for all objects in the data list. Next, we create the site indices stored in sites. sites should be a list with two elements (one for each data source), where each element consists of a vector that indicates the rows in occ.covs that correspond with the specific row of the detection-nondetection data for that data source. When the data sources stem from distinct points (like in our current case), this is again relatively straightforward as the indices simply correspond to how we ordered the points in occ.covs.

sites.int <- list(hbef = 1:nrow(ovenHBEF$occ.covs), neon = 1:nrow(ovenNEON$occ.covs) + nrow(ovenHBEF$occ.covs)) str(sites.int) List of 2$ hbef: int [1:373] 1 2 3 4 5 6 7 8 9 10 ...
$neon: int [1:80] 374 375 376 377 378 379 380 381 382 383 ... Next we create the combined detection-nondetection data y. For integrated models in spOccupancy, y is a list of matrices, with each element containing the detection-nondetection matrix for the specific data source. Again, we must ensure that we place the data sources in the correct order. y.int <- list(hbef = ovenHBEF$y,
neon = ovenNEON$y) str(y.int) List of 2$ hbef: num [1:373, 1:3] 1 1 0 1 0 0 1 0 1 1 ...
..- attr(*, "dimnames")=List of 2
.. ..$: chr [1:373] "1" "2" "3" "4" ... .. ..$ : chr [1:3] "1" "2" "3"
$neon: num [1:80, 1:3] 1 1 0 1 1 0 1 1 0 1 ... Lastly, we create the detection covariates det.covs. This should be a list of the detection covariates from each individual data source. Because individual data source detection covariates are stored as lists for single data source models in spOccupancy, det.covs is now a list of lists for integrated occupancy models. det.covs.int <- list(hbef = ovenHBEF$det.covs,
neon = ovenNEON$det.covs) Finally, we package everything together into a single list, which we call data.int. data.int <- list(y = y.int, occ.covs = occ.covs.int, det.covs = det.covs.int, sites = sites.int) str(data.int) List of 4$ y       :List of 2
..$hbef: num [1:373, 1:3] 1 1 0 1 0 0 1 0 1 1 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. .. ..$: chr [1:3] "1" "2" "3" ..$ neon: num [1:80, 1:3] 1 1 0 1 1 0 1 1 0 1 ...
$occ.covs: num [1:453, 1] 475 494 546 587 588 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL
.. ..$: chr "Elevation"$ det.covs:List of 2
..$hbef:List of 2 .. ..$ day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ...
.. ..$tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ... ..$ neon:List of 2
.. ..$day: num [1:80] 169 169 169 169 169 169 169 169 169 170 ... .. ..$ tod: int [1:80] 8 8 6 7 8 6 7 7 7 8 ...
$sites :List of 2 ..$ hbef: int [1:373] 1 2 3 4 5 6 7 8 9 10 ...
..neon: int [1:80] 374 375 376 377 378 379 380 381 382 383 ... We specify the occurrence and detection model formulas using the occ.formula, and det.formula arguments. The occ.formula remains unchanged from previous models, and we will specify occurrence of the ovenbird as a function of linear and quadratic terms of elevation. occ.formula.int <- ~ scale(Elevation) + I(scale(Elevation)^2) For the detection models, we need to specify a different detection model for each data source. We do this by supplying a list to the det.formula argument, where each element of the list is the model formula for that given data set. Here we specify the detection model for HBEF as a function of linear and quadratic terms for ‘day of survey’ as well as a linear term for ‘time of survey’. In this case, we include the same covariates for the NEON model (though we note that different coefficients are estimated for the two data sources). However, there is no requirement for the data sources to be a function of the same covariates. det.formula.int = list(hbef = ~ scale(day) + scale(tod) + I(scale(day)^2), neon = ~ scale(day) + scale(tod) + I(scale(day)^2)) Next we specify the initial values. Initial values are specified in a list with the following tags: z (latent occurrence values), alpha (detection regression coefficients), and beta (occurrence regression coefficients). This aligns with fitting single-species occupancy models using PGOcc(). However, since we now have multiple detection models with different coefficients for each data source, initial values for alpha are now passed to intPGOcc() as a list, with each element of the list corresponding to the initial detection parameter values for a given data source. These are specified either as a vector with a value for each parameter or a single value for all parameters). # Total number of sites J <- nrow(data.intocc.covs)
inits.list <- list(alpha = list(0, 0),
beta = 0,
z = rep(1, J))

We next specify the priors for all parameters in the integrated occupancy model in a list that is passed into the priors argument. We specify normal priors for both the occurrence and detection regression coefficients, using tags beta.normal and alpha.normal, respectively.

prior.list <- list(beta.normal = list(mean = 0, var = 2.72),
alpha.normal = list(mean = list(0, 0),
var = list(2.72, 2.72)))

Priors for the occurrence regression coefficients are specified as we have seen in previous models. Because we have multiple detection-nondetection data sets each with distinct detection parameters, we specify the hypermeans and hypervariances in individual lists, where each element of the list corresponds to a specific data source. Again, the ordering of the data sources in the lists must align with the order the data sources are saved in the detection-nondetection data supplied to the data argument.

Finally, we specify the number of samples, burn-in, and thinning rate using the same approach we have used for previous models.

n.samples <- 8000
n.burn <- 3000
n.thin <- 25

We can now run the integrated occupancy model. Below we set the number of threads used to 1 and print out sampler progress after every 2000th iteration.

# Approx. run time: < 15 sec
out.int <- intPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
n.samples = n.samples,
priors = prior.list,
verbose = TRUE,
n.report = 2000,
n.burn = n.burn,
n.thin = n.thin,
n.chains = 3) 
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Model description
----------------------------------------
Integrated Occupancy Model with Polya-Gamma latent
variable fit with 453 sites.

Integrating 2 occupancy data sets.

Samples per Chain: 8000
Burn-in: 3000
Thinning Rate: 25
Number of Chains: 3
Total Posterior Samples: 600

Source compiled with OpenMP support and model fit using 1 thread(s).

----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Sampled: 2000 of 8000, 25.00%
-------------------------------------------------
Sampled: 4000 of 8000, 50.00%
-------------------------------------------------
Sampled: 6000 of 8000, 75.00%
-------------------------------------------------
Sampled: 8000 of 8000, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Sampled: 2000 of 8000, 25.00%
-------------------------------------------------
Sampled: 4000 of 8000, 50.00%
-------------------------------------------------
Sampled: 6000 of 8000, 75.00%
-------------------------------------------------
Sampled: 8000 of 8000, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Sampled: 2000 of 8000, 25.00%
-------------------------------------------------
Sampled: 4000 of 8000, 50.00%
-------------------------------------------------
Sampled: 6000 of 8000, 75.00%
-------------------------------------------------
Sampled: 8000 of 8000, 100.00%

We again make a call to the summary function for a concise description of the model results.

summary(out.int)

Call:
intPGOcc(occ.formula = occ.formula.int, det.formula = det.formula.int,
data = data.int, inits = inits.list, priors = prior.list,
n.samples = n.samples, n.omp.threads = 1, verbose = TRUE,
n.report = 2000, n.burn = n.burn, n.thin = n.thin, n.chains = 3)

Samples per Chain: 8000
Burn-in: 3000
Thinning Rate: 25
Number of Chains: 3
Total Posterior Samples: 600
Run Time (min): 0.2103

Occurrence (logit scale):
Mean     SD    2.5%     50%   97.5%   Rhat ESS
(Intercept)            2.2633 0.2418  1.8281  2.2556  2.8053 1.0011 600
scale(Elevation)      -1.2818 0.1915 -1.6741 -1.2750 -0.9491 1.0178 748
I(scale(Elevation)^2) -0.6001 0.1572 -0.8958 -0.5980 -0.3042 1.0050 674

Data source 1 Detection (logit scale):
Mean     SD    2.5%     50%  97.5%   Rhat ESS
(Intercept)      0.8069 0.1262  0.5620  0.8134 1.0549 1.0078 600
scale(day)      -0.0898 0.0752 -0.2419 -0.0865 0.0543 1.0018 600
scale(tod)      -0.0447 0.0794 -0.2072 -0.0456 0.1113 1.0234 600
I(scale(day)^2)  0.0278 0.0960 -0.1637  0.0210 0.2122 1.0031 600

Data source 2 Detection (logit scale):
Mean     SD    2.5%     50%   97.5%   Rhat ESS
(Intercept)      2.0365 0.5611  0.8986  2.0194  3.1347 1.0031 600
scale(day)       0.8911 0.3726  0.2556  0.8660  1.6360 1.0002 600
scale(tod)       0.8419 0.5257 -0.3892  0.9016  1.7465 1.0099 600
I(scale(day)^2) -0.7988 0.3373 -1.4846 -0.7995 -0.1368 1.0117 532

The summary function for integrated models returns the detection parameters separately for each detection covariate. The Rhat and ESS values clearly indicate adequate convergence and mixing of the MCMC chains. Looking at the occurrence parameters, we see fairly similar estimates to those from the single data source model using HBEF data only. Of course this is only a half-surprise due to the unequal sizes of the two data sets, where the HBEF data will be expected to dominate estimation of the occupancy parameters. But an alternative explanation for the relative similarity is that the two processes are indeed very similar in the two areas within the greater WMNF region.

### Posterior predictive checks

We perform posterior predictive checks using ppcOcc() as before. GoF assessment for integrated models is an active area of research and no consensus has so far appeared on the best approach. In spOccupancy, we compute posterior predictive checks separately for each data set in the integrated model.

ppc.int.out <- ppcOcc(out.int, 'freeman-tukey', group = 2)
summary(ppc.int.out)

Call:
ppcOcc(object = out.int, fit.stat = "freeman-tukey", group = 2)

Samples per Chain: 8000
Burn-in: 3000
Thinning Rate: 25
Number of Chains: 3
Total Posterior Samples: 600

Data Source 1

Bayesian p-value: 0.3717
Fit statistic: freeman-tukey

Data Source 2

Bayesian p-value: 0.1217
Fit statistic: freeman-tukey 

The low Bayesian p-value for NEON suggests a potential lack of fit which we would explore in a complete analysis.

### Model selection using WAIC and k-fold cross-validation

We use waicOcc() to compute the WAIC for integrated occupancy models. Similar to the posterior predictive check, individual WAIC values are reported for each data set. These can be summed across all data sources for an overall WAIC value if desired.

waicOcc(out.int)
        elpd       pD      WAIC
1 -633.34832 5.849407 1278.3955
2  -57.95017 7.730643  131.3616

k-fold cross-validation is implemented using the k.fold argument as we have seen in previous spOccupancy model fitting functions. Cross-validation for models with multiple data sources is not as straightforward as single data source models. For instance, we could envision splitting the data in multiple different ways to assess predictive performance, depending on the purpose of our comparison. In spOccupancy, we implement two approaches for cross-validation of integrated occupancy models. In the first approach, we hold out locations irrespective of what data source they came from. This results in a scoring rule (deviance) for each individual data source based on the hold out sites where that data source is sampled. More specifically, our first algorithm for K-fold cross-validation is:

1. Randomly split the total number of sites with at least one data source into $$K$$ groups.
2. For each $$k = 1, \dots, K$$, fit the model without the data at the sites in the $$k$$th group of hold-out locations.
3. Predict the detection-nondetection data at the locations in the $$k$$th hold out set.
4. Compute the deviance for each hold out data point.
5. Sum the deviance values separately for each data source to yield a scoring rule for each data source separately.

This form of k-fold cross-validation is applicable for model-selection between different integrated occupancy models. In other words, this approach can be used to compare models that integrate the same data sources but include different covariates in the occurrence and/or detection portion of the occupancy model. Using our example, we implement 4-fold cross-validation to compare the full integrated model with covariates to an intercept only integrated occupancy model. We do this using the k.fold, k.fold.threads, and k.fold.seed arguments as with previous spOccupancy models. Below we use the default values for k.fold.threads and k.fold.seed.

out.int.k.fold <- intPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
n.samples = n.samples,
priors = prior.list,
verbose = FALSE,
n.report = 2000,
n.burn = n.burn,
n.thin = n.thin,
n.chains = 1,
k.fold = 4)
out.int.k.fold.small <- intPGOcc(occ.formula = ~ 1,
det.formula = list(hbef = ~ 1, neon = ~ 1),
data = data.int,
inits = inits.list,
n.samples = n.samples,
priors = prior.list,
verbose = FALSE,
n.report = 2000,
n.burn = n.burn,
n.thin = n.thin,
n.chains = 1,
k.fold = 4)
# Summarize the CV results
out.int.k.fold$k.fold.deviance [1] 1504.9896 157.0448 out.int.k.fold.small$k.fold.deviance
[1] 1532.9758  205.6215

We see the deviance for the full model is lower for both data sources compared to the intercept only model.

Alternatively, we may not wish to compare different integrated occupancy models together, but rather wish to assess whether or not data integration is necessary compared to using a single data source occupancy model. To accomplish this task, we can perform cross-validation with the integrated occupancy model using only a single data source as the hold out set, and then compare the deviance scoring rule to a scoring rule obtained from cross-validation with a single data source occupancy model. We accomplish this by using the argument k.fold.data. If k.fold.data is specified as an integer between 1 and the number of data sources integrated (in this case 2), only the data source corresponding to that integer will be used in the hold out and k-fold cross-validation process. If k.fold.data is not specified, k-fold cross-validation holds out data irrespective of the data sources at the given location. Here, we set k.fold.data = 1 and compare the cross-validation results of the integrated model to the occupancy model using only HBEF we fit with PGOcc() (which is stored in the out.k.fold object).

out.int.k.fold.hbef <- intPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
n.samples = n.samples,
priors = prior.list,
verbose = TRUE,
n.report = 2000,
n.burn = n.burn,
n.thin = n.thin,
k.fold = 4,
k.fold.data = 1) 
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Model description
----------------------------------------
Integrated Occupancy Model with Polya-Gamma latent
variable fit with 453 sites.

Integrating 2 occupancy data sets.

Samples per Chain: 8000
Burn-in: 3000
Thinning Rate: 25
Number of Chains: 1
Total Posterior Samples: 200

Source compiled with OpenMP support and model fit using 1 thread(s).

----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Sampled: 2000 of 8000, 25.00%
-------------------------------------------------
Sampled: 4000 of 8000, 50.00%
-------------------------------------------------
Sampled: 6000 of 8000, 75.00%
-------------------------------------------------
Sampled: 8000 of 8000, 100.00%
----------------------------------------
Cross-validation
----------------------------------------
Performing 4-fold cross-validation using 1 thread(s).
Only holding out data from data source 1.
# Look at CV results again
# Single data source model
out.k.fold$k.fold.deviance [1] 1510.854 # Integrated model out.int.k.fold.hbef$k.fold.deviance
[1] 1516.394

Here we see that integration of the two data sources does not improve predictive performance at HBEF alone. We should also do the same approach with the NEON data. We close this section by emphasizing that there are potentially numerous other benefits to data integration than just predictive performance that must be carefully considered when trying to determine if data integration is necessary or not. See Simmonds et al. (2020), the discussion in Doser et al. (2022), and Chapter 10 in AHM2 for more on this topic.

### Prediction

Prediction for integrated occupancy models proceeds exactly as before using predict(). Here we predict occurrence across HBEF for comparison with the single data source models (and remember to load the stars and ggplot2 packages).

# Make sure to standardize using mean and sd from fitted model
elev.pred <- (hbefElev$val - mean(data.int$occ.covs[, 1])) / sd(data.int$occ.covs[, 1]) X.0 <- cbind(1, elev.pred, elev.pred^2) out.int.pred <- predict(out.int, X.0) # Producing an SDM for HBEF alone (posterior mean) plot.dat <- data.frame(x = hbefElev$Easting,
y = hbefElev$Northing, mean.psi = apply(out.int.pred$psi.0.samples, 2, mean),
sd.psi = apply(out.int.pred$psi.0.samples, 2, sd)) dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y')) ggplot() + geom_stars(data = dat.stars, aes(x = x, y = y, fill = mean.psi)) + scale_fill_viridis_c(na.value = 'transparent') + labs(x = 'Easting', y = 'Northing', fill = '', title = 'Mean OVEN occurrence probability using intPGOcc') + theme_bw() ggplot() + geom_stars(data = dat.stars, aes(x = x, y = y, fill = sd.psi)) + scale_fill_viridis_c(na.value = 'transparent') + labs(x = 'Easting', y = 'Northing', fill = '', title = 'SD OVEN occurrence probability using intPGOcc') + theme_bw() ## Single-species spatial integrated occupancy models ### Basic model description Single-species spatial integrated occupancy models are identical to integrated occupancy models except the ecological process model now incorporates a spatially-structured random effect following the discussion with single-species spatial occupancy models. All details for the single-species integrated spatial occupancy model have already been presented in previous model descriptions. ### Fitting single-species spatial integrated occupancy models using spIntPGOcc() The function spIntPGOcc() fits single-species spatial integrated occupancy models in spOccupancy. Syntax is very similar to that of the single data source models and specifically takes the following form: spIntPGOcc(occ.formula, det.formula, data, inits, n.batch, batch.length, accept.rate = 0.43, priors, cov.model = "exponential", tuning, n.omp.threads = 1, verbose = TRUE, NNGP = TRUE, n.neighbors = 15, search.type = 'cb', n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed, k.fold.data, ...) The occ.formula, det.formula, and data arguments are analogous to what we saw with the nonspatial integrated occupancy model. However, as for all spatial models in spOccupancy, the data list must also contain the spatial coordinates in the coords tag, which we add below. data.int$coords <- rbind(hbef2015$coords, neon2015$coords)
occ.formula.int <- ~ scale(Elevation) + I(scale(Elevation)^2)
det.formula.int <- list(hbef = ~ scale(day) + scale(tod) + I(scale(day)^2),
neon = ~ scale(day) + scale(tod) + I(scale(day)^2))

Initial values specified in inits and priors in priors are specified in the same form as for intPGOcc() with the additional values for spatial parameters. Analogous to all other spatial models in spOccupancy, the spatial variance parameter takes an inverse-Gamma prior and the spatial range parameter (as well as the spatial smoothness parameter if cov.model = 'matern') takes a uniform prior (see discussion in single-species spatial occupancy models section on how to specify a uniform prior for the spatial variance parameter instead of an inverse-Gamma).

dist.int <- dist(data.int$coords) min.dist <- min(dist.int) max.dist <- max(dist.int) J <- nrow(data.int$occ.covs)
# Exponential covariance model
cov.model <- "exponential"
inits.list <- list(alpha = list(0, 0),
beta = 0,
z = rep(1, J),
sigma.sq = 2,
phi = 3 / mean(dist.int),
w = rep(0, J))
prior.list <- list(beta.normal = list(mean = 0, var = 2.72),
alpha.normal = list(mean = list(0, 0),
var = list(2.72, 2.72)),
sigma.sq.ig = c(2, 1),
phi.unif = c(3 / max.dist, 3 / min.dist))

Finally, we specify the remaining parameters regarding the NNGP specifications, tuning parameters, and the length of the MCMC sampler we will run. We are then all set to fit the model. Remember that spatially-explicit models in spOccupancy are implemented using an efficient adaptive MCMC sampler that requires us to specify the number of MCMC batches (n.batch) and the length of each MCMC batch (batch.length), which together determine the number of MCMC samples (i.e., n.samples = n.batch * batch.length). We run the model and summarize the results with summary().

batch.length <- 25
n.batch <- 800
n.burn <- 10000
n.thin <- 40
tuning <- list(phi = .2)
# Approx. run time: 2 min
out.sp.int <- spIntPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
priors = prior.list,
tuning = tuning,
cov.model = cov.model,
NNGP = TRUE,
n.neighbors = 5,
n.batch = n.batch,
n.burn = n.burn,
n.chains = 3,
batch.length = batch.length,
n.report = 200) 
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Building the neighbor list
----------------------------------------
----------------------------------------
Building the neighbors of neighbors list
----------------------------------------
----------------------------------------
Model description
----------------------------------------
NNGP Integrated Occupancy Model with Polya-Gamma latent
variable fit with 453 sites.

Integrating 2 occupancy data sets.

Samples per chain: 20000 (800 batches of length 25)
Burn-in: 10000
Thinning Rate: 1
Number of Chains: 3
Total Posterior Samples: 30000

Using the exponential spatial correlation model.

Using 5 nearest neighbors.

Source compiled with OpenMP support and model fit using 1 thread(s).

Adaptive Metropolis with target acceptance rate: 43.0
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Batch: 200 of 800, 25.00%
Parameter   Acceptance  Tuning
phi     32.0        0.25681
-------------------------------------------------
Batch: 400 of 800, 50.00%
Parameter   Acceptance  Tuning
phi     40.0        0.22326
-------------------------------------------------
Batch: 600 of 800, 75.00%
Parameter   Acceptance  Tuning
phi     52.0        0.22777
-------------------------------------------------
Batch: 800 of 800, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Batch: 200 of 800, 25.00%
Parameter   Acceptance  Tuning
phi     72.0        0.21450
-------------------------------------------------
Batch: 400 of 800, 50.00%
Parameter   Acceptance  Tuning
phi     44.0        0.23237
-------------------------------------------------
Batch: 600 of 800, 75.00%
Parameter   Acceptance  Tuning
phi     48.0        0.25172
-------------------------------------------------
Batch: 800 of 800, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Batch: 200 of 800, 25.00%
Parameter   Acceptance  Tuning
phi     40.0        0.23237
-------------------------------------------------
Batch: 400 of 800, 50.00%
Parameter   Acceptance  Tuning
phi     24.0        0.19801
-------------------------------------------------
Batch: 600 of 800, 75.00%
Parameter   Acceptance  Tuning
phi     52.0        0.21025
-------------------------------------------------
Batch: 800 of 800, 100.00%
summary(out.sp.int)

Call:
spIntPGOcc(occ.formula = occ.formula.int, det.formula = det.formula.int,
data = data.int, inits = inits.list, priors = prior.list,
tuning = tuning, cov.model = cov.model, NNGP = TRUE, n.neighbors = 5,
n.batch = n.batch, batch.length = batch.length, n.report = 200,
n.burn = n.burn, n.chains = 3)

Samples per Chain: 20000
Burn-in: 10000
Thinning Rate: 1
Number of Chains: 3
Total Posterior Samples: 30000
Run Time (min): 3.4521

Occurrence (logit scale):
Mean     SD    2.5%     50%   97.5%   Rhat ESS
(Intercept)            2.9401 0.7900  1.4555  2.9126  4.5962 1.0678 232
scale(Elevation)      -2.1644 0.5536 -3.4284 -2.0923 -1.2591 1.0566 326
I(scale(Elevation)^2) -0.9483 0.3445 -1.6763 -0.9306 -0.3178 1.0177 907

Data source 1 Detection (logit scale):
Mean     SD    2.5%     50%  97.5%   Rhat   ESS
(Intercept)      0.8257 0.1242  0.5853  0.8246 1.0721 1.0004 19380
scale(day)      -0.0922 0.0780 -0.2464 -0.0913 0.0592 1.0005 22828
scale(tod)      -0.0458 0.0779 -0.1991 -0.0460 0.1059 1.0001 22911
I(scale(day)^2)  0.0302 0.0961 -0.1574  0.0300 0.2209 1.0000 21751

Data source 2 Detection (logit scale):
Mean     SD    2.5%     50%  97.5%   Rhat  ESS
(Intercept)      2.1169 0.6172  0.8954  2.1104 3.3298 1.0078 2756
scale(day)       0.6557 0.5436 -0.5584  0.7031 1.6240 1.0160 1379
scale(tod)       0.4792 0.6259 -0.7387  0.4899 1.6173 1.0034 1313
I(scale(day)^2) -0.4309 0.6723 -1.4176 -0.5895 1.1811 1.0252  830

Spatial Covariance:
Mean     SD   2.5%    50%   97.5%   Rhat ESS
sigma.sq 6.9605 4.4000 1.8155 5.8916 18.4958     NA 121
phi      0.0019 0.0008 0.0005 0.0018  0.0037 1.1038 158

We see the occurrence and detection parameters have converged. We should run the chains for a bit longer to ensure convergence of the spatial covariance parameters, but for this example we will continue on.

### Posterior predictive checks

Below we perform a posterior predictive check for each of the data sets included in the occupancy model using ppcOcc().

ppc.sp.int.out <- ppcOcc(out.sp.int, 'freeman-tukey', group = 2)
summary(ppc.sp.int.out)

Call:
ppcOcc(object = out.sp.int, fit.stat = "freeman-tukey", group = 2)

Samples per Chain: 20000
Burn-in: 10000
Thinning Rate: 1
Number of Chains: 3
Total Posterior Samples: 30000

Data Source 1

Bayesian p-value: 0.3688
Fit statistic: freeman-tukey

Data Source 2

Bayesian p-value: 0.301
Fit statistic: freeman-tukey 

According to the Bayesian p-values, there is no lack of fit for either the HBEF or NEON data.

### Model selection using WAIC and k-fold cross-validation

We can perform model selection using WAIC with the waicOcc() function as we have seen previously. Here we compare the WAIC from the spatial integrated model to the non-spatial integrated model.

waicOcc(out.int)
        elpd       pD      WAIC
1 -633.34832 5.849407 1278.3955
2  -57.95017 7.730643  131.3616
waicOcc(out.sp.int)
        elpd       pD      WAIC
1 -564.74617 49.06561 1227.6236
2  -43.54207 18.99561  125.0754

We see the spatial model performs better for both the HBEF data and the NEON data.

Two forms of k-fold cross-validation are implemented for spIntPGOcc(), analogous to those discussed for non-spatial integrated occupancy models. We first use cross-validation to compare the predictive performance of the spatial integrated model to the nonspatial integrated model across all sites.

out.sp.int.k.fold <- spIntPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
priors = prior.list,
tuning = tuning,
cov.model = cov.model,
NNGP = TRUE,
n.neighbors = 5,
n.batch = n.batch,
n.burn = n.burn,
batch.length = batch.length,
verbose = FALSE,
k.fold = 4)
# Non-spatial model
out.int.k.fold$k.fold.deviance [1] 1504.9896 157.0448 # Spatial model out.sp.int.k.fold$k.fold.deviance
[1] 1516.2585  164.1131

Further, we perform cross-validation using only the HBEF data source as a hold out data source to compare with single data source models to assess the benefit of integration.

out.sp.int.k.fold.hbef <- spIntPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
priors = prior.list,
tuning = tuning,
cov.model = cov.model,
NNGP = TRUE,
n.neighbors = 5,
n.batch = n.batch,
n.burn = n.burn,
batch.length = batch.length,
verbose = FALSE,
k.fold = 4,
k.fold.data = 1)
out.sp.int.k.fold.hbef$k.fold.deviance [1] 1517.555 # Non-spatial single data source model out.k.fold$k.fold.deviance
[1] 1510.854

### Prediction

Prediction for spatial integrated occupancy models proceeds exactly analogous to our approach using nonspatial integrated occupancy models. The only difference is that now we must also provide the coordinates of the nonsampled locations. The following code is not run as it provides similar results to the nonspatial integrated occupancy model.

# Make sure to standardize using mean and sd from fitted model
elev.pred <- (hbefElev$val - mean(data.int$occ.covs[, 1])) / sd(data.int\$occ.covs[, 1])
X.0 <- cbind(1, elev.pred, elev.pred^2)
coords.0 <- as.matrix(hbefElev[, c(2, 3)])
out.sp.int.pred <- predict(out.sp.int, X.0, coords.0)`

## References

Banerjee, Sudipto, Bradley P Carlin, and Alan E Gelfand. 2003. Hierarchical Modeling and Analysis for Spatial Data. Chapman; Hall/CRC.

Barnett, David T, Paul A Duffy, David S Schimel, Rachel E Krauss, Kathryn M Irvine, Frank W Davis, John E Gross, et al. 2019. “The Terrestrial Organism and Biogeochemistry Spatial Sampling Design for the National Ecological Observatory Network.” Ecosphere 10 (2): e02540.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Bolker, Benjamin M. 2015. “GLMM worked examples.” https://ms.mcmaster.ca/~bolker/R/misc/foxchapter/bolker_chap.html.

Broms, Kristin M, Mevin B Hooten, and Ryan M Fitzpatrick. 2016. “Model Selection and Assessment for Multi-Species Occupancy Models.” Ecology 97 (7): 1759–70.

Brooks, Stephen P., and Andrew Gelman. 1998. “General Methods for Monitoring Convergence of Iterative Simulations.” Journal of Computational and Graphical Statistics 7 (4): 434–55.

Clark, Allan E, and Res Altwegg. 2019. “Efficient Bayesian Analysis of Occupancy Models with Logit Link Functions.” Ecology and Evolution 9 (2): 756–68.

Datta, Abhirup, Sudipto Banerjee, Andrew O Finley, and Alan E Gelfand. 2016. “Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets.” Journal of the American Statistical Association 111 (514): 800–812.

Doser, Jeffrey W, Andrew O Finley, Aaron S Weed, and Elise F Zipkin. 2021. “Integrating Automated Acoustic Vocalization Data and Point Count Surveys for Estimation of Bird Abundance.” Methods in Ecology and Evolution 12 (6): 1040–9.

Doser, Jeffrey W, Wendy Leuenberger, T Scott Sillett, Michael T Hallworth, and Elise F Zipkin. 2022. “Integrated Community Occupancy Models: A Framework to Assess Occurrence and Biodiversity Dynamics Using Multiple Data Sources.” Methods in Ecology and Evolution.

Finley, Andrew O, Abhirup Datta, and Sudipto Banerjee. 2020. “SpNNGP R Package for Nearest Neighbor Gaussian Process Models.” arXiv Preprint arXiv:2001.09111.

Finley, Andrew O, Abhirup Datta, Bruce D Cook, Douglas C Morton, Hans E Andersen, and Sudipto Banerjee. 2019. “Efficient Algorithms for Bayesian Nearest Neighbor Gaussian Processes.” Journal of Computational and Graphical Statistics 28 (2): 401–14.

Gelman, Andrew. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models (Comment on Article by Browne and Draper).” Bayesian Analysis 1 (3): 515–34.

Guélat, Jérôme, and Marc Kéry. 2018. “Effects of Spatial Autocorrelation and Imperfect Detection on Species Distribution Models.” Methods in Ecology and Evolution 9 (6): 1614–25.

Heaton, Matthew J, Abhirup Datta, Andrew O Finley, Reinhard Furrer, Joseph Guinness, Rajarshi Guhaniyogi, Florian Gerber, et al. 2019. “A Case Study Competition Among Methods for Analyzing Large Spatial Data.” Journal of Agricultural, Biological and Environmental Statistics 24 (3): 398–425.

Hobbs, N Thompson, and Mevin B Hooten. 2015. Bayesian Models. Princeton University Press.

Hooten, Mevin B, and N Thompson Hobbs. 2015. “A Guide to Bayesian Model Selection for Ecologists.” Ecological Monographs 85 (1): 3–28.

Kéry, Marc, and J Andrew Royle. 2016. Applied Hierarchical Modeling in Ecology: Volume 1: Prelude and Static Models. Elsevier Science.

Lany, Nina K, Phoebe L Zarnetske, Andrew O Finley, and Deborah G McCullough. 2020. “Complementary Strengths of Spatially-Explicit and Multi-Species Distribution Models.” Ecography 43 (3): 456–66.

Lunn, David, Christopher Jackson, Nicky Best, Andrew Thomas, and David Spiegelhalter. 2013. “The Bugs Book: A Practical Introduction to Bayesian Analysis.”

MacKenzie, Darryl I, and Larissa L Bailey. 2004. “Assessing the Fit of Site-Occupancy Models.” Journal of Agricultural, Biological, and Environmental Statistics 9 (3): 300–318.

MacKenzie, Darryl I, James D Nichols, Gideon B Lachman, Sam Droege, J Andrew Royle, and Catherine A Langtimm. 2002. “Estimating Site Occupancy Rates When Detection Probabilities Are Less Than One.” Ecology 83 (8): 2248–55.

McCullagh, Peter, and John A Nelder. 2019. Generalized Linear Models. Routledge.

Miller, David AW, Krishna Pacifici, Jamie S Sanderlin, and Brian J Reich. 2019. “The Recent Past and Promising Future for Data Integration Methods to Estimate Species’ Distributions.” Methods in Ecology and Evolution 10 (1): 22–37.

National Ecological Observatory Network (NEON). 2021. “Breeding Landbird Point Counts (Dp1.10003.001).” National Ecological Observatory Network (NEON). https://data.neonscience.org/data-products/DP1.10003.001.

Northrup, Joseph M, and Brian D Gerber. 2018. “A Comment on Priors for Bayesian Occupancy Models.” PloS One 13 (2): e0192819.

Polson, Nicholas G, James G Scott, and Jesse Windle. 2013. “Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables.” Journal of the American Statistical Association 108 (504): 1339–49.

Roberts, Gareth O, and Jeffrey S Rosenthal. 2009. “Examples of Adaptive Mcmc.” Journal of Computational and Graphical Statistics 18 (2): 349–67.

Simmonds, Emily G, Susan G Jarvis, Peter A Henrys, Nick JB Isaac, and Robert B O’Hara. 2020. “Is More Data Always Better? A Simulation Study of Benefits and Limitations of Integrated Distribution Models.” Ecography 43 (10): 1413–22.

Tyre, Andrew J, Brigitte Tenhumberg, Scott A Field, Darren Niejalke, Kirsten Parris, and Hugh P Possingham. 2003. “Improving Precision and Reducing Bias in Biological Surveys: Estimating False-Negative Error Rates.” Ecological Applications 13 (6): 1790–1801.

Watanabe, Sumio. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11 (12).