# Introduction to spOccupancy

#### Jeffrey W. Doser, Andrew O. Finley, Marc Kéry, Elise F. Zipkin

#### July 13, 2022

Source:`vignettes/modelFitting.Rmd`

`modelFitting.Rmd`

## Introduction

This vignette provides worked examples and explanations for fitting single-species, multi-species, and integrated occupancy models available in the `spOccupancy`

R package. We will provide step by step examples on how to fit the following models:

- Occupancy model using
`PGOcc()`

. - Spatial occupancy model using
`spPGOcc()`

. - Multi-species occupancy model using
`msPGOcc()`

. - Spatial multi-species occupancy model using
`spMsPGOcc()`

. - Integrated occupancy model using
`intPGOcc()`

. - Spatial integrated occupancy model using
`spIntPGOcc()`

.

We fit all occupancy models using Pólya-Gamma data augmentation (Polson, Scott, and Windle 2013) for computational efficiency (**P**ólya-**G**amma is where the `PG`

comes from in all `spOccupancy`

model fitting function names). In this vignette, we will provide a brief description of each model, with full statistical details provided in a separate MCMC sampler vignette. We will also show how `spOccupancy`

provides functions for posterior predictive checks as a Goodness of Fit assessment, model comparison and assessment using the Widely Applicable Information Criterion (WAIC), k-fold cross-validation, and out of sample predictions using standard R helper functions (e.g., `predict()`

).

To get started, we load the `spOccupancy`

package, as well as the `coda`

package, which we will use for some MCMC summary and diagnostics. We will also use the `stars`

and `ggplot2`

packages to create some basic plots of our results. We then set a seed so you can reproduce the same results as we do.

### Example data set: Foliage-gleaning birds at Hubbard Brook

As an example data set throughout this vignette, we will use data from twelve foliage-gleaning bird species collected from point count surveys at Hubbard Brook Experimental Forest (HBEF) in New Hampshire, USA. Specific details on the data set, which is just a small subset from a long-term survey, are available on the Hubbard Brook website and Doser et al. (2022). The data are provided as part of the `spOccupancy`

package and are loaded with `data(hbef2015)`

. Point count surveys were conducted at 373 sites over three replicates, each of 10 minutes in length and with a detection radius of 100m. In the data set provided here, we converted these data to detection-nondetection data. Some sites were not visited for all three replicates. Additional information on the data set (including individual species in the data set) can be viewed in the man page using `help(hbef2015)`

.

```
List of 4
$ y : num [1:12, 1:373, 1:3] 0 0 0 0 0 1 0 0 0 0 ...
..- attr(*, "dimnames")=List of 3
.. ..$ : chr [1:12] "AMRE" "BAWW" "BHVI" "BLBW" ...
.. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "1" "2" "3"
$ occ.covs: num [1:373, 1] 475 494 546 587 588 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Elevation"
$ det.covs:List of 2
..$ day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ...
..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
$ coords : num [1:373, 1:2] 280000 280000 280000 280001 280000 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$ : chr [1:2] "X" "Y"
```

Thus, the object `hbef2015`

is a list comprised of the detection-nondetection data (`y`

), covariates on the occurrence portion of the model (`occ.covs`

), covariates on the detection portion of the model (`det.covs`

), and the spatial coordinates of each site (`coords`

) for use in spatial occupancy models and in plotting. This list is in the exact format required for input to `spOccupancy`

model functions. `hbef2015`

contains data on 12 species in the three-dimensional array `y`

, where the dimensions of `y`

correspond to species (12), sites (373), and replicates (3). For single-species occupancy models in Section 2 and 3, we will only use data on the charming Ovenbird (OVEN; *Seiurus aurocapilla*), so we next subset the `hbef2015`

list to only include data from OVEN in a new object `ovenHBEF`

.

```
sp.names <- dimnames(hbef2015$y)[[1]]
ovenHBEF <- hbef2015
ovenHBEF$y <- ovenHBEF$y[sp.names == "OVEN", , ]
table(ovenHBEF$y) # Quick summary.
```

```
0 1
518 588
```

We see that OVEN is detected at a little over half of all site-replicate combinations.

## Single-species occupancy models

### Basic model description

Let \(z_j\) be the true presence (1) or absence (0) of a species at site \(j\), with \(j = 1, \dots, J\). For our OVEN example, \(J = 373\). Following the basic occupancy model (MacKenzie et al. 2002; Tyre et al. 2003), we assume this latent occurrence variable arises from a Bernoulli process following

\[\begin{equation} \begin{split} &z_j \sim \text{Bernoulli}(\psi_j), \\ &\text{logit}(\psi_j) = \boldsymbol{x}^{\top}_j\boldsymbol{\beta}, \end{split} \end{equation}\]

where \(\psi_j\) is the probability of occurrence at site \(j\), which is a function of site-specific covariates \(\boldsymbol{X}\) and a vector of regression coefficients (\(\boldsymbol{\beta}\)).

We do not directly observe \(z_j\), but rather we observe an imperfect representation of the latent occurrence process as a result of imperfect detection (i.e., the failure to detect a species at a site when it is truly present). Let \(y_{j, k}\) be the observed detection (1) or nondetection (0) of a species of interest at site \(j\) during replicate \(k\) for each of \(k = 1, \dots, K_j\) replicates. Note that the number of replicates, \(K_j\), can vary by site and in practical applications will often be equal to 1 for a subset of sites (i.e., some sites will have no replicate surveys). For our OVEN example, the maximum value of \(K_j\) is three. We envision the detection-nondetection data as arising from a Bernoulli process conditional on the true latent occurrence process:

\[\begin{equation} \begin{split} &y_{j, k} \sim \text{Bernoulli}(p_{j, k}z_j), \\ &\text{logit}(p_{j, k}) = \boldsymbol{v}^{\top}_{j, k}\boldsymbol{\alpha}, \end{split} \end{equation}\]

where \(p_{j, k}\) is the probability of detecting a species at site \(j\) during replicate \(k\) (given it is present at site \(j\)), which is a function of site and replicate specific covariates \(\boldsymbol{V}\) and a vector of regression coefficients (\(\boldsymbol{\alpha}\)).

To complete the Bayesian specification of the model, we assign multivariate normal priors for the occurrence (\(\boldsymbol{\beta}\)) and detection (\(\boldsymbol{\alpha}\)) regression coefficients. To yield an efficient implementation of the occupancy model using a logit link function, we use Pólya-Gamma data augmentation (Polson, Scott, and Windle 2013), which is described in depth in a separate MCMC sampler vignette.

### Fitting single-species occupancy models with `PGOcc()`

The `PGOcc()`

function fits single-species occupancy models using Pólya-Gamma latent variables, which makes it more efficient than standard Bayesian implementations of occupancy models using a logit link function (Polson, Scott, and Windle 2013; Clark and Altwegg 2019). `PGOcc()`

has the following arguments:

```
PGOcc(occ.formula, det.formula, data, inits, priors, n.samples,
n.omp.threads = 1, verbose = TRUE, n.report = 100,
n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1,
k.fold, k.fold.threads = 1, k.fold.seed, ...)
```

The first two arguments, `occ.formula`

and `det.formula`

, use standard R model syntax to denote the covariates to be included in the occurrence and detection portions of the model, respectively. Only the right hand side of the formulas are included. Random intercepts can be included in both the occurrence and detection portions of the single-species occupancy model using `lme4`

syntax (Bates et al. 2015). For example, to include a random intercept for different observers in the detection portion of the model, we would include `(1 | observer)`

in the `det.formula`

, where `observer`

indicates the specific observer for each data point. The names of variables given in the formulas should correspond to those found in `data`

, which is a list consisting of the following tags: `y`

(detection-nondetection data), `occ.covs`

(occurrence covariates), `det.covs`

(detection covariates). `y`

should be stored as a sites x replicate matrix, `occ.covs`

as a matrix or data frame with site-specific covariate values, and `det.covs`

as a list with each list element corresponding to a covariate to include in the detection portion of the model. Covariates on detection can vary by site and/or survey, and so these covariates may be specified as a site by survey matrix for survey-level covariates or as a one-dimensional vector for survey level covariates. The `ovenHBEF`

list is already in the required format. Here we will model OVEN occurrence as a function of linear and quadratic elevation and will include three observational covariates (linear and quadratic day of survey, time of day of survey) on the detection portion of the model. We standardize all covariates by using the `scale()`

function in our model specification, and use the `I()`

function to specify quadratic effects:

```
oven.occ.formula <- ~ scale(Elevation) + I(scale(Elevation)^2)
oven.det.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2)
# Check out the format of ovenHBEF
str(ovenHBEF)
```

```
List of 4
$ y : num [1:373, 1:3] 1 1 0 1 0 0 1 0 1 1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "1" "2" "3"
$ occ.covs: num [1:373, 1] 475 494 546 587 588 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Elevation"
$ det.covs:List of 2
..$ day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ...
..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
$ coords : num [1:373, 1:2] 280000 280000 280000 280001 280000 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$ : chr [1:2] "X" "Y"
```

Next, we specify the initial values for the MCMC sampler in `inits`

. `PGOcc()`

(and all other `spOccupancy`

model fitting functions) will set initial values by default, but here we will do this explicitly, since in more complicated cases setting initial values close to the presumed solutions can be vital for success of an MCMC-based analysis. For all models described in this vignette (in particular the non-spatial models), choice of the initial values is largely inconsequential, with the exception being that specifying initial values close to the presumed solutions can decrease the amount of samples you need to run to arrive at convergence of the MCMC chains. Thus, when first running a model in `spOccupancy`

, we recommend fitting the model using the default initial values that `spOccupancy`

provides, which are based on the prior distributions. After running the model for a reasonable period, if you find the chains are taking a long time to reach convergence, you then may wish to set the initial values to the mean estimates of the parameters from the initial model fit, as this will likely help reduce the amount of time you need to run the model.

The default initial values for occurrence and detection regression coefficients (including the intercepts) are random values from the prior distributions adopted in the model, while the default initial values for the latent occurrence effects are set to 1 if the species was observed at a site and 0 otherwise. Initial values are specified in a list with the following tags: `z`

(latent occurrence values), `alpha`

(detection intercept and regression coefficients), and `beta`

(occurrence intercept and regression coefficients). Below we set all initial values of the regression coefficients to 0, and set initial values for `z`

based on the detection-nondetection data matrix. For the occurrence (`beta`

) and detection (`alpha`

) regression coefficients, the initial values are passed either as a vector of length equal to the number of estimated parameters (including an intercept, and in the order specified in the model formula), or as a single value if setting the same initial value for all parameters (including the intercept). Below we take the latter approach. To specify the initial values for the latent occurrence at each site (`z`

), we must ensure we set the value to 1 at all sites where OVEN was detected at least once, because if we detect OVEN at a site then the value of `z`

is 1 with complete certainty (under the assumption of the model that there are no false positives). If the initial value for `z`

is set to 0 at one or more sites when the species was detected, the occupancy model will fail. `spOccupancy`

will provide a clear error message if the supplied initial values for `z`

are invalid. Below we use the raw detection-nondetection data and the `apply()`

function to set the initial values to 1 if OVEN was detected at that site and 0 otherwise.

```
# Format with explicit specification of inits for alpha and beta
# with four detection parameters and three occurrence parameters
# (including the intercept).
oven.inits <- list(alpha = c(0, 0, 0, 0),
beta = c(0, 0, 0),
z = apply(ovenHBEF$y, 1, max, na.rm = TRUE))
# Format with abbreviated specification of inits for alpha and beta.
oven.inits <- list(alpha = 0,
beta = 0,
z = apply(ovenHBEF$y, 1, max, na.rm = TRUE))
```

We next specify the priors for the occurrence and detection regression coefficients. The Pólya-Gamma data augmentation algorithm employed by `spOccupancy`

assumes normal priors for both the detection and occurrence regression coefficients. These priors are specified in a list with tags `beta.normal`

for occurrence and `alpha.normal`

for detection parameters (including intercepts). Each list element is then itself a list, with the first element of the list consisting of the hypermeans for each coefficient and the second element of the list consisting of the hypervariances for each coefficient. Alternatively, the hypermean and hypervariances can be specified as a single value if the same prior is used for all regression coefficients. By default, `spOccupancy`

will set the hypermeans to 0 and the hypervariances to 2.72, which corresponds to a relatively flat prior on the probability scale (0, 1; Lunn et al. (2013)). Broms, Hooten, and Fitzpatrick (2016), Northrup and Gerber (2018), and others show such priors are an adequate choice when the goal is to specify relatively non-informative priors. We will use these default priors here, but we specify them explicitly below for clarity

```
oven.priors <- list(alpha.normal = list(mean = 0, var = 2.72),
beta.normal = list(mean = 0, var = 2.72))
```

Our last step is to specify the number of samples to produce with the MCMC algorithm (`n.samples`

), the length of burn-in (`n.burn`

), the rate at which we want to thin the posterior samples (`n.thin`

), and the number of MCMC chains to run (`n.chains`

). Note that currently `spOccupancy`

runs multiple chains sequentially and does not allow chains to be run simultaneously in parallel across multiple threads. Instead, we allow for within-chain parallelization using the `n.omp.threads`

argument. We can set `n.omp.threads`

to a number greater than 1 and smaller than the number of threads on the computer you are using. Generally, setting `n.omp.threads > 1`

will not result in decreased run times for non-spatial models in `spOccupancy`

, but can substantially decrease run time when fitting spatial models (Finley, Datta, and Banerjee 2020). Here we set `n.omp.threads = 1`

.

For a simple single-species occupancy model, we shouldn’t need too many samples and will only need a small amount of burn-in and thinning. We will run the model using three chains to assess convergence using the Gelman-Rubin diagnostic (Rhat; Brooks and Gelman (1998)).

```
n.samples <- 5000
n.burn <- 3000
n.thin <- 2
n.chains <- 3
```

We are now nearly set to run the occupancy model. The `verbose`

argument is a logical value indicating whether or not MCMC sampler progress is reported to the screen. If `verbose = TRUE`

, sampler progress is reported after every multiple of the specified number of iterations in the `n.report`

argument. We set `verbose = TRUE`

and `n.report = 1000`

to report progress after every 1000th MCMC iteration. The last three arguments to `PGOcc()`

(`k.fold`

, `k.fold.threads`

, `k.fold.seed`

) are used for performing k-fold cross-validation for model assessment, which we will illustrate in a subsequent section. For now, we won’t specify the arguments, which will tell `PGOcc()`

not to perform k-fold cross-validation.

```
out <- PGOcc(occ.formula = oven.occ.formula,
det.formula = oven.det.formula,
data = ovenHBEF,
inits = oven.inits,
n.samples = n.samples,
priors = oven.priors,
n.omp.threads = 1,
verbose = TRUE,
n.report = 1000,
n.burn = n.burn,
n.thin = n.thin,
n.chains = n.chains)
```

```
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Model description
----------------------------------------
Occupancy model with Polya-Gamma latent
variable fit with 373 sites.
Samples per Chain: 5000
Burn-in: 3000
Thinning Rate: 2
Number of Chains: 3
Total Posterior Samples: 3000
Source compiled with OpenMP support and model fit using 1 thread(s).
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
```

`names(out) # Look at the contents of the resulting object.`

```
[1] "rhat" "beta.samples" "alpha.samples" "z.samples"
[5] "psi.samples" "like.samples" "ESS" "X"
[9] "X.p" "X.re" "X.p.re" "lambda.p"
[13] "y" "n.samples" "call" "n.post"
[17] "n.thin" "n.burn" "n.chains" "pRE"
[21] "psiRE" "run.time"
```

`PGOcc()`

returns a list of class `PGOcc`

with a suite of different objects, many of them being `coda::mcmc`

objects of posterior samples. Note the “Preparing to run the model” printed section doesn’t have any information shown in it. `spOccupancy`

model fitting functions will present messages when preparing the data for the model in this section, or will print out the default priors or initial values used when they are not specified in the function call. Here we specified everything explicitly so no information was reported.

For a concise yet informative summary of the regression parameters and convergence of the MCMC chains, we can use `summary()`

on the resulting `PGOcc()`

object.

`summary(out)`

```
Call:
PGOcc(occ.formula = oven.occ.formula, det.formula = oven.det.formula,
data = ovenHBEF, inits = oven.inits, priors = oven.priors,
n.samples = n.samples, n.omp.threads = 1, verbose = TRUE,
n.report = 1000, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains)
Samples per Chain: 5000
Burn-in: 3000
Thinning Rate: 2
Number of Chains: 3
Total Posterior Samples: 3000
Run Time (min): 0.137
Occurrence (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 2.1370 0.2728 1.6490 2.1188 2.7079 1.0001 834
scale(Elevation) -1.7024 0.2983 -2.4095 -1.6701 -1.2336 1.0100 402
I(scale(Elevation)^2) -0.3992 0.2038 -0.7687 -0.4103 0.0481 1.0030 645
Detection (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 0.7942 0.1269 0.5394 0.7940 1.0407 1.0018 2694
scale(day) -0.0904 0.0773 -0.2450 -0.0910 0.0587 1.0021 3814
scale(tod) -0.0460 0.0781 -0.2008 -0.0457 0.1030 0.9996 3000
I(scale(day)^2) 0.0238 0.0966 -0.1625 0.0231 0.2209 1.0053 3000
```

Note that all coefficients are printed on the logit scale. We see OVEN is fairly widespread (at the scale of the survey) in the forest given the large intercept value, which translates to a value of 0.89 on the probability scale (run `plogis(2.13)`

). In addition, the negative linear and quadratic terms for `Elevation`

suggest occurrence probability peaks at mid-elevations. On average, detection probability is about 0.69.

In principle, before we even look at the parameter summaries, we ought to ensure that the MCMC chains have reached convergence. The `summary`

function also returns the Gelman-Rubin diagnostic (Brooks and Gelman 1998) and the effective samples size (ESS) of the posterior samples, which we can use to assess convergence. Here we see all Rhat values are less than 1.1 and the ESS values indicate adequate mixing of the MCMC chains. Additionally, the posterior samples in the `PGOcc`

object are `coda::mcmc`

objects, which we can quickly assess for convergence visually using trace plots.

`plot(out$beta.samples, density = FALSE) # Occupancy parameters.`

`plot(out$alpha.samples, density = FALSE) # Detection parameters.`

### Posterior predictive checks

The function `ppcOcc()`

performs a posterior predictive check on all `spOccupancy`

model objects as a Goodness of Fit (GoF) assessment. The fundamental idea of GoF testing is that a good model should generate data that closely align with the observed data. If there are drastic differences in the true data from the model generated data, our model is likely not very useful (Hobbs and Hooten 2015). GoF assessments are more complicated using binary data, like detection-nondetection used in occupancy models, as standard approaches are not valid assessments for the raw data in binary response models such as occupancy models (Broms, Hooten, and Fitzpatrick 2016; McCullagh and Nelder 2019). Thus, any approach to assess model fit for detection-nondetection data must bin, or aggregate, the raw values in some manner, and then perform a model fit assessment on the binned values. There are numerous ways we could envision binning the raw detection-nondetection values (MacKenzie and Bailey 2004; Kéry and Royle 2016). In `spOccupancy`

, a posterior predictive check broadly takes the following steps:

- Fit the model using any of the model-fitting functions (here
`PGOcc()`

), which generates replicated values for all detection-nondetection data points. - Bin both the actual and the replicated detection-nondetection data in a suitable manner, such as by site or replicate (MacKenzie and Bailey 2004).
- Compute a fit statistic on both the actual data and also on the model-generated ‘replicate data’.
- Compare the fit statistics for the true data and replicate data. If they are widely different, this suggests a lack of fit of the model to the actual data set at hand.

To perform a posterior predictive check, we send the resulting `PGOcc`

model object as input to the `ppcOcc()`

function, along with a fit statistic (`fit.stat`

) and a numeric value indicating how to group, or bin, the data (`group`

). Currently supported fit statistics include the Freeman-Tukey statistic and the Chi-Squared statistic (`freeman-tukey`

or `chi-squared`

, respectively, Kéry and Royle (2016)). Currently, `ppcOcc()`

allows the user to group the data by row (site; `group = 1`

) or column (replicate; `group = 2`

). `ppcOcc()`

will then return a set of posterior samples for the fit statistic (or discrepancy measure) using the actual data (`fit.y`

) and model generated replicate data set (`fit.y.rep`

), summed across all data points in the chosen manner. We generally recommend performing a posterior predictive check when grouping data both across sites (`group = 1`

) as well as across replicates (`group = 2`

), as they may reveal (or fail to reveal) different inadequacies of the model for the specific data set at hand (Kéry and Royle 2016). In particular, binning the data across sites (`group = 1`

) may help reveal whether the model fails to adequately represent variation in occurrence and detection probability across space, while binning the data across replicates (`group = 2`

) may help reveal whether the model fails to adequately represent variation in detection probability across the different replicate surveys. Similarly, we suggest exploring posterior predictive checks using both the Freeman-Tukey Statistic as well as the Chi-Squared statistic. Generally, the more ways we explore the fit of our model to the data, the more confidence we have that our model is adequately representing the data and we can draw reasonable conclusions from it. By performing posterior predictive checks with the two fit statistics and two ways of grouping the data, `spOccupancy`

provides us with four different measures that we can use to assess the validity of our model. Throughout this vignette, we will display different types of posterior predictive checks using different combinations of the fit statistic and grouping approach. In a more complete analysis, we would run all four types of posterior predictive checks for each model we fit as a more complete GoF assessment.

The resulting values from a call to `ppcOcc()`

can be used with the `summary()`

function to generate a Bayesian p-value, which is the probability, under the fitted model, to obtain a value of the fit statistic that is more extreme (i.e., larger) than the one observed, i.e., for the actual data. Bayesian p-values are sensitive to individual values, so we should also explore the discrepancy measures for each “grouped” data point. `ppcOcc()`

returns a matrix of posterior quantiles for the fit statistic for both the observed (`fit.y.group.quants`

) and model generated, replicate data (`fit.y.rep.group.quants`

) for each “grouped” data point.

We next perform a posterior predictive check using the Freeman-Tukey statistic grouping the data by sites. We summarize the posterior predictive check with the `summary()`

function, which reports a Bayesian p-value. A Bayesian p-value that hovers around 0.5 indicates adequate model fit, while values less than 0.1 or greater than 0.9 suggest our model does not fit the data well (Hobbs and Hooten 2015). As always with a simulation-based analysis using MCMC, you will get numerically slightly different values.

```
Call:
ppcOcc(object = out, fit.stat = "freeman-tukey", group = 1)
Samples per Chain: 5000
Burn-in: 3000
Thinning Rate: 2
Number of Chains: 3
Total Posterior Samples: 3000
Bayesian p-value: 0.2403
Fit statistic: freeman-tukey
```

The Bayesian p-value is the proportion of posterior samples of the fit statistic of the model generated data that are greater than the corresponding fit statistic of the true data, summed across all “grouped” data points. We can create a visual representation of the Bayesian p-value as follows (Kéry and Royle 2016):

```
ppc.df <- data.frame(fit = ppc.out$fit.y,
fit.rep = ppc.out$fit.y.rep,
color = 'lightskyblue1')
ppc.df$color[ppc.df$fit.rep > ppc.df$fit] <- 'lightsalmon'
plot(ppc.df$fit, ppc.df$fit.rep, bg = ppc.df$color, pch = 21,
ylab = 'Fit', xlab = 'True')
lines(ppc.df$fit, ppc.df$fit, col = 'black')
```

Our Bayesian p-value is greater than 0.1 indicating no lack of fit, although the above plot shows most of the fit statistics are smaller for the replicate data than the actual data set. Relying solely on the Bayesian p-value as an assessment of model fit is not always a great option, as individual data points can have an overbearing influence on the resulting summary value. Instead of summing across all data points for a single discrepancy measure, `ppcOcc()`

also allows us to explore discrepancy measures on a “grouped” point by point basis. The resulting `ppcOcc`

object will contain the objects `fit.y.group.quants`

and `fit.y.rep.group.quants`

, which contain quantiles of the posterior distributions for the discrepancy measures of each grouped data point. Below we plot the difference in the discrepancy measure between the replicate and actual data across each of the sites.

```
diff.fit <- ppc.out$fit.y.rep.group.quants[3, ] - ppc.out$fit.y.group.quants[3, ]
plot(diff.fit, pch = 19, xlab = 'Site ID', ylab = 'Replicate - True Discrepancy')
```

We see there are four sites that contribute to the overall GoF measure far more than in proportion to their number, i.e., whose discrepancy measure for the actual data is much larger than the corresponding discrepancy for the replicate data. Here we will ignore this, but in a real analysis it might be very insightful to further explore these sites to see what could explain this pattern (e.g., are the sites close together in space? Could the data be the result of erroneous recording, or of extraordinary local habitat that is not adequately described by the occurrence covariates?). Although rarely done, closer investigations of outlying values in statistical analyses may sometimes teach one more than mere acceptance of a fitting model.

### Model selection using WAIC and k-fold cross-validation

Posterior predictive checks allow us to assess how well our model fits the data, but they are not very useful if we want to compare multiple competing models and ultimately select a final model based on some criterion. Bayesian model selection is very much a constantly changing field. See Hooten and Hobbs (2015) for an accessible overview of Bayesian model selection for ecologists.

For Bayesian hierarchical models like occupancy models, the most common Bayesian model selection criterion, the deviance information criterion or DIC, is not applicable (Hooten and Hobbs 2015). Instead, the Widely Applicable Information Criterion (Watanabe 2010) is recommended to compare a set of models and select the best-performing model for final analysis.

The WAIC is calculated for all `spOccupancy`

model objects using the function `waicOcc()`

. We calculate the WAIC as

\[ \text{WAIC} = -2 \times (\text{elpd} - \text{pD}), \]

where elpd is the expected log point-wise predictive density and PD is the effective number of parameters. We calculate elpd by calculating the likelihood for each posterior sample, taking the mean of these likelihood values, taking the log of the mean of the likelihood values, and summing these values across all sites. We calculate the effective number of parameters by calculating the variance of the log likelihood for each site taken over all posterior samples, and then summing these values across all sites. See Appendix S1 from Broms, Hooten, and Fitzpatrick (2016) for more details.

We calculate the WAIC using `waicOcc()`

for our OVEN model below (as always, note some slight differences with your solutions due to Monte Carlo error).

`waicOcc(out)`

```
elpd pD WAIC
-632.960543 6.290149 1278.501382
```

For illustration, let’s next rerun the OVEN model, but this time we assume occurrence is constant across the HBEF, and subsequently compare the WAIC value to the full model

```
out.small <- PGOcc(occ.formula = ~ 1,
det.formula = oven.det.formula,
data = ovenHBEF,
inits = oven.inits,
n.samples = n.samples,
priors = oven.priors,
n.omp.threads = 1,
verbose = FALSE,
n.burn = n.burn,
n.thin = n.thin,
n.chains = n.chains)
waicOcc(out.small)
```

```
elpd pD WAIC
-692.046183 4.932945 1393.958257
```

Smaller values of WAIC indicate models with better performance. We see the WAIC for the model with elevation is smaller than the intercept-only model, indicating elevation is an important predictor for OVEN occurrence in HBEF.

When focusing primarily on predictive performance, a k-fold cross-validation (CV) approach is another attractive, though computationally more intensive, alternative to compare a series of models, especially since WAIC may not always be reliable for occupancy models (Broms, Hooten, and Fitzpatrick 2016). In `spOccupancy`

, k-fold cross-validation is accomplished using the arguments `k.fold`

, `k.fold.threads`

, and `k.fold.seed`

in the model fitting function. A k-fold cross validation approach requires fitting a model \(k\) times, where each time the model is fit using \(J / k\) data points, where \(J\) is the total number of sites surveyed at least once in the data set. Each time the model is fit, it uses a different portion of the data and then predicts the remaining \(J - J/k\) hold out values. Because the data are not used to fit the model, CV yields true samples from the posterior predictive distribution that we can use to assess the predictive capacity of the model.

As a measure of out-of-sample predictive performance, we use the deviance as a scoring rule following Hooten and Hobbs (2015). For K-fold cross-validation, it is computed as

\[\begin{equation} -2 \sum_{k = 1}^K \text{log}\Bigg(\frac{\sum_{q = 1}^Q \text{Bernoulli}(\boldsymbol{y}_k \mid \boldsymbol{p}^{(q)}\boldsymbol{z}_k^{(q)})}{Q}\Bigg), \end{equation}\]

where \(\boldsymbol{p}^{(q)}\) and \(\boldsymbol{z}_k^{(q)}\) are MCMC samples of detection probability and latent occurrence, respectively, arising from a model that is fit without the observations \(\boldsymbol{y}_k\), and \(Q\) is the total number of posterior samples produced from the MCMC sampler. The -2 is used so that smaller values indicate better model fit, which aligns with most information criteria used for model assessment (like the WAIC implemented using `waicOcc()`

).

The final three arguments in `PGOcc()`

(`k.fold`

, `k.fold.threads`

, `k.fold.seed`

) control whether or not k-fold cross validation is performed following the complete fit of the model using the entire data set. The `k.fold`

argument indicates the number of \(k\) folds to use for cross-validation. If `k.fold`

is not specified, cross-validation is not performed and `k.fold.threads`

and `k.fold.seed`

are ignored. The `k.fold.threads`

argument indicates the number of threads to use for running the \(k\) models in parallel across multiple threads. Parallel processing is accomplished using the R packages `foreach`

and `doParallel`

. Specifying `k.fold.threads > 1`

can substantially decrease run time since it allows for models to be fit simultaneously on different threads rather than sequentially. The `k.fold.seed`

indicates the seed used to randomly split the data into \(k\) groups. This is by default set to 100.

Below we refit the occupancy model with elevation (linear and quadratic) as an occurrence predictor this time performing 4-fold cross-validation. We set `k.fold = 4`

to perform 4-fold cross-validation and `k.fold.threads = 1`

to run the model using 1 thread. Normally we would set `k.fold.threads = 4`

, but using multiple threads leads to complications when compiling this vignette, so we leave that to you to explore the computational improvements of performing cross-validation across multiple cores.

```
out.k.fold <- PGOcc(occ.formula = oven.occ.formula,
det.formula = oven.det.formula,
data = ovenHBEF,
inits = oven.inits,
n.samples = n.samples,
priors = oven.priors,
n.omp.threads = 1,
verbose = TRUE,
n.report = 1000,
n.burn = n.burn,
n.thin = n.thin,
n.chains = n.chains,
k.fold = 4,
k.fold.threads = 1)
```

```
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Model description
----------------------------------------
Occupancy model with Polya-Gamma latent
variable fit with 373 sites.
Samples per Chain: 5000
Burn-in: 3000
Thinning Rate: 2
Number of Chains: 3
Total Posterior Samples: 3000
Source compiled with OpenMP support and model fit using 1 thread(s).
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Sampled: 1000 of 5000, 20.00%
-------------------------------------------------
Sampled: 2000 of 5000, 40.00%
-------------------------------------------------
Sampled: 3000 of 5000, 60.00%
-------------------------------------------------
Sampled: 4000 of 5000, 80.00%
-------------------------------------------------
Sampled: 5000 of 5000, 100.00%
----------------------------------------
Cross-validation
----------------------------------------
```

`Performing 4-fold cross-validation using 1 thread(s).`

We subsequently refit the intercept only occupancy model, and compare the deviance metrics from the 4-fold cross-validation.

```
# Model fitting information is suppressed for space.
out.int.k.fold <- PGOcc(occ.formula = ~ 1,
det.formula = oven.det.formula,
data = ovenHBEF,
inits = oven.inits,
n.samples = n.samples,
priors = oven.priors,
n.omp.threads = 1,
verbose = FALSE,
n.report = 1000,
n.burn = n.burn,
n.thin = n.thin,
n.chains = n.chains,
k.fold = 4,
k.fold.threads = 1)
```

The cross-validation metric (model deviance) is stored in the `k.fold.deviance`

tag of the resulting model object.

`out.k.fold$k.fold.deviance # Larger model. `

`[1] 1510.854`

`out.int.k.fold$k.fold.deviance # Intercept-only model.`

`[1] 1532.225`

Similar to the results from the WAIC, CV also suggests that the model including elevation with a predictor outperforms the intercept only model.

### Prediction

All model objects resulting from a call to `spOccupancy`

model-fitting functions can be used with `predict()`

to generate a series of posterior predictive samples at new locations, given the values of all covariates used in the model fitting process. The object `hbefElev`

(which comes as part of the `spOccupancy`

package) contains elevation data at a 30x30m resolution from the National Elevation Data set across the entire HBEF. We load the data below

```
'data.frame': 46090 obs. of 3 variables:
$ val : num 914 916 918 920 922 ...
$ Easting : num 276273 276296 276318 276340 276363 ...
$ Northing: num 4871424 4871424 4871424 4871424 4871424 ...
```

The column `val`

contains the elevation values, while `Easting`

and `Northing`

contain the spatial coordinates that we will use for plotting. We can obtain posterior predictive samples for the occurrence probabilities at these sites by using the `predict()`

function and our `PGOcc`

fitted model object. Given that we standardized the elevation values when we fit the model, we need to standardize the elevation values for prediction using the exact same values of the mean and standard deviation of the elevation values used to fit the data.

```
elev.pred <- (hbefElev$val - mean(ovenHBEF$occ.covs[, 1])) / sd(ovenHBEF$occ.covs[, 1])
# These are the new intercept and covariate data.
X.0 <- cbind(1, elev.pred, elev.pred^2)
out.pred <- predict(out, X.0)
```

For `PGOcc`

objects, the `predict`

function takes two arguments: (1) the `PGOcc`

fitted model object; and (2) a matrix or data frame consisting of the design matrix for the prediction locations (which must include and intercept if our model contained one). The resulting object consists of posterior predictive samples for the latent occurrence probabilities (`psi.0.samples`

) and latent occurrence values (`z.0.samples`

). The beauty of the Bayesian paradigm, and the MCMC computing machinery, is that these predictions all have fully propagated uncertainty. We can use these values to create plots of the predicted mean occurrence values, as well as of their standard deviation as a measure of the uncertainty in these predictions akin to a standard error associated with maximum likelihood estimates. We could also produce a map for the Bayesian credible interval length as an alternative measure of prediction uncertainty or two maps, one showing the lower and the other the upper limit of, say, a 95% credible interval.

```
plot.dat <- data.frame(x = hbefElev$Easting,
y = hbefElev$Northing,
mean.psi = apply(out.pred$psi.0.samples, 2, mean),
sd.psi = apply(out.pred$psi.0.samples, 2, sd),
stringsAsFactors = FALSE)
# Make a species distribution map showing the point estimates,
# or predictions (posterior means)
dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y'))
ggplot() +
geom_stars(data = dat.stars, aes(x = x, y = y, fill = mean.psi)) +
scale_fill_viridis_c(na.value = 'transparent') +
labs(x = 'Easting', y = 'Northing', fill = '',
title = 'Mean OVEN occurrence probability') +
theme_bw()
```

```
# Map of the associated uncertainty of these predictions
# (i.e., posterior sds)
ggplot() +
geom_stars(data = dat.stars, aes(x = x, y = y, fill = sd.psi)) +
scale_fill_viridis_c(na.value = 'transparent') +
labs(x = 'Easting', y = 'Northing', fill = '',
title = 'SD OVEN occurrence probability') +
theme_bw()
```

## Single-species spatial occupancy models

### Basic model description

When working across large spatial domains, accounting for residual spatial autocorrelation in species distributions can often improve predictive performance of a model, leading to more accurate species distribution maps (Guélat and Kéry 2018; Lany et al. 2020). We here extend the basic single-species occupancy model to incorporate a spatial Gaussian Process that accounts for unexplained spatial variation in species occurrence across a region of interest. Let \(\boldsymbol{s}_j\) denote the geographical coordinates of site \(j\) for \(j = 1, \dots, J\). In all spatially-explicit models, we include \(\boldsymbol{s}_j\) directly in the notation of spatially-indexed variables to indicate the model is spatially-explicit. More specifically, the occurrence probability at site \(j\) with coordinates \(\boldsymbol{s}_j\), \(\psi(\boldsymbol{s}_j)\), now takes the form

\[\begin{equation} \text{logit}(\psi(\boldsymbol{s}_j) = \boldsymbol{x}(\boldsymbol{s}_j)^{\top}\boldsymbol{\beta} + \omega(\boldsymbol{s}_j), \end{equation}\]

where \(\omega_j\) is a realization from a zero-mean spatial Gaussian Process, i.e.,

\[\begin{equation} \boldsymbol{\omega}(\boldsymbol{s}) \sim N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\boldsymbol{s}, \boldsymbol{s}', \theta})). \end{equation}\]

We define \(\boldsymbol{\Sigma}(\boldsymbol{s}, \boldsymbol{s}', \boldsymbol{\theta})\) as a \(J \times J\) covariance matrix that is a function of the distances between any pair of site coordinates \(\boldsymbol{s}\) and \(\boldsymbol{s}'\) and a set of parameters \((\boldsymbol{\theta})\) that govern the spatial process. The vector \(\boldsymbol{\theta}\) is equal to \(\boldsymbol{\theta} = \{\sigma^2, \phi, \nu\}\), where \(\sigma^2\) is a spatial variance parameter, \(\phi\) is a spatial decay parameter, and \(\nu\) is a spatial smoothness parameter. \(\nu\) is only specified when using a Matern correlation function.

The detection portion of the occupancy model remains unchanged from the non-spatial occupancy model. Single-species spatial occupancy models, like all models in `spOccupancy`

are fit using Pólya-Gamma data augmentation (see MCMC sampler vignette for details).

When the number of sites is moderately large, say 1000, the above described spatial Gaussian process model can become drastically slow as a result of the need to take the inverse of the spatial covariance matrix \(\boldsymbol{\Sigma}(\boldsymbol{s}, \boldsymbol{s}', \boldsymbol{\theta})\) at each MCMC iteration. Numerous approximation methods exist to reduce this computational cost (see Heaton et al. (2019) for an overview and comparison of multiple methods). One attractive approach is the Nearest Neighbor Gaussian Process (NNGP; Datta et al. (2016)). Instead of modeling the spatial process using a full Gaussian Process, we replace the Gaussian Process prior specification with a NNGP, which leads to drastic improvements of run time with nearly identical inference and prediction as the full Gaussian Process specification. See Datta et al. (2016), Finley et al. (2019), and the MCMC sampler vignette for additional statistical details on NNGPs and their implementation in spatial occupancy models.

### Fitting single-species spatial occupancy models with `spPGOcc()`

The function `spPGOcc()`

fits single-species spatial occupancy models using Pólya-Gamma latent variables, where spatial autocorrelation is accounted for using a spatial Gaussian Process. `spPGOcc()`

fits spatial occupancy models using either a full Gaussian process or an NNGP. See Finley, Datta, and Banerjee (2020) for details on using NNGPs with Pólya-Gamma latent variables.

We will fit the same occupancy model for OVEN that we fit previously using `PGOcc()`

, but we will now make the model spatially explicit by incorporating a spatial process with `spPGOcc()`

. First, let’s take a look at the arguments for `spPGOcc()`

:

```
spPGOcc(occ.formula, det.formula, data, inits, n.batch,
batch.length, accept.rate = 0.43, priors,
cov.model = "exponential", tuning, n.omp.threads = 1,
verbose = TRUE, NNGP = FALSE, n.neighbors = 15,
search.type = "cb", n.report = 100,
n.burn = round(.10 * n.batch * batch.length),
n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1,
k.fold.seed = 100, ...)
```

We will walk through each of the arguments to `spPGOcc()`

in the context of our Ovenbird example. The occurrence (`occ.formula`

) and detection (`det.formula`

) formulas, as well as the list of data (`data`

), take the same form as we saw in `PGOcc`

, with random intercepts allowed in both the occurrence and detection models. Notice the `coords`

matrix in the `ovenHBEF`

list of data. We did not use this for `PGOcc()`

but specifying the spatial coordinates in `data`

is required for all spatially explicit models in `spOccupancy`

.

```
oven.occ.formula <- ~ scale(Elevation) + I(scale(Elevation)^2)
oven.det.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2)
str(ovenHBEF) # coords is required for spPGOcc.
```

```
List of 4
$ y : num [1:373, 1:3] 1 1 0 1 0 0 1 0 1 1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "1" "2" "3"
$ occ.covs: num [1:373, 1] 475 494 546 587 588 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Elevation"
$ det.covs:List of 2
..$ day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ...
..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
$ coords : num [1:373, 1:2] 280000 280000 280000 280001 280000 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$ : chr [1:2] "X" "Y"
```

The initial values (`inits`

) are again specified in a list. Valid tags for initial values now additionally include the parameters associated with the spatial random effects. These include: `sigma.sq`

(spatial variance parameter), `phi`

(spatial range parameter), `w`

(the latent spatial random effects at each site), and `nu`

(spatial smoothness parameter), where the latter is only specified if adopting a Matern covariance function (i.e., `cov.model = 'matern'`

). `spOccupancy`

supports four spatial covariance models (`exponential`

, `spherical`

, `gaussian`

, and `matern`

), which are specified in the `cov.model`

argument. Throughout this vignette, we will use an exponential covariance model, which we often use as our default covariance model when fitting spatially-explicit models and is commonly used throughout ecology. To determine which covariance function to use, we can fit models with the different covariance functions and compare them using WAIC or k-fold cross-validation to select the best performing function. We will note that the Matern covariance function has the additional spatial smoothness parameter \(\nu\) and thus can often be more flexible than the other functions. However, because we need to estimate an additional parameter, this also tends to require more data (i.e., a larger number of sites) than the other covariance functions, and so we encourage use of the three simpler functions if your data set is sparse. We note that model estimates are generally fairly robust to the different covariance functions, although certain functions may provide substantially better estimates depending on the specific form of the underlying spatial autocorrelation in the data. For example, the Gaussian covariance function is often useful for accounting for spatial autocorrelation that is very smooth (i.e., long range spatial dependence). See Chapter 2 in Banerjee, Carlin, and Gelfand (2003) for a more thorough discussion of these functions and their mathematical properties.

The default initial values for `phi`

, `sigma.sq`

, and `nu`

are all set to random values from the prior distribution, which generally will be sufficient for the models discussed in this vignette. In all spatially-explicit models described in this vignette, the spatial range parameter `phi`

is the most sensitive to initial values. In general, the spatial range parameter will often have poor mixing and take longer to converge than the rest of the parameters in the model, so specifying an initial value that is reasonably close to the resulting value can really help decrease run times for complicated models. As an initial value for the spatial range parameter `phi`

, we compute the mean distance between points in HBEF and then set it equal to 3 divided by this mean distance. When using an exponential covariance function, \(\frac{3}{\phi}\) is the effective range, or the distance at which the residual spatial correlation between two sites drops to 0.05 (Banerjee, Carlin, and Gelfand 2003). Thus our initial guess for this effective range is the average distance between sites across HBEF. As with all other parameters, we generally recommend using the default initial values for an initial model run, and if the model is taking a very long time to converge you can rerun the model with initial values based on the posterior means of estimated parameters from the initial model fit. For the spatial variance parameter `sigma.sq`

, we set the initial value to 2. This corresponds to a fairly small spatial variance, which we expect based on our previous work with this data set. Further, we set the initial values of the latent spatial random effects at each site to 0. The initial values for these random effects has an extremely small influence on the model results, so we generally recommend setting their initial values to 0 as we have done here (this is also the default). However, if you are running your model for a very long time and are seeing very slow convergence of the MCMC chains, setting the initial values of the spatial random effects to the mean estimates from a previous run of the model could help reach convergence faster.

```
# Pair-wise distances between all sites
dist.hbef <- dist(ovenHBEF$coords)
# Exponential covariance model
cov.model <- "exponential"
# Specify list of inits
oven.inits <- list(alpha = 0,
beta = 0,
z = apply(ovenHBEF$y, 1, max, na.rm = TRUE),
sigma.sq = 2,
phi = 3 / mean(dist.hbef),
w = rep(0, nrow(ovenHBEF$y)))
```

The next three arguments (`n.batch`

, `batch.length`

, and `accept.rate`

) are all related to the specific type of MCMC sampler we use when we fit the model. The spatial range parameter (and the spatial smoothness parameter if `cov.model = 'matern'`

) are the two hardest parameters to estimate in spatially-explicit models in `spOccupancy`

. In other words, you will often see slow mixing and convergence of the MCMC chains for these parameters. To try to speed up this slow mixing and convergence of these parameters, we use an algorithm called an adaptive Metropolis-Hastings algorithm for all spatially-explicit models in `spOccupancy`

(see Roberts and Rosenthal (2009) for more details on this algorithm). In this approach, we break up the total number of MCMC samples into a set of “batches”, where each batch has a specific number of MCMC samples. When we fit a spatially-explicit model in `spOccupancy`

, instead of specifying the total number of MCMC samples in the `n.samples`

argument like we did in `PGOcc()`

, we must specify the total number of batches (`n.batch`

) as well as the number of MCMC samples each batch contains. Thus, the total number of MCMC samples is `n.batch * batch.length`

. Typically, we set `batch.length = 25`

and then play around with `n.batch`

until convergence of all model parameters is reached. We recommend setting `batch.length = 25`

unless you have a specific reason to change it. Here we set `n.batch = 400`

for a total of 10,000 MCMC samples in each of 3 chains. We additionally specify a burn-in period of length 2000 and a thinning rate of 20.

```
batch.length <- 25
n.batch <- 400
n.burn <- 2000
n.thin <- 20
n.chains <- 3
```

Importantly, we also need to specify an acceptance rate and a tuning parameter for the spatial range parameter (and spatial smoothness parameter if `cov.model = 'matern'`

), which are both features of the adaptive algorithm we use to sample these parameters. In this algorithm, we propose new values for `phi`

(and `nu`

), compare them to our previous values, and use a statistical algorithm to determine if we should accept the new proposed value or keep the old one. The `accept.rate`

argument specifies the ideal proportion of times we will accept the newly proposed values for these parameters. Roberts and Rosenthal (2009) show that if we accept new values around 43% of the time, this will lead to optimal mixing and convergence of the MCMC chains. Following these recommendations, we should strive for an algorithm that accepts new values about 43% of the time. Thus, we recommend setting `accept.rate = 0.43`

unless you have a specific reason not to (this is the default value). The values specified in the `tuning`

argument helps control the initial values we will propose for both `phi`

and `nu`

. These values are supplied as input in the form of a list with tags `phi`

and `nu`

. The initial tuning value can be any value greater than 0, but we generally recommend starting the value out around 0.5. This initial tuning value is closely related to the ideal (or target) acceptance rate we specified in `accept.rate`

. In short, the algorithm we use is “adaptive” because the algorithm will change the tuning values after each batch of the MCMC to yield acceptance rates that are close to our target acceptance rate that we specified in the `accept.rate`

argument. Information on the acceptance rates for `phi`

and `nu`

in your model will be displayed when setting `verbose = TRUE`

. After some initial runs of the model, if you notice the final acceptance rate is much larger or smaller than the target acceptance rate (`accept.rate`

), you can then change the initial tuning value to get closer to the target rate. While use of this algorithm requires us to specify more arguments in spatially-explicit models, this leads to much shorter run times compared to a more simple approach where we do not have an “adaptive” sampling approach, and it should thus save you time in the long haul when waiting for these models to run. For our example here, we set the initial tuning value to 1 after some initial exploratory runs of the model.

```
oven.tuning <- list(phi = 1)
# accept.rate = 0.43 by default, so we do not specify it.
```

Priors are again specified in a list in the argument `priors`

. We follow standard recommendations for prior distributions from the spatial statistics literature (Banerjee, Carlin, and Gelfand 2003). We assume an inverse gamma prior for the spatial variance parameter `sigma.sq`

(the tag of which is `sigma.sq.ig`

), and uniform priors for the spatial decay parameter `phi`

and smoothness parameter `nu`

(if using the Matern correlation function), with the associated tags `phi.unif`

and `nu.unif`

. The hyperparameters of the inverse Gamma are passed as a vector of length two, with the first and second elements corresponding to the shape and scale, respectively. The lower and upper bounds of the uniform distribution are passed as a two-element vector for the uniform priors.

As of v0.3.2, we allow users to specify a uniform prior on the spatial variance parameter `sigma.sq`

(using the tag `sigma.sq.unif`

) instead of an inverse-Gamma prior. This can be useful in certain situations when working with a binary response variable (which we inherently do in occupancy models), as there is a confounding between the spatial variance parameter `sigma.sq`

and the occurrence intercept. This occurs as a result of the logit transformation and a mathematical statement known as Jensen’s Inequality (Bolker 2015). Briefly, Jensen’s Inequality tells us that while the spatial random effects don’t influence the mean on the logit scale (since we give them a prior with mean 0), they do have an influence on the mean on the actual probability scale (after back-transforming), which is what leads to potential confounding between the spatial variance parameter and the occurrence intercept. Generally, we have found this confounding to be inconsequential, as the spatial structure of the random effects helps to separate the spatial variance from the occurrence intercept. However, there may be certain circumstances when \(\sigma^2\) is estimated to be extremely large, and the estimate of the occurrence intercept is a very large magnitude negative number in order to compensate. It can be helpful in these situations to use a uniform distribution on `sigma.sq`

to restrict it to taking more reasonable values. We have rarely encountered such situations, and so throughout this vignette (and in our the vast majority of our analyses) we will use an inverse-Gamma prior.

Note that the priors for the spatial parameters in a spatially-explicit model must be at least weakly informative for the model to converge (Banerjee, Carlin, and Gelfand 2003). For the inverse-Gamma prior on the spatial variance, we typically set the shape parameter to 2 and the scale parameter equal to our best guess of the spatial variance. The default prior hyperparameter values for the spatial variance \(\sigma^2\) are a shape parameter of 2 and a scale parameter of 1. This weakly informative prior suggests a prior mean of 1 for the spatial variance, which is a moderately small amount of spatial variation. Based on our previous work with these data, we expected the residual spatial variation to be relatively small, and so we use this default value and set the scale parameter below to 1. For the spatial decay parameter, our default approach is to set the lower and upper bounds of the uniform prior based on the minimum and maximum distances between sites in the data. More specifically, by default we set the lower bound to `3 / max`

and the upper bound to `3 / min`

, where `min`

and `max`

are the minimum and maximum distances between sites in the data set, respectively. This equates to a vague prior that states the spatial autocorrelation in the data could only exist between sites that are very close together, or could span across the entire observed study area. If additional information is known on the extent of the spatial autocorrelation in the data, you may place more restrictive bounds on the uniform prior, which would reduce the amount of time needed for adequate mixing and convergence of the MCMC chains. Here we use this default approach, but will explicitly set the values for transparency.

```
min.dist <- min(dist.hbef)
max.dist <- max(dist.hbef)
oven.priors <- list(beta.normal = list(mean = 0, var = 2.72),
alpha.normal = list(mean = 0, var = 2.72),
sigma.sq.ig = c(2, 1),
phi.unif = c(3/max.dist, 3/min.dist))
```

The argument `n.omp.threads`

specifies the number of threads to use for within-chain parallelization, while `verbose`

specifies whether or not to print the progress of the sampler. We *highly* recommend setting `verbose = TRUE`

for all spatial models to ensure the adaptive MCMC is working as you want (and this is the reason for why this is the default for this argument). The argument `n.report`

specifies the interval to report the Metropolis-Hastings sampler acceptance rate. Note that `n.report`

is specified in terms of batches, not the overall number of samples. Below we set `n.report = 100`

, which will result in information on the acceptance rate and tuning parameters every 100th batch (not sample).

```
n.omp.threads <- 1
verbose <- TRUE
n.report <- 100 # Report progress at every 100th batch.
```

The parameters `NNGP`

, `n.neighbors`

, and `search.type`

relate to whether or not you want to fit the model with a Gaussian Process or with NNGP, which is a much more computationally efficient approximation. The argument `NNGP`

is a logical value indicating whether to fit the model with an NNGP (`TRUE`

) or a regular Gaussian Process (`FALSE`

). For data sets that have more than 1000 locations, using an NNGP will have substantial improvements in run time. Even for more modest-sized data sets, like the HBEF data set, using an NNGP will be quite a bit faster (especially for multi-species models as shown in subsequent sections). Unless your data set is particularly small (e.g., 100 points) and you are concerned about the NNGP approximation, we recommend setting `NNGP = TRUE`

, which is the default. The arguments `n.neighbors`

and `search.type`

specify the number of neighbors used in the NNGP and the nearest neighbor search algorithm, respectively, to use for the NNGP model. Generally, the default values of these arguments will be adequate. Datta et al. (2016) showed that setting `n.neighbors = 15`

is usually sufficient, although for certain data sets a good approximation can be achieved with as few as five neighbors, which could substantially decrease run time for the model. We generally recommend leaving `search.type = "cb"`

, as this results in a fast code book nearest neighbor search algorithm. However, details on when you may want to change this are described in Finley, Datta, and Banerjee (2020). We will run an NNGP model using the default value for `search.type`

and setting `n.neighbors = 5`

, which we have found in exploratory analysis to closely approximate a full Gaussian Process.

We now fit the model (without k-fold cross-validation) and summarize the results using `summary()`

.

```
# Approx. run time: < 1 minute
out.sp <- spPGOcc(occ.formula = oven.occ.formula,
det.formula = oven.det.formula,
data = ovenHBEF,
inits = oven.inits,
n.batch = n.batch,
batch.length = batch.length,
priors = oven.priors,
cov.model = cov.model,
NNGP = TRUE,
n.neighbors = 5,
tuning = oven.tuning,
n.report = n.report,
n.burn = n.burn,
n.thin = n.thin,
n.chains = n.chains)
```

```
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Building the neighbor list
----------------------------------------
----------------------------------------
Building the neighbors of neighbors list
----------------------------------------
----------------------------------------
Model description
----------------------------------------
NNGP Occupancy model with Polya-Gamma latent
variable fit with 373 sites.
Samples per chain: 10000 (400 batches of length 25)
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200
Using the exponential spatial correlation model.
Using 5 nearest neighbors.
Source compiled with OpenMP support and model fit using 1 thread(s).
Adaptive Metropolis with target acceptance rate: 43.0
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Parameter Acceptance Tuning
phi 16.0 0.42741
-------------------------------------------------
Batch: 200 of 400, 50.00%
Parameter Acceptance Tuning
phi 20.0 0.30422
-------------------------------------------------
Batch: 300 of 400, 75.00%
Parameter Acceptance Tuning
phi 44.0 0.29229
-------------------------------------------------
Batch: 400 of 400, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Parameter Acceptance Tuning
phi 36.0 0.29820
-------------------------------------------------
Batch: 200 of 400, 50.00%
Parameter Acceptance Tuning
phi 28.0 0.31664
-------------------------------------------------
Batch: 300 of 400, 75.00%
Parameter Acceptance Tuning
phi 40.0 0.28083
-------------------------------------------------
Batch: 400 of 400, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Parameter Acceptance Tuning
phi 48.0 0.28650
-------------------------------------------------
Batch: 200 of 400, 50.00%
Parameter Acceptance Tuning
phi 56.0 0.28650
-------------------------------------------------
Batch: 300 of 400, 75.00%
Parameter Acceptance Tuning
phi 40.0 0.29229
-------------------------------------------------
Batch: 400 of 400, 100.00%
```

`class(out.sp) # Look at what spOccupancy produced.`

`[1] "spPGOcc"`

`names(out.sp)`

```
[1] "rhat" "beta.samples" "alpha.samples" "theta.samples"
[5] "coords" "z.samples" "X" "X.re"
[9] "w.samples" "psi.samples" "like.samples" "X.p"
[13] "X.p.re" "lambda.p" "y" "ESS"
[17] "call" "n.samples" "n.neighbors" "cov.model.indx"
[21] "type" "n.post" "n.thin" "n.burn"
[25] "n.chains" "pRE" "psiRE" "run.time"
```

`summary(out.sp)`

```
Call:
spPGOcc(occ.formula = oven.occ.formula, det.formula = oven.det.formula,
data = ovenHBEF, inits = oven.inits, priors = oven.priors,
tuning = oven.tuning, cov.model = cov.model, NNGP = TRUE,
n.neighbors = 5, n.batch = n.batch, batch.length = batch.length,
n.report = n.report, n.burn = n.burn, n.thin = n.thin, n.chains = n.chains)
Samples per Chain: 10000
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200
Run Time (min): 1.4349
Occurrence (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 3.0292 0.7355 1.5816 2.9908 4.5583 1.0203 221
scale(Elevation) -2.6150 0.6037 -3.9895 -2.5396 -1.6115 1.0433 226
I(scale(Elevation)^2) -0.7462 0.3753 -1.5448 -0.7321 0.0096 1.0259 462
Detection (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 0.8220 0.1208 0.5832 0.8195 1.0659 1.0201 1102
scale(day) -0.0919 0.0829 -0.2569 -0.0934 0.0726 1.0013 1476
scale(tod) -0.0465 0.0790 -0.1950 -0.0480 0.1090 1.0025 1200
I(scale(day)^2) 0.0257 0.0959 -0.1568 0.0247 0.2120 1.0145 1200
Spatial Covariance:
Mean SD 2.5% 50% 97.5% Rhat ESS
sigma.sq 6.4647 3.9863 1.7700 5.4539 17.5299 1.1522 120
phi 0.0022 0.0009 0.0008 0.0020 0.0043 1.0187 186
```

Most Rhat values are less than 1.1, although for a full analysis we should run the model longer to ensure the spatial variance parameter (`sigma.sq`

) has converged. The effective sample sizes of the spatial covariance parameters are somewhat low, which indicates limited mixing of these parameters. This is common when fitting spatial models. We note that there is substantial MC error in the spatial covariance parameters, and so your estimates may be slightly different from what we have shown here. We see `spPGOcc()`

returns a list of class `spPGOcc`

and consists of posterior samples for all parameters. Note that posterior samples for spatial parameters are stored in the list element `theta.samples`

.

### Posterior predictive checks

For our posterior predictive check, we send the `spPGOcc`

model object to the `ppcOcc()`

function, this time grouping by replicate, or survey occasion, (`group = 2`

) instead of by site (`group = 1`

).

```
Call:
ppcOcc(object = out.sp, fit.stat = "freeman-tukey", group = 2)
Samples per Chain: 10000
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200
Bayesian p-value: 0.3525
Fit statistic: freeman-tukey
```

The Bayesian p-value indicates adequate fit of the spatial occupancy model.

### Model selection using WAIC and k-fold cross-validation

We next use the `waicOcc()`

function to compute the WAIC, which we can compare to the non-spatial model to assess the benefit of incorporating the spatial random effects in the occurrence model.

`waicOcc(out.sp)`

```
elpd pD WAIC
-563.78006 49.80549 1227.17109
```

```
# Compare to non-spatial model
waicOcc(out)
```

```
elpd pD WAIC
-632.960543 6.290149 1278.501382
```

We see the WAIC value for the spatial model is substantially reduced compared to the nonspatial model, indicating that incorporation of the spatial random effects yields an improvement in predictive performance.

k-fold cross-validation is accomplished by specifying the `k.fold`

argument in `spPGOcc`

just as we saw in `PGOcc`

.

### Prediction

Finally, we can perform out of sample prediction using the `predict`

function just as before. Prediction for spatial models is more computationally intensive than for non-spatial models, and so the `predict`

function for `spPGOcc`

class objects also has options for parallelization (`n.omp.threads`

) and reporting sampler progress (`verbose`

and `n.report`

). Note that for `spPGOcc()`

, you also need to supply the coordinates of the out of sample prediction locations in addition to the covariate values.

```
# Do prediction.
coords.0 <- as.matrix(hbefElev[, c('Easting', 'Northing')])
# Approx. run time: 6 min
out.sp.pred <- predict(out.sp, X.0, coords.0, verbose = FALSE)
# Produce a species distribution map (posterior predictive means of occupancy)
plot.dat <- data.frame(x = hbefElev$Easting,
y = hbefElev$Northing,
mean.psi = apply(out.sp.pred$psi.0.samples, 2, mean),
sd.psi = apply(out.sp.pred$psi.0.samples, 2, sd))
dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y'))
ggplot() +
geom_stars(data = dat.stars, aes(x = x, y = y, fill = mean.psi)) +
scale_fill_viridis_c(na.value = 'transparent') +
labs(x = 'Easting', y = 'Northing', fill = '',
title = 'Mean OVEN occurrence probability') +
theme_bw()
```

```
# Produce an uncertainty map of the SDM (posterior predictive SDs of occupancy)
ggplot() +
geom_stars(data = dat.stars, aes(x = x, y = y, fill = sd.psi)) +
scale_fill_viridis_c(na.value = 'transparent') +
labs(x = 'Easting', y = 'Northing', fill = '',
title = 'SD OVEN occurrence probability') +
theme_bw()
```

Comparing this to the non-spatial occupancy model, the spatial model appears to identify areas in HBEF with low OVEN occurrence that are not captured in the non-spatial model. We will resist trying to hypothesize what environmental factors could lead to these patterns.

## Multi-species occupancy models

### Basic model description

Let \(z_{i, j}\) be the true presence (1) or absence (0) of a species \(i\) at site \(j\), with \(j = 1, \dots, J\) and \(i = 1, \dots, N\). We assume the latent occurrence variable arises from a Bernoulli process following

\[\begin{equation} \begin{split} &z_{i, j} \sim \text{Bernoulli}(\psi_{i, j}), \\ &\text{logit}(\psi_{i, j}) = \boldsymbol{x}^{\top}_{j}\boldsymbol{\beta}_i, \end{split} \end{equation}\]

where \(\psi_{i, j}\) is the probability of occurrence of species \(i\) at site \(j\), which is a function of site-specific covariates \(\boldsymbol{X}\) and a vector of species-specific regression coefficients (\(\boldsymbol{\beta}_i\)). The regression coefficients in multi-species occupancy models are envisioned as random effects arising from a common community level distribution:

\[\begin{equation} \boldsymbol{\beta}_i \sim \text{Normal}(\boldsymbol{\mu}_{\beta}, \boldsymbol{T}_{\beta}), \end{equation}\]

where \(\boldsymbol{\mu}_{\beta}\) is a vector of community level mean effects for each occurrence covariate effect (including the intercept) and \(\boldsymbol{T}_{\beta}\) is a diagonal matrix with diagonal elements \(\boldsymbol{\tau}^2_{\beta}\) that represent the variability of each occurrence covariate effect among species in the community.

We do not directly observe \(z_{i, j}\), but rather we observe an imperfect representation of the latent occurrence process. Let \(y_{i, j, k}\) be the observed detection (1) or nondetection (0) of a species \(i\) at site \(j\) during replicate/survey \(k\) for each of \(k = 1, \dots, K_j\) replicates at each site \(j\). We envision the detection-nondetection data as arising from a Bernoulli process conditional on the true latent occurrence process:

\[\begin{equation} \begin{split} &y_{i, j, k} \sim \text{Bernoulli}(p_{i, j, k}z_{i, j}), \\ &\text{logit}(p_{i, j, k}) = \boldsymbol{v}^{\top}_{i, j, k}\boldsymbol{\alpha}_i, \end{split} \end{equation}\]

where \(p_{i, j, k}\) is the probability of detecting species \(i\) at site \(j\) during replicate \(k\) (given it is present at site \(j\)), which is a function of site and replicate specific covariates \(\boldsymbol{V}\) and a vector of species-specific regression coefficients (\(\boldsymbol{\alpha}_i\)). Similarly to the occurrence regression coefficients, the species specific detection coefficients are envisioned as random effects arising from a common community level distribution:

\[\begin{equation} \boldsymbol{\alpha}_i \sim \text{Normal}(\boldsymbol{\mu}_{\alpha}, \boldsymbol{T}_{\alpha}), \end{equation}\]

where \(\boldsymbol{\mu}_{\alpha}\) is a vector of community level mean effects for each detection covariate effect (including the intercept) and \(\boldsymbol{T}_{\alpha}\) is a diagonal matrix with diagonal elements \(\boldsymbol{\tau}^2_{\alpha}\) that represent the variability of each detection covariate effect among species in the community.

To complete the Bayesian specification of the model, we assign multivariate normal priors for the occurrence (\(\boldsymbol{\mu}_{\beta}\)) and detection (\(\boldsymbol{\mu}_{\alpha}\)) community-level regression coefficient means and independent inverse-Gamma priors for each element of \(\boldsymbol{\tau}^2_{\beta}\) and \(\boldsymbol{\tau}^2_{\alpha}\). We again use Pólya-Gamma data augmentation to yield an efficient implementation of the multi-species occupancy model, which is described in depth in the MCMC sampler vignette.

### Fitting multi-species occupancy models with `msPGOcc()`

`spOccupancy`

uses nearly identical syntax for fitting multi-species occupancy models as it does for single-species models and provides the same functionality for posterior predictive checks, model assessment and selection using WAIC and k-fold cross-validation, and out of sample prediction. The `msPGOcc()`

function fits nonspatial multi-species occupancy models using Pólya-Gamma latent variables, which results in substantial increases in run time compared to standard implementations of logit link multi-species occupancy models. `msPGOcc()`

has exactly the same arguments as `PGOcc()`

:

```
msPGOcc(occ.formula, det.formula, data, inits, n.samples, priors,
n.omp.threads = 1, verbose = TRUE, n.report = 100,
n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1,
k.fold, k.fold.threads = 1, k.fold.seed, ...)
```

For illustration, we will again use the Hubbard Brook data in `hbef2015`

, but we will now model occurrence for all 12 species in the community. Below we reload the `hbef2015`

data set to get a fresh copy.

```
data(hbef2015)
# str(hbef2015) # Remind ourselves of what's in it (not shown).
```

We will model occurrence for all species as a function of linear and quadratic terms for elevation, and detection as a function of linear and quadratic day of survey as well as the time of day during which the survey was conducted. These models are specified in `occ.formula`

and `det.formula`

as before, which reference variables stored in the `data`

list. Random intercepts can be included in both the occurrence and detection portions of the occupancy model using `lme4`

syntax (Bates et al. 2015). For multi-species models, the multi-species detection-nondetection data `y`

is now a three-dimensional array with dimensions corresponding to species, sites, and replicates. This is how the data are provided in the `hbef2015`

object, so we don’t need to do any additional preparations.

```
occ.ms.formula <- ~ scale(Elevation) + I(scale(Elevation)^2)
det.ms.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2)
str(hbef2015)
```

```
List of 4
$ y : num [1:12, 1:373, 1:3] 0 0 0 0 0 1 0 0 0 0 ...
..- attr(*, "dimnames")=List of 3
.. ..$ : chr [1:12] "AMRE" "BAWW" "BHVI" "BLBW" ...
.. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "1" "2" "3"
$ occ.covs: num [1:373, 1] 475 494 546 587 588 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Elevation"
$ det.covs:List of 2
..$ day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ...
..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
$ coords : num [1:373, 1:2] 280000 280000 280000 280001 280000 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$ : chr [1:2] "X" "Y"
```

Next we specify the initial values in `inits`

. For multi-species occupancy models, we supply initial values for community-level and species-level parameters. In `msPGOcc()`

, we will supply initial values for the following parameters: `alpha.comm`

(community-level detection coefficients), `beta.comm`

(community-level occurrence coefficients), `alpha`

(species-level detection coefficients), `beta`

(species-level occurrence coefficients), `tau.sq.beta`

(community-level occurrence variance parameters), `tau.sq.alpha`

(community-level detection variance parameters, `z`

(latent occurrence variables for all species). These are all specified in a single list. Initial values for community-level parameters are either vectors of length corresponding to the number of community-level detection or occurrence parameters in the model (including the intercepts) or a single value if all parameters are assigned the same initial values. Initial values for species level parameters are either matrices with the number of rows indicating the number of species, and each column corresponding to a different regression parameter, or a single value if the same initial value is used for all species and parameters. The initial values for the latent occurrence matrix are specified as a matrix with \(N\) rows corresponding to the number of species and \(J\) columns corresponding to the number of sites.

```
N <- dim(hbef2015$y)[1]
ms.inits <- list(alpha.comm = 0,
beta.comm = 0,
beta = 0,
alpha = 0,
tau.sq.beta = 1,
tau.sq.alpha = 1,
z = apply(hbef2015$y, c(1, 2), max, na.rm = TRUE))
```

In multi-species models, we specify priors on the community-level coefficients (or hyperparameters) rather than the species-level effects. For nonspatial models, these priors are specified with the following tags: `beta.comm.normal`

(normal prior on the community-level occurrence mean effects), `alpha.comm.normal`

(normal prior on the community-level detection mean effects), `tau.sq.beta.ig`

(inverse-Gamma prior on the community-level occurrence variance parameters), `tau.sq.alpha.ig`

(inverse-Gamma prior on the community-level detection variance parameters). Each tag consists of a list with elements corresponding to the mean and variance for normal priors and scale and shape for inverse-Gamma priors. Values can be specified individually for each parameter or as a single value if the same prior is assigned to all parameters of a given type.

By default, we set the prior hyperparameter values for the community-level means to a mean of 0 and a variance of 2.72, which results in a vague prior on the probability scale as we discussed for the single-species spatial occupancy model. Below we specify these priors explicitly. For the community-level variance parameters, by default we set the scale and shape parameters to 0.1 following the recommendations of (Lunn et al. 2013), which results in a weakly informative prior on the community-level variances. This may lead to shrinkage of the community-level variance towards zero under certain circumstances so that all species will have fairly similar values for the species-specific covariate effect (Gelman 2006), although we have found multi-species occupancy models to be relatively robust to this prior specification. However, if you want to explore the impact of this prior on species-specific covariate effects, we recommend running single-species occupancy models for a select number of species in the data set, and assessing how large the differences in the estimated species-specific effects are between the single-species model estimates and those from the multi-species model. Below we explicitly set these default prior values for our HBEF example.

```
ms.priors <- list(beta.comm.normal = list(mean = 0, var = 2.72),
alpha.comm.normal = list(mean = 0, var = 2.72),
tau.sq.beta.ig = list(a = 0.1, b = 0.1),
tau.sq.alpha.ig = list(a = 0.1, b = 0.1))
```

All that’s now left to do is specify the number of threads to use (`n.omp.threads`

), the number of MCMC samples (`n.samples`

), the amount of samples to discard as burn-in (`n.burn`

), the thinning rate (`n.thin`

), and arguments to control the display of sampler progress (`verbose`

, `n.report`

).

```
# Approx. run time: 6 min
out.ms <- msPGOcc(occ.formula = occ.ms.formula,
det.formula = det.ms.formula,
data = hbef2015,
inits = ms.inits,
n.samples = 30000,
priors = ms.priors,
n.omp.threads = 1,
verbose = TRUE,
n.report = 6000,
n.burn = 10000,
n.thin = 50,
n.chains = 3)
```

```
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Model description
----------------------------------------
Multispecies Occupancy Model with Polya-Gamma latent
variable fit with 373 sites and 12 species.
Samples per Chain: 30000
Burn-in: 10000
Thinning Rate: 50
Number of Chains: 3
Total Posterior Samples: 1200
Source compiled with OpenMP support and model fit using 1 thread(s).
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Sampled: 6000 of 30000, 20.00%
-------------------------------------------------
Sampled: 12000 of 30000, 40.00%
-------------------------------------------------
Sampled: 18000 of 30000, 60.00%
-------------------------------------------------
Sampled: 24000 of 30000, 80.00%
-------------------------------------------------
Sampled: 30000 of 30000, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Sampled: 6000 of 30000, 20.00%
-------------------------------------------------
Sampled: 12000 of 30000, 40.00%
-------------------------------------------------
Sampled: 18000 of 30000, 60.00%
-------------------------------------------------
Sampled: 24000 of 30000, 80.00%
-------------------------------------------------
Sampled: 30000 of 30000, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Sampled: 6000 of 30000, 20.00%
-------------------------------------------------
Sampled: 12000 of 30000, 40.00%
-------------------------------------------------
Sampled: 18000 of 30000, 60.00%
-------------------------------------------------
Sampled: 24000 of 30000, 80.00%
-------------------------------------------------
Sampled: 30000 of 30000, 100.00%
```

The resulting object `out.ms`

is a list of class `msPGOcc()`

consisting primarily of posterior samples of all community and species-level parameters, as well as some additional objects that are used for summaries, prediction, and model fit evaluation. We can display a nice summary of these results using the `summary()`

function as before. For multi-species objects, when using summary we need to specify the level of parameters we want to summarize. We do this using the argument `level`

, which takes values `community`

, `species`

, or `both`

to print results for community-level parameters, species-level parameters, or all parameters. By default, `level`

is set to `both`

to display both species and community-level parameters.

`summary(out.ms)`

```
Call:
msPGOcc(occ.formula = occ.ms.formula, det.formula = det.ms.formula,
data = hbef2015, inits = ms.inits, priors = ms.priors, n.samples = 30000,
n.omp.threads = 1, verbose = TRUE, n.report = 6000, n.burn = 10000,
n.thin = 50, n.chains = 3)
Samples per Chain: 30000
Burn-in: 10000
Thinning Rate: 50
Number of Chains: 3
Total Posterior Samples: 1200
Run Time (min): 8.0261
----------------------------------------
Community Level
----------------------------------------
Occurrence Means (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 0.3543 0.8370 -1.3016 0.3240 1.9512 1.0004 978
scale(Elevation) 0.2679 0.5016 -0.7169 0.2776 1.2843 1.0060 1200
I(scale(Elevation)^2) -0.2165 0.2595 -0.7027 -0.2207 0.3206 1.0020 1101
Occurrence Variances (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 10.9032 6.5399 3.9471 9.3436 26.5235 1.0116 712
scale(Elevation) 2.9349 1.8947 1.0017 2.4329 7.8181 1.0053 767
I(scale(Elevation)^2) 0.6147 0.4394 0.1440 0.4932 1.7338 1.0208 941
Detection Means (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) -0.7672 0.4744 -1.7072 -0.7646 0.1168 1.0143 877
scale(day) 0.0589 0.0840 -0.1060 0.0604 0.2326 0.9991 1200
scale(tod) -0.0395 0.0767 -0.1867 -0.0413 0.1152 1.0022 1200
I(scale(day)^2) -0.0246 0.0842 -0.1770 -0.0282 0.1552 1.0006 1318
Detection Variances (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 2.7007 1.6171 0.8984 2.3029 6.8622 1.0259 299
scale(day) 0.0655 0.0368 0.0238 0.0562 0.1667 1.0000 1328
scale(tod) 0.0507 0.0360 0.0173 0.0423 0.1320 1.0371 1052
I(scale(day)^2) 0.0559 0.0439 0.0178 0.0458 0.1574 1.0376 1119
----------------------------------------
Species Level
----------------------------------------
Occurrence (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept)-AMRE -1.8107 2.1105 -4.5386 -2.4000 3.3652 1.0593 66
(Intercept)-BAWW 2.0466 1.5090 -0.1170 1.8089 5.9299 1.0190 111
(Intercept)-BHVI 3.1041 2.3441 0.0144 2.6511 8.9521 1.0531 156
(Intercept)-BLBW 2.5512 0.5865 1.7183 2.4376 4.0163 1.0083 1044
(Intercept)-BLPW -4.6594 0.6160 -5.9920 -4.6079 -3.5216 1.0030 915
(Intercept)-BTBW 3.6416 0.4762 2.8556 3.6093 4.7216 1.0039 1387
(Intercept)-BTNW 2.4139 0.3400 1.8539 2.3723 3.1978 0.9995 1200
(Intercept)-CAWA -2.1378 0.5583 -3.0914 -2.1840 -0.9588 1.0051 598
(Intercept)-MAWA -1.7047 0.2654 -2.2323 -1.6992 -1.1922 1.0083 1102
(Intercept)-NAWA -2.5000 0.5463 -3.4426 -2.5546 -1.3104 1.0062 617
(Intercept)-OVEN 2.1586 0.2735 1.6550 2.1462 2.7440 1.0045 1200
(Intercept)-REVI 3.1395 0.4624 2.3684 3.1025 4.1672 1.0059 921
scale(Elevation)-AMRE 0.8762 1.0901 -1.1293 0.8184 3.1547 1.0240 248
scale(Elevation)-BAWW -0.7958 1.0684 -3.9238 -0.5719 0.5389 1.0023 157
scale(Elevation)-BHVI 0.1010 1.0025 -1.9299 0.1079 1.9618 1.0150 615
scale(Elevation)-BLBW -0.4381 0.2303 -0.9829 -0.4111 -0.0692 1.0089 1079
scale(Elevation)-BLPW 2.6891 0.6631 1.6980 2.5727 4.1881 1.0116 718
scale(Elevation)-BTBW -0.5043 0.1649 -0.8348 -0.5021 -0.2107 1.0024 1200
scale(Elevation)-BTNW 0.6480 0.2862 0.1913 0.6210 1.2933 1.0075 1009
scale(Elevation)-CAWA 1.8800 0.6213 0.8493 1.8250 3.1988 1.0011 750
scale(Elevation)-MAWA 1.5670 0.2892 1.0519 1.5393 2.1608 0.9993 1421
scale(Elevation)-NAWA 1.1545 0.3681 0.5569 1.1172 1.9559 1.0007 963
scale(Elevation)-OVEN -1.6726 0.2695 -2.2725 -1.6422 -1.2330 1.0062 1200
scale(Elevation)-REVI -2.1313 0.7091 -3.6974 -2.0146 -1.1043 1.0082 792
I(scale(Elevation)^2)-AMRE -0.6650 0.6718 -2.0704 -0.6461 0.6285 1.0058 310
I(scale(Elevation)^2)-BAWW -0.9815 0.5035 -1.9458 -0.9884 0.1314 1.0054 428
I(scale(Elevation)^2)-BHVI 0.4369 0.7635 -0.8651 0.3335 2.2730 1.0225 456
I(scale(Elevation)^2)-BLBW -0.5725 0.1888 -0.9976 -0.5535 -0.2554 1.0050 1200
I(scale(Elevation)^2)-BLPW 0.7498 0.3787 -0.0799 0.7684 1.4496 1.0055 957
I(scale(Elevation)^2)-BTBW -1.0344 0.1657 -1.3904 -1.0289 -0.7354 1.0021 1200
I(scale(Elevation)^2)-BTNW -0.1412 0.1713 -0.4694 -0.1397 0.1908 1.0020 1200
I(scale(Elevation)^2)-CAWA -0.5376 0.4077 -1.3442 -0.5209 0.2371 0.9999 1100
I(scale(Elevation)^2)-MAWA 0.2938 0.2421 -0.1924 0.3018 0.7695 0.9996 1200
I(scale(Elevation)^2)-NAWA 0.2807 0.2661 -0.2549 0.2872 0.7932 1.0004 1200
I(scale(Elevation)^2)-OVEN -0.4236 0.1852 -0.7832 -0.4328 -0.0435 1.0063 1200
I(scale(Elevation)^2)-REVI -0.0335 0.3699 -0.6636 -0.0664 0.7631 1.0070 865
Detection (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept)-AMRE -2.9297 1.4415 -5.6275 -2.8097 -0.5810 1.0243 83
(Intercept)-BAWW -2.6905 0.3594 -3.3086 -2.7218 -1.8952 1.0352 340
(Intercept)-BHVI -2.7047 0.2316 -3.1381 -2.7138 -2.2058 1.0174 398
(Intercept)-BLBW -0.0359 0.1256 -0.2896 -0.0376 0.2047 1.0058 1200
(Intercept)-BLPW -0.4388 0.2629 -0.9502 -0.4314 0.0596 1.0020 1200
(Intercept)-BTBW 0.6365 0.1097 0.4368 0.6346 0.8605 1.0010 1200
(Intercept)-BTNW 0.5921 0.1129 0.3705 0.5908 0.8058 1.0097 1200
(Intercept)-CAWA -1.5263 0.5334 -2.6624 -1.5019 -0.5798 1.0132 583
(Intercept)-MAWA -0.7789 0.2419 -1.2226 -0.7784 -0.2928 1.0026 1200
(Intercept)-NAWA -1.5088 0.4980 -2.5089 -1.4999 -0.5645 1.0005 750
(Intercept)-OVEN 0.7971 0.1207 0.5621 0.7997 1.0311 1.0060 1200
(Intercept)-REVI 0.5440 0.1067 0.3347 0.5435 0.7494 1.0020 1200
scale(day)-AMRE 0.0434 0.2364 -0.3831 0.0330 0.5803 1.0087 1200
scale(day)-BAWW 0.2136 0.1379 -0.0410 0.2118 0.4997 1.0025 1200
scale(day)-BHVI 0.2082 0.1230 -0.0238 0.2060 0.4398 1.0180 1200
scale(day)-BLBW -0.2208 0.0651 -0.3477 -0.2183 -0.0926 1.0106 1200
scale(day)-BLPW 0.0707 0.1512 -0.2244 0.0707 0.3578 1.0013 1200
scale(day)-BTBW 0.2680 0.0678 0.1354 0.2668 0.3958 1.0041 1200
scale(day)-BTNW 0.1522 0.0671 0.0192 0.1545 0.2838 1.0034 1200
scale(day)-CAWA -0.0115 0.1715 -0.3490 -0.0061 0.3302 1.0119 1200
scale(day)-MAWA 0.1134 0.1268 -0.1379 0.1106 0.3603 1.0010 1200
scale(day)-NAWA 0.0302 0.1780 -0.3133 0.0292 0.3746 1.0059 1200
scale(day)-OVEN -0.0747 0.0782 -0.2262 -0.0772 0.0692 1.0017 994
scale(day)-REVI -0.0719 0.0663 -0.2053 -0.0692 0.0609 1.0078 1013
scale(tod)-AMRE -0.0369 0.2195 -0.5051 -0.0311 0.3818 1.0028 1104
scale(tod)-BAWW -0.1770 0.1323 -0.4445 -0.1678 0.0687 1.0019 1270
scale(tod)-BHVI -0.0482 0.1108 -0.2573 -0.0445 0.1703 1.0036 1200
scale(tod)-BLBW 0.0573 0.0644 -0.0675 0.0551 0.1895 1.0014 1200
scale(tod)-BLPW 0.1075 0.1463 -0.1677 0.1094 0.3982 1.0048 1200
scale(tod)-BTBW -0.0338 0.0669 -0.1733 -0.0332 0.0912 1.0112 1589
scale(tod)-BTNW 0.0402 0.0642 -0.0863 0.0402 0.1644 1.0064 1063
scale(tod)-CAWA -0.2118 0.1639 -0.5613 -0.2053 0.0806 1.0003 1021
scale(tod)-MAWA 0.0216 0.1151 -0.2088 0.0249 0.2393 1.0043 1200
scale(tod)-NAWA -0.0600 0.1591 -0.3768 -0.0564 0.2305 1.0105 1200
scale(tod)-OVEN -0.0455 0.0728 -0.1935 -0.0457 0.0959 1.0025 1269
scale(tod)-REVI -0.0783 0.0661 -0.2009 -0.0782 0.0536 1.0077 1200
I(scale(day)^2)-AMRE -0.1226 0.2318 -0.5927 -0.1243 0.3385 1.0053 1580
I(scale(day)^2)-BAWW -0.0423 0.1500 -0.3701 -0.0339 0.2375 1.0145 817
I(scale(day)^2)-BHVI 0.0475 0.1273 -0.1930 0.0456 0.3014 1.0117 1090
I(scale(day)^2)-BLBW -0.1748 0.0793 -0.3251 -0.1748 -0.0090 1.0166 1200
I(scale(day)^2)-BLPW 0.1902 0.1800 -0.1453 0.1816 0.5770 1.0006 1200
I(scale(day)^2)-BTBW -0.0616 0.0782 -0.2208 -0.0608 0.0869 1.0035 1200
I(scale(day)^2)-BTNW -0.0666 0.0773 -0.2194 -0.0677 0.0947 1.0072 1200
I(scale(day)^2)-CAWA -0.0336 0.1827 -0.3951 -0.0347 0.3359 1.0072 1200
I(scale(day)^2)-MAWA 0.0051 0.1327 -0.2636 0.0079 0.2564 1.0057 1200
I(scale(day)^2)-NAWA -0.1335 0.1874 -0.5206 -0.1331 0.2350 1.0099 1200
I(scale(day)^2)-OVEN 0.0176 0.0877 -0.1537 0.0177 0.1852 1.0002 1200
I(scale(day)^2)-REVI 0.0372 0.0772 -0.1105 0.0360 0.1877 1.0011 1200
```

```
# Or more explicitly
# summary(out.ms, level = 'both')
```

Looking at the community-level variance parameters, we see substantial variability in the average occurrence (the intercept) for the twelve species, as well as considerable variability in the effect of elevation across the community. There appears to be less variability across species in the detection portion of the model. We can look directly at the species-specific effects to confirm this. All community-level Rhat values are less than 1.1 and most species-level Rhat values are less than 1.1, with the exception of a few of the parameters for rare species. Mixing of the MCMC chains appears adequate when looking at the ESS values, although we can clearly see the ESS is lower for rare species (e.g., AMRE, BAWW) compared to more common species in the forest (e.g., OVEN, REVI).

### Posterior predictive checks

We can use the `ppcOcc()`

function to perform a posterior predictive check, and summarize the check with a Bayesian p-value using the `summary()`

function, all as shown before. The `summary()`

function again requires the `level`

argument to specify if you want an overall Bayesian p-value for the entire community (`level = 'community'`

), each individual species (`level = 'species'`

), or for both (`level = 'both'`

). By default, we set `level = ‘both’.

```
# Approx. run time: 20 sec
ppc.ms.out <- ppcOcc(out.ms, 'chi-squared', group = 1)
```

`Currently on species 1 out of 12`

`Currently on species 2 out of 12`

`Currently on species 3 out of 12`

`Currently on species 4 out of 12`

`Currently on species 5 out of 12`

`Currently on species 6 out of 12`

`Currently on species 7 out of 12`

`Currently on species 8 out of 12`

`Currently on species 9 out of 12`

`Currently on species 10 out of 12`

`Currently on species 11 out of 12`

`Currently on species 12 out of 12`

`summary(ppc.ms.out)`

```
Call:
ppcOcc(object = out.ms, fit.stat = "chi-squared", group = 1)
Samples per Chain: 30000
Burn-in: 10000
Thinning Rate: 50
Number of Chains: 3
Total Posterior Samples: 1200
----------------------------------------
Community Level
----------------------------------------
Bayesian p-value: 0.3697
----------------------------------------
Species Level
----------------------------------------
AMRE Bayesian p-value: 0.585
BAWW Bayesian p-value: 0.6608
BHVI Bayesian p-value: 0.5775
BLBW Bayesian p-value: 0.1042
BLPW Bayesian p-value: 0.6292
BTBW Bayesian p-value: 0.21
BTNW Bayesian p-value: 0.04
CAWA Bayesian p-value: 0.4733
MAWA Bayesian p-value: 0.4708
NAWA Bayesian p-value: 0.545
OVEN Bayesian p-value: 0.0533
REVI Bayesian p-value: 0.0875
Fit statistic: chi-squared
```

The Bayesian p-value for the overall community suggests an adequate model fit, but we see clear variation in the values for each individual species. We should explore this further in a complete analysis.

### Model selection using WAIC and k-fold cross-validation

We can compute the WAIC for comparison with alternative models using the `waicOcc()`

function.

`waicOcc(out.ms)`

```
elpd pD WAIC
-4531.71692 65.44891 9194.33166
```

k-fold cross-validation is again accomplished using the `k.fold`

argument as we have seen previously. For multi-species occupancy models, using multiple threads can greatly reduce the time needed for k-fold cross-validation, so we encourage the use of multiple threads if such computing power is readily available. Using up to \(k\) threads will generally involve substantial decreases in run time. For multi-species models, a separate deviance measure is reported for each species as a measure of predictive capacity, allowing for comparisons across multiple models for individual species, as well as for the entire community (by summing all species-specific values).

## Multi-species spatial occupancy models

### Basic model description

Residual spatial autocorrelation may perhaps be more prominent in multi-species occupancy models compared to single-species models, as a single set of covariates is used to explain occurrence probability across a region of interest for every one of the species in the modeled community. Given the large variety individual species show in habitat requirements, this may result in important drivers of occurrence probability not being included for certain species, resulting in many species having high residual spatial autocorrelation. We extend the previous multi-species occupancy model to incorporate a distinct spatial Gaussian Process (GP) for each species that accounts for unexplained spatial variation in each individual species occurrence across a spatial region. Occurrence probability for species \(i\) at site \(j\) with spatial coordinates \(\boldsymbol{s}_j\), \(\psi_{i}(\boldsymbol{s}_j)\), now takes the form

\[\begin{equation} \text{logit}(\psi_{i}(\boldsymbol{s}_j)) = \boldsymbol{x}^{\top}(\boldsymbol{s}_j)\boldsymbol{\beta}_i + \omega_{i}(\boldsymbol{s}_j), \end{equation}\]

where the species-specific regression coefficients \(\boldsymbol{\beta}_i\) follow the community-level distribution described previously for nonspatial multi-species occupancy models, and \(\omega_{i}(\boldsymbol{s}_j)\) is a realization from a zero-mean spatial GP, i.e.,

\[\begin{equation} \boldsymbol{\omega}_{i}(\boldsymbol{s}) \sim \text{Normal}(\boldsymbol{0}, \boldsymbol{\Sigma}_i(\boldsymbol{s}, \boldsymbol{s}', \boldsymbol{\theta}_i)). \end{equation}\]

We define \(\boldsymbol{\Sigma}_i(\boldsymbol{s}, \boldsymbol{s}', \boldsymbol{\theta}_i)\) as a \(J \times J\) covariance matrix that is a function of the distances between any pair of site coordinates \(\boldsymbol{s}\) and \(\boldsymbol{s}'\) and a set of parameters \((\boldsymbol{\theta}_i)\) that govern the spatial process. The vector \(\boldsymbol{\theta}_i\) is equal to \(\boldsymbol{\theta}_i = \{\sigma^2_i, \phi_i, \nu_i\}\), where \(\sigma^2_i\) is a spatial variance parameter for species \(i\), \(\phi_i\) is a spatial decay parameter for species \(i\), and \(\nu_i\) is a spatial smoothness parameter for species \(i\), where as before, \(\nu_i\) is only specified when using a Matern correlation function.

The detection portion of the multi-species spatial occupancy model remains unchanged from the non-spatial multi-species occupancy model. We fit the model again using Pólya-Gamma data augmentation to enable an efficient Gibbs sampler (see MCMC sampler vignette for details). Similar to our discussion on the single-species spatial occupancy model, we also allow for specification of the spatial process using an NNGP instead of a full GP. This leads to even larger computational gains over the full GP than in the case of a single-species model, given that a separate covariance matrix is specified for each species in the model. See Datta et al. (2016), Finley, Datta, and Banerjee (2020), and the MCMC sampler vignette for additional details on NNGPs and their implementation in multi-species spatial occupancy models.

### Fitting multi-species spatial occupancy models with `spMsPGOcc()`

The function `spMsPGOcc()`

fits spatially explicit multi-species occupancy models. Similar to single-species models using `spPGOcc()`

, models can be fit using either a full Gaussian Process (GP) or a Nearest Neighbor Gaussian Process (NNGP). `spMsPGOcc()`

fits a separate spatial process for each species. The syntax for `spMsPGOcc()`

is analogous to the syntax for single-species spatially-explicit models using `spPGOcc()`

.

```
spMsPGOcc(occ.formula, det.formula, data, inits, n.batch,
batch.length, accept.rate = 0.43, priors,
cov.model = "exponential", tuning, n.omp.threads = 1,
verbose = TRUE, NNGP = TRUE, n.neighbors = 15,
search.type = "cb", n.report = 100,
n.burn = round(.10 * n.batch * batch.length), n.thin = 1,
n.chains = 1, k.fold, k.fold.threads = 1, k.fold.seed, ...)
```

We will again showcase the model using the HBEF foliage-gleaning bird data set, with the same predictors in our occurrence and detection models

```
occ.ms.sp.formula <- ~ scale(Elevation) + I(scale(Elevation)^2)
det.ms.sp.formula <- ~ scale(day) + scale(tod) + I(scale(day)^2)
```

Our initial values in the `inits`

argument will look analogous to what we had specified for the nonspatial multi-species occupancy model using `msPGOcc()`

, but we will now also include additional initial values for the parameters controlling the spatial processes: `sigma.sq`

is the species-specific spatial variance parameter, `phi`

is the species specific spatial decay parameter, and `w`

is the latent spatial process for each species at each site. We will use an exponential covariance model, but note that when using a Matern covariance model we must also specify initial values for `nu`

, the species-specific spatial smoothness parameter. Note that all species-specific spatial parameters are independent of each other. We currently do not leverage any correlation between spatial processes of different species, although this is something we plan to incorporate for future `spOccupancy`

development. Initial values for `phi`

, `sigma.sq`

, and `nu`

(if applicable) are specified as vectors with \(N\) elements (the number of species being modeled) or as a single value that is used for all species, while the initial values for the latent spatial processes are specified as a matrix with \(N\) rows (i.e., species) and \(J\) columns (i.e., sites). Here we set the initial value for the spatial variances equal to 2 for all species and set the initial values for the spatial decay parameter to yield an effective range of the average distance between sites across the HBEF.

```
# Number of species
N <- dim(hbef2015$y)[1]
# Distances between sites
dist.hbef <- dist(hbef2015$coords)
# Exponential covariance model
cov.model <- "exponential"
ms.inits <- list(alpha.comm = 0,
beta.comm = 0,
beta = 0,
alpha = 0,
tau.sq.beta = 1,
tau.sq.alpha = 1,
z = apply(hbef2015$y, c(1, 2), max, na.rm = TRUE),
sigma.sq = 2,
phi = 3 / mean(dist.hbef),
w = matrix(0, N, dim(hbef2015$y)[2]))
```

We next specify the priors in the `priors`

argument. The priors are the same as those we specified for the non-spatial multi-species model, with the addition of priors for the parameters controlling the species-specific spatial processes. We assume independent priors for all spatial parameters across the different species. Priors (and their default values in `spOccupancy`

) for the spatial parameters are exactly analogous to those we saw for the single-species spatially-explicit occupancy model. For each species, we assign an inverse gamma prior for the spatial variance parameter `sigma.sq`

(tag is `sigma.sq.ig`

). As of v0.3.2, we can also specify a uniform prior for the spatial variance parameter with tag `sigma.sq.unif`

(see relevant discussion in single-species spatial occupancy models). We assign uniform priors for the spatial decay parameter `phi`

and smoothness parameter `nu`

(if `cov.model = 'matern'`

), with the associated tags `phi.unif`

and `nu.unif`

. All priors are specified as lists with two elements. For the inverse-Gamma prior, the first element is a length \(N\) vector of shape parameters for each species, and the second element is a length \(N\) vector of scale parameters for each species. As we have seen many times before, if the same prior is used for all species, both elements can be specified as single values. For the uniform priors, the first element is a length \(N\) vector of the lower bounds for each species, and the second element is a length \(N\) vector of upper bounds for each species. If the same prior is used for all species, both the lower and upper bounds can be specified as single values. For the inverse-Gamma prior on the spatial variances, here we set the shape parameter to 2 and the scale parameter also equal to 2. For a more formal analysis, we may want to do some exploratory data analysis to obtain a better guess for the spatial variance for each species, and then replace the scale parameter with this estimated guess for each species. For the spatial decay parameter, we determine the bounds of the uniform distribution by computing the smallest distance between sites and the largest distance between sites. We then set the lower bound of the uniform to `3/max`

and the upper bound to `3/min`

, where `min`

and `max`

correspond to the predetermined distances between sites. Again, these are the default hyperparameter values for the prior distribution of `phi`

that `spOccupancy`

will assign.

```
# Minimum value is 0, so need to grab second element.
min.dist <- sort(unique(dist.hbef))[2]
max.dist <- max(dist.hbef)
ms.priors <- list(beta.comm.normal = list(mean = 0, var = 2.72),
alpha.comm.normal = list(mean = 0, var = 2.72),
tau.sq.beta.ig = list(a = 0.1, b = 0.1),
tau.sq.alpha.ig = list(a = 0.1, b = 0.1),
sigma.sq.ig = list(a = 2, b = 2),
phi.unif = list(a = 3 / max.dist, b = 3 / min.dist))
```

We next set the parameters controlling the Adaptive MCMC algorithm (see `spPGOcc()`

section for details). Notice our specification of the initial tuning values is exactly the same as for `spPGOcc`

. We assume the same initial tuning value for all species. However, the adaptive algorithm will allow for species-specific tuning parameters, so these will be adjusted in the algorithm as needed (and reported to the R console if `verbose = TRUE`

).

```
batch.length <- 25
n.batch <- 400
n.burn <- 2000
n.thin <- 20
n.chains <- 3
ms.tuning <- list(phi = 0.5)
n.omp.threads <- 1
# Values for reporting
verbose <- TRUE
n.report <- 100
```

Spatially explicit multi-species occupancy models are currently the most computationally intensive models fit by `spOccupancy`

. Even for modest sized data sets, we encourage the use of NNGPs instead of full GPs when fitting spatially-explicit models to ease the computational burden of fitting these models. We fit the model with an NNGP below using 5 neighbors and summarize it using the `summary()`

function, where we specify that we want to summarize both species and community-level parameters.

```
# Approx. run time: 10 min
out.sp.ms <- spMsPGOcc(occ.formula = occ.ms.sp.formula,
det.formula = det.ms.sp.formula,
data = hbef2015,
inits = ms.inits,
n.batch = n.batch,
batch.length = batch.length,
accept.rate = 0.43,
priors = ms.priors,
cov.model = cov.model,
tuning = ms.tuning,
n.omp.threads = n.omp.threads,
verbose = TRUE,
NNGP = TRUE,
n.neighbors = 5,
n.report = n.report,
n.burn = n.burn,
n.thin = n.thin,
n.chains = n.chains)
```

```
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Building the neighbor list
----------------------------------------
----------------------------------------
Building the neighbors of neighbors list
----------------------------------------
----------------------------------------
Model description
----------------------------------------
NNGP Multispecies Occupancy Model with Polya-Gamma latent
variable fit with 373 sites and 12 species.
Samples per chain: 10000 (400 batches of length 25)
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200
Using the exponential spatial correlation model.
Using 5 nearest neighbors.
Source compiled with OpenMP support and model fit using 1 thread(s).
Adaptive Metropolis with target acceptance rate: 43.0
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Species Parameter Acceptance Tuning
1 phi 60.0 0.50503
2 phi 76.0 0.58092
3 phi 20.0 0.50503
4 phi 12.0 0.50503
5 phi 36.0 0.35946
6 phi 48.0 0.39727
7 phi 32.0 0.46620
8 phi 76.0 0.47561
9 phi 56.0 0.31250
10 phi 20.0 0.80000
11 phi 40.0 0.33853
12 phi 44.0 0.41348
-------------------------------------------------
Batch: 200 of 400, 50.00%
Species Parameter Acceptance Tuning
1 phi 28.0 0.70953
2 phi 44.0 0.45697
3 phi 76.0 0.73849
4 phi 40.0 0.84947
5 phi 28.0 0.36672
6 phi 36.0 0.50503
7 phi 28.0 0.48522
8 phi 88.0 0.59265
9 phi 36.0 0.28847
10 phi 20.0 0.73849
11 phi 52.0 0.31250
12 phi 52.0 0.42183
-------------------------------------------------
Batch: 300 of 400, 75.00%
Species Parameter Acceptance Tuning
1 phi 12.0 0.56941
2 phi 60.0 0.55814
3 phi 72.0 1.05850
4 phi 40.0 0.81616
5 phi 40.0 0.30631
6 phi 44.0 0.62930
7 phi 20.0 0.53625
8 phi 32.0 0.54709
9 phi 40.0 0.31881
10 phi 60.0 0.60462
11 phi 40.0 0.28847
12 phi 68.0 0.42183
-------------------------------------------------
Batch: 400 of 400, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Species Parameter Acceptance Tuning
1 phi 16.0 0.51523
2 phi 24.0 0.92022
3 phi 0.0 1.14666
4 phi 48.0 0.60462
5 phi 48.0 0.39727
6 phi 60.0 0.35234
7 phi 20.0 0.62930
8 phi 48.0 0.55814
9 phi 44.0 0.31250
10 phi 8.0 0.59265
11 phi 52.0 0.30025
12 phi 32.0 0.38169
-------------------------------------------------
Batch: 200 of 400, 50.00%
Species Parameter Acceptance Tuning
1 phi 32.0 0.58092
2 phi 64.0 0.99686
3 phi 8.0 1.14666
4 phi 12.0 0.75341
5 phi 20.0 0.34537
6 phi 56.0 0.37413
7 phi 16.0 0.65498
8 phi 20.0 0.44792
9 phi 40.0 0.31250
10 phi 68.0 0.75341
11 phi 60.0 0.30025
12 phi 48.0 0.48522
-------------------------------------------------
Batch: 300 of 400, 75.00%
Species Parameter Acceptance Tuning
1 phi 64.0 0.59265
2 phi 24.0 0.43035
3 phi 20.0 1.26725
4 phi 48.0 0.81616
5 phi 32.0 0.34537
6 phi 36.0 0.37413
7 phi 36.0 0.49502
8 phi 60.0 0.55814
9 phi 44.0 0.31250
10 phi 36.0 0.88413
11 phi 48.0 0.27716
12 phi 32.0 0.37413
-------------------------------------------------
Batch: 400 of 400, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Batch: 100 of 400, 25.00%
Species Parameter Acceptance Tuning
1 phi 36.0 0.73849
2 phi 40.0 0.53625
3 phi 24.0 0.81616
4 phi 20.0 0.90199
5 phi 32.0 0.50503
6 phi 48.0 0.38169
7 phi 76.0 0.40529
8 phi 32.0 0.55814
9 phi 28.0 0.28276
10 phi 0.0 1.67674
11 phi 36.0 0.30025
12 phi 44.0 0.38940
-------------------------------------------------
Batch: 200 of 400, 50.00%
Species Parameter Acceptance Tuning
1 phi 80.0 0.66821
2 phi 20.0 0.62930
3 phi 40.0 0.37413
4 phi 36.0 0.95777
5 phi 48.0 0.37413
6 phi 28.0 0.48522
7 phi 60.0 0.46620
8 phi 36.0 0.41348
9 phi 40.0 0.28847
10 phi 36.0 1.48714
11 phi 76.0 0.33853
12 phi 48.0 0.40529
-------------------------------------------------
Batch: 300 of 400, 75.00%
Species Parameter Acceptance Tuning
1 phi 24.0 0.70953
2 phi 40.0 0.52564
3 phi 96.0 0.40529
4 phi 72.0 0.80000
5 phi 92.0 0.36672
6 phi 24.0 0.35234
7 phi 68.0 0.64201
8 phi 48.0 0.44792
9 phi 44.0 0.28276
10 phi 0.0 1.37280
11 phi 32.0 0.31881
12 phi 36.0 0.40529
-------------------------------------------------
Batch: 400 of 400, 100.00%
```

`summary(out.sp.ms, level = 'both')`

```
Call:
spMsPGOcc(occ.formula = occ.ms.sp.formula, det.formula = det.ms.sp.formula,
data = hbef2015, inits = ms.inits, priors = ms.priors, tuning = ms.tuning,
cov.model = cov.model, NNGP = TRUE, n.neighbors = 5, n.batch = n.batch,
batch.length = batch.length, accept.rate = 0.43, n.omp.threads = n.omp.threads,
verbose = TRUE, n.report = n.report, n.burn = n.burn, n.thin = n.thin,
n.chains = n.chains)
Samples per Chain: 10000
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200
Run Time (min): 17.2565
----------------------------------------
Community Level
----------------------------------------
Occurrence Means (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 0.3597 1.0176 -1.6049 0.3543 2.4254 1.0011 1032
scale(Elevation) 0.3016 0.6899 -1.0360 0.3053 1.7081 1.0064 883
I(scale(Elevation)^2) -0.2811 0.3538 -0.9803 -0.2856 0.4529 1.0218 723
Occurrence Variances (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 20.7689 11.6274 7.2632 17.8888 50.1195 1.0069 187
scale(Elevation) 6.3631 3.9216 1.9893 5.2830 17.1310 1.0900 246
I(scale(Elevation)^2) 1.1639 0.8402 0.2748 0.9586 3.3140 1.0161 258
Detection Means (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) -0.7027 0.4561 -1.6310 -0.6969 0.2389 1.0159 424
scale(day) 0.0533 0.0887 -0.1241 0.0519 0.2276 1.0034 1200
scale(tod) -0.0367 0.0756 -0.1884 -0.0363 0.1095 1.0007 1200
I(scale(day)^2) -0.0207 0.0863 -0.1922 -0.0177 0.1447 1.0011 1415
Detection Variances (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 2.3997 1.3965 0.8328 2.0825 5.6853 1.1251 244
scale(day) 0.0647 0.0350 0.0240 0.0553 0.1576 1.0085 1200
scale(tod) 0.0495 0.0309 0.0163 0.0423 0.1265 1.0196 1200
I(scale(day)^2) 0.0582 0.0390 0.0188 0.0475 0.1583 1.0055 1200
----------------------------------------
Species Level
----------------------------------------
Occurrence (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept)-AMRE -2.7839 2.7995 -6.2666 -3.3099 5.8533 2.4781 20
(Intercept)-BAWW 2.3955 2.3571 -1.0727 1.9490 8.3598 1.8358 33
(Intercept)-BHVI 3.6254 3.0337 -1.1396 3.2159 10.4335 1.4752 47
(Intercept)-BLBW 3.3546 1.0681 2.0451 3.1021 6.0533 1.0495 115
(Intercept)-BLPW -5.7827 1.0190 -8.0843 -5.6333 -4.1817 1.0164 121
(Intercept)-BTBW 5.1493 1.1764 3.4881 4.9594 8.0912 1.0921 81
(Intercept)-BTNW 3.3242 0.8880 2.0654 3.1474 5.6730 1.0717 94
(Intercept)-CAWA -2.9981 0.9436 -5.0567 -2.9225 -1.2636 1.0179 129
(Intercept)-MAWA -3.8120 1.2331 -6.4453 -3.6571 -1.7271 1.1076 92
(Intercept)-NAWA -3.1808 0.7509 -4.9280 -3.1228 -1.7643 1.0078 218
(Intercept)-OVEN 3.6986 1.1148 1.5970 3.6054 6.2405 1.2508 80
(Intercept)-REVI 4.3757 1.0710 2.8313 4.1667 7.1326 1.1002 109
scale(Elevation)-AMRE 1.2484 1.3159 -0.8497 1.1333 4.2380 1.0957 96
scale(Elevation)-BAWW -1.1052 1.7476 -6.4918 -0.7208 1.5054 2.0653 32
scale(Elevation)-BHVI 0.2855 1.3816 -2.4782 0.2416 3.2837 1.0068 169
scale(Elevation)-BLBW -0.5659 0.5475 -1.4730 -0.4810 -0.0115 1.4423 76
scale(Elevation)-BLPW 3.1770 0.8314 1.9339 3.0566 5.0817 1.0212 368
scale(Elevation)-BTBW -0.6607 0.3144 -1.3604 -0.6364 -0.0997 1.0356 295
scale(Elevation)-BTNW 0.8616 0.4211 0.2232 0.7996 1.8574 1.0040 535
scale(Elevation)-CAWA 2.5017 0.8703 1.1382 2.4132 4.5013 1.0093 314
scale(Elevation)-MAWA 3.1731 1.0251 1.5879 3.0129 5.4575 1.0739 81
scale(Elevation)-NAWA 1.3518 0.4610 0.5993 1.2988 2.3902 1.0532 467
scale(Elevation)-OVEN -3.0981 0.9067 -5.2549 -2.9631 -1.7147 1.0773 116
scale(Elevation)-REVI -2.9657 1.1394 -5.8219 -2.7039 -1.4194 1.0982 182
I(scale(Elevation)^2)-AMRE -0.8691 0.8807 -2.6965 -0.8757 0.9554 1.2924 109
I(scale(Elevation)^2)-BAWW -1.2662 0.7206 -2.7113 -1.2555 0.2190 1.2940 113
I(scale(Elevation)^2)-BHVI 0.5053 0.9266 -1.3355 0.4406 2.4339 1.0144 241
I(scale(Elevation)^2)-BLBW -0.7195 0.3017 -1.3739 -0.6956 -0.2447 1.0260 151
I(scale(Elevation)^2)-BLPW 1.0131 0.5114 -0.0250 1.0373 1.9844 1.0328 415
I(scale(Elevation)^2)-BTBW -1.4685 0.3629 -2.3426 -1.4104 -0.9319 1.0927 98
I(scale(Elevation)^2)-BTNW -0.1944 0.2512 -0.7000 -0.2055 0.2947 1.0404 844
I(scale(Elevation)^2)-CAWA -0.6908 0.5777 -1.8527 -0.6590 0.4648 1.0016 450
I(scale(Elevation)^2)-MAWA 0.6772 0.5442 -0.2918 0.6537 1.8326 1.0053 517
I(scale(Elevation)^2)-NAWA 0.3880 0.3277 -0.2428 0.3777 1.0331 1.0142 653
I(scale(Elevation)^2)-OVEN -0.8304 0.4283 -1.6853 -0.8213 -0.0175 1.0024 497
I(scale(Elevation)^2)-REVI 0.0063 0.5554 -0.9405 -0.0508 1.2308 1.0512 239
Detection (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept)-AMRE -2.5565 1.4008 -5.4953 -2.2636 -0.2155 1.5742 45
(Intercept)-BAWW -2.5949 0.3923 -3.2812 -2.6311 -1.8298 1.4104 106
(Intercept)-BHVI -2.6425 0.3212 -3.1665 -2.6867 -1.8837 1.7077 34
(Intercept)-BLBW -0.0252 0.1279 -0.2801 -0.0212 0.2179 1.0528 378
(Intercept)-BLPW -0.4344 0.2682 -0.9807 -0.4349 0.0584 1.0014 1099
(Intercept)-BTBW 0.6272 0.1042 0.4179 0.6247 0.8268 0.9995 1088
(Intercept)-BTNW 0.5902 0.1105 0.3695 0.5917 0.8056 1.0080 1200
(Intercept)-CAWA -1.4753 0.5489 -2.6137 -1.4608 -0.4972 1.0107 265
(Intercept)-MAWA -0.6802 0.2323 -1.1362 -0.6740 -0.1938 0.9998 961
(Intercept)-NAWA -1.4636 0.4748 -2.3708 -1.4631 -0.5867 1.0142 567
(Intercept)-OVEN 0.8149 0.1185 0.5891 0.8109 1.0389 1.0053 1200
(Intercept)-REVI 0.5407 0.1087 0.3332 0.5448 0.7498 1.0036 1097
scale(day)-AMRE 0.0377 0.2349 -0.4194 0.0360 0.4849 1.0122 1200
scale(day)-BAWW 0.2076 0.1435 -0.0631 0.2091 0.4954 1.0047 972
scale(day)-BHVI 0.2048 0.1186 -0.0141 0.2015 0.4361 1.0104 1200
scale(day)-BLBW -0.2304 0.0680 -0.3618 -0.2286 -0.0932 1.0012 1200
scale(day)-BLPW 0.0629 0.1562 -0.2476 0.0635 0.3640 1.0004 1200
scale(day)-BTBW 0.2670 0.0665 0.1411 0.2664 0.3976 1.0100 1200
scale(day)-BTNW 0.1461 0.0686 0.0115 0.1453 0.2761 1.0106 1200
scale(day)-CAWA -0.0230 0.1747 -0.3750 -0.0247 0.3076 1.0055 1200
scale(day)-MAWA 0.1047 0.1252 -0.1323 0.1032 0.3559 1.0166 1341
scale(day)-NAWA 0.0126 0.1796 -0.3268 0.0063 0.3637 1.0037 1200
scale(day)-OVEN -0.0758 0.0707 -0.2084 -0.0743 0.0633 1.0037 1200
scale(day)-REVI -0.0754 0.0652 -0.2042 -0.0755 0.0427 1.0019 1200
scale(tod)-AMRE -0.0318 0.2173 -0.4623 -0.0309 0.4051 1.0212 987
scale(tod)-BAWW -0.1747 0.1392 -0.4514 -0.1718 0.0780 1.0016 1200
scale(tod)-BHVI -0.0424 0.1101 -0.2709 -0.0447 0.1704 1.0174 1200
scale(tod)-BLBW 0.0594 0.0673 -0.0738 0.0580 0.1875 1.0121 1093
scale(tod)-BLPW 0.1024 0.1455 -0.1827 0.1033 0.3946 1.0030 1200
scale(tod)-BTBW -0.0361 0.0658 -0.1692 -0.0372 0.0938 1.0018 1200
scale(tod)-BTNW 0.0401 0.0644 -0.0885 0.0412 0.1721 1.0026 1200
scale(tod)-CAWA -0.2041 0.1746 -0.5550 -0.1954 0.1211 1.0044 1200
scale(tod)-MAWA 0.0180 0.1162 -0.2001 0.0147 0.2474 1.0040 1200
scale(tod)-NAWA -0.0610 0.1612 -0.3602 -0.0686 0.2751 1.0123 1200
scale(tod)-OVEN -0.0425 0.0734 -0.1828 -0.0420 0.0964 1.0030 1200
scale(tod)-REVI -0.0800 0.0653 -0.2125 -0.0812 0.0438 1.0060 1200
I(scale(day)^2)-AMRE -0.1050 0.2253 -0.5649 -0.0978 0.3139 1.0013 1241
I(scale(day)^2)-BAWW -0.0382 0.1456 -0.3232 -0.0383 0.2489 1.0033 1200
I(scale(day)^2)-BHVI 0.0552 0.1320 -0.1815 0.0538 0.3287 1.0020 1200
I(scale(day)^2)-BLBW -0.1704 0.0814 -0.3267 -0.1727 -0.0116 1.0029 1200
I(scale(day)^2)-BLPW 0.2060 0.1897 -0.1579 0.2028 0.6264 1.0039 1200
I(scale(day)^2)-BTBW -0.0604 0.0786 -0.2210 -0.0599 0.0990 1.0005 1200
I(scale(day)^2)-BTNW -0.0599 0.0807 -0.2176 -0.0610 0.0969 1.0066 1143
I(scale(day)^2)-CAWA -0.0353 0.1837 -0.3963 -0.0350 0.3161 1.0051 1088
I(scale(day)^2)-MAWA 0.0007 0.1399 -0.2760 0.0019 0.2884 1.0057 1200
I(scale(day)^2)-NAWA -0.1388 0.1903 -0.5233 -0.1325 0.2080 1.0004 1200
I(scale(day)^2)-OVEN 0.0208 0.0883 -0.1535 0.0197 0.1985 0.9996 1200
I(scale(day)^2)-REVI 0.0394 0.0771 -0.1070 0.0390 0.1841 0.9993 1200
Spatial Covariance:
Mean SD 2.5% 50% 97.5% Rhat ESS
sigma.sq-AMRE 2.1899 2.2940 0.4064 1.4374 9.0988 1.1310 102
sigma.sq-BAWW 6.5982 16.9455 0.3967 1.4193 54.3471 2.7628 19
sigma.sq-BHVI 9.7972 19.1097 0.4265 1.7880 70.9977 7.6504 15
sigma.sq-BLBW 2.8018 3.6071 0.3978 1.6198 13.0921 1.2352 78
sigma.sq-BLPW 2.1438 1.9955 0.3600 1.5393 8.2392 1.0245 62
sigma.sq-BTBW 2.8815 2.4739 0.5351 2.0934 10.0705 1.0782 64
sigma.sq-BTNW 3.3354 3.4263 0.4612 2.2685 12.7757 1.0235 90
sigma.sq-CAWA 2.6432 2.5349 0.4600 1.7529 9.9263 1.0666 71
sigma.sq-MAWA 12.9775 9.8118 2.3417 10.9231 34.1622 1.1662 48
sigma.sq-NAWA 1.7696 1.7041 0.3605 1.1927 7.6151 1.1013 103
sigma.sq-OVEN 8.9635 5.9542 2.1749 7.4681 25.3373 1.0642 108
sigma.sq-REVI 3.0870 2.8910 0.5851 2.1013 11.3286 1.1086 86
phi-AMRE 0.0124 0.0084 0.0010 0.0103 0.0291 1.0080 106
phi-BAWW 0.0106 0.0087 0.0008 0.0077 0.0283 1.1699 57
phi-BHVI 0.0135 0.0088 0.0008 0.0126 0.0288 1.8660 59
phi-BLBW 0.0139 0.0079 0.0024 0.0122 0.0292 1.0137 156
phi-BLPW 0.0062 0.0067 0.0007 0.0037 0.0266 1.2457 55
phi-BTBW 0.0068 0.0070 0.0007 0.0040 0.0269 1.1759 41
phi-BTNW 0.0101 0.0077 0.0008 0.0074 0.0284 1.2180 77
phi-CAWA 0.0100 0.0091 0.0006 0.0064 0.0288 1.0189 27
phi-MAWA 0.0019 0.0008 0.0007 0.0018 0.0037 1.1800 150
phi-NAWA 0.0150 0.0085 0.0020 0.0152 0.0293 1.0669 60
phi-OVEN 0.0021 0.0008 0.0007 0.0020 0.0040 1.0501 180
phi-REVI 0.0074 0.0061 0.0011 0.0053 0.0250 1.0618 95
```

The resulting object `out.sp.ms`

is a list of class `spMsPGOcc`

consisting primarily of posterior samples of all community and species-level parameters, as well as some additional objects that are used for summaries, predictions, and model fit evaluation. The Rhat values indicate convergence of most community and species-level parameters, but some parameters for rare species have not yet converged and the ESS of many species-level parameters are quite small. This is a clear reminder that spatially-explicit models often have slow mixing rates, and so you will often need to run these models for a large number of samples in order to achieve convergence and adequate ESS.

### Posterior predictive checks

We perform posterior predictive checks to assess Goodness of Fit using `ppcOcc()`

just as we have previously seen.

```
# Takes a few seconds to run.
ppc.sp.ms.out <- ppcOcc(out.sp.ms, 'freeman-tukey', group = 2)
```

`Currently on species 1 out of 12`

`Currently on species 2 out of 12`

`Currently on species 3 out of 12`

`Currently on species 4 out of 12`

`Currently on species 5 out of 12`

`Currently on species 6 out of 12`

`Currently on species 7 out of 12`

`Currently on species 8 out of 12`

`Currently on species 9 out of 12`

`Currently on species 10 out of 12`

`Currently on species 11 out of 12`

`Currently on species 12 out of 12`

`summary(ppc.sp.ms.out, level = 'both')`

```
Call:
ppcOcc(object = out.sp.ms, fit.stat = "freeman-tukey", group = 2)
Samples per Chain: 10000
Burn-in: 2000
Thinning Rate: 20
Number of Chains: 3
Total Posterior Samples: 1200
----------------------------------------
Community Level
----------------------------------------
Bayesian p-value: 0.5103
----------------------------------------
Species Level
----------------------------------------
AMRE Bayesian p-value: 0.7467
BAWW Bayesian p-value: 0.5342
BHVI Bayesian p-value: 0.5783
BLBW Bayesian p-value: 0.3092
BLPW Bayesian p-value: 0.4025
BTBW Bayesian p-value: 0.4958
BTNW Bayesian p-value: 0.48
CAWA Bayesian p-value: 0.5483
MAWA Bayesian p-value: 0.57
NAWA Bayesian p-value: 0.5992
OVEN Bayesian p-value: 0.385
REVI Bayesian p-value: 0.475
Fit statistic: freeman-tukey
```

### Model selection using WAIC

Below we compute the WAIC using `waicOcc()`

and compare it to the WAIC for the non-spatial multi-species occupancy model.

`waicOcc(out.sp.ms)`

```
elpd pD WAIC
-4198.9219 310.4972 9018.8380
```

`waicOcc(out.ms)`

```
elpd pD WAIC
-4531.71692 65.44891 9194.33166
```

As always, remember there is MC error in these numbers, and so the values you receive will be slightly different. The WAIC for the spatial model is considerably smaller than that for the nonspatial model, indicating the species-specific spatial processes improve prediction across the entire community. However, in a complete analysis we should ensure the models fully converge (and we have adequate ESS) before performing any model selection or comparison.

k-fold cross-validation proceeds using the `k.fold`

argument as we have seen previously, returning a separate scoring rule (deviance) for each species.

### Prediction

Prediction with `spMsPGOcc`

objects again uses the `predict()`

function given a set of covariates and spatial coordinates of a set of locations. Results are very similar to the nonspatial multi-species model, so we do not execute the following code, since prediction in spatial models takes some time and run time is multiplied by the number of species here.

## Single-species integrated occupancy models

Data integration is a model-based approach that leverages multiple data sources to provide inference and prediction on some latent process of interest (Miller et al. 2019). Data integration is particularly relevant in ecology as many data sources are often collected to better understand a single ecological phenomenon, with each data source having advantages and disadvantages, say in terms of sample size, spatial extent, or data reliability. Often, multiple detection-nondetection data sources are available to study the occurrence and distribution of some species of interest. For example, both point count surveys performed by human observers and autonomous recording units could be used to monitor a bird species (Doser et al. 2021). Different types of data have different sources of observation error, which we should explicitly incorporate into a model to avoid attributing any variation in detection probability to the true ecological process. Here we describe single-species integrated occupancy models, which combine multiple sources of detection-nondetection data in a single hierarchical modeling framework. These data sources may or may not be replicated, but we note that the combination of two or more sources always results in a replicated data set.

### Basic model description

The integrated occupancy model has an identical process model to that of the single-species occupancy model, but has a distinct detection model for each data source, each of which is conditional on the same, shared ecological process (i.e., species occurrence).

Let \(z_j\) be the presence or absence of a species at site \(j\), with \(j = 1, \dots, J\). We assume this latent occurrence variable arises from a Bernoulli process following

\[\begin{equation} \begin{split} &z_j \sim \text{Bernoulli}(\psi_j), \\ &\text{logit}(\psi_j) = \boldsymbol{x}^{\top}_j\boldsymbol{\beta}, \end{split} \end{equation}\]

where \(\psi_j\) is the probability of occurrence at site \(j\), which is a function of site-specific covariates \(\boldsymbol{X}\) and a vector of regression coefficients (\(\boldsymbol{\beta}\)).

We do not directly observe \(z_j\), but rather we observe an imperfect representation of the latent occurrence process. In integrated models, we have \(r = 1, \dots, R\) distinct sources of data that are all imperfect representations of a single, shared occurrence process. Let \(y_{r, a, k}\) be the observed detection (1) or nondetection (0) of a species of interest in data set \(r\) at site \(a\) during replicate \(k\). Because different data sources have different variables influencing the observation process, we envision a separate detection model for each data source that is defined conditional on a single, shared ecological process described above. We envision the detection-nondetection data from source \(r\) as arising from a Bernoulli process conditional on the true latent occurrence process:

\[\begin{equation} \begin{split} &y_{r, a, k} \sim \text{Bernoulli}(p_{r, a, k}z_{j[a]}), \\ &\text{logit}(p_{r, a, k}) = \boldsymbol{v}^{\top}_{r, a, k}\boldsymbol{\alpha}_r, \end{split} \end{equation}\]

where \(p_{r, a, k}\) is the probability of detecting a species at site \(a\) during replicate \(k\) (given it is present at site \(a\)) for data source \(r\), which is a function of site, replicate, and data source specific covariates \(\boldsymbol{V}_r\) and a vector of regression coefficients specific to each data source (\(\boldsymbol{\alpha}_r\)). Note that \(z_{j[a]}\) is the true occurrence status at site \(j\) corresponding to the \(a\)th data source site in the given data set \(r\). Each data source may be available at all \(J\) sites in the region of interest or at a subset of the \(J\) sites. Additionally, data sources can overlap in the sites they sample, or they can be obtained at distinct sites within all \(J\) sites of interest in the overall region.

We assume multivariate normal priors for the occurrence (\(\boldsymbol{\beta}\)) and data-set specific detection (\(\boldsymbol{\alpha}\)) regression coefficients to complete the Bayesian specification of a single-species occupancy model. Pólya-Gamma data augmentation is implemented in an analogous manner to that of the previous models to yield an efficient implementation of integrated occupancy models.

### Example data sources: Ovenbird occurrence in the White Mountain National Forest

To illustrate an integrated occupancy model, we will use two data sets that come from the White Mountain National Forest (WMNF) in New Hampshire and Maine, USA. Our goal is to model the occurrence of OVEN in the WMNF in 2015. Our first data source is the HBEF data set we have used to illustrate all single data source models. Our second data source comes from the National Ecological Observatory Network (NEON) at Bartlett Experimental Forest (Barnett et al. 2019; National Ecological Observatory Network (NEON) 2021). The Bartlett Forest and HBEF both lie within the larger WMNF. Suppose we are interested in OVEN occurrence across the entire WMNF. By leveraging both data sources in a single integrated model, we will expand the range of covariates across which we can make reliable predictions, and may obtain results that are more indicative across the entire region of interest and not just a single data source location (Doser et al. 2022). In this particular case, there is no overlap between the two data sources (i.e., Bartlett Forest and HBEF do not overlap spatially). However, the integrated occupancy models fit by `spOccupancy`

can integrate data sources with no overlap, partial overlap, or complete overlap.

The NEON data are provided along with `spOccupancy`

in the `neon2015`

list. We load the NEON data along with the HBEF data below

```
List of 4
$ y : num [1:12, 1:80, 1:3] 0 0 0 0 0 0 1 0 0 0 ...
$ occ.covs: num [1:80, 1] 390 425 443 382 441 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Elevation"
$ det.covs:List of 2
..$ day: num [1:80] 169 169 169 169 169 169 169 169 169 170 ...
..$ tod: int [1:80] 8 8 6 7 8 6 7 7 7 8 ...
$ coords : num [1:80, 1:2] 318472 318722 318972 318472 318722 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:80] "1" "2" "3" "4" ...
.. ..$ : chr [1:2] "X" "Y"
```

Details on the NEON data set are provided in the package documentation as well as in Doser et al. (2022). The NEON data are collected at 80 point count sites in Bartlett Forest using a removal protocol with three time periods, resulting in replicated detection-nondetection data that can be used in an occupancy modeling framework (after reducing counts to binary detection indicators). The `neon2015`

list, like the `hbef2015`

object, contains the detection-nondetection data for 12 foliage-gleaning bird species (`y`

), occurrence covariates stored in `occ.covs`

, detection covariates stored in `det.covs`

, and the coordinates of the 80 point count locations stored in `coords`

. Below we subset the detection-nondetection data in both data sources to solely work with OVEN.

```
sp.names <- dimnames(hbef2015$y)[[1]]
ovenHBEF <- hbef2015
ovenHBEF$y <- ovenHBEF$y[sp.names == "OVEN", , ]
ovenNEON <- neon2015
ovenNEON$y <- ovenNEON$y[sp.names == "OVEN", , ]
table(ovenHBEF$y)
```

```
0 1
518 588
```

`table(ovenNEON$y)`

```
0 1
61 62
```

Thus, the ovenbird is observed in a little over half of the possible site/replicate combinations in both of the data sources.

### Fitting single-species integrated occupancy models with `intPGOcc()`

The function `intPGOcc()`

fits single-species integrated occupancy models in `spOccupancy`

. Syntax is very similar to that of single data source models, and specifically takes the following form:

```
intPGOcc(occ.formula, det.formula, data, inits, n.samples, priors,
n.omp.threads = 1, verbose = TRUE, n.report = 1000,
n.burn = round(.10 * n.samples), n.thin = 1, n.chains = 1,
k.fold, k.fold.threads = 1, k.fold.seed, k.fold.data, ...)
```

The `data`

argument contains the list of data elements necessary for fitting an integrated occupancy model. For nonspatial integrated occupancy models, `data`

should be a list comprised of the following objects: `y`

(list of detection-nondetection data matrices for each data source), `occ.covs`

(data frame or matrix of covariates for occurrence model), `det.covs`

(a list of lists where each element of the list corresponds to the detection-nondetection data for the given data source), `sites`

(a list where each element consists of the site indices for the given data source.

The `ovenHBEF`

and `ovenNEON`

lists are currently formatted for use in single data source models and so we need to combine these data sources together. Perhaps the trickiest part of data integration is ensuring each point count location in each data source lines up with the correct geographical location where you want to determine the true presence/absence of the species of interest. In `spOccupancy`

, most of this bookkeeping is done under the hood, but we will need to combine the two data sources together into a single list in which we are consistent about how the data sources are sorted. To accomplish this, we recommend first creating the occurrence covariates matrix for all data sources. Because our two data sources do not overlap spatially, this is relatively simple here as we can just use `rbind()`

.

```
num [1:453, 1] 475 494 546 587 588 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr "Elevation"
```

Notice the order in which we placed these covariates: all covariate values for HBEF come first, followed by all covariates for NEON. We need to ensure that we use this identical ordering for all objects in the `data`

list. Next, we create the site indices stored in `sites`

. `sites`

should be a list with two elements (one for each data source), where each element consists of a vector that indicates the rows in `occ.covs`

that correspond with the specific row of the detection-nondetection data for that data source. When the data sources stem from distinct points (like in our current case), this is again relatively straightforward as the indices simply correspond to how we ordered the points in `occ.covs`

.

```
sites.int <- list(hbef = 1:nrow(ovenHBEF$occ.covs),
neon = 1:nrow(ovenNEON$occ.covs) + nrow(ovenHBEF$occ.covs))
str(sites.int)
```

```
List of 2
$ hbef: int [1:373] 1 2 3 4 5 6 7 8 9 10 ...
$ neon: int [1:80] 374 375 376 377 378 379 380 381 382 383 ...
```

Next we create the combined detection-nondetection data `y`

. For integrated models in `spOccupancy`

, `y`

is a list of matrices, with each element containing the detection-nondetection matrix for the specific data source. Again, we must ensure that we place the data sources in the correct order.

```
List of 2
$ hbef: num [1:373, 1:3] 1 1 0 1 0 0 1 0 1 1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "1" "2" "3"
$ neon: num [1:80, 1:3] 1 1 0 1 1 0 1 1 0 1 ...
```

Lastly, we create the detection covariates `det.covs`

. This should be a list of the detection covariates from each individual data source. Because individual data source detection covariates are stored as lists for single data source models in `spOccupancy`

, `det.covs`

is now a list of lists for integrated occupancy models.

```
det.covs.int <- list(hbef = ovenHBEF$det.covs,
neon = ovenNEON$det.covs)
```

Finally, we package everything together into a single list, which we call `data.int`

.

```
data.int <- list(y = y.int,
occ.covs = occ.covs.int,
det.covs = det.covs.int,
sites = sites.int)
str(data.int)
```

```
List of 4
$ y :List of 2
..$ hbef: num [1:373, 1:3] 1 1 0 1 0 0 1 0 1 1 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:373] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:3] "1" "2" "3"
..$ neon: num [1:80, 1:3] 1 1 0 1 1 0 1 1 0 1 ...
$ occ.covs: num [1:453, 1] 475 494 546 587 588 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Elevation"
$ det.covs:List of 2
..$ hbef:List of 2
.. ..$ day: num [1:373, 1:3] 156 156 156 156 156 156 156 156 156 156 ...
.. ..$ tod: num [1:373, 1:3] 330 346 369 386 409 425 447 463 482 499 ...
..$ neon:List of 2
.. ..$ day: num [1:80] 169 169 169 169 169 169 169 169 169 170 ...
.. ..$ tod: int [1:80] 8 8 6 7 8 6 7 7 7 8 ...
$ sites :List of 2
..$ hbef: int [1:373] 1 2 3 4 5 6 7 8 9 10 ...
..$ neon: int [1:80] 374 375 376 377 378 379 380 381 382 383 ...
```

We specify the occurrence and detection model formulas using the `occ.formula`

, and `det.formula`

arguments. The `occ.formula`

remains unchanged from previous models, and we will specify occurrence of the ovenbird as a function of linear and quadratic terms of elevation.

For the detection models, we need to specify a different detection model for each data source. We do this by supplying a list to the `det.formula`

argument, where each element of the list is the model formula for that given data set. Here we specify the detection model for HBEF as a function of linear and quadratic terms for ‘day of survey’ as well as a linear term for ‘time of survey’. In this case, we include the same covariates for the NEON model (though we note that different coefficients are estimated for the two data sources). However, there is no requirement for the data sources to be a function of the same covariates.

```
det.formula.int = list(hbef = ~ scale(day) + scale(tod) + I(scale(day)^2),
neon = ~ scale(day) + scale(tod) + I(scale(day)^2))
```

Next we specify the initial values. Initial values are specified in a list with the following tags: `z`

(latent occurrence values), `alpha`

(detection regression coefficients), and `beta`

(occurrence regression coefficients). This aligns with fitting single-species occupancy models using `PGOcc()`

. However, since we now have multiple detection models with different coefficients for each data source, initial values for `alpha`

are now passed to `intPGOcc()`

as a list, with each element of the list corresponding to the initial detection parameter values for a given data source. These are specified either as a vector with a value for each parameter or a single value for all parameters).

```
# Total number of sites
J <- nrow(data.int$occ.covs)
inits.list <- list(alpha = list(0, 0),
beta = 0,
z = rep(1, J))
```

We next specify the priors for all parameters in the integrated occupancy model in a list that is passed into the `priors`

argument. We specify normal priors for both the occurrence and detection regression coefficients, using tags `beta.normal`

and `alpha.normal`

, respectively.

```
prior.list <- list(beta.normal = list(mean = 0, var = 2.72),
alpha.normal = list(mean = list(0, 0),
var = list(2.72, 2.72)))
```

Priors for the occurrence regression coefficients are specified as we have seen in previous models. Because we have multiple detection-nondetection data sets each with distinct detection parameters, we specify the hypermeans and hypervariances in individual lists, where each element of the list corresponds to a specific data source. Again, the ordering of the data sources in the lists must align with the order the data sources are saved in the detection-nondetection data supplied to the `data`

argument.

Finally, we specify the number of samples, burn-in, and thinning rate using the same approach we have used for previous models.

```
n.samples <- 8000
n.burn <- 3000
n.thin <- 25
```

We can now run the integrated occupancy model. Below we set the number of threads used to 1 and print out sampler progress after every 2000th iteration.

```
# Approx. run time: < 15 sec
out.int <- intPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
n.samples = n.samples,
priors = prior.list,
n.omp.threads = 1,
verbose = TRUE,
n.report = 2000,
n.burn = n.burn,
n.thin = n.thin,
n.chains = 3)
```

```
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Model description
----------------------------------------
Integrated Occupancy Model with Polya-Gamma latent
variable fit with 453 sites.
Integrating 2 occupancy data sets.
Samples per Chain: 8000
Burn-in: 3000
Thinning Rate: 25
Number of Chains: 3
Total Posterior Samples: 600
Source compiled with OpenMP support and model fit using 1 thread(s).
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Sampled: 2000 of 8000, 25.00%
-------------------------------------------------
Sampled: 4000 of 8000, 50.00%
-------------------------------------------------
Sampled: 6000 of 8000, 75.00%
-------------------------------------------------
Sampled: 8000 of 8000, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Sampled: 2000 of 8000, 25.00%
-------------------------------------------------
Sampled: 4000 of 8000, 50.00%
-------------------------------------------------
Sampled: 6000 of 8000, 75.00%
-------------------------------------------------
Sampled: 8000 of 8000, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Sampled: 2000 of 8000, 25.00%
-------------------------------------------------
Sampled: 4000 of 8000, 50.00%
-------------------------------------------------
Sampled: 6000 of 8000, 75.00%
-------------------------------------------------
Sampled: 8000 of 8000, 100.00%
```

We again make a call to the `summary`

function for a concise description of the model results.

`summary(out.int)`

```
Call:
intPGOcc(occ.formula = occ.formula.int, det.formula = det.formula.int,
data = data.int, inits = inits.list, priors = prior.list,
n.samples = n.samples, n.omp.threads = 1, verbose = TRUE,
n.report = 2000, n.burn = n.burn, n.thin = n.thin, n.chains = 3)
Samples per Chain: 8000
Burn-in: 3000
Thinning Rate: 25
Number of Chains: 3
Total Posterior Samples: 600
Run Time (min): 0.2103
Occurrence (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 2.2633 0.2418 1.8281 2.2556 2.8053 1.0011 600
scale(Elevation) -1.2818 0.1915 -1.6741 -1.2750 -0.9491 1.0178 748
I(scale(Elevation)^2) -0.6001 0.1572 -0.8958 -0.5980 -0.3042 1.0050 674
Data source 1 Detection (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 0.8069 0.1262 0.5620 0.8134 1.0549 1.0078 600
scale(day) -0.0898 0.0752 -0.2419 -0.0865 0.0543 1.0018 600
scale(tod) -0.0447 0.0794 -0.2072 -0.0456 0.1113 1.0234 600
I(scale(day)^2) 0.0278 0.0960 -0.1637 0.0210 0.2122 1.0031 600
Data source 2 Detection (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 2.0365 0.5611 0.8986 2.0194 3.1347 1.0031 600
scale(day) 0.8911 0.3726 0.2556 0.8660 1.6360 1.0002 600
scale(tod) 0.8419 0.5257 -0.3892 0.9016 1.7465 1.0099 600
I(scale(day)^2) -0.7988 0.3373 -1.4846 -0.7995 -0.1368 1.0117 532
```

The `summary`

function for integrated models returns the detection parameters separately for each detection covariate. The Rhat and ESS values clearly indicate adequate convergence and mixing of the MCMC chains. Looking at the occurrence parameters, we see fairly similar estimates to those from the single data source model using HBEF data only. Of course this is only a half-surprise due to the unequal sizes of the two data sets, where the HBEF data will be expected to dominate estimation of the occupancy parameters. But an alternative explanation for the relative similarity is that the two processes are indeed very similar in the two areas within the greater WMNF region.

### Posterior predictive checks

We perform posterior predictive checks using `ppcOcc()`

as before. GoF assessment for integrated models is an active area of research and no consensus has so far appeared on the best approach. In `spOccupancy`

, we compute posterior predictive checks separately for each data set in the integrated model.

```
Call:
ppcOcc(object = out.int, fit.stat = "freeman-tukey", group = 2)
Samples per Chain: 8000
Burn-in: 3000
Thinning Rate: 25
Number of Chains: 3
Total Posterior Samples: 600
Data Source 1
Bayesian p-value: 0.3717
Fit statistic: freeman-tukey
Data Source 2
Bayesian p-value: 0.1217
Fit statistic: freeman-tukey
```

The low Bayesian p-value for NEON suggests a potential lack of fit which we would explore in a complete analysis.

### Model selection using WAIC and k-fold cross-validation

We use `waicOcc()`

to compute the WAIC for integrated occupancy models. Similar to the posterior predictive check, individual WAIC values are reported for each data set. These can be summed across all data sources for an overall WAIC value if desired.

`waicOcc(out.int)`

```
elpd pD WAIC
1 -633.34832 5.849407 1278.3955
2 -57.95017 7.730643 131.3616
```

k-fold cross-validation is implemented using the `k.fold`

argument as we have seen in previous `spOccupancy`

model fitting functions. Cross-validation for models with multiple data sources is not as straightforward as single data source models. For instance, we could envision splitting the data in multiple different ways to assess predictive performance, depending on the purpose of our comparison. In `spOccupancy`

, we implement two approaches for cross-validation of integrated occupancy models. In the first approach, we hold out locations irrespective of what data source they came from. This results in a scoring rule (deviance) for each individual data source based on the hold out sites where that data source is sampled. More specifically, our first algorithm for K-fold cross-validation is:

- Randomly split the total number of sites with at least one data source into \(K\) groups.
- For each \(k = 1, \dots, K\), fit the model without the data at the sites in the \(k\)th group of hold-out locations.
- Predict the detection-nondetection data at the locations in the \(k\)th hold out set.
- Compute the deviance for each hold out data point.
- Sum the deviance values separately for each data source to yield a scoring rule for each data source separately.

This form of k-fold cross-validation is applicable for model-selection between different integrated occupancy models. In other words, this approach can be used to compare models that integrate the same data sources but include different covariates in the occurrence and/or detection portion of the occupancy model. Using our example, we implement 4-fold cross-validation to compare the full integrated model with covariates to an intercept only integrated occupancy model. We do this using the `k.fold`

, `k.fold.threads`

, and `k.fold.seed`

arguments as with previous `spOccupancy`

models. Below we use the default values for `k.fold.threads`

and `k.fold.seed`

.

```
out.int.k.fold <- intPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
n.samples = n.samples,
priors = prior.list,
n.omp.threads = 1,
verbose = FALSE,
n.report = 2000,
n.burn = n.burn,
n.thin = n.thin,
n.chains = 1,
k.fold = 4)
out.int.k.fold.small <- intPGOcc(occ.formula = ~ 1,
det.formula = list(hbef = ~ 1, neon = ~ 1),
data = data.int,
inits = inits.list,
n.samples = n.samples,
priors = prior.list,
n.omp.threads = 1,
verbose = FALSE,
n.report = 2000,
n.burn = n.burn,
n.thin = n.thin,
n.chains = 1,
k.fold = 4)
# Summarize the CV results
out.int.k.fold$k.fold.deviance
```

`[1] 1504.9896 157.0448`

`out.int.k.fold.small$k.fold.deviance`

`[1] 1532.9758 205.6215`

We see the deviance for the full model is lower for both data sources compared to the intercept only model.

Alternatively, we may not wish to compare different integrated occupancy models together, but rather wish to assess whether or not data integration is necessary compared to using a single data source occupancy model. To accomplish this task, we can perform cross-validation with the integrated occupancy model using only a single data source as the hold out set, and then compare the deviance scoring rule to a scoring rule obtained from cross-validation with a single data source occupancy model. We accomplish this by using the argument `k.fold.data`

. If `k.fold.data`

is specified as an integer between 1 and the number of data sources integrated (in this case 2), only the data source corresponding to that integer will be used in the hold out and k-fold cross-validation process. If `k.fold.data`

is not specified, k-fold cross-validation holds out data irrespective of the data sources at the given location. Here, we set `k.fold.data = 1`

and compare the cross-validation results of the integrated model to the occupancy model using only HBEF we fit with `PGOcc()`

(which is stored in the `out.k.fold`

object).

```
out.int.k.fold.hbef <- intPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
n.samples = n.samples,
priors = prior.list,
n.omp.threads = 1,
verbose = TRUE,
n.report = 2000,
n.burn = n.burn,
n.thin = n.thin,
k.fold = 4,
k.fold.data = 1)
```

```
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Model description
----------------------------------------
Integrated Occupancy Model with Polya-Gamma latent
variable fit with 453 sites.
Integrating 2 occupancy data sets.
Samples per Chain: 8000
Burn-in: 3000
Thinning Rate: 25
Number of Chains: 1
Total Posterior Samples: 200
Source compiled with OpenMP support and model fit using 1 thread(s).
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Sampled: 2000 of 8000, 25.00%
-------------------------------------------------
Sampled: 4000 of 8000, 50.00%
-------------------------------------------------
Sampled: 6000 of 8000, 75.00%
-------------------------------------------------
Sampled: 8000 of 8000, 100.00%
----------------------------------------
Cross-validation
----------------------------------------
```

`Performing 4-fold cross-validation using 1 thread(s).`

`Only holding out data from data source 1.`

```
# Look at CV results again
# Single data source model
out.k.fold$k.fold.deviance
```

`[1] 1510.854`

```
# Integrated model
out.int.k.fold.hbef$k.fold.deviance
```

`[1] 1516.394`

Here we see that integration of the two data sources does not improve predictive performance at HBEF alone. We should also do the same approach with the NEON data. We close this section by emphasizing that there are potentially numerous other benefits to data integration than just predictive performance that must be carefully considered when trying to determine if data integration is necessary or not. See Simmonds et al. (2020), the discussion in Doser et al. (2022), and Chapter 10 in AHM2 for more on this topic.

### Prediction

Prediction for integrated occupancy models proceeds exactly as before using `predict()`

. Here we predict occurrence across HBEF for comparison with the single data source models (and remember to load the `stars`

and `ggplot2`

packages).

```
# Make sure to standardize using mean and sd from fitted model
elev.pred <- (hbefElev$val - mean(data.int$occ.covs[, 1])) / sd(data.int$occ.covs[, 1])
X.0 <- cbind(1, elev.pred, elev.pred^2)
out.int.pred <- predict(out.int, X.0)
```

```
# Producing an SDM for HBEF alone (posterior mean)
plot.dat <- data.frame(x = hbefElev$Easting,
y = hbefElev$Northing,
mean.psi = apply(out.int.pred$psi.0.samples, 2, mean),
sd.psi = apply(out.int.pred$psi.0.samples, 2, sd))
dat.stars <- st_as_stars(plot.dat, dims = c('x', 'y'))
ggplot() +
geom_stars(data = dat.stars, aes(x = x, y = y, fill = mean.psi)) +
scale_fill_viridis_c(na.value = 'transparent') +
labs(x = 'Easting', y = 'Northing', fill = '',
title = 'Mean OVEN occurrence probability using intPGOcc') +
theme_bw()
```

```
ggplot() +
geom_stars(data = dat.stars, aes(x = x, y = y, fill = sd.psi)) +
scale_fill_viridis_c(na.value = 'transparent') +
labs(x = 'Easting', y = 'Northing', fill = '',
title = 'SD OVEN occurrence probability using intPGOcc') +
theme_bw()
```

## Single-species spatial integrated occupancy models

### Basic model description

Single-species spatial integrated occupancy models are identical to integrated occupancy models except the ecological process model now incorporates a spatially-structured random effect following the discussion with single-species spatial occupancy models. All details for the single-species integrated spatial occupancy model have already been presented in previous model descriptions.

### Fitting single-species spatial integrated occupancy models using `spIntPGOcc()`

The function `spIntPGOcc()`

fits single-species spatial integrated occupancy models in `spOccupancy`

. Syntax is very similar to that of the single data source models and specifically takes the following form:

```
spIntPGOcc(occ.formula, det.formula, data, inits, n.batch,
batch.length, accept.rate = 0.43, priors,
cov.model = "exponential", tuning, n.omp.threads = 1,
verbose = TRUE, NNGP = TRUE, n.neighbors = 15,
search.type = 'cb', n.report = 100,
n.burn = round(.10 * n.batch * batch.length),
n.thin = 1, n.chains = 1, k.fold, k.fold.threads = 1,
k.fold.seed, k.fold.data, ...)
```

The `occ.formula`

, `det.formula`

, and `data`

arguments are analogous to what we saw with the nonspatial integrated occupancy model. However, as for all spatial models in `spOccupancy`

, the `data`

list must also contain the spatial coordinates in the `coords`

tag, which we add below.

```
data.int$coords <- rbind(hbef2015$coords, neon2015$coords)
occ.formula.int <- ~ scale(Elevation) + I(scale(Elevation)^2)
det.formula.int <- list(hbef = ~ scale(day) + scale(tod) + I(scale(day)^2),
neon = ~ scale(day) + scale(tod) + I(scale(day)^2))
```

Initial values specified in `inits`

and priors in `priors`

are specified in the same form as for `intPGOcc()`

with the additional values for spatial parameters. Analogous to all other spatial models in `spOccupancy`

, the spatial variance parameter takes an inverse-Gamma prior and the spatial range parameter (as well as the spatial smoothness parameter if `cov.model = 'matern'`

) takes a uniform prior (see discussion in single-species spatial occupancy models section on how to specify a uniform prior for the spatial variance parameter instead of an inverse-Gamma).

```
dist.int <- dist(data.int$coords)
min.dist <- min(dist.int)
max.dist <- max(dist.int)
J <- nrow(data.int$occ.covs)
# Exponential covariance model
cov.model <- "exponential"
inits.list <- list(alpha = list(0, 0),
beta = 0,
z = rep(1, J),
sigma.sq = 2,
phi = 3 / mean(dist.int),
w = rep(0, J))
prior.list <- list(beta.normal = list(mean = 0, var = 2.72),
alpha.normal = list(mean = list(0, 0),
var = list(2.72, 2.72)),
sigma.sq.ig = c(2, 1),
phi.unif = c(3 / max.dist, 3 / min.dist))
```

Finally, we specify the remaining parameters regarding the NNGP specifications, tuning parameters, and the length of the MCMC sampler we will run. We are then all set to fit the model. Remember that spatially-explicit models in `spOccupancy`

are implemented using an efficient adaptive MCMC sampler that requires us to specify the number of MCMC batches (`n.batch`

) and the length of each MCMC batch (`batch.length`

), which together determine the number of MCMC samples (i.e., `n.samples = n.batch * batch.length`

). We run the model and summarize the results with `summary()`

.

```
batch.length <- 25
n.batch <- 800
n.burn <- 10000
n.thin <- 40
tuning <- list(phi = .2)
# Approx. run time: 2 min
out.sp.int <- spIntPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
priors = prior.list,
tuning = tuning,
cov.model = cov.model,
NNGP = TRUE,
n.neighbors = 5,
n.batch = n.batch,
n.burn = n.burn,
n.chains = 3,
batch.length = batch.length,
n.report = 200)
```

```
----------------------------------------
Preparing to run the model
----------------------------------------
----------------------------------------
Building the neighbor list
----------------------------------------
----------------------------------------
Building the neighbors of neighbors list
----------------------------------------
----------------------------------------
Model description
----------------------------------------
NNGP Integrated Occupancy Model with Polya-Gamma latent
variable fit with 453 sites.
Integrating 2 occupancy data sets.
Samples per chain: 20000 (800 batches of length 25)
Burn-in: 10000
Thinning Rate: 1
Number of Chains: 3
Total Posterior Samples: 30000
Using the exponential spatial correlation model.
Using 5 nearest neighbors.
Source compiled with OpenMP support and model fit using 1 thread(s).
Adaptive Metropolis with target acceptance rate: 43.0
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Batch: 200 of 800, 25.00%
Parameter Acceptance Tuning
phi 32.0 0.25681
-------------------------------------------------
Batch: 400 of 800, 50.00%
Parameter Acceptance Tuning
phi 40.0 0.22326
-------------------------------------------------
Batch: 600 of 800, 75.00%
Parameter Acceptance Tuning
phi 52.0 0.22777
-------------------------------------------------
Batch: 800 of 800, 100.00%
----------------------------------------
Chain 2
----------------------------------------
Sampling ...
Batch: 200 of 800, 25.00%
Parameter Acceptance Tuning
phi 72.0 0.21450
-------------------------------------------------
Batch: 400 of 800, 50.00%
Parameter Acceptance Tuning
phi 44.0 0.23237
-------------------------------------------------
Batch: 600 of 800, 75.00%
Parameter Acceptance Tuning
phi 48.0 0.25172
-------------------------------------------------
Batch: 800 of 800, 100.00%
----------------------------------------
Chain 3
----------------------------------------
Sampling ...
Batch: 200 of 800, 25.00%
Parameter Acceptance Tuning
phi 40.0 0.23237
-------------------------------------------------
Batch: 400 of 800, 50.00%
Parameter Acceptance Tuning
phi 24.0 0.19801
-------------------------------------------------
Batch: 600 of 800, 75.00%
Parameter Acceptance Tuning
phi 52.0 0.21025
-------------------------------------------------
Batch: 800 of 800, 100.00%
```

`summary(out.sp.int)`

```
Call:
spIntPGOcc(occ.formula = occ.formula.int, det.formula = det.formula.int,
data = data.int, inits = inits.list, priors = prior.list,
tuning = tuning, cov.model = cov.model, NNGP = TRUE, n.neighbors = 5,
n.batch = n.batch, batch.length = batch.length, n.report = 200,
n.burn = n.burn, n.chains = 3)
Samples per Chain: 20000
Burn-in: 10000
Thinning Rate: 1
Number of Chains: 3
Total Posterior Samples: 30000
Run Time (min): 3.4521
Occurrence (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 2.9401 0.7900 1.4555 2.9126 4.5962 1.0678 232
scale(Elevation) -2.1644 0.5536 -3.4284 -2.0923 -1.2591 1.0566 326
I(scale(Elevation)^2) -0.9483 0.3445 -1.6763 -0.9306 -0.3178 1.0177 907
Data source 1 Detection (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 0.8257 0.1242 0.5853 0.8246 1.0721 1.0004 19380
scale(day) -0.0922 0.0780 -0.2464 -0.0913 0.0592 1.0005 22828
scale(tod) -0.0458 0.0779 -0.1991 -0.0460 0.1059 1.0001 22911
I(scale(day)^2) 0.0302 0.0961 -0.1574 0.0300 0.2209 1.0000 21751
Data source 2 Detection (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 2.1169 0.6172 0.8954 2.1104 3.3298 1.0078 2756
scale(day) 0.6557 0.5436 -0.5584 0.7031 1.6240 1.0160 1379
scale(tod) 0.4792 0.6259 -0.7387 0.4899 1.6173 1.0034 1313
I(scale(day)^2) -0.4309 0.6723 -1.4176 -0.5895 1.1811 1.0252 830
Spatial Covariance:
Mean SD 2.5% 50% 97.5% Rhat ESS
sigma.sq 6.9605 4.4000 1.8155 5.8916 18.4958 NA 121
phi 0.0019 0.0008 0.0005 0.0018 0.0037 1.1038 158
```

We see the occurrence and detection parameters have converged. We should run the chains for a bit longer to ensure convergence of the spatial covariance parameters, but for this example we will continue on.

### Posterior predictive checks

Below we perform a posterior predictive check for each of the data sets included in the occupancy model using `ppcOcc()`

.

```
Call:
ppcOcc(object = out.sp.int, fit.stat = "freeman-tukey", group = 2)
Samples per Chain: 20000
Burn-in: 10000
Thinning Rate: 1
Number of Chains: 3
Total Posterior Samples: 30000
Data Source 1
Bayesian p-value: 0.3688
Fit statistic: freeman-tukey
Data Source 2
Bayesian p-value: 0.301
Fit statistic: freeman-tukey
```

According to the Bayesian p-values, there is no lack of fit for either the HBEF or NEON data.

### Model selection using WAIC and k-fold cross-validation

We can perform model selection using WAIC with the `waicOcc()`

function as we have seen previously. Here we compare the WAIC from the spatial integrated model to the non-spatial integrated model.

`waicOcc(out.int)`

```
elpd pD WAIC
1 -633.34832 5.849407 1278.3955
2 -57.95017 7.730643 131.3616
```

`waicOcc(out.sp.int)`

```
elpd pD WAIC
1 -564.74617 49.06561 1227.6236
2 -43.54207 18.99561 125.0754
```

We see the spatial model performs better for both the HBEF data and the NEON data.

Two forms of k-fold cross-validation are implemented for `spIntPGOcc()`

, analogous to those discussed for non-spatial integrated occupancy models. We first use cross-validation to compare the predictive performance of the spatial integrated model to the nonspatial integrated model across all sites.

```
out.sp.int.k.fold <- spIntPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
priors = prior.list,
tuning = tuning,
cov.model = cov.model,
NNGP = TRUE,
n.neighbors = 5,
n.batch = n.batch,
n.burn = n.burn,
batch.length = batch.length,
verbose = FALSE,
k.fold = 4)
# Non-spatial model
out.int.k.fold$k.fold.deviance
```

`[1] 1504.9896 157.0448`

```
# Spatial model
out.sp.int.k.fold$k.fold.deviance
```

`[1] 1516.2585 164.1131`

Further, we perform cross-validation using only the HBEF data source as a hold out data source to compare with single data source models to assess the benefit of integration.

```
out.sp.int.k.fold.hbef <- spIntPGOcc(occ.formula = occ.formula.int,
det.formula = det.formula.int,
data = data.int,
inits = inits.list,
priors = prior.list,
tuning = tuning,
cov.model = cov.model,
NNGP = TRUE,
n.neighbors = 5,
n.batch = n.batch,
n.burn = n.burn,
batch.length = batch.length,
verbose = FALSE,
k.fold = 4,
k.fold.data = 1)
out.sp.int.k.fold.hbef$k.fold.deviance
```

`[1] 1517.555`

```
# Non-spatial single data source model
out.k.fold$k.fold.deviance
```

`[1] 1510.854`

### Prediction

Prediction for spatial integrated occupancy models proceeds exactly analogous to our approach using nonspatial integrated occupancy models. The only difference is that now we must also provide the coordinates of the nonsampled locations. The following code is not run as it provides similar results to the nonspatial integrated occupancy model.

## References

Banerjee, Sudipto, Bradley P Carlin, and Alan E Gelfand. 2003. *Hierarchical Modeling and Analysis for Spatial Data*. Chapman; Hall/CRC.

Barnett, David T, Paul A Duffy, David S Schimel, Rachel E Krauss, Kathryn M Irvine, Frank W Davis, John E Gross, et al. 2019. “The Terrestrial Organism and Biogeochemistry Spatial Sampling Design for the National Ecological Observatory Network.” *Ecosphere* 10 (2): e02540.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” *Journal of Statistical Software* 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Bolker, Benjamin M. 2015. “GLMM worked examples.” https://ms.mcmaster.ca/~bolker/R/misc/foxchapter/bolker_chap.html.

Broms, Kristin M, Mevin B Hooten, and Ryan M Fitzpatrick. 2016. “Model Selection and Assessment for Multi-Species Occupancy Models.” *Ecology* 97 (7): 1759–70.

Brooks, Stephen P., and Andrew Gelman. 1998. “General Methods for Monitoring Convergence of Iterative Simulations.” *Journal of Computational and Graphical Statistics* 7 (4): 434–55.

Clark, Allan E, and Res Altwegg. 2019. “Efficient Bayesian Analysis of Occupancy Models with Logit Link Functions.” *Ecology and Evolution* 9 (2): 756–68.

Datta, Abhirup, Sudipto Banerjee, Andrew O Finley, and Alan E Gelfand. 2016. “Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets.” *Journal of the American Statistical Association* 111 (514): 800–812.

Doser, Jeffrey W, Andrew O Finley, Aaron S Weed, and Elise F Zipkin. 2021. “Integrating Automated Acoustic Vocalization Data and Point Count Surveys for Estimation of Bird Abundance.” *Methods in Ecology and Evolution* 12 (6): 1040–9.

Doser, Jeffrey W, Wendy Leuenberger, T Scott Sillett, Michael T Hallworth, and Elise F Zipkin. 2022. “Integrated Community Occupancy Models: A Framework to Assess Occurrence and Biodiversity Dynamics Using Multiple Data Sources.” *Methods in Ecology and Evolution*.

Finley, Andrew O, Abhirup Datta, and Sudipto Banerjee. 2020. “SpNNGP R Package for Nearest Neighbor Gaussian Process Models.” *arXiv Preprint arXiv:2001.09111*.

Finley, Andrew O, Abhirup Datta, Bruce D Cook, Douglas C Morton, Hans E Andersen, and Sudipto Banerjee. 2019. “Efficient Algorithms for Bayesian Nearest Neighbor Gaussian Processes.” *Journal of Computational and Graphical Statistics* 28 (2): 401–14.

Gelman, Andrew. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models (Comment on Article by Browne and Draper).” *Bayesian Analysis* 1 (3): 515–34.

Guélat, Jérôme, and Marc Kéry. 2018. “Effects of Spatial Autocorrelation and Imperfect Detection on Species Distribution Models.” *Methods in Ecology and Evolution* 9 (6): 1614–25.

Heaton, Matthew J, Abhirup Datta, Andrew O Finley, Reinhard Furrer, Joseph Guinness, Rajarshi Guhaniyogi, Florian Gerber, et al. 2019. “A Case Study Competition Among Methods for Analyzing Large Spatial Data.” *Journal of Agricultural, Biological and Environmental Statistics* 24 (3): 398–425.

Hobbs, N Thompson, and Mevin B Hooten. 2015. *Bayesian Models*. Princeton University Press.

Hooten, Mevin B, and N Thompson Hobbs. 2015. “A Guide to Bayesian Model Selection for Ecologists.” *Ecological Monographs* 85 (1): 3–28.

Kéry, Marc, and J Andrew Royle. 2016. *Applied Hierarchical Modeling in Ecology: Volume 1: Prelude and Static Models*. Elsevier Science.

Lany, Nina K, Phoebe L Zarnetske, Andrew O Finley, and Deborah G McCullough. 2020. “Complementary Strengths of Spatially-Explicit and Multi-Species Distribution Models.” *Ecography* 43 (3): 456–66.

Lunn, David, Christopher Jackson, Nicky Best, Andrew Thomas, and David Spiegelhalter. 2013. “The Bugs Book: A Practical Introduction to Bayesian Analysis.”

MacKenzie, Darryl I, and Larissa L Bailey. 2004. “Assessing the Fit of Site-Occupancy Models.” *Journal of Agricultural, Biological, and Environmental Statistics* 9 (3): 300–318.

MacKenzie, Darryl I, James D Nichols, Gideon B Lachman, Sam Droege, J Andrew Royle, and Catherine A Langtimm. 2002. “Estimating Site Occupancy Rates When Detection Probabilities Are Less Than One.” *Ecology* 83 (8): 2248–55.

McCullagh, Peter, and John A Nelder. 2019. *Generalized Linear Models*. Routledge.

Miller, David AW, Krishna Pacifici, Jamie S Sanderlin, and Brian J Reich. 2019. “The Recent Past and Promising Future for Data Integration Methods to Estimate Species’ Distributions.” *Methods in Ecology and Evolution* 10 (1): 22–37.

National Ecological Observatory Network (NEON). 2021. “Breeding Landbird Point Counts (Dp1.10003.001).” National Ecological Observatory Network (NEON). https://data.neonscience.org/data-products/DP1.10003.001.

Northrup, Joseph M, and Brian D Gerber. 2018. “A Comment on Priors for Bayesian Occupancy Models.” *PloS One* 13 (2): e0192819.

Polson, Nicholas G, James G Scott, and Jesse Windle. 2013. “Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables.” *Journal of the American Statistical Association* 108 (504): 1339–49.

Roberts, Gareth O, and Jeffrey S Rosenthal. 2009. “Examples of Adaptive Mcmc.” *Journal of Computational and Graphical Statistics* 18 (2): 349–67.

Simmonds, Emily G, Susan G Jarvis, Peter A Henrys, Nick JB Isaac, and Robert B O’Hara. 2020. “Is More Data Always Better? A Simulation Study of Benefits and Limitations of Integrated Distribution Models.” *Ecography* 43 (10): 1413–22.

Tyre, Andrew J, Brigitte Tenhumberg, Scott A Field, Darren Niejalke, Kirsten Parris, and Hugh P Possingham. 2003. “Improving Precision and Reducing Bias in Biological Surveys: Estimating False-Negative Error Rates.” *Ecological Applications* 13 (6): 1790–1801.

Watanabe, Sumio. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” *Journal of Machine Learning Research* 11 (12).