The SPARSE symposium will take place in Canberra on Wednesday 16 November, 2022 from 9.15am – 5.00pm.

Location: Hanna Neumann Building 145, room 1.33 (ground floor).

Registration: on Eventbrite, click here. The symposium is free but please register for catering purposes. If you have any dietary requirements please email

Draft program:

9.15Emily Banks
Opening remarks
9.30Ray Chambers
Data Linkage, robustness and SAE
Discussant: James Brown 
10.30Tea break
10.45Steve Haslett  
The role of block diagonal matrices in equality of BLUEs of full, small, and intermediate linear models under covariance change, and their link to data confidentiality and encryption
11.30Sumonkanti Das
Estimation of daily smoking prevalence for disaggregated statistical areas in Australia  
1.15Susanna Cramb, Peter Baade, Jess Cameron  
Communicating insights from disease maps: innovative visualisations and analyses from the Australian Cancer Atlas
1.45James Hogg  
Risk factors and the Australian Cancer Atlas: the trepidation of instability and sparsity in small area estimation
2.15Lydia Lucchesi
Using Vizumap to visualise uncertainty on smoking prevalence maps
2.45Bernard Baffour
The utility of socioeconomic and remoteness indicators in understanding geographical variation in the prevalence of early childhood vulnerability in Australia
3.30Tea break
4.00Mu Li 
Small area estimation using spatial Bayesian hierarchical models
4.30Panel discussion on implementation and impact
Alice Richardson, Ginny Sargent, Hannah Gisz
5.00Informal discussion

Abstracts of the presentations follow.

Ray Chambers, University of Wollongong: Data Linkage, robustness and SAE
There is growing interest in a data integration approach to survey sampling, particularly where population registers are linked for sampling and subsequent analysis. The reason for doing this is simple: it is only by linking the same individuals in the different sources that it becomes possible to create a data set suitable for analysis. But data linkage is not error free. Many linkages are nondeterministic, based on how likely a linking decision corresponds to a correct match, that is, it brings together the same individual in all sources. High quality linking will ensure that the probability of this happening is high. Analysis of the linked data should take account of this additional source of error when this is not the case. This is especially true for secondary analysis carried out without access to the linking information, that is, the often confidential data that agencies use in their record matching.

We describe an inferential framework that allows for linkage errors when sampling from linked registers. After first reviewing current research activity in this area, we focus on secondary analysis and linear regression modelling, including the important special case of estimation of subpopulation and small area means. In doing so we consider both robustness and efficiency of the resulting linked data inferences.

Stephen Haslett, Massey University: The role of block diagonal matrices in equality of BLUEs of full, small, and intermediate linear models under covariance change, and their link to data confidentiality and encryption
A necessary and sufficient condition for BLUEs of estimable functions of parameters in a linear fixed effect model being un-altered by a change in error covariance structure is due to Rao (1971). Structural insight into Rao’s condition can be gained by writing the quadratic form that is permitted to be added to the original covariance in block diagonal form. When the original full linear model is made smaller by reducing the number of regressors (which may include interactions of any order), block diagonal or diagonal matrices also provide insight into conditions for the entire set of full, small, and intermediate models each to retain their own BLUEs. Using these results, it is possible to generate new datasets with exactly the same parameter estimates and even with the same estimated parameter covariance. The paper outlines the role that such changes in error covariance structure can play in data confidentiality and data encryption, especially when the covariance of the BLUEs is also retained. Extensions to linear mixed models and BLUPs are outlined in principle.
Keywords BLUE · BLUP · Confidentialised unit record files · Covariance · Data cloning · Data confidentiality · Encryption · Linear model · Residuals
Mathematics Subject Classification (2010) 62J05 · 62J10
Rao CR (1971) Unified theory of linear estimation (corr: 72v34 p194; 72v34 p477). Sankhya Series A 33:371–394.
(This is joint research with Simo Puntanen, University of Tampere, Finland.)

Sumonkanti Das, ANU: Estimation of daily smoking prevalence for disaggregated statistical areas in Australia
Official statistics on health outcomes for small domains are prized by policymakers and researchers alike, for measuring and monitoring progress of communities towards healthy lifestyles. Countries like Australia use their National Health Survey to monitor adult health behaviours such as daily smoking. Due to sparseness and remoteness in addition to lack of collected data, the nationwide survey data is not enough to estimate accurate smoking prevalence at the disaggregated statistical areas in Australia, even sometimes at the sub-national level particularly for the Northern Territory, where one-fifth of the population live in very remote areas.

This study aims to estimate the daily smoking prevalence for the Australian adults aged 18 years and above at various sub-national levels comprising 8 States and Territories, 88 Statistical Areas Level 4 (SA4s), and 334 Statistical Areas Level 3 (SA3s) by developing widely used small area models like the Fay-Herriot, spatial Fay-Herriot and the Besag-York-Mollié models beginning at the SA3 level, the most detailed level here. Detailed level direct estimates of daily smoking prevalence and their smoothed standard errors extracted from 2017-18 National Health Survey have been used as input for developing the small area models, which are expressed in a hierarchical Bayesian framework and fitted by Markov Chain Monte Carlo simulation. SA3 level estimates obtained directly from the fitted models are then aggregated to obtain estimates at higher aggregation levels.
Models accounting for the detailed level structured (spatial) and unstructured (non-spatial) random effects provide slightly higher bias and lower coverage at the higher disaggregated SA4 level. The extension of the considered models by inclusion of SA4 level non-spatial random effects, in addition to the detailed level spatial and non-spatial effects, is proposed to improve the unbiasedness and coverage of the model-based estimators at higher aggregation levels and is found to be suitable for the Australian smoking prevalence data through examination of model performance statistics.

Peter Baade, Susanna Cramb & Jessica Cameron, Cancer Queensland: Communicating insights from disease maps: innovative visualisations and analyses from the Australian Cancer Atlas
This talk delves into the Australian Cancer Atlas (, a world-first ongoing research program mapping small-area patterns in cancer-related measures, which currently includes cancer incidence and survival. Approaches to visualise and analyse the small-area estimates will be presented, including innovative current developments and future plans.

James Hogg, QUT: Risk factors and the Australian Cancer Atlas: the trepidation of instability and sparsity in small area estimation
With the rise in popularity of digital Atlases to communicate spatial variation in health to the public, there is an increasing need for robust small-area proportion estimates. However, current small-area estimation methods suffer from various modelling problems when data are very sparse or the areas are very small.
More widely, recent work has shown significant benefits in modelling at both the individual and area levels. Building on this two-stage approach, we propose a two-stage Bayesian hierarchical small area estimation model that can: account for survey design; use both individual-level survey-only covariates and area-level census covariates; reduce the instability, variance and bias in modelled prevalence estimates; and generate prevalence estimates for SA2s with no survey data.

Using a simulation study we show that, compared with existing Bayesian SAE methods, our model can provide optimal predictive performance of proportions under a variety of data conditions, including excessively sparse data. We compare models in terms of Bayesian mean relative root mean squared error and mean absolute bias. Finally, we discuss applying the novel methodology to the 2017-2018 National Health Survey data and 2016 census data to derive prevalence estimates for current smokers in Australia.

Lydia Lucchesi, ANU: Using Vizumap to visualise uncertainty on smoking prevalence maps
Uncertainty is information often left off maps presenting areal data estimates. The Vizumap R package is a tool for making maps that include uncertainty. In this talk, we will discuss the four visualisation methods offered by the software and present a case study on mapping smoking prevalence in Australia. We will also demo VizumApp, the interactive Shiny app for visualising uncertainty in spatial data.

Bernard Baffour, ANU: The utility of socioeconomic and remoteness indicators in understanding geographical variation in the prevalence of early childhood vulnerability in Australia
Background: In rural and remote Australia, the family lives of children and their early childhood development outcomes are attributable to the level of disadvantage in their local area, and the distance of their local area from a major city. This study aims to investigate how the disadvantage of the local area reflected from the socio-economic indexes for areas (SEIFA) and the remoteness reflected from the accessibility/remoteness index of Australia (ARIA) contribute to obtain improved prevalence estimates of children who are considered as developmentally vulnerable in statistical areas level 3 (SA3) and 4 (SA4) across Australia.

Methodology: Target outcome variables, such as proportion of developmentally vulnerable children in a particular development domain (for example, language and cognitive development) have been extracted at the SA3 level from the 2018 Australian Early Development Census (AEDC). The study included 308,953 children involved in the AEDC 2018 where roughly one-in-ten of them are considered to be developmentally vulnerable in each domain. The socio-economic index SEIFA and the geographic accessibility index ARIA constructed from census data are considered in this study to explain the spatial variability in child development vulnerability. We developed models in a hierarchical Bayesian framework considering SA3 level SEIFA and ARIA indices as covariates to account for spatial differences and unobserved heterogeneity at the SA3 level. The SA3 level estimates of the target outcome parameters are estimated from the developed model and then the SA4 level estimates are obtained by aggregation over SA3 level estimates. The performances of developed models are examined based on the consistency at SA3, SA4, and state level compared to the direct estimates to obtain numerically consistent estimate at various disaggregation levels.

Results: The results reveal that socio-economic disadvantage makes a significant contribution to explaining the spatial variation in childhood development vulnerability across small domains in Australia. Since the SEIFA score does not consider the remoteness of the locality, the inclusion of the ARIA score in the model further improves the model performance and provides better accuracy, particularly in remote and very remote regions. Therefore, the developed model provides significant improvements in estimation of the prevalence of developmental vulnerability at the disaggregated level, compared to the direct estimates, in terms of accuracy.
Conclusion: There are a number of sparsely populated areas where direct estimation leads to unreliable estimates of the relatively small prevalence of child vulnerability. The utilization of socio-economic disadvantage and geographic remoteness of the finer level domains in the Bayesian spatial model help to explain the geographical variation in child vulnerability, and to provide more accurate estimates of childhood vulnerability for the sparsely populated areas across Australia.

Keywords: Australian Early Development Census, Socio-Economic Indexes, Remoteness of locality, Spatial Bayesian Model, Disaggregated Statistical Area, developmental vulnerability

Mu Li, ANU: Small area estimation using spatial Bayesian hierarchical models
When the data are sparse in small-area statistic, borrowing strength from other reliable covariate (like Census data, Social-economic Index) would be a powerful strategy. However, it was noticed that covariates may not capture all variations and spatial correlations occurs as random effect. Conditional autoregressive models are commonly used to capture the spatial autocorrelation in data. Considering the spatial adjacency information under the Bayesian hierarchical model framework would be a typical application for small area estimation of disease and vulnerability.

Based on the Besag-York-Mollie (BYM) CAR model framework, there are many CAR models that try to capture and separate the spatial random effect. In addition, there are two extension of CAR Bayesian hierarchical model. One is the spatial-temporal BYM model where the estimation might be more accurate when the temporal pattern considered in the model. The other one is the multivariate CAR model, which could build up a joint distribution on multivariate estimator with a more flexible assumptions on SAE results.

In this symposium, I would like to share the comparison on the results (and restrictions) between the simple BYM model, the spatial-temporal CAR model and the multivariate CAR model for the developmentally vulnerability of early-aged children based on data of Australian Early Childhood Development (AEDC).