Introduction
Imagine a world where you could access to all possible information, about individual companies, households or people. Privacy issues aside, this information would clearly be of use for many socially beneficial applications in the hands of a benevolent state or non-governmental organisation. In reality researchers must work with information that is incomplete, lacking spatial or temporal resolution or missing key attributes. Spatial microsimulation provides a statistical approximation of individual-level data at high spatial resolution. This spatial microdata can provide a basis for work in many fields to better understand the world.
Spatial microsimulation is a statistical technique for combining individual and geographical data, typically from aggregated Census results and surveys from a sample of the population. The resulting spatial microdata are useful in many situations where individual-level and geographically specific processes are in operation, enabling modelling and analysis on multiple levels. Spatial microsimulation can also be seen as an approach to better understand the world. The term is little known outside the fields of human geography and regional science yet its methods have the potential to be useful in a wide range of applications. Spatial microsimulation has great potential for informing public policy for social benefit in housing, transport and sustainable urban planning to prepare for a post carbon world — after we stop burning fossil fuels.
There is growing interest in spatial microsimulation. This is due largely to its practical utility in an era of ‘evidence-based policy’ but is also driven by changes in the wider research environment inside and outside of academia. Continued improvements in computers, software and data availability mean spatial microsimulation is more accessible than ever. It is now possible to simulate the populations of small administrative areas at the individual-level almost anywhere in the world. This opens new possibilities for a range of applications, not least policy evaluation.
Still, the meaning of spatial microsimulation is ambiguous for many. This is partly because the technique is inherently difficult to understand and partly due to researchers themselves: some uses of the term in the academic literature are unclear or inconsistent about what the method entails. Worse is work that treats spatial microsimulation as a magical black box. This book is about demystifying spatial microsimulation.
“spatial microsimulation” is a term that can seem ambiguous, because it concerns all ways to reach data statistically similar to the real state of the studied entities in a dis-aggregate level. So, depending on the available data and the aim of the research, spatial microsimulation can be understood either as a technique or an approach:
- A method for generating spatial microdata — individuals allocated to zones (see Figure 1.1) — by combining individual and geographically aggregated datasets. Used this was ‘spatial microsimulation’ is roughly synonymous with ‘population synthesis’.
- An approach to understanding multi-level phenomena based on spatial microdata — simulated or real.
Figure 1.1 is a very simplified flow diagram showing the difference between spatial microdata (bottom) and more commonly found official data types (top). There are clear disadvantages with both
Throughout this book we use the term in the wider sense of the term. We thus define spatial microsimulation as follows:
The creation, analysis and modelling of individual-level data allocated to geographic zones.
To make the distinction between the methodology and the approach, we use term population synthesis to describe the narrower process of generating the spatial microdata process, which usually involves population synthesis. Unless you are lucky enough to have access to real spatial microdata, a scarce but increasingly available commodity, spatial microdata must by generated by combining zone-level and individual-level data. Throughout the course of the book the emphasis shifts from the from the former to the latter side of the spatial microsimulation process: the focus in chapters 4 to 6 is on generating spatial microdata while chapters 7 onwards focus on how to use this rich data type.
Another issue tackled in this book is reproducibility. Most findings in the field cannot easily be replicated, meaning there is no way of independently checking the results. In today’s age of fast Internet connections, open access datasets and free software, there is little excuse for this. The issue is not unique to the field of spatial microsimulation. Opaque methods, impossible to replicate, are widespread in academia, leading to calls for an ‘Open Regional Science’ (Rey 2014). Similar proposals made in other research areas are gaining traction (Ince et al. 2012; Peng et al. 2006; McNutt 2014). This book encourages such a shift towards transparency in the field of spatial microsimulation.
Reproducibility is encouraged throughout via provision of code for readers to actually do spatial microsimulation. Small yet realistic datasets are provided to run the methods on your own computer. All the findings presented in this book can therefore be reproduced using code and data in the book’s GitHub repository.
Why spend time and effort on reproducibility? The first answer pragmatic: reproducibility can actually save time in the long-run, by ensuring more readable code; reproducibility can increase the re-usability and impact of research, allowing methods to be re-run or modified at a later data. The second reason is more profound: reproducibility is a prerequisite of falsifiability and falsifiability is the backbone of science (Popper 1959). The results of non-reproducible research cannot be verified, reducing scientific credibility. These philosophical observations inform the book’s practical nature.
This book presents spatial microsimulation as a living, evolving set of techniques rather than a prescriptive formula for arriving at the ‘right’ answer. Spatial microsimulation is largely defined by its user-community, made up of a growing number of people worldwide. This book aims to contribute to the community by encouraging collaboration, innovation and rigour. It also encourages playing with the methods and ‘getting your hands dirty’ with the code. As (Kabakoff 2011) put it regarding R, “the best way to learn is to experiment”.
Why spatial microsimulation with R?
Software decisions have a major impact on the flexibility, efficiency and reproducibility of research. Nearly three decades ago Clarke and Hölm (1987) observed that “little attention is paid to the choice of programming language used” for microsimulation. This appears to be as true now as it was then. Software is rarely discussed in papers on the subject and there are few mature spatial microsimulation packages.1 Factors that should influence software selection including cost, maturity, features and performance. Perhaps most important for busy researchers are the ease and speed of learning, writing, adapting and communicating the analysis. R excels in each of these areas.
R is a low-level language compared with statistical programs based on a strong graphical user interface (GUI) such as Microsoft Excel and SPSS. R offers great flexibility for analysing and modelling data and enables easy creation of user-defined functions. These are all desirable attributes of software for undertaking spatial microsimulation. On the other hand, R is high-level compared with general purpose languages such as C and Python. Instead of writing code to perform statistical operations ‘from scratch’, R users generally use pre-made functions. To calculate the mean value of variable x
, for example, one would need to type 20 characters in Python: float(sum(x))/len(x)
.2 In pure R just 7 characters are sufficient: mean(x)
. This terseness and range of pre-made functions is useful for ease of reading and writing spatial microsimulation models and analysing the results.
The example of calculating the mean in R and Python may be trite but illustrates a wider point: R was designed to work with statistical data, so many functions in the default R installation (e.g. lm()
, to create a linear regression model) perform statistical analysis ‘out of the box’. In agent-based modelling, the statistical analysis of results often occupies more time than running the model itself (Thiele 2014). The same applies to spatial microsimulation, making R an ideal choice due to its statistical analysis capabilities.
R has an active and growing user community and is easy to extend. R’s extreme flexibility allows it to call code written in other programming languages. This means that ‘glass ceilings’ encountered in other environments are not an issue in R: there is a wide range of algorithms that can be used from within R. This ability has been used to dramatically speed-up R code. The new R package dplyr, for example, uses C++ code to do the ‘heavy lifting’ of data manipulation tasks and is used in subsequent for data preparation tasks essential for spatial microsimulation.
As a result of R’s flexibility, open source ethic and strong user community there are thousands of packages that extend R’s capabilities by providing new functions and improvements are being added all the time. This book seeks to provide an up-to-date overview of the functions from mature packages for spatial microsimulation at the time of publication. It is a fast-moving game, however, so it is worth searching the internet for updates to existing packages and new packages for spatial microsimulation. New software can save both research and computer time so it’s worth staying up-to-date with the latest developments.
The ipfp and mipfp packages, for example, can greatly reduce the number of lines of code and the computational time needed for population synthesis compared with ‘hard-coding’ the method in R from scratch. (For the curious reader, the ‘IPF’ letters in the package names stand for ‘Iterative Proportional Fitting’, one of a number of technical terms we will use frequently in subsequent chapters. See the Glossary for a definition.)
These time-saving add-ons to R are described in Chapter 5. A clear advantage of R is that anyone can write a package. This is also potentially a disadvantage: it has been argued there are too many packages, making it difficult for new users to identify which packages are most reliable and which are best suited to different situations (Horkik 2012). However, this problem is mitigated by the open-source nature of R: users can see precisely how any particular function works (providing they are willing to learn some programming), rather than relying on a “black box”.
An recent development that has made R far more accessible is RStudio, an Integrated Development Environment (IDE) for R that helps new users become familiar with the language (see figure 4.1). Further information about why R is a good choice for spatial microsimulation is provided in the Appendix, a tutorial introduction to R for spatial microsimulation applications. The next section describes approaches to learning R in general terms.
Learning the R language
Having learned a little about why R is a good tool for the job, it is worth considering at this stage how R should be used. It is useful to think of R not as a series of isolated commands, but as an interconnected language. The code is used not only for the computer to crunch numbers, but also to communicate ideas, from one person to another. In other words, this book teaches spatial microsimulation in the language of R. Of course, English is more appropriate than R for explaining rather than merely describing the method and the language of mathematics is ideal for describing quantitative relationships conceptually. However, because the practical components of this book are implemented in R, you will gain more from it if you are fluent in R. To this end the book aims to improve your R skills as well as your ability to perform spatial microsimulation, especially in the earlier practical chapters. Some prior knowledge of R will make reading this book easier, but R novices should be able to follow the worked examples, with reference to appropriate supplementary material. (See the references listed in the Appendix.) As with learning Spanish or Chinese, frequent practice, persistence and experimentation will ensure deep learning.
A more practical piece of advice is to organise your workflow. Each project should have its own self-contained folder containing all that is needed to replicate the analysis, except perhaps large input datasets. This could include the raw (unchanged) input data3, R code for analysis, the graphical outputs and files containing data outputs. To avoid clutter, it is sensible to arrange this content into folders, as illustrated below (thanks to Colin Gillespie for this tip):
|-- book.Rmd
|-- data
|-- figures
|-- output
|-- R
| |-- load.R
| `-- parallel-ipfp.R
`-- spatial-microsim-book.Rproj
The example directory structure above is taken from an early version of this book. It contains the document for the write-up (book.Rmd
— this could equally be a .docx
or .tex
file) and RStudio’s .Rproj
file in the source directory. The rest of the entities are folders: one for the input data, one for figures generated, one for data outputs and one for R scripts. The R scripts should have meaningful names and contain only code that works and is commented. An additional backup directory could be used to store experimental code. There is no need to be prescriptive in following this structure, but projects using spatial microdata tend to be complex, so imposing order over your workflow early will likely yield dividends in the long run. If you work in a team, this kind of structure helps each member to rapidly understand and continue your work.
The same applies to learning the R language. Fluency allows complex numerical ideas to be described with a few keystrokes. If you are a new R user it is therefore worth spending some time learning the R language. To this end the Appendix provides a primer on R from the perspective of spatial microsimulation.
Typographic conventions
The following typographic conventions are followed to make the practical examples easier to follow:
- In-line code is provided in
monospace
font to show it’s something the computer understands. - Larger blocks of code, referred to as listings, are provided on separate lines and have coloured syntax highlighting to distinguish between values, names and functions:
x <- c(1, 2, 5, 10) # create a vector
sqrt(x) # find the square root of x
#> [1] 1.000 1.414 2.236 3.162
- Output from the R console is preceded by the
##
symbol, as illustrated above. - Comments are preceded by a single
#
symbol to explain specific lines. - Often, reference will be made to files contained within the book’s project folder. The notation used to refer to the location of these files follows the way we refer to files and folders on Linux computers. Thus ‘R/CakeMap.R’ refers to a file titled ‘CakeMap.R’, within the ‘R’ directory of the project’s folder.
There are many ways to write R code that will generate the same results. However, to ensure clarity and consistency, a single style, advocated in a chapter in Hadley Wickham’s Advanced R, is followed throughout (Wickham 2014a). Consistent style and plentiful comments will make your code readable by yourself and others for decades to come.
An overview of the book
This document is a working draft of a book to be published in CRC Press’s R Series in summer 2015. Any comments, relating to code, content or clarity of explanation will be gratefully received.4
The book’s chapters progress in a logical order that corresponds to the steps typically taken during a spatial microsimulation research project that generates, analyses and models spatial microdata. The next two chapters are primarily conceptual, introducing the concept and applications of spatial microsimulation in more detail whilst the remaining chapters are practical. Because we’ll be making heavy use of terminology specific to spatial microsimulation, it is recommended that you read these introductory chapters before tackling the practical section that follow although this is not essential. If any of the words do not make sense, the Glossary at the end of the book may help clarify what they mean. Chapters 4, 5, 6 and 7 to encapsulate the process of population synthesis which generates spatial microdata and which is key to much spatial micrsomulation work. They should be read together in that order. Chapters 8 9 and 10 explain and demonstrate methods of using spatial microdata to better understand the world. These chapters are more self-standing and can be read in any order. The chapter titles and brief descriptions of their contents are as follows:
- What is spatial microsimulation? introduces the concepts and applications of spatial microsimulation not only as a narrow methodology but as an approach to understanding real-world phenomena.
- SimpleWorld is a simple and reproducible explanation of spatial microsimulation method with reference to an imaginary planet.
- Data preparation is dedicated to the boring but vital task of loading and ‘cleaning’ the input data, ready for spatial microsmulation.
- Spatial microsimulation in R introduces the main functions and techniques that are used to generate spatial microdata building on the previous chapters.
- Population synthesis without input microdata gives a way to produce a spatial microsimulation without having individual-level data.
- Spatial microsimulation in the wild is a larger and more involved example using real data. In this chapter you will estimate geographical variation in cake consumption.
- Model checking and evaluation tackles the understudied but crucially important questions of ‘is the model working?’ and ‘do the results coincide with reality?’. Here you will learn appropriate tests and checks to verify spatial microsimulation outputs and think about its underlying assumptions.
- Visualising and analysing spatial microdata is dedicated to displaying the complex data generated through spatial microsimulation in a concise, compelling and beautiful way.
- Spatial microdata in agent-based models describes how the output from spatial microsimulation can be used as an input into more complex models.
- Additional tools and techniques introduces further methods and applications, including a description of R packages for spatial microsimulation.
References
The Flexible Modelling Framework (FMF) is a notable exception written in Java that can perform various modelling tasks.↩
The
float
function is needed in case whole numbers are used. This can be reduced to 13 characters with the excellent NumPy package:import numpy; x = [1,3,9]; numpy.mean(x)
would generate the desired result. The R equivalent isx = c(1,3,9); mean(x)
.↩Raw data should be kept safely on an external hard disk or a server if it is large or sensitive.↩
Feedback can be left via email to r.lovelace@leeds.ac.uk or via the project’s GitHub page (http://github.com/Robinlovelace/spatial-microsim-book).↩