9 Population synthesis without microdata

Sometimes no representative individual level dataset is available as an input for population synthesis. In this case, the methods described in the previous chapters must be adapted accordingly. The challenge is still to generate spatial microdata that fits all the constraint tables, but based on a purely synthetic ‘seed’ input cross-tabulated contingency table. Many combinations of individual level data could correspond to these distributions. Depending on the aim of the spatial microsimulation model, simply finding one reasonable fit can be sufficient.

In other cases a fit based on entropy maximisation may be required. This concept involves finding the population that is most likely to represent the micro level populations (Bierlaire, 1991) (M. 1991). This chapter demonstrates two options for population synthesis when real individual level data is unavailable:

• Global cross-tables and local marginal distributions (9.1) explains a method for cases where the constraints consist in cross-tables not spatially located and local marginal distributions.
• Two level aggregated data (9.2) contains a procedure to make a spatial microsimulation when having data at different aggregated levels, for example, one for the provinces and one for the districts.

9.1 Global cross-tables and local marginal distributions

Assume we have a contingency table of constraint variables for the entire study area (but not at the local level) in the aggregate level data. This multi-dimensional cross-table (the seed) could be the result of a previous step such as the implementation of IPF re-weight individual level data to fit the case-study area of interest.

If the marginal distributions for small areas are known, we can use the mipfp function as previously shown. If, however, the only information about the zones is the total population living there, the function is usable only when considering the zone as a variable. In this specific case, having no additional data, the only option corresponds to re-scale the global cross-table for each zone. Note that this implies that the correlations between the variables are independent of the zone in question.

To illustrate, we will develop the SimpleWorld example (which can be loaded from the book’s data directory by entering source("code/SimpleWorld.R") or was previously loaded if you as followed (4)) with adapted constraints. When watching the available data in an aggregated level, we have for the moment:

# Cross-tabulation of individual level dataset
table(ind$age, ind$sex)
##
##         f m
##   a0_49 1 1
##   a50+  1 2
(total_pop <- rowSums(con_sex)) # total population of each zone
## [1] 12 10 11

To illustrate this section, the local constraint will be the total number of people in each zone (last column of consTot). The global constraint is a matrix of the form of the cross-table between age and sex, but including the total population (33 people for SimpleWorld). The new constraints could be:

# Global Constraint possible for SimpleWorld
global_cons <- table(ind$age, ind$sex)
global_cons[1,] <- c(6,9)
global_cons[2,] <- c(7,11)

# Local Constraint for SimpleWorld
local_cons <- total_pop

When only the total population is known for each zone, the best way to create a synthetic population is to simply re-scale the cross-table. For each zone, a table proportional to the global one is created. The results are stored in a three dimensional array, which first dimension represents the zone. The initialisation of the resulting matrix is the first step. We here fill in the table with “0”.

# initialise result's array and its names
resNames <- list(1:nrow(cons), rownames(global_cons),
colnames(global_cons))
res <- array(0, dim=c(nrow(cons), dim(global_cons)),
dimnames=resNames)

Now the final weight table is calculated, simply by taking the global matrix and re-scaling it to fit the the desired marginals. In this way we keep the global proportions, but with the correct total per zone. Note that making this process is exactly the same as running mipfp on the seed table with as constraints only the zone marginals.

# Re-scale the cross-table to fit the zone's constraints
for (zone in 1:length(total_pop)){ # loop over the zones
res[zone,,] <- global_cons * total_pop[zone] / sum(global_cons)
}

# Print the cross-table for zone 1
res[1,,]
##          f    m
## a0_49 2.18 3.27
## a50+  2.55 4.00

We can verify that the total population per zone is of the desired size. We can also check the global table of age and sex. This means that we have now weights fitting well all available data.

# Check the local constraints for each zone (should be TRUE)
for (zone in 1:length(total_pop)){
print( sum(round(res[zone,,])) == total_pop[zone] )
}
## [1] TRUE
## [1] TRUE
## [1] TRUE
# Save the global final table
SimTot <- apply(res,c(2,3),sum)

# Check the global constraint (should be 0)
sum(SimTot - global_cons)
## [1] 0

As with IPF, the fractional result needs to be integerised to create spatial microdata. The round() function generally provides a reasonable approximation, in terms of fitting the constraints. However, the aforementioned integerisation algorithms such as truncate, replicate, sample (TRS) can also be used. This method cannot be followed exactly, because we want to perfectly fit the few constraints we have. In our example, a satisfactory result is achieved by using the round function, as shown in the code below.

# Integerisation by simply using round
resRound <- round(res)
resTruncate <- floor(res) # take the minimum integer value

# Zero error achieved by rounding for global constraint
apply(resRound, c(2,3), sum) - global_cons
##       f m
## a0_49 0 0
## a50+  0 0
# Zero error achieved by rounding for local constraint
apply(resRound,c(1),sum) - local_cons
## 1 2 3
## 0 0 0

It is due to luck (and the small size of the SimpleWorld example) that the round method works in this case: in most cases there will be errors due to rounding. If a zone had 4 individuals and three categories, for example, the resulting weights could be $$(\frac{4}{3},\frac{4}{3},\frac{4}{3})$$. Then, the rounding gives $$(1,1,1)$$ and there would be too few individuals in the synthetic population (3 not 4). We can try the algorithms proposed in (5.3). However, as illustrated by the following code chunks, these integerisation methods lead to errors in relation to the constraints.

# Integerisation with pp
res_pp <- int_pp(res)

apply(res_pp, c(2,3), sum) - global_cons
##
##          f  m
##   a0_49  0  3
##   a50+   1 -4

These errors are often very small and if you model a whole country, the relative error is small. Note that this little error comes from the random draw at last stage of the algorithm. Here, TRS is better than PP, as explained in Chapter 5.

# Integerisation with trs
set.seed(17)
res_trs <- res_pp <- array(dim = dim(res))
# Apply trs (see code/functions.R to see how int_trs works)
res_trs[] <- int_trs(res)

# Print the errors
apply(res_trs, c(2,3), sum) - global_cons
##
##          f  m
##   a0_49  0 -1
##   a50+   1  0

If desired, we can adapt TRS to ensure it fits fit the constraints at the end of the process. To adapt the method to use TRS, we first truncate33 the data and identify the missing individuals, in terms of constraints.

# Truncate
resTruncate <- floor(res)

# number of missing individuals
sum(total_pop) - sum(resTruncate)
## [1] 4

This means that, in total, 4 individuals are missing after we have truncated. We will have to chose which categories will be incremented. For this, the basic TRS take the decimal parts of the weights (that were forgotten when truncate) and make a random draw inside this distribution. This is in this step that we can add an error in terms of the constraints. To make a better fit after integerisation, we need to observe in which category and in which zone we have to add individuals.

# Calculate the total simulated cross-table
# After truncate
SimTotTruncate <- apply(resTruncate,c(2,3),sum)

# Number of missing individuals per category
# After truncate
ToAdd
##
##         f m
##   a0_49 1 1
##   a50+  1 1
# Number of missing individuals per zone
# After truncate
ToComplete <- local_cons - apply(resTruncate,c(1),sum)
ToComplete
## 1 2 3
## 1 2 1

We observe that the individuals to add are one per category of age and sex. In terms of zones, one individual is missing in zone one and three, whereas two persons will have to be added in zone 2.

The principle now is to add people in the not completed zones and categories. The cells to be incremented are always chosen as the one with the bigger decimal parts (whereas in TRS, these decimals act as probabilities). Note that we chose to adapt the resTruncate instead of defining another tabular.

The code works as followed: As long as there are missing individuals in the matrix, we look at next biggest decimal and add an individual in the corresponding cell if someone is missing in this category and in this zone.

# Calculate the decimals left by truncate
decimals <- res - resTruncate

# Adapting resTruncate to fit all constraints
while (sum(total_pop) - sum(resTruncate) > 0){
# find the biggest decimals
i <- which( decimals == max(decimals), arr.ind = TRUE)

# remember we already considered this cell
decimals[i] <- 0

# if this zone still miss individuals
if (ToComplete[i[1]] > 0){
# if this category still miss individuals
resTruncate[i] <- resTruncate[i] + 1
ToComplete[i[1]] <- ToComplete[i[1]] - 1
}
}
}

The new values in resTruncate follow all constraints. The adaptation of TRS could be avoided by using combinatorial optimization to integerise. Indeed, this process could choose for each cell to take the integer just under or just above the weight by optimising the fit to the constraints. We use TRS here because it is faster and requires fewer lines of code (see Lovelace and Ballas 2013 for more detail).

After the integerisation, the last step to get the final individual dataset is the expansion. This stage is intuitive, since we have now a table containing the number of individuals in each category. Thus, we simply need to replicate the combination of categories the right number of times.

We can first flatten the 3–dimensional matrix. Then, the final individual micro dataset ind_data is created.

countData <- as.data.frame.table(resTruncate)
indices <- rep(1:nrow(countData), countData\$Freq)
ind_data <- countData[indices,]

9.2 Two level aggregated data

We present here how to find a possible distribution per zone when having only aggregated data, but in two different levels of aggregation. For example, we have some data for the municipalities and other for the districts. A first proposition can be to use a genetic algorithm that minimises the distance between the constraint and the simulation. This can give very good solutions, but need a high level understand of optimisation and is rare in the literature for the moment. For classical problems, a simpler method is available. The basis of this method is explained here. The solution proposed by Barthélemy and Toint (2013), and used in this book, is to generate a ‘seed’ before executing IPF.

This paper demonstrates the simulation of a population with four characteristics per individual: the gender, the age class, the diploma level and the activity status and at the municipality level. Their available data was:

1. At municipality level: the cross table gender x age and the marginals of diploma level and activity status;
2. At district level: the cross tables gender x activity status, gender x diploma level, age x activity status and age x diploma level.

Note that a district contains several municipalities, but each municipality is associated to only one district. We consider the marginals of the tables being consistent. If not, a preliminary step is necessary to re-scale the data to avoid shifting to probabilities. We chose to do this to have the best chance to fit the data well. When shifting to probabilities, it is more difficult to adapt the distributions during the iteration. Indeed, when considering the theoretical counts, if you create a young women, you just need to take the cell ‘young’ and ‘women’ and make minus one. When considering probabilities, when you create a young woman, you have to recalculate all probabilities, because you still need less women and proportionally, more men than before. This is the reason why we prefer to adapt the distributions to the one we are the more confident.

The global idea of their method is to proceed in two steps. First, they simulate the cross table of the four variables per district. Then, this table is considered as the seed of the IPF algorithm to simulate the distributions per municipality. During this second stage, the data concerning the municipality are used as constraints. How to execute the second part has been explained in the first section of this chapter. The point here is to develop the process, per district, to simulate the four–dimensional cross table, with the available data. This is also done in two steps :

1. Generate age x gender x diploma level and age x gender x professional status;
2. Generate age x gender x diploma level x professional status.

For the first step, we will explain only the creation of the first cross table, since the second reasoning is similar. The idea is simply to proceed proportionally to respect both available tables. The pseudo-code below corresponds to the code provided by Barthélemy and Toint (2013).

For the clarity of the formal formula, we rename gender (A), age (B) and diploma level (C). To create the cross table of these three variables, we have at the district level the cross tables gender x diploma level (renamed AC) and age x diploma level (renamed BC). Then, the cells of the three–dimensional table is defined for each gender $$g$$, age $$a$$ and diploma level $$d$$ as followed :

$ABC(g,a,d)=\frac{AC(g,d)}{margin(d)}BC(a,d)$

The formula is intuitive. The fraction gives the proportion of each gender inside the considered category of diploma level. Then, this proportion splits the number of persons having characteristics a and d into the category of g. For example, in the specific case of defining (Male, Young, Academics), we will have :

$ABC(Male, Young, Aca)=\frac{AC(Male,Aca)}{\#Aca}BC(Young,Aca)$

Suppose we have 50 young academics out of 150 academics (90 female and 60 male). We would have:

$ABC(Male, Young, Aca)=\frac{60}{150}*50=20$ $ABC(Female, Young, Aca)=\frac{90}{150}*50=30$

Thus, the tables age x gender x diploma level and age x gender x professional status are simulated. The seed for the IPF function can now be established, with help of the two contingencies. These initial weights will be the distribution of the four variables inside the whole district.

This seed is generated by several iterations. The initialisation of the cross table is simply a matrix with the right number of dimensions, with “0” in impossible cells and “1” in potentially non empty cases. For example, individuals of less than 10 years cannot hold a diploma from university.

With this initial point, an IPF can be performed to fit the two previously determined three–dimensional tables. The result is a table with the four variables per district.

The final step is explained in the previous section. Indeed, we have a contingency table at the district level and the zone margins. Note that you can imagine a lot of combinations of IPF to perform a population synthesis adapted to your own data.

9.3 Chapter summary

In summary spatial microsimulation can be used in situations where no sample data is available. The techniques can be adapted to work with a synthetic population. This chapter presented two methods for creating synthetic populations in R, the selection of which should depend on the type of input constraint data available. The first method assumed access to global cross tables and local marginals. The second assumed having aggregate data at different levels.

References

M., Bierlaire. 1991. “Evaluation de la demande en trafic : quelques méthodes de distribution.” Annales de La Société Scientifique de Bruxelles 105. University of Namur: 17–66.

Lovelace, Robin, and Dimitris Ballas. 2013. “‘Truncate, replicate, sample’: A method for creating integer weights for spatial microsimulation.” Computers, Environment and Urban Systems 41 (September). Elsevier Ltd: 1–11. doi:10.1016/j.compenvurbsys.2013.03.004.

1. Note that truncate means round each weight to the first integer under the weight. This implies that we underestimated the population.