Alternative approaches to population synthesis

GREGWT

As described in the Introduction, IPF is just one strategy for obtaining a spatial microdataset. However, researchers tend to select one method that they are comfortable and stick with that for their models. This is understandable because setting-up the method is usually time consuming: most researchers rightly focus on applying the methods to the real world rather than fretting about the details. On the other hand, if alternative methods work better for a particular application, resistance to change can result in poor model fit. In the case of very large datasets, spatial microsimulation may not be possible unless certain methods, optimised to deal with large datasets, are used. Above all, there is no consensus about which methods are ‘best’ for different applications, so it is worth experimenting to identify which method is most suitable for each application.

An interesting alternative to IPF method is the GREGWT algorithm. First implemented in the SAS language by the Statistical Service area of the Australian Bureau of Statistics (ABS), the algorithm reweighs a set of initial weights using a Generalized Regression Weighting procedure (hence the name GREGWT). The resulting weights ensure that, when aggregated, the individuals selected for each small area fit the constraint variables. Like IPF, the GREGWT results in non-integer weights, meaning some kind of integerisation algorithm will be needed to obtain a final individual-level population, so for example, if the output is to be used in ABM. The macro developed by ABS adds a weight restriction in their GREGWT macros to ensure positive weights. The ABS uses the Linear Truncated Method described in Singh and Mohl (1996) to enforce these restrictions.

A clear simplified example of this algorithm (and other algorithms) can be found in Rahman (2009). In their paper Tanton et.al (2011) make a full description of the algorithm. For a deeper understanding of the SAS macros see Bell (1999). An R implementation of GREGWT, has been created by Esteban Muñoz, and can be found in the GitHub repository mikrosim.

We use these implementation of the GREGWT algorithm with the data from the present the use of the GREGWT library and the resulting weights matrix.

The code presented above loads the SimpleWorld data and created a new data.frame with this data. Version 1.4 of the R library requires the data to be in binary form. The require input for the R function is X representing the individual level survey, dx representing the initial weights of the survey and Tx representing the small area benchmarks.

# Install GREGWT (uncomment/alter as appropriate)
# url = "https://github.com/emunozh/mikrosim/raw/master/GREGWT_1.4.tar.gz"
# download.file(url = url, method = "wget", destfile = "GREGWT_1.4.tar.gz")
# install.packages("GREGWT_1.4.tar.gz", repos = NULL, type = "source")
# load the library (1.4)
library('GREGWT')
# We loop through each area
for(area in seq(dim(age)[1])){
    # True population totals for this area (Benchmarks)
    Tx <- data.frame("a0.49" = age[area, 1],
                     "a.50"  = age[area, 2],
                     "f"     = sex[area, 2],
                     "m"     = sex[area, 1])
    # Get new weights with GREGWT
    Weights.GREGWT = GREGWT(X, dx, Tx, bounds=c(0,Inf))
    fw <- Weights.GREGWT$Final.Weights
    # store the new weights
    fweights <- c(fweights, fw)
}
#> GREGWT .OK   OK  GREGWT .OK  OK  GREGWT .OK  OK

The R library implementing the GREGWT algorithm reweights the survey for one area at the time, and therefor we need to construct a loop to iterate each area. The estimated weights are combined into a long vector that will be latter on transformed into the weights matrix.

fweights <- matrix(fweights, nrow = nrow(ind))
fweights
#>       [,1]   [,2]   [,3]
#> [1,] 1.143 2.5714 1.8571
#> [2,] 1.143 2.5714 1.8571
#> [3,] 3.714 0.8571 4.2857
#> [4,] 1.714 2.8571 0.2857
#> [5,] 4.286 1.1429 2.7143

In the last step we transform the vector into a matrix and see the results from the reweighing process.

Population synthesis as an optimization problem

In general terms, an optimization problem consists of a function, the result of which must be minimised or maximised, called an objective function. This function is not necessarily defined for the entire domain of possible inputs. The domain where this function is defined is called the solution space (or the feasible space in formal mathematics). This implies that optimization problems can be unconstrained or constrained, by limits on the values that arguments (or that a function of the arguments) of the function can take (Boyd 2004). If there are constraints, the solution space include only a part of the domain of the objective funcion. The constraints define the solution space. Under this framework, population synthesis can be seen as a constrained optimisation problem. Suppose x is a vector of length n (x₁, x₂, .., x_n) whose values are to be adjusted. In this case the value of the objective function is f₀(x), depends on x. The possible values of x are defined thanks to par, a vector of predefined arguments or parameters of length m (m is the number of constraints) (par₁, par₂, .., par_m). This kind of problems can be expressed as:

$$\left\{ \begin{array}{l} min \hspace{0.2cm}f_0(x_1,x_2,..,x_n) \\ \\ s.c. \hspace{0.2cm} f_i(x) \leq par_i,\ i = 1, ..., m \end{array} \right.$$

Applying this to the problem of population synthesis, the parameters par_i represent 0, since all cells have to be positive. The f₀(x), to be minimised, is the distance between the actual weight matrix and the aggregate constraint variable cons. x represents the weights which will be calculated to minimise to minimise f₀(x).

To illustrate the concept further, consider the case of aircraft design. Imagine that the aim (the objective function) is to minimise weight by changing its shape and materials. But these modifications must proceed subject to some constraints, because the airplane must be safe and sufficiently voluminous to transport people. Constrained optimisation in this case would involve searching combinations of shape and material (to include in x) that minimise the weight (the result of f₀, is a single value depending on x). This search must take place under constraints relating to volume (depending on the shape) and safety (par₁ and par₂ in the above notation). Thus par values define the domain of the solution space. We search inside this domain the combination of x_i that minimise weight.

The case of spatial microsimulation has relatively simple constraints: all weights must be positive or zero:

$$ \left\{ weight_{ij} \in \mathbb{R}^+ \cup \{0\} \hspace{0.5cm} \forall i,j \right\} $$

Seeing spatial microsimulation as an optimisation problem allows solutions to be found using established techniques of constrained optimisation. The main advantage of this re-framing is that it allows any optimisation algorithm to perform the reweighting.

To see population synthesis as a constrained optimization problem analogous to aircraft design, we must define the problem to optimise, the variable x and then set the constraints.
Intuitively, we search the number of occurrences we want of each individual to take form the final population that fits the best the constraints. We could take the weight matrix as x and as the objective function the difference between the population with this weight matrix and the constraint. However, we want to include the information of the distribution of the sample. Thus, we want to find a vector of parameters par that will multiply the indu matrix, which is similar to ind_cat, but with only different rows and the cells contain the number of time that this kind of individual appears in the sample. We want the result of this multiplication to be as closer as possible to the constraints.

As for the IPF proposition, we will proceed zone per zone. So, our optimization problem for the first zone can be written as follows:

$$\left\{ \begin{array}{l} min \hspace{0.2cm} f(par_1,..,par_m) = DIST(sim, cons[1,]) \\ \hspace{0.8cm} where \hspace{0.2cm} sim=colSums(indu * par)\\ \\ s.c. \hspace{0.2cm} par_i \geq 0,\ i = 1, ..., m \end{array} \right.$$

Key to this is interpreting individual weights as parameters (the vector par, of length m above) that are iteratively modified to optimise the fit between individual and aggregate-level data. Remarks that in comparison with the theoretical definition of an optimisation problem, our par are the theorical x. The measure of fit, so the distance, we use in this context is Total Absolute Error (TAE).

$$\left\{ \begin{array}{l} min \hspace{0.2cm} f(par_1,..,par_m) = TAE(sim, cons[1,]) \\ \hspace{0.8cm} where \hspace{0.2cm} sim=colSums(ind\_cat * par)\\ \\ s.c. \hspace{0.2cm} par_i \geq 0,\ i = 1, ..., m \end{array} \right.$$

We have chosen the distance “TAE”, but we could imagine to do the same with other metrics.

Note that in the above, par is equivalent to the weights object we have created in previous sections to represent how representative each individual is of each zone. The main issue with this definition of reweighting is therefore the large number of free parameters: equal to the number of individual-level dataset. Clearly this can be very very large. To overcome this issue, we must ‘compress’ the individual level dataset to its essence, to contain only unique individuals with respect to the constraint variables (constraint-unique individuals).

The challenge is to convert the binary ‘model matrix’ form of the individual-level data (ind_cat in the previous examples) into a new matrix (indu) that has fewer rows of data. Information about the frequency of each constraint-unique individual is kept by increasing the value of the ‘1’ entries for each column for the replicated individuals by the number of other individuals who share the same combination of attributes. This may sound quite simple, so let’s use the example of SimpleWorld to illustrate the point.

Reweighting with optim and GenSA

The base R function optim provides a general purpose optimization framework for numerically solving objective functions. Based on the objective function for spatial microsimulation described above, we can use any general optimization algorithm for reweighting the individual-level dataset. But which to use?

Different reweighting strategies are suitable in different contexts and there is no clear winner for every occasion. However, testing a range of strategy makes it clear that certain algorithms are more efficient than others for spatial microsimulation. Figure x demonstrates this variability by plotting total absolute error as a function of number of iterations for various optimization algorithms available from the base function optim and the GenSA package. Note that the comparisons are performed only for zone 1.

Figure x shows that, in a first step, all algorithm reach a real improvement during the first iteration. The advantage of the IPF algorithm we have been using, which converges rapidly to an error very close to zero (as seen before, zero is not reachable with a computer performing calculus) is observable after only the first iteration. On the other end of the spectrum is R’s default optimization algorithm, the Nelder-Mead method. Although the graph shows no improvement from one iteration to the next, it should be stated that the algorithm is just ‘warming up’ at this stage and than each iteration is very fast, as we shall see. After 400 iterations (which happen in the same time that other algorithms take for a single iteration!), the Nelder-Mead begins to converge: it works effectively. However, it requires far more iterations to converge to a value approximating zero than IPF. Next best in terms of iterations is GenSA, the Generalized Simulated Annealing Function from the GenSA package. GenSA attained a near-perfect fit after only two full iterations.

The remaining algorithms shown are, like Nelder-Mead, available from within R’s default optimisation function optim. The implementations with method = set to "BFGS" (short for the Broyden–Fletcher–Goldfarb–Shanno algorithm), "CG" (‘conjugate gradients’) performed roughly the same, steadily approaching zero error and fitting to "IPF" and "GenSA" after 10 iterations. Finally, the SANN method, also available in optim, performed most erratically of the methods tested. This is another implementation of simulated annealing which demonstrates that optimisation functions that depend on random numbers do not always lead to improved fit from one iteration to the next. If we look until 200 iterations, the fit will continue to oscillate and not be improved at all.

The code used to test these alternative methods for reweighting are provided in the script ‘R/optim-tests-SimpleWorld.R’. The results should be reproducible on any computer, provided the book’s supplementary materials have been downloaded. There are many other optimisation algorithms available in R through a wide range of packages and new and improved functions are being made available all the time. Enthusiastic readers are encouraged to experiment with the methods presented here: it is possible that an algorithm exists which outperforms all of those tested for this book. Also, it should be noted that the algorithms were tested on the extremely simple and rather contrived example dataset of SimpleWorld. Some algorithms may perform better with larger datasets than others and may be sensitive to changes to the initial conditions such as the problem of ‘empty cells’.

Therefore these results, as with any modelling exercise, should be interpreted with a healthy dose of skepticism: just because an algorithm converges after few ‘iterations’ this does not mean it is inherently any faster or more useful than another. The results are context specific, so it is recommended that the tested framework in ‘R/optim-tests-SimpleWorld.R’ is used as a basis for further tests on algorithm performance on the datasets you are using. IPF has performed well in the situations I have tested it in (especially via the ipfp function, which performs disproportionately faster than the pure R implementation on large datasets) but this does not mean that it is always the best approach.

To overcome the caveat that the meaning of an ‘iteration’ changes dramatically from one algorithm to the next, further tests measured the time taken for each reweighting algorithm to run. To have a readable graph, we do not represent the error as a function of the time, but the time per algorithm in function of the maxit argument (Figure xx). This figure demonstrates that an iteration of GenSA take a long time in comparison with the other algorithm. Moreover, "BFGS" and "CG" are still following a similar curve under GenSA. Nelder-Mead, SANN and ipf contains iterations that take less time. Until, it appears that ipf and the best in terms of iterations and the time needed for few iterations is good.

Nelder-Mead is fast at reaching a good approximation of the constraint data, despite taking many iterations. GenSA, on the other hand, is shown to be much slower than the others, despite only requiring 2 iterations to arrive at a good level of fit.

Note that these results are biased by the example that is pretty small and runs only for the first zone.

Combinatorial optimisation

Combinatorial optimisation is a complete field consisting in a different method to result optimisation problem. Instead of having one candidate that evolve through the iterations, combinatorial optimisation forms a set of feasible candidates and then evaluate them thanks to the objective function to be minimised. There are several types of combinatorial optimisation depending on how we chose the combination of candidates.

This technique can be seen as an alternative to IPF for allocating individuals to zones. This strategy is probabilistic and results in integer weights (since it is a combination of individuals), as opposed to the fractional weights generated by IPF. Combinatorial optimisation may be more appropriate for applications where input individual microdatasets are very large: the speed benefits of using the deterministic IPF algorithm shrink as the size of the survey dataset increases. As seen before, IPF creates non integer weights, but we have proposed two solutions to transform them into the final individual-level population. So, the proportionality of IPF is more intuitive, but need to calculate the whole weights matrix in each iteration, where CO just proposes candidates. However, if the objective function takes a long time to be calculated, CO will have to perform this evaluation for each candidate.

Genetic algorithms are included in this field and become popular in some domains, such as industry, for the moment. These kind of algorithms can be really effective when the objective function has several local minimum and we want to find the global one. (Hermes and Poulsen, 2012)

There are two approaches for reweighting using combinatorial optimisation in R: shuffling individuals in and out of each area and combinatorial optimisation, the domain of the solution space set to allow inter-only results…

The second approach to combinatorial optimisation in R depends on methods that allow only integer solutions to the general constrained optimisation formulae demonstrated in the previous section. Integer programming is the branch of computer science dedicated to this area, and it is associated with its own algorithms and approaches, some of which have been implemented in R (Zubizarreta, 2012).

To illustrate how the approach works in general terms, we can use the data.type.int argument of the genoud function in the rgenoud package. This ensures only integer results for a genetic algorithm to select parameters are selected:

# Set min and maximum values of constraints with 'Domains'
m <- matrix(c(0, 100), ncol = 2)[rep(1, nrow(ind)),]
set.seed(2014)
genoud(nrow(ind), fn = fun, ind_num = ind, con = cons[1,],
  control = list(maxit = 1000), data.type.int = TRUE, D = m)

This command, implemented in ‘optim-tests-SimpleWorld.R’, results in weights for the unique individuals 1 to 4 of 1, 4, 2 and 4 respectively. This means a final population with aggregated data equal to (as seen in the previous section):

colSums(indu * c(1, 4, 2, 4))
#> ind$agea0_49  ind$agea50+     ind$sexm     ind$sexf 
#>            8            4            6            6

Note that we performed the test only for zone 1 and this aggregated data are equally the same as the first constraint. Moreover, thanks to the fact that the algorithm directly consider only integer weights, we do not have the issue of fractional weights associated with IPF, which only generates perfect fit for non-integer weights. Combinatorial optimisation algorithms for population synthesis do not rely on integerisation, which can damage model fit. The fact that the gradient contains “NA” in the end of the algorithm is not a problem, because it just means that they have not been calculated.

Note that there can be several solutions which attain a perfect fit. This result depends on the random seed chosen for the random draw. Indeed, if we chose a seed of 0 (by writing set.seed(0)), as before, we obtain the weights (0, 6, 4, 2) which results also in a perfect fit for zone 1:

colSums(indu * c(0 ,6 ,4 ,2))
#> ind$agea0_49  ind$agea50+     ind$sexm     ind$sexf 
#>            8            4            6            6

These two potential synthetic populations reach a perfect fit, but are quite different. Indeed, we can observe the two populations:

indu * c(1 ,4 ,2 ,4)
#>   ind$agea0_49 ind$agea50+ ind$sexm ind$sexf
#> 1            0           2        2        0
#> 3            4           0        4        0
#> 4            0           2        0        2
#> 5            4           0        0        4
indu * c(0 ,6 ,4 ,2)
#>   ind$agea0_49 ind$agea50+ ind$sexm ind$sexf
#> 1            0           0        0        0
#> 3            6           0        6        0
#> 4            0           4        0        4
#> 5            2           0        0        2

An example of comparison is that the second proposition contains no male being more than 50 years old, but the first one has 2. With this method, there cannot be a population with 1 male of over 50, because we take integer weights and there are two men in this category in the sample. This is the disadvantage of algorithms reaching directly integer weights. With IPF, if the weights of this individual is between 0 and 1, there is a possibility to have a person in this category.

genoud is used here only to provide a practical demonstration of the possibilities of combinatorial optimisation using existing R packages.

For combinatorial optimisation algorithms designed for spatial microsimulation we must, for now, look for programs outside the R ‘ecosystem’. Harland (2013) provides a practical tutorial introducing the subject based on the Flexible Modelling Framework (FMF) Java program.

Spatial Microsimulation with R by Robin Lovelace

Contents

Alternative approaches to population synthesis

GREGWT

Population synthesis as an optimization problem

Reweighting with optim and GenSA

Combinatorial optimisation