Particle Swarm Optimization

Particle Swarm models the swarming behaviour of bees and other animals.

We maintain a swarm (population) of particles (solutions). Each particle has velocity, which is tempered over time. The particles have knowledge of the global best known solution that the swarm has discovered, and also the local best known solution that the individual particle has discovered. They are attracted (think magnetism or gravity) to each of these historical points according to weighting parameters.

The algorithm is as follows. As usual, I’ve simplified this a little for readability.  Have a look a the pseudocode on Wikipedia for a more complete description.

  1. Given fitness function “f”
  2. For each of n particles,
    1. Select a starting point xᵢ at random (uniformly distributed) within the solution space.
    2. Assign initial velocity vᵢ at random.  This is chosen almost identically to how xᵢ is chosen, except that it may also be negative in each direction. The velocity is a vector with the dimensionality of the decision space.
  3. For each particle,
    1. Update the particle position, given xᵢ = xᵢ + vᵢ 
    2. If f(xᵢ) < f(p), replace with xᵢp is the best known solution discovered by this particle.
    3. If f(xᵢ) < f(g), replace with xᵢ. g is the best known solution discovered globally by the swarm.
    4. Update the particle velocity, given vᵢ = ω vi + φp rp (pixi) + φg rg (gxi). rp and rg are random numbers between 0 and 1. ω, φp, φg are parameters chosen by the user.
  4. Continue (from step 3) until termination criteria is reached.

There were several parameters used above, I’ll define them now.

  • ω governs the reduction of particle velocity over time.
  • φp, is a weighting factor, governing how much the particle is attracted to p, the “particle best” solution.
  • φg, also a weighting factor, governs how much the particle is attracted to g, the “global best” solution.

You will have noted that the initial velocity is quite high. Particles will initially be zipping right from one side of the decision space to the other. The velocity is slowed over time according to parameter ω, favouring exploration early on, and exploitation later on.

We predict that PSO will perform poorly on problems of high decision space dimensionality, where the metaphor of velocity becomes strained.

We predict that PSO will perform poorly in a discretized solution space, where the number of outcomes in each dimension are few. Again, the metaphor of velocity becomes strained here.

Further Reading

References

 

Optimized Relative Step Size Random Search (ORSSRS)

Here we’re looking at the 1976 paper “Optimized relative step size random searches” (ORSSRS). The authors build on the earlier FSSRS and ASSRS. FSSRS used a step size of fixed distance to define the hypersphere within which samples would be taken. ASSRS adapts the step size based on the successful and failed samples.  ORSSRS proposes a more advanced method for selecting the step size as the optimization progresses.

  1. Given fitness function “f”
  2. Select a starting point x at random within the solution space.
  3. Let best = x.
  4. Generate vector d, a random vector of length r (within the dimensions of the solution space).
  5. Calculate a new position x from x = best + d.
  6. If f(x) < f(best), replace best with x, replace r with r-δr (reduce the step size), and continue (from step 4) until termination criteria is reached.
  7. Otherwise, reverse the direction of the search by calculating a new position x from x = best – d.
  8. If f(x) < f(best), replace best with x, replace r with r-δr (reduce the step size), and continue (from step 4) until termination criteria is reached.

The user must provide the initial value of r (σ⁰ is used in the paper).

δr r is used in the paper) is given by a complex derivation provided in the paper. I must confess that I did not fully understand the derivation, so I won’t foster any further confusion on your part by attempting a poor explanation here.

References

  • Schrack 1976 – Optimized relative step size random searches – Schrack, G. & Choit, M. Mathematical Programming (1976) 10: 230. Link.

 

Adaptive Step Size Random Search (ASSRS)

Adaptive Step Size Random Search (ASSRS) is proposed by Schumer 1968 as an attempt to improve upon FSSRS by dynamically adapting the radius of the hypersphere to be sampled.

The approach is described as follows. Note that this is slightly simplified relative to the source paper.

  1. Given fitness function “f”
  2. Select a starting point x at random within the solution space.
  3. Let best = x.
  4. Sample a new position x within a given hypersphere of a specified radius r around best. That is, randomly select x to be a new point within a specified distance r of best.
  5. Sample a new position y within a given hypersphere of a specified radius r+δr around best. That is, y is chosen from a slightly larger hypersphere than x.
  6. If f(y) < f(x), replace r with r+δr (that is, we slightly increase the step size r).
  7. If the predicate at step 6 fails several times, replace r with r-δr (that is, we slightly reduce the step size r).
  8. If f(x) < f(best), replace best with x.
  9. If f(y) < f(best), replace best with y.
  10. Continue (from step 4) until termination criteria is reached.

The user must choose the initial radius r, as well as δr.

Schrack and Choit later propose a further enhancement, ORSSRS.

References

  • Schumer, M.A.; Steiglitz, K. (1968). “Adaptive step size random search”. IEEE Transactions on Automatic Control13 (3): 270–276. doi:10.1109/tac.1968.1098903.

Fixed Step Size Random Search (FSSRS)

In the 1963 paper (Rastrigin, 1963) the author Rastrigin presents an improved version of Random Search.

  1. Given fitness function “f”
  2. Select a starting point x at random within the solution space.
  3. Let best = x.
  4. Sample a new position x within a given hypersphere of a specified radius r around best. That is, randomly select x to be a new point within a specified distance r of best.
  5. If f(x) < f(best), replace best with x.
  6. Continue (from step 4) until termination criteria is reached.

The user must choose the radius r.

Later authors propose was to dynamically adjust r. See ASSRS and ORSSRS.

Further Reading

References

  • Rastrigin, L.A (1963), “About Convergence of Random Search Method in Extremal Control of Multi-Parameter Systems” L.A. Rastrigin (mathnet.ru, includes PDF, russian language).

Random Search

Going way back to basics today with a look at Random Search. In this simple algorithm, the solution space is sampled purely at random, with no attempt to learn from previous samples.

The approach is described as follows.

  • Given fitness function “f”
  • Select a starting point x at random within the solution space.
  • Let best = x.
  • Sample a new position x at random within the solution space.
  • If f(x) < f(best), replace best with x.
  • Continue until termination criteria is reached.

In the worst case, random search may perform worse than full enumeration. The reason for this is that this search method has no memory, and will certainly sample the same “x” multiple times.

This type of search, sometimes called “true random search” or “pure random search”, is often used as a point of reference in literature.

Variants

  • Rastrigin (1963) presents an improved variant FSSRS which samples the previous best around a hyperspace of fixed radius “r”.

Further Reading

References

Rastrigin, L.A (1963), “About Convergence of Random Search Method in Extremal Control of Multi-Parameter Systems” L.A. Rastrigin (mathnet.ru, includes PDF, russian language).

Drift Stall

[W]hen using mutation at a fairly high rate the lower salient genes do not fully converge since the selective pressure acting on these genes is counterbalanced by the disruptive effect of the mutation operator. – Thierens (1998)1

“Drift Stall” or “Convergence Stall” can result from overly high rates of mutation. Thierens (1998)1 credited Rudnick (1992)2 for the observation that where the mutation rate is very high, lower salient genes may be prevented from converging at all, due to the level of noise created by the mutation operator.

The key factor is to ensure that the diversity of the less salient building blocks is preserved long enough in order to buy time for the selection-recombination operators to reach this level of fitness scale. – Thierens (1998)1

 

References

  1. Domino Convergence, Drift, and the Temporal-Salience Structure of Problems – Dirk Thierens, David E. Goldberg, Angela Guimaraes Pereira, 1998
  2. Rudnick M. (1992). Genetic Algorithms and Fitness Variance with an Application to the Automated Design of Artificial Neural Networks. PhD thesis, Oregon Graduate Institute of Science and Technology.

Genetic Drift

[…] in a finite-sized population the proportion of alleles fluctuates due to stochastic sampling errors, so even in the absence of any selective pressure, the genes will eventually become fixated at one particular allele. – Thierens 19981

Genetic Drift (or just “drift”) is the phenomenon of the progressive loss of genetic information among less salient building blocks. While the algorithm is working hard to converge the salient building blocks, the less salient building blocks are also converging. But because they have lower marginal fitness contribution, they converge by chance (“drift”) alone.

Studies suggest that the expected time for a gene to converge due to genetic drift alone is, in very general terms, proportional to the population size.

According to the Domino Convergence model (Thierens 19981), time complexity to fully converge on the optimal solution is linear  to the encoding length l (i.e., O(l)) (only for an algorithm exhibiting constant selection intensity, read Domino Convergence).

In the absence of other factors, the conclusion might be that as the encoding length increases, we must increase the population size proportionately to ensure that genetic drift does not overtake domino convergence. In practical observation, we observe that the rule generally seems to hold, but that the required adjustment is somewhat less than strictly linear.

So, dilettante beware! We might expect to observe that as our encoding length increases, if our effective population size is not adjusted to compensate, we may find ourselves suffering from the effects of genetic drift.

References

  1. Domino Convergence, Drift, and the Temporal-Salience Structure of Problems – Dirk Thierens, David E. Goldberg, Angela Guimaraes Pereira, 1998

Domino Convergence

Building blocks with higher marginal fitness contribution – salient building blocks – converge before those with lower marginal fitness. Moreover, there comes a point in time – and fitness scale – when strong convergence runs out of steam, and low salience building blocks drift to substantial convergence at random with a resulting high probability of error. – Thierens 19981

Domino Convergence refers to the observation that elements (“building blocks”) in an encoding string that more significantly affect the measured objective value (“fitness”) converge earlier.

Thierens 19981 concludes, in the absence of other factors, that time complexity to converge in the case of constant selection intensity is linear (O(l)) to exponential (O(2l)) to the encoding length (l) of the encoding. On the other hand, where selection intensity is proportional to the fitness difference, time complexity to convergence is  exponential (O(2l)) to the encoding length (l).

Proportional selection intensity is exhibited, for example, in genetic algorithms that employ “roulette wheel” selection. Acting on Thierens’ conclusion above, we ought to be tempted to prefer the alternative “tournament” selection approach instead.

References

  1. Domino Convergence, Drift, and the Temporal-Salience Structure of Problems – Dirk Thierens, David E. Goldberg, Angela Guimaraes Pereira, 1998

How to use Windows Error Reporting

How do you use Windows Error Reporting?

Unfortunately, you can’t. It doesn’t work. The documentation is imprecise and convoluted. Try to log in to the various services you’re supposed to use, and you’ll see some error or another. Give up now, and use something else. Unfortunately, you won’t ever get Windows Error Reporting to work for you.

Working with GIT Submodules

GIT submodules is a feature for linking together multiple distinct repositories. It’s useful, for example, if you have one library that you use in multiple different projects.

The easy way

The easiest way to use GIT submodules, is to have your submodules track a branch.

To add a submodule tracking a branch, use the following commands (note the “-b” argument) (don’t forget to commit and push your changes afterwards)

git submodule add -b [branch_to_track] [git_repo_url] [relative_dir]
git submodule update --init --recursive --remote

The “update” command above can also be used to pull updates.

To remove a submodule, use the following commands (don’t forget to commit and push your changes afterwards):

git submodule deinit mysubmodule
git rm --cached mysubmodule

The other way

So I mentioned that tracking a branch was “the easy way” to use submodules. However, tracking a branch is a relatively new feature. The previous usage of submodules “pins” each submodule to a specific revision rather than a branch. Advocates of this approach will point out that specifying a specific revision means that your code will never be broken unexpectedly by upstream changes. If you find yourself working with a project in which this is the case, you will from time to time want to update your submodule references to a newer revision. Here’s how you do that.

This command pulls changes in the main repo and in submodules, then updates each submodule individually to the latest revision on the master branch also recursing into submodules of submodules of submodules. Finally, we commit our changes.

git pull --recurse-submodules
git submodule update --init --recursive --remote
git submodule foreach git pull origin master
git commit -a -m "updating submodule references"

There you have it. Submodules. Enjoy.