Least cost routing using swarm intelligence

A few years back, before pay-as-you-go SMS sending services like Twilio existed, I needed to send SMS’s to a number of my subscribers, but couldn’t afford the monthly cost, and up-front fees that bulk SMS providers charged. I decided to set up my own SMS sending service using 2G modems plugged in to serial ports of a PC, and driven using AT commands to send and receive SMS’s.

I then ran in to the very real problem of “how do I decide which modem to send an SMS on?” Service providers charge different rates to send to different mobile networks, and at any given instant, a GSM modem could be down because of hardware or software failure.

This is how I designed a reliable service on top of unreliable hardware, and complex least cost routing rules.

Intro to least cost routing

I needed to send SMS’s on three mobile networks: MTN, Vodacom, Cell C. Depending on the contract with each mobile network, they charge different rates for sending SMS’s on the same network, versus sending SMS’s to different networks; they come with free bundles (e.g. first 1000 SMS’s per month are free, to any network), and different rates depending on time of the day. (Cheaper at night.)

For an SMS that I needed to send I wanted to use the cheapest option.

However, if a particular mobile network or modem to that network was down, I didn’t want to wait indefinitely to send an SMS. If an SMS had been waiting to send for a while then I wanted to send the SMS, almost regardless of the cost.

Proposed solution: Ant colony optimisation

I’ve always been a fan of swarm intelligence, and in particular ant colony optimisation techniques.

Ant colonies are incredibly robust societies, able to survive in the face of natural challenges, without requiring centralised control. Each ant (or agent) follows a number of simple rules, and interacts with ants in its immediate vicinity, and with the environment.

A classic example is division of labour in the hive. Each ant is a member of a caste, typically worker ants and warrior ants. Worker ants are smaller in size and perform duties such as brood feeding and hive cleaning. Warrior ants are larger in size and protect the hive from intruders.

Division of labour

Each caste responds to stimulus proportionally based on the mount of stimulus it receives, as well as its propensity to reacting to that stimulation.

For example, when a warrior ant comes across a hungry grub (baby ant) it tends to leave it alone; and when a worker ant comes across a hungry grub it tends to feed it. However, when there is a large number of hungry grubs the warrior ants join in to help with the feeding.

(For a much better description of this see [1])

Modeling with sigmoid curves

This behavior can be modeled with a Sigmoid curve.

Let’s look at some examples:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.font_manager import FontProperties
import pandas as pd

fontP = FontProperties()

def sigmoid(x, n, theta):
    return x**n / (x**n + theta**n)

x = np.arange(0,1,0.01)

plt.plot(x, sigmoid(x, n=10, theta=0.5), x, sigmoid(x, n=5, theta=0.5), x, sigmoid(x, n=5, theta=0.25))
plt.legend(['n=10, theta=0.5', 'n=5, theta=0.5', 'n=5, theta=0.25'], prop=fontP, frameon = False)
plt.xlabel('Stimulus (s)')
plt.ylabel('Probability of performing an action (P)')


The sigmoid curve I have chosen is

\begin{equation} P_s = \frac{s^n }{s^n + \Theta^n} \end{equation}

where is the steepness of the curve, with larger making the curve steeper.

And for : If then and if then and if then

Feeding a hungry brood

Below we have two curves, the blue one for ‘worker’ ants, and the green curve for ‘warrior’ ants. The x-axis is the stimulus, the level of how hungry the grubs are, and the y-axis is the probability that a given worker or warrior ant will get involved with feeding the grubs.

x = np.arange(0,100,0.01)
plt.semilogx(x, sigmoid(x, n=2.0, theta=0.5), x, sigmoid(x, n=2.0, theta=10.0))
plt.legend(['Worker ants', 'Warrior ants'], prop=fontP, frameon = False)
plt.xlabel('Stimulus intensity, level of hungry ($s$)')
plt.ylabel('Probability of feeding grubs ($P_s$)')


As the stimulus (the number of hungry grubs) increases, the proportion of the ants responding to the stimulus (feeding the grubs) goes up. The worker ants respond sooner to the stimulus than the warrior ants, but if the worker ants cannot keep up with demand and the stimulus continues to increase then warrior ants begin to help with feeding.

Sending SMS messages

Back to sending SMS messages.

We want messages to be sent:

  1. As soon as possible (minimise delay)
  2. As cheaply as possible (minimise cost)

These two goals oppose each other. We could send an SMS via any mobile network, to any destination mobile network, but networks often charge more to send an SMS to a competing network. Therefore cost would be higher.

We could send an SMS only on the cheapest route, but if that route is down then the SMS would be stuck until that route came back up, potentially leading to a long delay.

We need to find a balance that we are happy with.

Therefore, the decision to send a text message is based on:

  1. The cost of the message
  2. The time that the message has been waiting to be sent
  3. The number of messages that are waiting to be sent

Point 3 is required because the length of the queue is a leading indicator to the amount of time that a message will spend in the queue. If a message is added to the back of a long queue, even though it has not yet been waiting long to be sent, it will wait a time long because all the messages in front of it need to be sent first. (For some more background info, see Little’s Law.)

The SMS queue

In our system, we have one queue where all messages are added. We also have a number of SMS modems (or routers) that send the SMS’s.

Starting at the head of the queue, each router inspects the messages in the queue and decides whether it will send that message or not. If it does, then it removes the message from the queue and sends it. If it decides not to send the message then it leaves it on the queue and moves on to the next message in the queue.

Note that this is not a strict FIFO queue. A modem could decide not to send the first four messages in the queue but send the fifth. After it sends the fifth it will then start from the front of the queue again and repeat the process. If it reaches the end of the queue it will restart at the front.

SMS routers: Deciding to send or not


Based on the source and destination mobile networks, free SMS’s bundled in to the contract, and time of day a router can calculate the cost to send an SMS.

As the cost of an SMS goes up we want the probability of sending it to go down.


cost = np.arange(0, 2, 2.0/1000)
n = 3
theta = 0.3
probability_of_sending_cost = 1 - sigmoid(cost, n, theta)

plt.plot(cost, probability_of_sending_cost)
plt.xlabel('Cost to send SMS ($cost$)')
plt.ylabel('Probability of sending SMS ($P_{cost}$)')


For typical values of cost, the probability is:

c = np.array([0.0, 0.19, 0.21, 0.23, 0.25, 0.35, 0.40, 0.50, 0.75, 1.30, 1.74])
p = 1 - sigmoid(c, n, theta)
data = {'${cost}$': c, 'Probability of sending SMS ($P_{cost}$)': p}
pd.DataFrame(data, columns=['${cost}$', 'Probability of sending SMS ($P_{cost}$)'])
${cost}$ Probability of sending SMS ($P_{cost}$)
0 0.00 1.000000
1 0.19 0.797425
2 0.21 0.744602
3 0.23 0.689356
4 0.25 0.633431
5 0.35 0.386404
6 0.40 0.296703
7 0.50 0.177632
8 0.75 0.060150
9 1.30 0.012140
10 1.74 0.005099

When it costs nothing to send an SMS then there is 100% chance of the SMS being sent. And as the cost goes up that chance goes down. By the time the cost is 30c there is a 50% chance of it being sent, and by the time the cost is R 1.74 there is about a 1 in 200 chance of the SMS being sent.

Send the SMS?

To make the final decision to send the SMS we:

  1. Map the stimulus (cost) to the probability of sending, using the function described above
  2. Pick a random number between 0 and 1
  3. If the random number <= probability of sending then send the message, else don’t send it

Cost isn’t the only factor in deciding to send the message. We need to take in to account time in the queue, and the length of the queue.

Time in queue

If a message has been in the queue for a long time then we want to have a higher probability of it being sent. Time is the number of seconds that the message has been waiting to be sent.

\begin{equation} P_{time} = \frac{time^n }{time^n + \Theta^n} \end{equation}

n = 1.0
theta = 20
time = np.arange(0,300,300.0/1000)
probability_of_sending_time = sigmoid(time, n, theta)

plt.plot(time, probability_of_sending_time)
plt.xlabel('Time in queue (${time}, s$)')
plt.ylabel('Probability of sending SMS ($P_{time}$)')


After 20 seconds in the queue the message has a 50% probability of being sent, and then more slowly aproches 100%.

Length of queue

Queue length is a leading indicator to the time time that a message will wait in the queue. In a long queue, messages at the back of the queue will take longer to be sent than if the queue was short. We want a higher probability of sending messages when the queue is backed up than if the queue is short.

\begin{equation} P_{length} = \frac{length^n }{length^n + \Theta^n} \end{equation}

n = 3.0
theta = 200
length = np.arange(0,1000,1)
probability_of_sending_length = sigmoid(length, n, theta)

plt.plot(length, probability_of_sending_length)
plt.xlabel('Number of messages in the queue (${length}$)')
plt.ylabel('Probability of sending SMS ($P_{length}$)')


Combining the cost, time and queue length

To combine all three factors we use a simple weighting, and multiply each probability by its weighting and sum the three results in to a final probability:

The weightings that I chose were:


Using this routing method allowed us to send SMS’s in bulk across a number of mobile networks that had complex costing schemes (fixed number of free, bundled SMS’s to any network, unlimited free SMS’s only on home network, and costing that varied by on- and off-peak times), and was robust to hardware failures of mobile routers.

This technique can be used whereever there is a need to divide work among multiple consumers, and there is a complex decision matrix on whether a consumer should perform that work or not.

While the solution is not guaranteed to be the global optimum, it will more often than not be a good solution over a wide range of changes in the underlying system, making it more robust than hard-coded rules.

Further reading

[1] Swarm Intelligence: From Natural to Artificial Systems