(The code examples in this post are also available here.)

Although I have already written something on binary exponentiation, as it applied to modular exponentiation, there’s plenty more to be said about it in more general settings. Basically, today we are after computing the n-th power of a, but a need not be an integer, rational, real or even complex number, but can be any more general mathematical object that can be raised to a power, as a square matrix or a polynomial could be…

For an n-th power to exist, the object being exponentiated must fulfill certain conditions, which can basically be summarized in belonging to a more general set that has a well-behaved multiplication operation defined. Good behavior can be translated into closure and associativity, which in mathematical slang can be expressed as the object belonging to a semigroup. This will often mean that the object belongs to a monoid, and even to a ring or rng… But I’ll follow V.I. Arnold‘s advice, and not get too carried away with the abstract generalizations, so enough said: the thing must be multipliable, and the result must be of the same kind.

#### Defining Multiplication

Because we are dealing with general mathematical objects, we first need to define the multiplication operation. There are two ways of dealing with this in python. To work on an example, say we want to deal with 2×2 matrices. We could define a `Matrix2x2`

class, and write a `__mul__`

method for it, so that whenever we write `a*b`

, what gets evaluated is `a.__mul__(b)`

. For the same price, I’ve thrown in an `__add__`

method…

class Matrix2x2(object) : def __init__(self,elements = None) : if elements : self._data_ = elements[:4] else : self._data_ = [0] * 4 def __repr__(self) : as_string = [str(j) for j in self._data_] str_length = [len(j) for j in as_string] longest = max(str_length) for j in xrange(4) : as_string[j] =' '*(longest - str_length[j]) + as_string[j] ret = 'Matrix2x2 object:' for j in xrange(2) : ret += '\n[ %s %s ]' % (as_string[2*j], as_string[2*j+1]) return ret def __mul__(self, b) : ret = [self._data_[0]*b._data_[0] + self._data_[1]*b._data_[2]] ret += [self._data_[0]*b._data_[1] + self._data_[1]*b._data_[3]] ret += [self._data_[2]*b._data_[0] + self._data_[3]*b._data_[2]] ret += [self._data_[2]*b._data_[1] + self._data_[3]*b._data_[3]] return Matrix2x2(ret) def __add__(self, b) : return Matrix2x2([self._data_[j] + b._data_[j] for j in xrange(4)])

Or if we don’t want to go through the hassle of dealing with objects, we can agree with ourselves in keeping the matrix’s data in a list in row-major order, and define a standalone multiplication function…

def mat_mul(a, b) : """ Returns the product of two 2x2 square matrices. Computes the product of two 2x2 matrices, each stored in a four element list in row major order. """ ret = [a[0]*b[0] + a[1]*b[2]] ret += [a[0]*b[1] + a[1]*b[3]] ret += [a[2]*b[0] + a[3]*b[2]] ret += [a[2]*b[1] + a[3]*b[3]] return ret

#### Direct Multiplication

To set a baseline for performance, the starting point has to be the most direct implementation of exponentiation: repeated multiplication. There really isn’t much mistery to it…

import nrp_base @nrp_base.performance_timer def direct_pow(a, n, **kwargs) : """ Computes a**n by direct multiplication. Arguments a - The object to be exponentiated. n - The integral exponent. Keyword Argument mul - A function taking two arguments of the same type as a, and returning their product. If undefined, a's __mul__ method will be used. """ mul = kwargs.pop('mul',None) ret = a if mul is None : mul = lambda x,y : x*y for j in xrange(n-1) : ret = mul(ret,a) return ret

#### The Two Flavors of Binary Exponentiation

OK, so now we are ready for the fun to begin… The basic idea behind speeding exponentiation up is that a^{4} can be computed with two, instead of three, multiplications, if rather than doing it as a^{4} = a·a·a·a, we do it as a^{2} = a·a, a^{4} = a^{2}·a^{2}, storing the intermediate result for a^{2}…

To expand this idea to exponents other than 4 we basically need to write the exponent in binary. Examples are great for figuring out this kind of things, so we’ll take computing a^{19} as ours. Knowing that 19 in binary is 10011, we now have a choice on how to proceed, which basically amount to start using the bits of the exponent from the least significant one, or from the most significant one…

##### Least Significant Bit First

That 19 in binary is 10011 can also be interpreted as 19 = 2^{0} + 2^{1} + 2^{4}, and so a^{19} = a^{20}·a^{21}·a^{24}… The algorithm then is pretty straightforward:

- start computing sequentially from j = 0 the values of a
^{2j}, where each new value is the square of the preceding one, - when the first non-zero bit is found, set the return value to the corresponding a
^{2j}value, - for subsequent non-zero bits, multiply the return value by the corresponding a
^{2j}value, - once all bits of the exponent have been searched, the return value holds the sought value.

If the exponent is n, then there will be log_{2} n squarings, to compute the a^{2j} values, plus one multiplication less than there are 1’s in the binary expansion of the exponent.

##### Most Significant Bit First

It is also possible to write a^{19} = ((a^{23}·a)^{2})·a. While it is easy to verify that is the case, it may be harder to see where does it come from. The algorithm is as follows:

- Take the exponent bits from most to least significant,
- since the first bit is always a 1, set the return value to a,
- for every bit processed, first square the return value, and then, if the bit is a 1, multiply the return value by a,
- again, once the last bit is used, the return value holds the sought value.

The analysis of the algorithm is similar to the previous one: there will again be log_{2} n squarings, and the same number of multiplications as well.

These second flavor has one clear disadvantage: there’s no simple way to generate the bits of a number from most to least significant. So this approach requires computing and storing the bits from least to most significant, and then retrieve them in inverse order. No that it is anything too complicated, but there are additional time costs involved.

So why would one even worry about this scheme then? Well, it also has a potential advantage: the multiplications performed always involves the original object being exponentiated. This can prove beneficial in two different ways:

- it lends itself better for optimization of particular cases, and
- if the original element involves small numbers, but the end result involves large numbers, the multiplications are usually faster than in the previous algorithm.

To see the differences with examples, it is better to first write the python code…

@nrp_base.performance_timer def bin_pow(a, n, **kwargs) : """ Computes a**n by binary exponentiation. Arguments a - The object to be exponentiated. n - The integral exponent. Keyword Argument mul - A function taking two arguments of the same type as a, and returning their product. If undefined, a's __mul__ method will be used. sqr - A function taking a single argument of the same type as a, and returning its square. If undefined, mul(a,a) will be used. To use a method of the object's class, use sqr=ClassName.function_name. lsb - Set to True to use least significant bits first. Default is False. msb - Set to True to use most significant bits first. Overrides lsb. Default is True. """ mul = kwargs.pop('mul',None) sqr = kwargs.pop('sqr',None) lsb = kwargs.pop('lsb',None) msb = kwargs.pop('msb',None) if mul is None : mul = lambda x, y : x * y if sqr is None : sqr = lambda x : mul(x, x) if lsb is None and msb is None : msb = True lsb = False elif msb is None : msb = not lsb else : lsb = not msb if msb : bits = [] while n > 1 : # The last bit is always a 1... if n & 1 : bits += [1] else : bits += [0] n >>= 1 ret = a while bits : ret = sqr(ret) if bits.pop() : ret = mul(ret,a) else : val = a while not n & 1 : n >>= 1 val = sqr(val) ret = val while n > 1: n >>= 1 val = sqr(val) if n & 1 : ret = mul(ret,val) return ret

#### Taking `bin_pow`

for a Ride…

To get a better grip on how the algorithm performs, we’ll take a very simple matrix, which we will find again in a future post, when we discuss Fibonacci numbers…

`>>> a = Matrix2x2([1,1,1,0])`

>>> a

Matrix2x2 object:

[ 1 1 ]

[ 1 0 ]

So lets raise it first to a small power, and see how the two algorithms perform…

`>>> val = bin_pow(a,10,lsb=True,verbose=True,timer_loops=100)`

Call to bin_pow took:

2.84952e-05 sec. (min)

0.000205613 sec. (max)

3.363e-05 sec. (avg)

0.000178629 sec. (st_dev)

>>> val = bin_pow(a,10,msb=True,verbose=True,timer_loops=100)

Call to bin_pow took:

3.12889e-05 sec. (min)

5.16825e-05 sec. (max)

3.32891e-05 sec. (avg)

2.93951e-05 sec. (st_dev)

Not many surprises here, as the algorithm having to store and retrieve bits is a little slower than the other. But lets see what happens if we scale things…

`>>> val = bin_pow(a,1000000,lsb=True,verbose=True)`

Call to bin_pow took 2.599 sec.

>>> val = bin_pow(a,1000000,msb=True,verbose=True)

Call to bin_pow took 1.76194 sec.

This sort of shows that the multiplication thing can be very relevant. A last little experiment gives more insight into the workings of this…

`>>> 2**20`

1048576

>>> val = bin_pow(a,2**20,lsb=True,verbose=True)

Call to bin_pow took 1.95651 sec.

>>> val = bin_pow(a,2**20,msb=True,verbose=True)

Call to bin_pow took 1.90171 sec.

Funny, isn’t it? Why would it be that the time differences are so much smaller now, even though the exponents are roughly the same? This other times may help answer that…

`>>> val = bin_pow(a,2**20-1,lsb=True,verbose=True)`

Call to bin_pow took 2.61069 sec.

>>> val = bin_pow(a,2**20-1,msb=True,verbose=True)

Call to bin_pow took 1.89222 sec.

Got it now? 2^{20} has a binary expression full of 0’s, 20 of them, preceded by a single 1. This means that neither algorithm is performing any multiplication, just squarings, so the performance is almost identical. On the other hand, 2^{20}-1 has a binary expression full of 1’s, 20 of them, with no 0 anywhere. This means that, although there will be one less squaring than for the previous case, there will be 19 multiplications more. And here’s where the `msb`

approach excels, because the involved multiplications are simpler.

Nicely explained article.

One comment regarding timing functions. The ‘timeit’ module warns about calculating mean and standard deviation of timings, it says:

“Note

It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.”