## Good boy!

November 30, 2009

Just a quick pat on my own shoulder…

Project Euler’s problem 266, posted this very last weekend, was proposed by yours truly:

The divisors of 12 are: 1,2,3,4,6 and 12.
The largest divisor of 12 that does not exceed the square root of 12 is 3.
We shall call the largest divisor of an integer n that does not exceed the square root of n the pseudo square root (PSR) of n.
It can be seen that PSR(3102)=47.

Let p be the product of the primes below 190.
Find PSR(p) mod 1016.

I had something much more modest in mind, such as “all primes below 100”, since the way I figured it out this was to be solved treating it as a variation of the knapsack problem and then using dynamic programming on it. An approach that will see you die of old age if you try it for the proposed value…

But I then had the pleasure of witnessing how xan, hk and daniel.is.fischer started coming up with alternative, much more sophisticated and elegant methods, that pushed the question all the way up to were it is now. It is both a humbling and edifying experience to see these people at work, so if you have an idea for a problem, do propose it to them.

I won’t go into any more details as to how to solve this particular problem: the right place to ask questions is here.

## Binary Exponentiation

June 5, 2009

(The code examples in this post are also available here.)

Although I have already written something on binary exponentiation, as it applied to modular exponentiation, there’s plenty more to be said about it in more general settings. Basically, today we are after computing the n-th power of a, but a need not be an integer, rational, real or even complex number, but can be any more general mathematical object that can be raised to a power, as a square matrix or a polynomial could be…

For an n-th power to exist, the object being exponentiated must fulfill certain conditions, which can basically be summarized in belonging to a more general set that has a well-behaved multiplication operation defined. Good behavior can be translated into closure and associativity, which in mathematical slang can be expressed as the object belonging to a semigroup. This will often mean that the object belongs to a monoid, and even to a ring or rng… But I’ll follow V.I. Arnold‘s advice, and not get too carried away with the abstract generalizations, so enough said: the thing must be multipliable, and the result must be of the same kind.

#### Defining Multiplication

Because we are dealing with general mathematical objects, we first need to define the multiplication operation. There are two ways of dealing with this in python. To work on an example, say we want to deal with 2×2 matrices. We could define a Matrix2x2 class, and write a __mul__ method for it, so that whenever we write a*b, what gets evaluated is a.__mul__(b). For the same price, I’ve thrown in an __add__ method…

class Matrix2x2(object) :
def __init__(self,elements = None) :
if elements :
self._data_ = elements[:4]
else :
self._data_ = [0] * 4

def __repr__(self) :
as_string = [str(j) for j in self._data_]
str_length = [len(j) for j in as_string]
longest = max(str_length)
for j in xrange(4) :
as_string[j] =' '*(longest - str_length[j]) + as_string[j]
ret = 'Matrix2x2 object:'
for j in xrange(2) :
ret += '\n[ %s %s ]' % (as_string[2*j], as_string[2*j+1])
return ret

def __mul__(self, b) :
ret = [self._data_[0]*b._data_[0] + self._data_[1]*b._data_[2]]
ret += [self._data_[0]*b._data_[1] + self._data_[1]*b._data_[3]]
ret += [self._data_[2]*b._data_[0] + self._data_[3]*b._data_[2]]
ret += [self._data_[2]*b._data_[1] + self._data_[3]*b._data_[3]]
return Matrix2x2(ret)

return Matrix2x2([self._data_[j] + b._data_[j] for j in xrange(4)])


Or if we don’t want to go through the hassle of dealing with objects, we can agree with ourselves in keeping the matrix’s data in a list in row-major order, and define a standalone multiplication function…

def mat_mul(a, b) :
"""
Returns the product of two 2x2 square matrices.

Computes the product of two 2x2 matrices, each stored in a four element
list in row major order.
"""

ret = [a[0]*b[0] + a[1]*b[2]]
ret += [a[0]*b[1] + a[1]*b[3]]
ret += [a[2]*b[0] + a[3]*b[2]]
ret += [a[2]*b[1] + a[3]*b[3]]
return ret


#### Direct Multiplication

To set a baseline for performance, the starting point has to be the most direct implementation of exponentiation: repeated multiplication. There really isn’t much mistery to it…

import nrp_base

@nrp_base.performance_timer
def direct_pow(a, n, **kwargs) :
"""
Computes a**n by direct multiplication.

Arguments
a - The object to be exponentiated.
n - The integral exponent.

Keyword Argument
mul - A function taking two arguments of the same type as a, and
returning their product. If undefined, a's __mul__ method will
be used.

"""
mul = kwargs.pop('mul',None)
ret = a
if mul is None :
mul = lambda x,y : x*y
for j in xrange(n-1) :
ret = mul(ret,a)
return ret


#### The Two Flavors of Binary Exponentiation

OK, so now we are ready for the fun to begin… The basic idea behind speeding exponentiation up is that a4 can be computed with two, instead of three, multiplications, if rather than doing it as a4 = a·a·a·a, we do it as a2 = a·a, a4 = a2·a2, storing the intermediate result for a2

To expand this idea to exponents other than 4 we basically need to write the exponent in binary. Examples are great for figuring out this kind of things, so we’ll take computing a19 as ours. Knowing that 19 in binary is 10011, we now have a choice on how to proceed, which basically amount to start using the bits of the exponent from the least significant one, or from the most significant one…

##### Least Significant Bit First

That 19 in binary is 10011 can also be interpreted as 19 = 20 + 21 + 24, and so a19 = a20·a21·a24… The algorithm then is pretty straightforward:

• start computing sequentially from j = 0 the values of a2j, where each new value is the square of the preceding one,
• when the first non-zero bit is found, set the return value to the corresponding a2j value,
• for subsequent non-zero bits, multiply the return value by the corresponding a2j value,
• once all bits of the exponent have been searched, the return value holds the sought value.

If the exponent is n, then there will be log2 n squarings, to compute the a2j values, plus one multiplication less than there are 1’s in the binary expansion of the exponent.

##### Most Significant Bit First

It is also possible to write a19 = ((a23·a)2)·a. While it is easy to verify that is the case, it may be harder to see where does it come from. The algorithm is as follows:

• Take the exponent bits from most to least significant,
• since the first bit is always a 1, set the return value to a,
• for every bit processed, first square the return value, and then, if the bit is a 1, multiply the return value by a,
• again, once the last bit is used, the return value holds the sought value.

The analysis of the algorithm is similar to the previous one: there will again be log2 n squarings, and the same number of multiplications as well.

These second flavor has one clear disadvantage: there’s no simple way to generate the bits of a number from most to least significant. So this approach requires computing and storing the bits from least to most significant, and then retrieve them in inverse order. No that it is anything too complicated, but there are additional time costs involved.

So why would one even worry about this scheme then? Well, it also has a potential advantage: the multiplications performed always involves the original object being exponentiated. This can prove beneficial in two different ways:

1. it lends itself better for optimization of particular cases, and
2. if the original element involves small numbers, but the end result involves large numbers, the multiplications are usually faster than in the previous algorithm.

To see the differences with examples, it is better to first write the python code…

@nrp_base.performance_timer
def bin_pow(a, n, **kwargs) :
"""
Computes a**n by binary exponentiation.

Arguments
a - The object to be exponentiated.
n - The integral exponent.

Keyword Argument
mul - A function taking two arguments of the same type as a, and
returning their product. If undefined, a's __mul__ method will
be used.
sqr - A function taking a single argument of the same type as a, and
returning its square. If undefined, mul(a,a) will be used. To
use a method of the object's class, use sqr=ClassName.function_name.
lsb - Set to True to use least significant bits first. Default is False.
msb - Set to True to use most significant bits first. Overrides lsb.
Default is True.

"""
mul = kwargs.pop('mul',None)
sqr = kwargs.pop('sqr',None)
lsb = kwargs.pop('lsb',None)
msb = kwargs.pop('msb',None)
if mul is None :
mul = lambda x, y : x * y
if sqr is None :
sqr = lambda x : mul(x, x)
if lsb is None and msb is None :
msb = True
lsb = False
elif msb is None :
msb = not lsb
else :
lsb = not msb

if msb :
bits = []
while n > 1 : # The last bit is always a 1...
if n & 1 :
bits += [1]
else :
bits += [0]
n >>= 1
ret = a
while bits :
ret = sqr(ret)
if bits.pop() :
ret = mul(ret,a)
else :
val = a
while not n & 1 :
n >>= 1
val = sqr(val)
ret = val
while n > 1:
n >>= 1
val = sqr(val)
if n & 1 :
ret = mul(ret,val)
return ret


#### Taking bin_pow for a Ride…

To get a better grip on how the algorithm performs, we’ll take a very simple matrix, which we will find again in a future post, when we discuss Fibonacci numbers

>>> a = Matrix2x2([1,1,1,0]) >>> a Matrix2x2 object: [ 1 1 ] [ 1 0 ]

So lets raise it first to a small power, and see how the two algorithms perform…

>>> val = bin_pow(a,10,lsb=True,verbose=True,timer_loops=100) Call to bin_pow took: 2.84952e-05 sec. (min) 0.000205613 sec. (max) 3.363e-05 sec. (avg) 0.000178629 sec. (st_dev) >>> val = bin_pow(a,10,msb=True,verbose=True,timer_loops=100) Call to bin_pow took: 3.12889e-05 sec. (min) 5.16825e-05 sec. (max) 3.32891e-05 sec. (avg) 2.93951e-05 sec. (st_dev)

Not many surprises here, as the algorithm having to store and retrieve bits is a little slower than the other. But lets see what happens if we scale things…

>>> val = bin_pow(a,1000000,lsb=True,verbose=True) Call to bin_pow took 2.599 sec. >>> val = bin_pow(a,1000000,msb=True,verbose=True) Call to bin_pow took 1.76194 sec.

This sort of shows that the multiplication thing can be very relevant. A last little experiment gives more insight into the workings of this…

>>> 2**20 1048576 >>> val = bin_pow(a,2**20,lsb=True,verbose=True) Call to bin_pow took 1.95651 sec. >>> val = bin_pow(a,2**20,msb=True,verbose=True) Call to bin_pow took 1.90171 sec.

Funny, isn’t it? Why would it be that the time differences are so much smaller now, even though the exponents are roughly the same? This other times may help answer that…

>>> val = bin_pow(a,2**20-1,lsb=True,verbose=True) Call to bin_pow took 2.61069 sec. >>> val = bin_pow(a,2**20-1,msb=True,verbose=True) Call to bin_pow took 1.89222 sec.

Got it now? 220 has a binary expression full of 0’s, 20 of them, preceded by a single 1. This means that neither algorithm is performing any multiplication, just squarings, so the performance is almost identical. On the other hand, 220-1 has a binary expression full of 1’s, 20 of them, with no 0 anywhere. This means that, although there will be one less squaring than for the previous case, there will be 19 multiplications more. And here’s where the msb approach excels, because the involved multiplications are simpler.

## Divisors

May 13, 2009

When I wrote the first post of the FFT series I confessed it was out of boredom from number theory. That statement got me into a coffee break discussion with a friend and mathematician, who claimed that the FFT is as number theory as it gets, and went on to explain why. Since I usually don’t understand a word of what he is saying when he gets started on one of this rants, which invariably lead to Riemann’s zeta function, I just pretended to acknowledge his doubts, and then forgot about the conversation altogether.

Well, a few weeks down the road it seems he was right: while trying to make my FFT code work, I have found myself dealing with all sorts of number theoretical stuff, ranging from simple ones, like factorization, to arcane, as the Chinese remainder theorem. And since I have this thing about complete expositions, I’m afraid that this interleaving of apparently unrelated number theory posts with the FFT ones is going to be the norm. As a matter of fact the odd post on the Euclidean algorithm was supposed to be a stepping stone to an improved radix 2 FFT, which I finally managed to do without.

But lets get going with today’s subject: computing the divisors of a positive integer. May not seem a very sexy topic at first glance, but there’s beauty in it, as the striking Divisor Plot website put together by Jeffrey Ventrella proves…

Getting into the algorithmic treatment of the problem, there is of course the dumber approach, on the line of “to compute all divisors of a number, do trial division by all numbers smaller than it.” I will not pursue that line of thinking though, and will focus on computing the divisors once the prime factorization of the number is known, which is the efficient way of dealing with this problem.

So I will assume that we know the expansion of the number into its prime factors as

$\displaystyle n = \prod_{i=1}^{\omega(n)} p_i^{a_i}$,

where the function ω(n) represents the number of distinct prime factors, pi, each appearing with the corresponding exponent, ai.

A divisor will have a subset of the prime factors of n as its own factors, so we could represent any divisor as a list of ω(n) numbers, the i-th ranging from 0 up to and including ai. This scheme makes it very easy to calculate the total number of divisors as

$\displaystyle d(n)=\prod_{i=1}^{\omega(n)} (a_i+1)$.

Unordered Generation of Divisors
Furthermore, it points out that the computation of all the divisors could be achieved using ω(n) nested loops. Of course the number of loops to nest will vary from input to input, so recursion is probably the best of choices, or so thinks Tim Peters, as the code below has only cosmetic modifications to what he posted on the Python-list@python.org mailing list:

def divisors(factors) :
"""
Generates all divisors, unordered, from the prime factorization.
"""
ps = sorted(set(factors))
omega = len(ps)

def rec_gen(n = 0) :
if n == omega :
yield 1
else :
pows = [1]
for j in xrange(factors.count(ps[n])) :
pows += [pows[-1] * ps[n]]
for q in rec_gen(n + 1) :
for p in pows :
yield p * q

for p in rec_gen() :
yield p


This approach, while extremely fast, is not much use if you want to produce the divisors sorted, as you will then have to first generate and store all divisors, then sort them.

Ordered Generation of Divisors
If what you are after is not the full set of divisors, or any arbitrary subset of them, but something like “the first divisor larger than x” or “the last divisor smaller than x,” it may be worth trying other algorithms which, while being less efficient, manage to produce the divisors properly sorted.

There’s little literature on the subject, so its hard to find a valid benchmark . Even PARI/GP, which is the golden standard on computational number theory seems, from the little sense I have managed to make out of their Subversion repository, to be generating all divisors (in a similar manner to above) and then sorting them.

So here’s a calculation scheme that can be effectively modified to yield ordered divisors… We start with a pivot value of 1, and an ordered list of all prime factors. Before every factor we split our flow in two:

• one branch skips the factor, and proceeds to the first factor different from the current one, keeping the pivot unchanged
• the other branch first updates the pivot, multiplying it by the factor, yields the result, and then proceeds to the next factor, with the modified pivot.

The flow is probably easier to understand with the example below, showing the branching when computing the divisors of 540 = 22·33·5. Blue arrows skip to the first different factors, red ones go to the next factors, a line going through a factor implies multiplication by it…

To keep track of this structure, I’m going to get fancy, and will build a tree-like data structure, where each item will be a list, holding the factor in the first position, and pointers, i.e. nested lists, to the next elements in the second and third ones. Again, to make things clear, here’s the data structure for 540, using the same color code as before…

Rather than traversing this structure recursively, we will approach it in a more structured way, by keeping a separate list of factors that can be visited in a next step. In the example above, when we start out, we can visit the first factor (a 2), or skip to the third (the first 3), so the list will hold two elements. If we select going to the first 2, we then have two new options, i.e. going on to the second 2, or taking the shortcut to the first three. Note that the two instances point at the first three are different, since once has a pivot value of 1 (and will thus generate the divisor 3), while the other has it set to 2 (an will generate the divisor 6).

If we don’t care about the other of divisors, this method can even be implemented by hand, by keeping a list of pivots and remaining factors, and on each step erasing the first element of the list and generating a divisor, plus adding up to two new items at the bottom. The first steps, plus the resulting list of divisors, are worked out below, again for the case of n = 540.

Now, while when doing it on paper there is not much more choice other than processing the items sequentially, we could choose to process always the element that produces the smallest divisor, and thus produce them in order. The simplest implementation of this idea would simply use a sort on the list of processable items before yielding the next divisor…

def divisors_ordered(factors) :
"""
Generates all divisors, ordered, from the prime factorization
"""
factors =  [(j,factors.count(j)) for j in sorted(set(factors))]
omega = len(factors)
fcts =[factors[0][0],[],[]]
aux = [fcts, None]
for j in xrange(omega) :
p,a = factors[j]
if omega - j > 1 :
aux[1] = [factors[j+1][0],[],[]]
else :
aux[1] = None
for k in xrange(1, a) :
b = [p,[],[]]
aux[0][1] = b
aux[0][2] = aux[1]
aux[0] = b
aux[0][1] = aux[1]
aux[0][2] = aux[1]
aux[0] = aux[1]
div_tree = [[fcts[0],1, fcts]]
yield 1
while div_tree :
div_tree.sort()
divisor, pivot, pointer = div_tree.pop(0)
yield divisor
if pointer[1] :
div_tree += [[divisor * pointer[1][0] , divisor, pointer[1]]]
if pointer[2] :
div_tree += [[pivot * pointer[2][0], pivot, pointer[2]]]


Of course doing a full sort before yielding every divisor kills the efficiency of this algorithm. The fancy (or should I simply say correct?) way of keeping an ordered list, popping the smallest item only, and pushing arbitrary elements, is with a priority queue, which I will implement as a binary heap. Since we will be building our heap from scratch, we only need to worry about push and pop functions.

def push_heap(item, heap = None, compare = cmp) :
"""
Pushes an item into a min-heap, or creates one.
"""
heap_size = None
if not heap :
heap = [item, 1]
else :
heap_size = heap.pop()
heap += [item, heap_size + 1]
i = heap[-1] - 1
val = heap[i]
heap_ordered = False
while i > 0 and not heap_ordered :
n_i = i - 1
if n_i & 1 : # faster than n_i % 2
n_i -= 1
n_i >>= 1 # faster than n_i // 2
n_val = heap[n_i]
if compare(val, n_val) < 0 :
heap&#91;i&#93;, heap&#91;n_i&#93; = heap&#91;n_i&#93;, heap&#91;i&#93;
i, val = n_i, heap&#91;n_i&#93;
else :
heap_ordered = True
return heap

def pop_heap(heap, compare = cmp) :
"""
Pops the smallest item from a min-heap.
"""
heap_size = heap.pop()
if not heap_size:
return None
elif heap_size == 1 :
return heap.pop()
ret, heap&#91;0&#93; = heap&#91;0&#93;, heap.pop()
heap_size -= 1
heap += &#91;heap_size&#93;
i = 0
heap_ordered = False
while not heap_ordered :
n_i = 2 * i + 1
if n_i < heap_size :
if n_i + 1 < heap_size and compare(heap&#91;n_i&#93;,heap&#91;n_i + 1&#93;) > 0 :
n_i += 1
if compare(heap[i], heap[n_i]) > 0 :
heap[i], heap[n_i] = heap[n_i], heap[i]
i = n_i
else :
heap_ordered = True
else :
heap_ordered = True
return ret

def divisors_ordered(factors) :
"""
Generates an ordered list of all divisors
"""
factors =  [(j,factors.count(j)) for j in sorted(set(factors))]
omega = len(factors)
fcts =[factors[0][0],[],[]]
aux = [fcts, None]
for j in xrange(omega) :
p,a = factors[j]
if omega - j > 1 :
aux[1] = [factors[j+1][0],[],[]]
else :
aux[1] = None
for k in xrange(1, a) :
b = [p,[],[]]
aux[0][1] = b
aux[0][2] = aux[1]
aux[0] = b
aux[0][1] = aux[1]
aux[0][2] = aux[1]
aux[0] = aux[1]
yield 1
div_heap = push_heap([fcts[0],1, fcts])
while div_heap :
divisor, pivot, pointer = pop_heap(div_heap)
yield divisor
if pointer[1] :
div_heap = push_heap([divisor * pointer[1][0] , divisor, pointer[1]], div_heap)
if pointer[2] :
div_heap = push_heap([pivot * pointer[2][0], pivot, pointer[2]], div_heap)


The following graph shows the time required by each of the three implementations presented to produce an ordered list of all divisors of pn#, i.e. the product of the first n prime numbers. Do notice how the lower computational complexity of the heap sorting eventually beats the initial speed advantage of the sorting version. Still, for full lists of divisors, generating all of them and then sorting beats any other approach. It is also worth pointing out how the time required for the heap version of the ordered algorithm eventually settles around being 14 times slower than the unordered generator with delayed sorting, i.e. the computational complexities of both are the same, which probably means that sorting dominates the time of the unordered algorithm.

Just to show why we could want to use an algorithm that is more than ten times slower, the following graph shows the time needed to compute the first divisor larger than 1000 for the first few primorials. While using the heap version of the ordered algorithm time is close to constant, the unordered algorithm shows the same behavior as in the previous test.

It is nevertheless worth noting that the break even point is p11# = 200,560,490,130, so unless you are after the first few divisors of a huge number, the first algorithm presented should always be the choice.

## Computing the Greatest Common Divisor

April 30, 2009

Here comes my take at one of the most ubiquitous algorithms you can find googling

The least common multiple of 2 numbers, lcm(a,b), is the smallest number that is divisible by both a and b. The greatest common divisor of 2 numbers, gcd(a,b), is the largest number that divides both a and b. But just how do you go about calculating them? A most important property is that one can be calculated from the other based on the relation lcm(a,b)·gcd(a,b) = a·b, which ends up meaning that in practice the gcd is always calculated first. But lets not get ahead of ourselves…

Brute Force
I will disregard the dumber brute force approaches, such as “to get gcd(a,b), try all numbers down from min(a,b), until you find one that divides both a and b”, but a very resource wasteful implementation could use a sieving approach for the lcm. See, the lcm has to be smaller than a·b, which is definitely a multiple of both a and b. So if we set up an array of zeros of size a·b, we could use sieving to first set all multiples of a to 1, then do a search over all multiples of b to find the first common multiple. This approach will definitely break down for even moderately large a or b, but it is probably as good as most of us would get on our own without wikipedia…

from time import clock

def lcmBruteForce(a, b, verbose = False) :
t = clock()
sieve = [0]*a*b
a,b = max(a,b), min(a,b)
for j in range(a-1,a*b,a) :
sieve[j] += 1
ret = 0
for j in range(b-1,a*b,b) :
if sieve[j] == 1 :
ret = j + 1
break
t = clock()-t
if verbose :
print "Calculated the lcm of",a,"and",b,"in",t,"sec."
return ret


Integer Factorization
In my case I believe it was Ernesto, back in 1985 or 1986, who showed me into the secret of calculating the lcm and the gcd from the prime factorization of both numbers:

• to find the gcd, keep only factors common to both and b, raised to the smaller exponent, and
• for the lcm, keep all factors, common or not, raised to the largest exponent.

This approach will prove to be very inefficient, but lets give it a go. We will reuse the factorAndSieve function from the post on integer factorization. I have also written an auxiliary function, listToDict, that turns the list of factors with repeating entries, into a dictionary, where keys are factors and corresponding entries are exponents:

import time

def list_to_dict(seq) :
ret = dict()
for j in seq :
if j in ret :
ret[j] += 1
else :
ret[j] = 1
return ret

def gcd_factor(a, b, verbose = False) :
t = time.clock()
aa = list_to_dict(factorAndSieve(a))
bb = list_to_dict(factorAndSieve(b))
rr = dict()
for j in bb :
if j in aa :
rr[j] = min(aa[j], bb[j])
ret = 1
for j in rr :
ret *= j**rr[j]
t = time.clock() - t
if verbose :
print "Calculated the gcd of",a,"and",b,"in",t,"sec."
return ret


The Euclidean Algorithm
When we looked at the calculation of the modular multiplicative inverse, we already took a look at the extended Euclidean algorithm, and pointed out it gives the gcd of the two passed numbers as a by-product. Not that it was an interesting thing then, since when calculating multiplicative inverses, we normally are interested in relatively prime number pairs, i.e. gcd(a,b) = 1. But if the adjective extended is dropped from the name, we are left with the Euclidean algorithm, which is the standard to calculate the gcd.

The basic idea behind the algorithm is the observation that if a > b, and g is the gcd of a and b, then it is also the gcd of a – b and b. To make it more efficient, rather than subtracting, we can cut corners by using the equivalent fact that, if g is the gcd of a and b, then it is also the gcd of b and a mod b. If we perform that same operation recursively, after a finite number of iterations, the result of the modular operation will eventually reach zero, and the gcd then is equal to the last non-zero remainder. Or

import time

def gcd_euclid(a, b, verbose = False) :
t = time.clock()
aa, bb = a,b
while b != 0 :
a, b = b, a%b
t = time.clock() - t
if verbose :
print "Calculated the gcd of",aa,"and",bb,"in",t,"sec."
return a


The Binary GCD Algorithm
Euclid’s algorithm is so fast that there really is not much point in trying to improve it. If we were using C rather than python, we could have taken advantage of the fact that a computer that does all its arithmetic in binary, and use the binary GCD algorithm. In python the resulting code is less clear, and the run times are slightly slower, probably owing to the fact that bitwise operations “don’t get optimized down to single CPU instructions (…) like in C,” or so says a commenter here… The truth is I haven’t been able to implement the algorithm so that it runs faster than the standard Euclidean one, so I’ll spare you my code…

## Naive Integer Factorization

April 9, 2009

After three posts (1, 2, 3) on calculating prime numbers, it is probably worth putting that knowledge to a more useful task. As we will see in a near future, integer factorization, i.e. breaking down a (composite) number into its prime factors is one such task. In purity, factoring a number n is simply decomposing it as the product of two smaller non-trivial, i.e. different from 1 and n itself, divisors. But by repeatedly factoring the divisors one will eventually come up with a unique set of primes which, when multiplied together, render the original number, or so says the fundamental theorem of arithmetic… The point is, we will consider factorization a synonym of prime decomposition, be it formally correct or not.

There are some very sophisticated methods to factor very large numbers, but they use a lot of extremely complex math, so I doubt they will ever find their way onto this blog. So we are going to be left with the naive, straightforward approach as our only option, although I will try to give it an efficiency boost. What is this naive approach? Trial division, of course: given a number n, we know that its smallest factor will be smaller than the square root of n, so we can simply try and see if any of those numbers divide it. No, I will not try to code that yet… If you have read the entries on determining prime numbers, it should come as no surprise that we really do not need to do trial division by all numbers smaller than the square root of n, but only by the primes within. This is a consequence of the fact that, if a composite number divides n, then each of the prime factors of that composite number will also divide n. According to the prime number theorem the number of primes below x is asymptotic to x / log x. So by limiting our trials to prime numbers we can reduce the number of tests from n1/2 to something around 2 n1/2 / log n. If we rescue the primeListSofE function from the post on the sieve of Erathostenes, a python implementation of naive factorization could look something like this…

from time import clock

def factor(n, verbose = False) :
"""
Returns all prime factors of n, using trial division by prime
numbers only. Returns a list of (possibly repeating) prime factors
"""
t = clock()
ret =[]
nn = n
maxFactor = int(n**0.5)
primes = primeListSofE(maxFactor, verbose)
for p in primes :
while nn % p == 0 :
nn //= p
ret += [p]
if nn == 1 :
break
if nn != 1 :
ret += [nn]
t = clock() - t
if verbose :
print "Calculated factors of",n,"in",t,"sec."
return ret


While this function will be about as good as we can make it for numbers which are the product of two large prime factors, it will be terribly inefficient for most numbers. Consider, as an extreme example, that we are trying to factor 255 ~ 3.6·1016. We would first calculate all primes up to 1.9·108, a challenging feat in itself with our available tools, only to find out that we only needed the first of those primes, i.e. 2. Taking into account that 50% of all numbers are divisible by 2, 33% are divisible by 3, 20% are divisible by 5… it doesn’t seem wise to disregard the potential time savings. What we can do to profit from this is to do the trial division checks at the same time as we determine the prime numbers, updating the largest prime to test on the fly. This has to be done in two stages, the first while we sieve up to n1/4, the second while we search the rest of the sieve up to n1/2 searching for more primes. The following Franken-code has been written mostly by cut-and-paste from primeListSofE and factor, which hopefully hasn’t affected its readability much:

from time import clock

def factorAndSieve(n, verbose = False) :
"""
Returns all prime factors of n, using trial division while sieving
for primes. Returns a list of (possibly repeating) prime factors
"""
t = clock()
ret =[]
nn = n
while nn % 2 == 0 : # remove 2's first, as 2 is not in sieve
nn //= 2
ret += [2]
maxFactor = int(nn**0.5)
maxI = (maxFactor-3) // 2
maxP = int(maxFactor**0.5)
sieve = [True for j in xrange(maxI+1)]
i = 0
for p in xrange(3, maxP+1,2) : # we then sieve as far as needed
if p > maxP :
break
i = (p-3) // 2
if sieve[i] :
while nn % p == 0 :
nn //= p
ret += [p]
maxFactor = int(nn**0.5)
maxI = (maxFactor-3) // 2
maxP = int(maxFactor**0.5)
if nn == 1 :
break
else :
i2 = (p*p - 3) // 2
for k in xrange(i2, maxI+1, p) :
sieve[k] = False
index = i
for i in xrange(index, maxI+1) : # and inspect the rest of the sieve
if i > maxI :
break
if sieve[i] :
p = 2*i + 3
while nn % p == 0 :
nn //= p
ret += [p]
maxFactor = int(nn**0.5)
maxI = (maxFactor-3) // 2
maxP = int(maxFactor**0.5)
if nn == 1 :
break
if nn != 1 :
ret += [nn]
t = clock() - t
if verbose :
print "Calculated factors of",n,"in",t,"sec."
print "Stopped trial division at",2*i+3,"instead of",int(n**0.5)
return ret


This new code will very often be much faster than the other one, but at times it will be just about as slow as in the other case, or even slower, since the mixing of both codes introduces some inefficiencies. The most extreme examples of such cases would be a prime number, or the square of a prime number on one side, and a power of 2 on the other one.

The graph above plots times to calculate the factors of numbers between 106 and 106 + 100. Prime numbers in this interval stick out as the red dots among the blue ones: 106 +3, +33, the twin primes +37 and +39, +81 and +99. And numbers with many small prime factors populate the bottom of the red cloud.

If the above graph is not enough to convince you of the benefits of the second approach, maybe this timings for very large numbers will:

 >>> factor(10**15+37,True) Calculated primes to 31622776 in 6.760 sec. Calculated factors of 1000000000000037 in 8.466 sec. [1000000000000037L] >>> factorAndSieve(10**15+37,True) Calculated factors of 1000000000000037 in 8.666 sec. Stopped trial division at 31622775 instead of 31622776 [1000000000000037L]

 

>>> factor(2**55,True) Calculated primes to 189812531 in 42.811 sec. Calculated factors of 36028797018963968 in 43.261 sec. [2, ..., 2] >>> factorAndSieve(2**55,True) Calculated factors of 36028797018963968 in 8.632e-05 sec. Stopped trial division at 3 instead of 189812531 [2, ..., 2]

## Prime Numbers 3. Wheel Factorization

March 24, 2009

In an earlier post, I described the sieve of Eratosthenes, a very fast algorithm to calculate all primes below a given limit, n. At least when programming it in python, speed is not an issue at all with this algorithm, because we start having memory problems much earlier, due to the fact that we need to keep in memory a list of size n…

This is specially problematic with python, because although all we need is a list of booleans (is prime / is not prime), which could fit in a bit each, I’m afraid python assigns several bytes to each. How many of them is probably machine dependent, but I’m almost certain it is at least 4 of them. So it is very likely that we could improve memory usage by a factor of 32, maybe more, if we could get around this. There are not-so-standard libraries that do just that, but I’ve sworn not to use them, and one day I may expose my failed attempts at using very long ints as bit arrays to the world, but python is definitely not the best language to mess around with low level optimizations, so I’m going to leave this line of thinking aside for the time being.

I also presented a very naive optimization, which simply discards all even numbers, which we already know are not prime, and thus cuts in half the memory requirements. More formally, what was being done there was to partition all numbers into two groups: those of the form 2·n, and those of the form 2·n+1, and then discard the former, since no prime of that form exists, and keep the latter only. This line of thinking is much more promising, as it leads to a very neat way of further reducing the memory required, known as wheel factorization.

For some odd reason that escapes my shallow understanding, the standard description of this algorithm places much emphasis on setting all numbers in circles, and then eliminating what it calls spokes. I personally don’t see any benefit of such a mental image, specially since the number are going to end up arranged in a rectangular array. So for the sake of clarity, I will not honor tradition, and spare you the circles and the spokes.

We begin by choosing a first few prime numbers. Say the first three: 2, 3 and 5. We multiply them together to get M = 30. We are now going to partition the set of all numbers into M groups, of the form M·m + ki, where the ki are the numbers 1 through M, both inclusive. Most of these subsets happen to hold no prime number, and thus need not be considered at all. To find out which of them, remember that M is the product of several prime numbers, M = p1·p2 … pn. Therefore, if ki is a multiple of any of those prime numbers, i.e. ki = pj·k’i, we can always extract the common factor pj from both summands in M·m + ki, which means that all numbers in that subset are composite. These subsets, which we discard, are very closely related to the spokes I was telling you about a while ago.

If M = 30, of the 30 potential ki we are left with just 8 (1, 7, 11, 13, 17, 19, 23 and 29), so we have cut memory required to about 27% of the straightforward implementation of the sieve of Eratosthenes. If we use the first 4 primes, we then have M = 210, and of the 210 potential ki we only need to keep 48. This improves memory use, which now is roughly 23% of the original. But you can probably see a law of diminishing returns at wortk here: the improvement in memory use gets smaller and smaller with every prime you add to the calculation of M. You can find some more data regarding this at the end of this link.

Of course we are still going to have to sieve each of our subsets, in much the same way we did with the sieve of Eratothenes. But sieving all multiples of a prime p when you have several subsets to consider is anything but trivial. Lets see how to tackle it…

The question to answer is, what integer values of m fulfill the equation M·m + ki = p·n? Using modular arithmetic, this equation is equivalent to finding M·m = -ki (mod p). This requires calculating M-1 (mod p), the modular multiplicative inverse of M modulo p, a subject already treated in a previous post. Given that, we can simply write mi = M-1·(-ki) (mod p), and therefore determine all multiples of p in the subset of ki as m = p·j + mi.

Last, but not least, we want to start sieving at the smallest multiple of p in each subset that is larger than or equal to p2, and so we want to increment mi with p·ji, where ji ≥ (p2-ki– M·mi) / (m·p).

Lets try to put all this ideas together in a python script. To make it work, you will need the primeListSofE function from this post, as well as the extEuclideanAlg and modInvEuclid ones from this other post

def primeListWF(n, m, verbose = True) :
"""Returns a list of prime numbers smaller than or equal to n,
using wheel factorization, with a wheel size of the product of
all primes smaller than or equal to m
"""
t = time()
# We use the standard sieve of Eratosthenes first...
primes = primeListSofE(m)
M = 1
for p in primes :
M *= p
if verbose :
print "Size of factorization wheel is",M
# We now sieve out all multiples of these primes,including
# the primes themselves, to get the k[i]
sieve = [True] * M
for p in primes :
sieve[p-1] = False
for j in xrange(p*p,M+1,p) :
sieve[j-1] = False
k = [j+1 for j in range(M) if sieve[j]]
if verbose :
print "Memory use is only",len(k),"/",M,"=",len(k)/M
N = n
if N %M :
N += M
N //= M
maxP = int(sqrt(M*N))
if verbose:
print "Primes will be calculated up to",M*N,"not",n

sieve = [[True]*N for j in range(len(k))]
sieve[0][0] = False # Take care of the non primality of 1
for row in range(N) :
baseVal = M * row
for subset in range(len(k)) :
if sieve[subset][row] :
p = baseVal + k[subset]
primes += [p]
if p > maxP :
continue
# Sieve all multiples of p...
invMp = modInvEuclid(M,p)
for i in range(len(k)) :
mi = invMp * (-k[i] % p)
mi %= p
# ...starting at p**2
ji = (p*p - k[i] - M*mi)
if ji % (M*p) :
ji += M*p
ji //= M*p
mi+= ji*p

for r in range(mi,N,p) :
sieve[j][r] = False
if verbose :
print "Calculated primes to",n,"in",time()-t,"sec."

return primes


You can play around with this algorithm and look at the time required. It probably depends on the value of n, but there seems to be a sweet spot for speed somewhere around m = 7 (and thus M = 210) or m = 11 (M = 2310). This being a general purpose wheel factorization algorithm, it has speeds that are faster than the unoptimized sieve of Eratosthenes code, but it cannot beat the optimized code, which in fact is a particular case of wheel factorization, for m = M = 2.

On the other hand, using m = 7 I have managed to compute all primes up to 108, a feat unreachable for the previous code, at least without finding a workaround the MemoryError they produce. 109 is still out of reach, though…

## Modular Multiplicative Inverse

March 20, 2009

Now that we are all proficient with modular exponentiation thanks to my previous post, it is time to tackle a more complicated issue, still in the realm of modular arithmetic.
While addition, subtraction and multiplication are “compatible with the congruence relation” introduced by modular arithmetic, the same doesn’t happen with division. This means that solving a simple equation such as a·x = 1 (mod m) is anything but trivial. In fact, determining x = a-1 (mod m) that fulfills that equation is the whole object of this post…

There is, as always, the brute force approach, which I always like to visit first, so that I can later congratulate myself on how much the efficiency has improved with my deep insight. In this particular case, brute force means trying every number between 1 and m-1, and see if the m-modulo of its product with a is 1. :

def modInv1(a,m) :
"""
Computes the modular multiplicative inverse of a modulo m,
using brute force
"""
a %= m
for x in range(1,m) :
if a*x%m == 1 :
return x
return None


Do notice that the possibility of no multiplicative inverse existing is contemplated in the code. This will actually be the case if a and m have any common factors, i.e. if they are not coprime. So the previous code could serve as a moron’s approach to determining if two numbers are relatively prime. Not that I recommend it, though, as it should be clear that the code I have just produced will be very inefficient for large values of m.

The Extended Euclidean Algorithm
As stated, we are after a number x that fulfills a·x = 1 (mod m). This can of course be written as well as a·x = 1 + m·y, which rearranges neatly into a·x – m·y = 1. Since x and y need not be positive, we can write it as well in the standard form, a·x + m·y = 1. If a and m are coprime, this is a particular instance of a·x + b·y = gcd(a,b), where gcd(a,b) is the greatest common divisor of a and b. This equation is known as Bézout’s identity, and can be efficiently solved using the extended Euclidean algorithm. To find a solution to our particular equation, this algorithm would proceed as follows:

1. Rewrite the equation as an·xn + mn·yn = 1, starting with n = 0
2. Compute the quotient, qn, and the remainder, rn, of an divided by mn, and rewrite the original equation as (mn·qn + rn)·xn + mn·yn =1
3. Rearrange the terms in the equation to get mn·(qn·xn+yn) + rn·xn = 1
4. Do the changes of variables
• xn+1 = qn·xn+yn,
• yn+1 = xn,
• an+1 = mn, and
• mn+1 = rn
5. Repeat the process, until an = 1, and mn = 0.
6. Such equations has the trivial solution xn = 1, yn = 0.
7. Work your way back to x0 and y0, noting that
• xn-1 = yn,
• yn-1 = xn – qn-1·yn
8. x0 is the number we are after.

There is a certain component of faith in believing that, eventually, an = 1 and mn = 0. It is in fact not always the case, as if a and m are not coprime, then the values we will arrive at are mn = 0, but an = gcd(a,m). As a matter of fact, the gcd is normally computed using the Euclidean algorithm, dropping the adjective extended, which basically does the same as above, but without the back-tracking of the xn‘s and the yn‘s. With all of this in mind, we can write two functions, one calling the other, as follows:

def extEuclideanAlg(a, b) :
"""
Computes a solution  to a x + b y = gcd(a,b), as well as gcd(a,b)
"""
if b == 0 :
return 1,0,a
else :
x, y, gcd = extEuclideanAlg(b, a % b)
return y, x - y * (a // b),gcd

def modInvEuclid(a,m) :
"""
Computes the modular multiplicative inverse of a modulo m,
using the extended Euclidean algorithm
"""
x,y,gcd = extEuclideanAlg(a,m)
if gcd == 1 :
return x % m
else :
return None


The gains in speed are tremendous with this new approach. Again, write your own testing function as an exercise, but mine spits out the following results…

 >>> testModInv(1234,5,10000) Got 4 using method 1 in 0.0160000324249 sec. Got 4 using Euclidean algorithm in 0.0309998989105 sec. >>> testModInv(1234,56789,10000) Got 31800 using method 1 in 59.4000101089 sec. Got 31800 using Euclidean algorithm in 0.047000169754 sec. 

So while brute force is good enough, or even the best choice, for very small values of m, as soon as it grows, the benefits of the more elaborate algorithm are all too obvious. According to wikipedia, the algorithm we have coded has complexity O(log(m)2), i.e. time required to compute the modular multiplicative inverse is proportional to log(m)2.