The radix-2 Cooley-Tukey FFT Algorithm with Decimation in Time

April 30, 2009

EDIT May 29th 2009: The code presented in this post has a major bug in the calculation of inverse DFTs using the FFT algorithm. Some explanation can be found here, and fixed code can be found here.

Once the DFT has been introduced, it is time to start computing it efficiently. Focusing on the direct transform,

\displaystyle X_k = \sum_{n=0}^{N-1}{x_n e^{-2\pi i k n / N}},

if the size of the input is even, we can write N = 2·M and it is possible to split the N element summation in the previous formula into two M element ones, one over n = 2·m, another over n = 2·m + 1. By doing so, after some cancellations in the exponents we arrive at

\displaystyle X_k = \sum_{m=0}^{M-1}{x_{2m} e^{-2\pi i \frac{m k}{M} }} + e^{-2\pi i\frac{k}{N}} \sum_{m=0}^{M-1}{x_{2m+1} e^{-2\pi i\frac{m k}{M}}}.

There’s something very promising here: each of the two summations are very close to being the expression of a DFT of half the size of the original. The only thing preventing them from fully being so is the fact that k runs from 0 to N-1, rather than up to M-1 only. To get around this, we will consider two different cases: k < M and k ≥ M. The first trivially produces a pair of half sized DFTs, which we will term Ek and Ok, owing to one running over the even indexes, the other over the odd ones:

\displaystyle E_k = \sum_{m=0}^{M-1}{x_{2m} e^{-2\pi i \frac{m k}{M} }},  O_k = \sum_{m=0}^{M-1}{x_{2m+1} e^{-2\pi i \frac{m k}{M} }}.

As for the case k ≥ M, we may as well write it out as k = M + k’. And if we plug this value into the original formula, work a little with the algebra, and then undo the variable change, i.e. plug k’ = M – k back in, we arrive at the following combined formula,

X_k  =  \left\{ \begin{array}{ll} E_k +  e^{-2\pi i \frac{k}{N}} O_k & \mbox{if } k < M \\ E_{k-M} -  e^{-2\pi i\frac{k-M}{N}} O_{k-M} & \mbox{if } k \geq M \end{array} \right..

The cool thing about this decomposition is that we have turned our N element DFT into two N/2 DFTs, plus some overhead composing the results, which includes calculating twiddle factors, i.e. the complex exponential multiplying the O’s in the last formula. Given the quadratic nature of the time required to calculate the DFT, this amounts to cutting the time in half. And the best part is that we need not stop after one such decomposition: if the original size is a power of two, we can keep doing the same trick recursively until we reach a DFT of only two elements, for which we already have produced explicit formulas. A similar approach can be used for the inverse DFT, the details are built into the following code…

from __future__ import division
import math
import time

def fft_CT(x, inverse = False, verbose = False) :
    """
    Computes the DFT of x using Cooley-Tukey's FFT algorithm.
    """
    t = time.clock()
    N = len(x)
    inv = -1 if not inverse else 1
    if N % 2 :
        return dft(x, inverse, False)
    x_e = x[::2]
    x_o  = x[1::2]
    X_e = fft_CT(x_e, inverse, False)
    X_o  = fft_CT(x_o, inverse, False)
    X = []
    M = N // 2
    for k in range(M) :
        X += [X_e[k] + X_o[k] * math.e ** (inv * 2j * math.pi * k / N)]
    for k in range(M,N) :
        X += [X_e[k-M] - X_o[k-M] * math.e ** (inv * 2j * math.pi * (k-M) / N)]
    if inverse :
        X = [j/2 for j in X]
    t = time.clock() - t
    if verbose :
        print "Computed","an inverse" if inverse else "a","CT FFT of size",N,
        print "in",t,"sec."
    return X

The speed gain with this approach, restricted for now to power of two sizes, are tremendous even for moderately large input sizes. Mostly so due to the fact the new algorithm scales as N log N. For instance, a 220 (~106) item input can be processed with the FFT approach in a couple of minutes, while the projected duration using the naive DFT will be more like a couple of weeks…

Twiddle Factors Again
While it may not be immediately obvious, the previous code is generously wasting resources by calculating over and over again the same twiddle factors. Imagine, for instance, a 1024 element DFT computed using this algorithm. At some point in the recursion the original DFT will have been exploded into 64 DFTs of 16 elements each. Each of this 64 DFTs must then be composed with the exact same 8 different twiddle factors, but right now they are being calculated from scratch each of the 64 times. Furthermore, half of this 8 different twiddle factors are the same that where needed to compose the 128 DFTs of 8 elements into the 64 DFTs of 16 elements… Summing it all up, to compute a FFT of size 2n, which requires the calculation of 2n-1 different twiddle factors, we are calculating as many as n·2n-1 of them.

We can easily work around this by precalculating them first, and then passing the list to other functions. For instance as follows…

def fft_CT_twiddles(x, inverse = False, verbose = False, twiddles = None) :
    """
    Computes the DFT of x using Cooley-Tukey's FFT algorithm. Twiddle factors
    are precalculated in the first function call, then passed down recursively.
    """
    t = time.clock()
    N = len(x)
    inv = -1 if not inverse else 1
    if N % 2 :
        return dft(x, inverse, False)
    M = N // 2
    if not twiddles :
        twiddles = [math.e**(inv*2j*math.pi*k/N) for k in xrange(M)]+[N]
    x_e = x[::2]
    x_o  = x[1::2]
    X_e = fft_CT_twiddles(x_e, inverse, False, twiddles)
    X_o  = fft_CT_twiddles(x_o, inverse, False, twiddles)
    X = []
    for k in range(M) :
        X += [X_e[k] + X_o[k] * twiddles[k * twiddles[-1] // N]]
    for k in range(M,N) :
        X += [X_e[k-M] - X_o[k-M] * twiddles[(k - M) * twiddles[-1] // N]]
    if inverse :
        X = [j/2 for j in X]
    t = time.clock() - t
    if verbose :
        print "Computed","an inverse" if inverse else "a","CT FFT of size",N,
        print "in",t,"sec."
    return X

This additional optimization does not improve the time scaling of the algorithm, but reduces the calculation time to about two thirds of the original one.

Advertisements

Computing the Greatest Common Divisor

April 30, 2009

Here comes my take at one of the most ubiquitous algorithms you can find googling

The least common multiple of 2 numbers, lcm(a,b), is the smallest number that is divisible by both a and b. The greatest common divisor of 2 numbers, gcd(a,b), is the largest number that divides both a and b. But just how do you go about calculating them? A most important property is that one can be calculated from the other based on the relation lcm(a,b)·gcd(a,b) = a·b, which ends up meaning that in practice the gcd is always calculated first. But lets not get ahead of ourselves…

Brute Force
I will disregard the dumber brute force approaches, such as “to get gcd(a,b), try all numbers down from min(a,b), until you find one that divides both a and b”, but a very resource wasteful implementation could use a sieving approach for the lcm. See, the lcm has to be smaller than a·b, which is definitely a multiple of both a and b. So if we set up an array of zeros of size a·b, we could use sieving to first set all multiples of a to 1, then do a search over all multiples of b to find the first common multiple. This approach will definitely break down for even moderately large a or b, but it is probably as good as most of us would get on our own without wikipedia…

from time import clock

def lcmBruteForce(a, b, verbose = False) :
    t = clock()
    sieve = [0]*a*b
    a,b = max(a,b), min(a,b)
    for j in range(a-1,a*b,a) :
        sieve[j] += 1
    ret = 0
    for j in range(b-1,a*b,b) :
        if sieve[j] == 1 :
            ret = j + 1
            break
    t = clock()-t
    if verbose :
        print "Calculated the lcm of",a,"and",b,"in",t,"sec."
    return ret

Integer Factorization
In my case I believe it was Ernesto, back in 1985 or 1986, who showed me into the secret of calculating the lcm and the gcd from the prime factorization of both numbers:

  • to find the gcd, keep only factors common to both and b, raised to the smaller exponent, and
  • for the lcm, keep all factors, common or not, raised to the largest exponent.

This approach will prove to be very inefficient, but lets give it a go. We will reuse the factorAndSieve function from the post on integer factorization. I have also written an auxiliary function, listToDict, that turns the list of factors with repeating entries, into a dictionary, where keys are factors and corresponding entries are exponents:

import time

def list_to_dict(seq) :
    ret = dict()
    for j in seq :
        if j in ret :
            ret[j] += 1
        else :
            ret[j] = 1
    return ret

def gcd_factor(a, b, verbose = False) :
    t = time.clock()
    aa = list_to_dict(factorAndSieve(a))
    bb = list_to_dict(factorAndSieve(b))
    rr = dict()
    for j in bb :
        if j in aa :
            rr[j] = min(aa[j], bb[j])
    ret = 1
    for j in rr :
        ret *= j**rr[j]
    t = time.clock() - t
    if verbose :
        print "Calculated the gcd of",a,"and",b,"in",t,"sec."
    return ret

The Euclidean Algorithm
When we looked at the calculation of the modular multiplicative inverse, we already took a look at the extended Euclidean algorithm, and pointed out it gives the gcd of the two passed numbers as a by-product. Not that it was an interesting thing then, since when calculating multiplicative inverses, we normally are interested in relatively prime number pairs, i.e. gcd(a,b) = 1. But if the adjective extended is dropped from the name, we are left with the Euclidean algorithm, which is the standard to calculate the gcd.

The basic idea behind the algorithm is the observation that if a > b, and g is the gcd of a and b, then it is also the gcd of a – b and b. To make it more efficient, rather than subtracting, we can cut corners by using the equivalent fact that, if g is the gcd of a and b, then it is also the gcd of b and a mod b. If we perform that same operation recursively, after a finite number of iterations, the result of the modular operation will eventually reach zero, and the gcd then is equal to the last non-zero remainder. Or

import time

def gcd_euclid(a, b, verbose = False) :
    t = time.clock()
    aa, bb = a,b
    while b != 0 :
        a, b = b, a%b
    t = time.clock() - t
    if verbose :
        print "Calculated the gcd of",aa,"and",bb,"in",t,"sec."
    return a

The Binary GCD Algorithm
Euclid’s algorithm is so fast that there really is not much point in trying to improve it. If we were using C rather than python, we could have taken advantage of the fact that a computer that does all its arithmetic in binary, and use the binary GCD algorithm. In python the resulting code is less clear, and the run times are slightly slower, probably owing to the fact that bitwise operations “don’t get optimized down to single CPU instructions (…) like in C,” or so says a commenter here… The truth is I haven’t been able to implement the algorithm so that it runs faster than the standard Euclidean one, so I’ll spare you my code…


The Discrete Fourier Transform

April 30, 2009

I’m currently a little fed up with number theory, so its time to change topics completely. Specially since the post on basic integer factorization completes what I believe is a sufficient toolkit to tackle a very cool subject: the fast Fourier transform (FFT).

I have some mixed feelings about how does Fourier analysis qualify for the “uncomplicated complexity” rule I imposed on myself when starting this blog. There certainly is a lot of very elaborate math behind it, but I think that if you skip all the mathematical subtleties and dive head first into a “the Fourier transform gives you the frequencies present in the signal” type of intuitive description, the journey from the more naive implementations of the discrete Fourier transform (DFT) to the many flavors of the FFT is a very enjoyable one, which requires only a limited amount of complication, but gentle doses of complexity. Which is exactly what I’m after…

So I will take the definition of the DFT for granted, and concentrate on the algorithms to compute it efficiently. There will have to be a few leaps of faith, but I will try to mark them clearly, so you can whisper “amen” and keep going without a second thought. Although things may unfold differently, I think there is material for about half a dozen posts. Towards the end I may devote some time to optimizing the code, with things such as avoiding recursion, or doing the calculations in place, but be warned that most of my efforts will be devoted to optimizing the math behind the code, not the code itself.

The Discrete Fourier Transform
Without further explanation, we will begin by writing down the analytical expression of the DFT,

\displaystyle X_k = \sum_{n=0}^{N-1}{x_n e^{-2\pi i k n / N}},

and of its corresponding inverse transform,

\displaystyle x_n =\frac{1}{N} \sum_{k=0}^{N-1}{X_k e^{2\pi i k n / N}}.

With the built-in support for complex arithmetic, there really isn’t much mistery in turning these two formulas into python functions, or as I have chosen, one with an inverse switch:

from __future__ import division
import math
import time

def dft(x, inverse = False, verbose = False) :
    t = time.clock()
    N = len(x)
    inv = -1 if not inverse else 1
    X =[0] * N
    for k in xrange(N) :
        for n in xrange(N) :
            X[k] += x[n] * math.e**(inv * 2j * math.pi * k * n / N)
        if inverse :
            X[k] /= N
    t = time.clock() - t
    if verbose :
        print "Computed","an inverse" if inverse else "a","DFT of size",N,
        print "in",t,"sec."
    return X

Of course, having two nested loops of the size of the input means that computation time will grow with N2, which is extremely inefficient.

Twiddle Factors
Apart from the FFT algorithms we will start with in the next post, there are some pretty straightforward optimizations at hand. While some texts reserve the name for similar factors appearing in FFT algorithms, the complex exponentials coming up in the DFT formula are often known as twiddle factors. While in principle there are N2 of them to calculate, we can take advantage of the fact that

\displaystyle e^{\varphi i} = \cos \varphi + i \sin \varphi,

and owing to the periodic nature of trigonometric functions, it isn’t hard to see that n·k can be replaced by n·k mod N without altering the result, which means only N different ones do exist.

A python implementation with pre-calculated twiddle factors could look something like this…

from __future__ import division
import math
import time

def dft(x, inverse = False, verbose = False) :
    t = time.clock()
    N = len(x)
    inv = -1 if not inverse else 1
    twiddles = [math.e**(inv*2j*math.pi*k/N) for k in xrange(N)]
    X =[0] * N
    for k in xrange(N) :
        for n in xrange(N) :
            X[k] += x[n] * twiddles[n * k % N]
        if inverse :
            X[k] /= N
    t = time.clock() - t
    if verbose :
        print "Computed","an inverse" if inverse else "a","DFT of size",N,
        print "in",t,"sec."
    return X

On my machine this code is about 5 times faster than the original one, although it still scales with time as N2, so we are only postponing the inevitable. Do notice as well that the increase in speed comes at the cost of an additional array of complex of size N memory requirement.

Particular Cases
Without being touched with insightful inspiration, the only other route to optimization is by working out the math on paper for particular cases, and code explicitly the resulting formulas to avoid doing costly exponentiations.

As examples, if N = 2, it is easy to show, that

\displaystyle X_0 = x_0 + x_1, \\ X_1 = x_0 - x_1,

and if N = 4, one can similarly arrive at

\displaystyle X_0 = x_0 + x_1  + x_2 + x_3, \\ X_1 = x_0 - x_2 - i (x_1 - x_3), \\ X_2 = x_0 + x_2 - (x_1+x_3), \\ X_3 = x_0 - x_2 + i (x_1 - x_3).

Adding these particular cases to the python code could be done as follows…

def dft(x, inverse = False, verbose = False) :
    t = time.clock()
    N = len(x)
    X =[0] * N
    inv = -1 if not inverse else 1
    if N == 1 :
        X[0] = x[0]
    elif N == 2 :
        X[0] = x[0] + x[1]
        X[1] = x[0] - x[1]
    elif N == 4 :
        a = x[0] + x[2]
        b = x[0] - x[2]
        c = x[1] + x[3]
        d = x[1] - x[3]
        X[0] = a + c
        X[1] = b + inv * 1j * d
        X[2] = a - c
        X[3] = b - inv * 1j * d
        if inverse :
            X = [k/N for k in X]
    else :
        twiddles = [math.e**(inv*2j*math.pi*k/N) for k in xrange(N//2+1)]
        for k in xrange(N//2+1,N) :
            twiddles += [twiddles[N-k].conjugate()]
        for k in xrange(N) :
            for n in xrange(N) :
                X[k] += x[n] * twiddles[n * k % N]
            if inverse :
                X[k] /= N
    t = time.clock() - t
    if verbose :
        print "Computed","an inverse" if inverse else "a","DFT of size",N,
        print "in",t,"sec."
    return X

Overall, and for this particular cases only, an additional factor of 3 increase in speed is achieved in this way. But this solution is not scalable to arbitrary input sizes, and so wiser approaches are needed to tackle even moderately large inputs of few tens of thousands samples…


Naive Integer Factorization

April 9, 2009

After three posts (1, 2, 3) on calculating prime numbers, it is probably worth putting that knowledge to a more useful task. As we will see in a near future, integer factorization, i.e. breaking down a (composite) number into its prime factors is one such task. In purity, factoring a number n is simply decomposing it as the product of two smaller non-trivial, i.e. different from 1 and n itself, divisors. But by repeatedly factoring the divisors one will eventually come up with a unique set of primes which, when multiplied together, render the original number, or so says the fundamental theorem of arithmetic… The point is, we will consider factorization a synonym of prime decomposition, be it formally correct or not.

There are some very sophisticated methods to factor very large numbers, but they use a lot of extremely complex math, so I doubt they will ever find their way onto this blog. So we are going to be left with the naive, straightforward approach as our only option, although I will try to give it an efficiency boost. What is this naive approach? Trial division, of course: given a number n, we know that its smallest factor will be smaller than the square root of n, so we can simply try and see if any of those numbers divide it. No, I will not try to code that yet… If you have read the entries on determining prime numbers, it should come as no surprise that we really do not need to do trial division by all numbers smaller than the square root of n, but only by the primes within. This is a consequence of the fact that, if a composite number divides n, then each of the prime factors of that composite number will also divide n. According to the prime number theorem the number of primes below x is asymptotic to x / log x. So by limiting our trials to prime numbers we can reduce the number of tests from n1/2 to something around 2 n1/2 / log n. If we rescue the primeListSofE function from the post on the sieve of Erathostenes, a python implementation of naive factorization could look something like this…

from time import clock

def factor(n, verbose = False) :
    """
    Returns all prime factors of n, using trial division by prime
    numbers only. Returns a list of (possibly repeating) prime factors
    """
    t = clock()
    ret =[]
    nn = n
    maxFactor = int(n**0.5)
    primes = primeListSofE(maxFactor, verbose)
    for p in primes :
        while nn % p == 0 :
            nn //= p
            ret += [p]
        if nn == 1 :
            break
    if nn != 1 :
        ret += [nn]
    t = clock() - t
    if verbose :
        print "Calculated factors of",n,"in",t,"sec."
    return ret

While this function will be about as good as we can make it for numbers which are the product of two large prime factors, it will be terribly inefficient for most numbers. Consider, as an extreme example, that we are trying to factor 255 ~ 3.6·1016. We would first calculate all primes up to 1.9·108, a challenging feat in itself with our available tools, only to find out that we only needed the first of those primes, i.e. 2. Taking into account that 50% of all numbers are divisible by 2, 33% are divisible by 3, 20% are divisible by 5… it doesn’t seem wise to disregard the potential time savings. What we can do to profit from this is to do the trial division checks at the same time as we determine the prime numbers, updating the largest prime to test on the fly. This has to be done in two stages, the first while we sieve up to n1/4, the second while we search the rest of the sieve up to n1/2 searching for more primes. The following Franken-code has been written mostly by cut-and-paste from primeListSofE and factor, which hopefully hasn’t affected its readability much:

from time import clock

def factorAndSieve(n, verbose = False) :
    """
    Returns all prime factors of n, using trial division while sieving
    for primes. Returns a list of (possibly repeating) prime factors
    """
    t = clock()
    ret =[]
    nn = n
    while nn % 2 == 0 : # remove 2's first, as 2 is not in sieve
        nn //= 2
        ret += [2]
    maxFactor = int(nn**0.5)
    maxI = (maxFactor-3) // 2
    maxP = int(maxFactor**0.5)
    sieve = [True for j in xrange(maxI+1)]
    i = 0
    for p in xrange(3, maxP+1,2) : # we then sieve as far as needed
        if p > maxP :
            break
        i = (p-3) // 2
        if sieve[i] :
            while nn % p == 0 :
                nn //= p
                ret += [p]
                maxFactor = int(nn**0.5)
                maxI = (maxFactor-3) // 2
                maxP = int(maxFactor**0.5)
            if nn == 1 :
                break
            else :
                i2 = (p*p - 3) // 2
                for k in xrange(i2, maxI+1, p) :
                    sieve[k] = False
    index = i
    for i in xrange(index, maxI+1) : # and inspect the rest of the sieve
        if i > maxI :
            break
        if sieve[i] :
            p = 2*i + 3
            while nn % p == 0 :
                nn //= p
                ret += [p]
                maxFactor = int(nn**0.5)
                maxI = (maxFactor-3) // 2
                maxP = int(maxFactor**0.5)
            if nn == 1 :
                break
    if nn != 1 :
        ret += [nn]
    t = clock() - t
    if verbose :
        print "Calculated factors of",n,"in",t,"sec."
        print "Stopped trial division at",2*i+3,"instead of",int(n**0.5)
    return ret

This new code will very often be much faster than the other one, but at times it will be just about as slow as in the other case, or even slower, since the mixing of both codes introduces some inefficiencies. The most extreme examples of such cases would be a prime number, or the square of a prime number on one side, and a power of 2 on the other one.

The graph above plots times to calculate the factors of numbers between 106 and 106 + 100. Prime numbers in this interval stick out as the red dots among the blue ones: 106 +3, +33, the twin primes +37 and +39, +81 and +99. And numbers with many small prime factors populate the bottom of the red cloud.

If the above graph is not enough to convince you of the benefits of the second approach, maybe this timings for very large numbers will:


>>> factor(10**15+37,True)
Calculated primes to 31622776 in 6.760 sec.
Calculated factors of 1000000000000037 in 8.466 sec.
[1000000000000037L]
>>> factorAndSieve(10**15+37,True)
Calculated factors of 1000000000000037 in 8.666 sec.
Stopped trial division at 31622775 instead of 31622776
[1000000000000037L]

>>> factor(2**55,True)
Calculated primes to 189812531 in 42.811 sec.
Calculated factors of 36028797018963968 in 43.261 sec.
[2, ..., 2]
>>> factorAndSieve(2**55,True)
Calculated factors of 36028797018963968 in 8.632e-05 sec.
Stopped trial division at 3 instead of 189812531
[2, ..., 2]


Prime Numbers 3. Wheel Factorization

March 24, 2009

In an earlier post, I described the sieve of Eratosthenes, a very fast algorithm to calculate all primes below a given limit, n. At least when programming it in python, speed is not an issue at all with this algorithm, because we start having memory problems much earlier, due to the fact that we need to keep in memory a list of size n…

This is specially problematic with python, because although all we need is a list of booleans (is prime / is not prime), which could fit in a bit each, I’m afraid python assigns several bytes to each. How many of them is probably machine dependent, but I’m almost certain it is at least 4 of them. So it is very likely that we could improve memory usage by a factor of 32, maybe more, if we could get around this. There are not-so-standard libraries that do just that, but I’ve sworn not to use them, and one day I may expose my failed attempts at using very long ints as bit arrays to the world, but python is definitely not the best language to mess around with low level optimizations, so I’m going to leave this line of thinking aside for the time being.

I also presented a very naive optimization, which simply discards all even numbers, which we already know are not prime, and thus cuts in half the memory requirements. More formally, what was being done there was to partition all numbers into two groups: those of the form 2·n, and those of the form 2·n+1, and then discard the former, since no prime of that form exists, and keep the latter only. This line of thinking is much more promising, as it leads to a very neat way of further reducing the memory required, known as wheel factorization.

For some odd reason that escapes my shallow understanding, the standard description of this algorithm places much emphasis on setting all numbers in circles, and then eliminating what it calls spokes. I personally don’t see any benefit of such a mental image, specially since the number are going to end up arranged in a rectangular array. So for the sake of clarity, I will not honor tradition, and spare you the circles and the spokes.

We begin by choosing a first few prime numbers. Say the first three: 2, 3 and 5. We multiply them together to get M = 30. We are now going to partition the set of all numbers into M groups, of the form M·m + ki, where the ki are the numbers 1 through M, both inclusive. Most of these subsets happen to hold no prime number, and thus need not be considered at all. To find out which of them, remember that M is the product of several prime numbers, M = p1·p2 … pn. Therefore, if ki is a multiple of any of those prime numbers, i.e. ki = pj·k’i, we can always extract the common factor pj from both summands in M·m + ki, which means that all numbers in that subset are composite. These subsets, which we discard, are very closely related to the spokes I was telling you about a while ago.

If M = 30, of the 30 potential ki we are left with just 8 (1, 7, 11, 13, 17, 19, 23 and 29), so we have cut memory required to about 27% of the straightforward implementation of the sieve of Eratosthenes. If we use the first 4 primes, we then have M = 210, and of the 210 potential ki we only need to keep 48. This improves memory use, which now is roughly 23% of the original. But you can probably see a law of diminishing returns at wortk here: the improvement in memory use gets smaller and smaller with every prime you add to the calculation of M. You can find some more data regarding this at the end of this link.

Of course we are still going to have to sieve each of our subsets, in much the same way we did with the sieve of Eratothenes. But sieving all multiples of a prime p when you have several subsets to consider is anything but trivial. Lets see how to tackle it…

The question to answer is, what integer values of m fulfill the equation M·m + ki = p·n? Using modular arithmetic, this equation is equivalent to finding M·m = -ki (mod p). This requires calculating M-1 (mod p), the modular multiplicative inverse of M modulo p, a subject already treated in a previous post. Given that, we can simply write mi = M-1·(-ki) (mod p), and therefore determine all multiples of p in the subset of ki as m = p·j + mi.

Last, but not least, we want to start sieving at the smallest multiple of p in each subset that is larger than or equal to p2, and so we want to increment mi with p·ji, where ji ≥ (p2-ki– M·mi) / (m·p).

Lets try to put all this ideas together in a python script. To make it work, you will need the primeListSofE function from this post, as well as the extEuclideanAlg and modInvEuclid ones from this other post

def primeListWF(n, m, verbose = True) :
    """Returns a list of prime numbers smaller than or equal to n,
    using wheel factorization, with a wheel size of the product of
    all primes smaller than or equal to m
    """
    t = time()
    # We use the standard sieve of Eratosthenes first...
    primes = primeListSofE(m)
    M = 1
    for p in primes :
        M *= p
    if verbose :
        print "Size of factorization wheel is",M
    # We now sieve out all multiples of these primes,including
    # the primes themselves, to get the k[i]
    sieve = [True] * M
    for p in primes :
        sieve[p-1] = False
        for j in xrange(p*p,M+1,p) :
            sieve[j-1] = False
    k = [j+1 for j in range(M) if sieve[j]]
    if verbose :
        print "Memory use is only",len(k),"/",M,"=",len(k)/M
    N = n
    if N %M :
        N += M
    N //= M
    maxP = int(sqrt(M*N))
    if verbose:
        print "Primes will be calculated up to",M*N,"not",n

    sieve = [[True]*N for j in range(len(k))]
    sieve[0][0] = False # Take care of the non primality of 1
    for row in range(N) :
        baseVal = M * row
        for subset in range(len(k)) :
            if sieve[subset][row] :
                p = baseVal + k[subset]
                primes += [p]
                if p > maxP :
                    continue
                # Sieve all multiples of p...
                invMp = modInvEuclid(M,p)
                for i in range(len(k)) :
                    mi = invMp * (-k[i] % p)
                    mi %= p
                    # ...starting at p**2
                    ji = (p*p - k[i] - M*mi)
                    if ji % (M*p) :
                        ji += M*p
                    ji //= M*p
                    mi+= ji*p

                    for r in range(mi,N,p) :
                        sieve[j][r] = False
    if verbose :
        print "Calculated primes to",n,"in",time()-t,"sec."
        
    return primes

You can play around with this algorithm and look at the time required. It probably depends on the value of n, but there seems to be a sweet spot for speed somewhere around m = 7 (and thus M = 210) or m = 11 (M = 2310). This being a general purpose wheel factorization algorithm, it has speeds that are faster than the unoptimized sieve of Eratosthenes code, but it cannot beat the optimized code, which in fact is a particular case of wheel factorization, for m = M = 2.

On the other hand, using m = 7 I have managed to compute all primes up to 108, a feat unreachable for the previous code, at least without finding a workaround the MemoryError they produce. 109 is still out of reach, though…


Modular Multiplicative Inverse

March 20, 2009

Now that we are all proficient with modular exponentiation thanks to my previous post, it is time to tackle a more complicated issue, still in the realm of modular arithmetic.
While addition, subtraction and multiplication are “compatible with the congruence relation” introduced by modular arithmetic, the same doesn’t happen with division. This means that solving a simple equation such as a·x = 1 (mod m) is anything but trivial. In fact, determining x = a-1 (mod m) that fulfills that equation is the whole object of this post…

There is, as always, the brute force approach, which I always like to visit first, so that I can later congratulate myself on how much the efficiency has improved with my deep insight. In this particular case, brute force means trying every number between 1 and m-1, and see if the m-modulo of its product with a is 1. :

def modInv1(a,m) :
    """
    Computes the modular multiplicative inverse of a modulo m,
    using brute force
    """
    a %= m
    for x in range(1,m) :
        if a*x%m == 1 :
        return x
    return None

Do notice that the possibility of no multiplicative inverse existing is contemplated in the code. This will actually be the case if a and m have any common factors, i.e. if they are not coprime. So the previous code could serve as a moron’s approach to determining if two numbers are relatively prime. Not that I recommend it, though, as it should be clear that the code I have just produced will be very inefficient for large values of m.

The Extended Euclidean Algorithm
As stated, we are after a number x that fulfills a·x = 1 (mod m). This can of course be written as well as a·x = 1 + m·y, which rearranges neatly into a·x – m·y = 1. Since x and y need not be positive, we can write it as well in the standard form, a·x + m·y = 1. If a and m are coprime, this is a particular instance of a·x + b·y = gcd(a,b), where gcd(a,b) is the greatest common divisor of a and b. This equation is known as Bézout’s identity, and can be efficiently solved using the extended Euclidean algorithm. To find a solution to our particular equation, this algorithm would proceed as follows:

  1. Rewrite the equation as an·xn + mn·yn = 1, starting with n = 0
  2. Compute the quotient, qn, and the remainder, rn, of an divided by mn, and rewrite the original equation as (mn·qn + rn)·xn + mn·yn =1
  3. Rearrange the terms in the equation to get mn·(qn·xn+yn) + rn·xn = 1
  4. Do the changes of variables
    • xn+1 = qn·xn+yn,
    • yn+1 = xn,
    • an+1 = mn, and
    • mn+1 = rn
  5. Repeat the process, until an = 1, and mn = 0.
  6. Such equations has the trivial solution xn = 1, yn = 0.
  7. Work your way back to x0 and y0, noting that
    • xn-1 = yn,
    • yn-1 = xn – qn-1·yn
  8. x0 is the number we are after.

There is a certain component of faith in believing that, eventually, an = 1 and mn = 0. It is in fact not always the case, as if a and m are not coprime, then the values we will arrive at are mn = 0, but an = gcd(a,m). As a matter of fact, the gcd is normally computed using the Euclidean algorithm, dropping the adjective extended, which basically does the same as above, but without the back-tracking of the xn‘s and the yn‘s. With all of this in mind, we can write two functions, one calling the other, as follows:

def extEuclideanAlg(a, b) :
    """
    Computes a solution  to a x + b y = gcd(a,b), as well as gcd(a,b)
    """
    if b == 0 :
        return 1,0,a
    else :
        x, y, gcd = extEuclideanAlg(b, a % b)
        return y, x - y * (a // b),gcd

def modInvEuclid(a,m) :
    """
    Computes the modular multiplicative inverse of a modulo m,
    using the extended Euclidean algorithm
    """
    x,y,gcd = extEuclideanAlg(a,m)
    if gcd == 1 :
        return x % m
    else :
        return None

The gains in speed are tremendous with this new approach. Again, write your own testing function as an exercise, but mine spits out the following results…


>>> testModInv(1234,5,10000)
Got 4 using method 1 in 0.0160000324249 sec.
Got 4 using Euclidean algorithm in 0.0309998989105 sec.
>>> testModInv(1234,56789,10000)
Got 31800 using method 1 in 59.4000101089 sec.
Got 31800 using Euclidean algorithm in 0.047000169754 sec.

So while brute force is good enough, or even the best choice, for very small values of m, as soon as it grows, the benefits of the more elaborate algorithm are all too obvious. According to wikipedia, the algorithm we have coded has complexity O(log(m)2), i.e. time required to compute the modular multiplicative inverse is proportional to log(m)2.


Modular Exponentiation

March 20, 2009

I have a post cooking on wheel factorization, yet another sieving algorithm for the computation of all primes below a given number n. It turns out that while implementing it, one comes across a number of not so obvious preliminary questions. It is a rule of this blog not to resort to any non standard library, and I feel it would be breaking the spirit of that rule to post code which hasn’t been thoroughly explained in advance. So in order not to make that soon-to-be post on prime numbers longer than anyone could bear, I find myself in need of covering a couple of preliminaries, and so here comes this thrilling post on modular exponentiation

What we are after is determining ab (mod m), or to put it in plain English, the remainder of a to the power of b, divided by m. Doesn’t seem much, does it? We could just go ahead and write:

def modExp1(a, b, m) :
    """
    Computes a to the power b, modulo m, directly
    """
    return a**b % m

The problems begin if a and b are both big. Python’s integers don’t overflow, and we all love it for it, but the memory juggling it has to do to accommodate a huge number in memory does come at a speed cost. Specially if m is small, it definitely seems a little overkill to keep a zillion digit number in memory, given that the answer will be smaller than m…

We can, and will, use modular arithmetic to our advantage. See, wikipedia says that “the congruence relation (introduced by modular arithmetic) on the integers is compatible with the multiplication operation of the ring of integers,” and if it’s on the internet it has to be true, right? Translated to plain English, the idea is that if we want to take the product of two numbers, a and b, modulo m, we can first multiply them, and then take the m-modulo of the result, or we can take the m-modulo of a and/or b, prior to multiplication, and the m-modulo of the result will still be the right answer. So the least a conscious programmer should do is to rewrite the previous function as:

def modExp2(a, b, m) :
    """
    Computes a to the power b, modulo m, a little less directly
    """
    return (a%m)**b % m

Of course, if b is large enough, you may still find yourself in the realm of zillion digit numbers, which is a place we don’t really want to go. It turns out then that it may be a good option to remember that exponentiation is just repeated multiplication. If we multiply a by itself b times, and take the m-modulo of the result after every multiplication, the larger number we will ever have in memory will be m2. It doesn’t really seem right to put a loop of b iterations into the code, since what we are worried about are large b’s, but lets do it anyway:

def modExp3(a, b, m) :
    """
    Computes a to the power b, modulo m, by iterative multiplication
    """
    ret = 1
    a %= m
    while b :
        ret *= a
        ret %= m
        b -= 1
    return ret

The only reason why this last function deserves a place here, is because it helps introduce the really cool algorithm that this whole post really is about, binary exponentiation. In John D. Cook’s blog there is a great explanation on how it could be implemented iteratively, based on the binary representation of the exponent, but I’ll go for a recursive version, which is essentially the same, but which I find easier for my brain’s wiring…

So what we are about to do is, when asked to compute ab, check if b is even, in which case we will compute it as the square of ab/2. If b is odd, we will compute it as a·ab-1. If b is 0, we will of course return 1. If we also throw in all the modulo m stuff I have been discussing previously, we finally arrive at THE function to do modular exponentiation:

def modExp(a, b, m) :
    """
    Computes a to the power b, modulo m, using binary exponentiation
    """
    a %= m
    ret = None
    if b == 0 :
        ret = 1
    elif b%2 :
        ret = a * modExp(a,b-1,m)
    else :
        ret = modExp(a,b//2,m)
        ret *= ret
    return ret%m

Now that we have all four versions of the same function written, we can do a little testing, I will leave writing the testing code as an exercise for the reader (I’ve been wanting to say this since my first year of university…), but when computing 1234 to the power 5678, modulo 90, a 1000 times with each function, these is the performance I get:


>>> testModExp(1234,5678,90,1000)
Got 46 using method 1 in 3.73399996758 sec.
Got 46 using method 2 in 0.546999931335 sec.
Got 46 using method 3 in 1.25 sec.
Got 46 using THE method in 0.0150001049042 sec.

Results which really don’t require much commenting…

EDIT: There is some very interesting information in the comments: python comes with a built-in pow function, which can take two or three arguments, the third being, precisely, the modulo. It probably implements an algorithm similar to what has been presented, but is more than an order of magnitude faster.