Mostly Technical: 2008

I've been trying to write an efficient prime number generator. So far the best I've been able to do is about 2.3 seconds for the prime numbers less than one million. Using a normal sieve, it is possible for a C++ program to generate the primes in less than 0.05 seconds. So my goal is to get it to 0.5 seconds without having to resort to funky unsafe operations or basically writing the C solution in Haskell.

Here is the current code:

import Data.List
import Prime.Queue

type Prime = Int
type PrimeCache = ([Prime], Queue Prime)

data Primes = Primes !(Prime, PrimeCache) Primes

primeList :: [Prime] 
primeList = 2:3:(nextPrimes (3, ([2], enq newQueue 3)))
    where 
      p2l (Primes (p, _) next) = p:(p2l next)

nextPrimes :: (Prime, PrimeCache) -> [Prime]
nextPrimes (p, cache) = newP:(nextPrimes h)
    where
      h@(newP, _) = sweep (fst cache, enq (snd cache) p) (p+2)

sweep c n | not $ any ((== 0).(mod n)) (fst shortList) = (n, shortList)
          | otherwise                     = sweep shortList (n+2) -- once odd, always odd
    where
      shortList = extendedList (limit n) c 

extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l@(x:xs), q) | x>=limit || empty q
                                       = (l, q)
                                   | otherwise 
                                       = extendedList limit (min:l, newSet)
          where
            (min, newSet) = deq q 

limit n = floor $ sqrt (fromIntegral n :: Float)

Yes, it's not pretty. All it does is keep a list of prime numbers already found and see if the candidate is a factor of any of them.

It implements two basic optimization:

After 2, there are no more even primes, so starting at 3 it adds 2 to the next number to test for primeness.
It doesn't have to check all previously found prime numbers, only prime numbers that are < sqrt(p).

It uses the handy Queue code from Eric Kidd. So before any optimization is really done other than the simplest algorithmic optimizations, we'll benchmark it.

The profiling has:

 Sun Jun  8 00:14 2008 Time and Allocation Profiling Report  (Final)

    10 +RTS -P -s10.stats -RTS

 total time  =        3.80 secs   (190 ticks @ 20 ms)
 total alloc = 532,086,816 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc  ticks     bytes

sweep                          Prime.PrimeList3      82.6   77.3    157 102850044
limit                          Prime.PrimeList3      15.8   18.4     30  24537068
nextPrimes                     Prime.PrimeList3       0.5    1.1      1   1491443
main                           Main                   0.5    1.4      1   1851678
extendedList                   Prime.PrimeList3       0.0    1.1      0   1502535


                                                                                               individual    inherited
COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc  ticks     bytes

MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0      0       152
 main                    Main                                                 174           1   0.0    0.0     0.0    0.0      0      8792
 CAF                     Main                                                 168           3   0.0    0.0     0.5    1.4      0        24
  main                   Main                                                 175           0   0.5    1.4     0.5    1.4      1   7397920
 CAF                     GHC.Num                                              156           1   0.0    0.0     0.0    0.0      0       104
 CAF                     GHC.Handle                                           120           4   0.0    0.0     0.0    0.0      0      8620
 CAF                     Prime.PrimeList3                                      95           4   0.0    0.0    99.5   98.6      0        32
  primeList              Prime.PrimeList3                                     176           1   0.0    0.0    99.5   98.6      0        12
   enq                   Prime.Queue                                          183           1   0.0    0.0     0.0    0.0      0        24
   nextPrimes            Prime.PrimeList3                                     177      156994   0.5    1.1    99.5   98.6      1   5965772
    sweep                Prime.PrimeList3                                     179      500000  82.6   77.3    98.4   96.9    157 411400176
     extendedList        Prime.PrimeList3                                     181      500169   0.0    1.1     0.0    1.1      0   6010140
      deq                Prime.Queue                                          184         169   0.0    0.0     0.0    0.0      0     13856
      empty              Prime.Queue                                          182         169   0.0    0.0     0.0    0.0      0         0
     limit               Prime.PrimeList3                                     180      500000  15.8   18.4    15.8   18.4     30  98148272
    enq                  Prime.Queue                                          178       78207   0.5    0.6     0.5    0.6      1   3132920

Not surprisingly, the function that does a square root and type conversion is rather expensive in time as well as space (15.8% of the time and 18.4% in space). The biggest factor though is the sweep function. Nothing jumps out as a problem in sweep, but we can fix the limit problem by doing the comparison. Instead of x>sqrt(y), we can do (x*x)>y:

extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l@(x:_), q) | (x*x)>=limit || empty q  -- n to the square rather then x to root(n)
                                       = (l, q)
                                   | otherwise 
                                       = extendedList limit (min:l, newSet)
          where
            (min, newSet) = deq q 

limit n=n

Now for the profile:

 Sun Jun  8 00:43 2008 Time and Allocation Profiling Report  (Final)

    10 +RTS -P -s10.stats -RTS

 total time  =        3.36 secs   (168 ticks @ 20 ms)
 total alloc = 434,489,520 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc  ticks     bytes

sweep                          Prime.PrimeList3      98.8   94.8    166 102989282
main                           Main                   0.6    1.7      1   1851678
nextPrimes                     Prime.PrimeList3       0.0    1.4      0   1491443
extendedList                   Prime.PrimeList3       0.0    1.4      0   1502535


                                                                                               individual    inherited
COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc  ticks     bytes

MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0      0       152
 main                    Main                                                 174           1   0.0    0.0     0.0    0.0      0      8792
 CAF                     Main                                                 168           3   0.0    0.0     0.6    1.7      0        24
  main                   Main                                                 175           0   0.6    1.7     0.6    1.7      1   7397920
 CAF                     GHC.Num                                              156           1   0.0    0.0     0.0    0.0      0       104
 CAF                     GHC.Handle                                           120           4   0.0    0.0     0.0    0.0      0      8620
 CAF                     Prime.PrimeList3                                      95           4   0.0    0.0    99.4   98.3      0        32
  primeList              Prime.PrimeList3                                     176           1   0.0    0.0    99.4   98.3      0        12
   enq                   Prime.Queue                                          182           1   0.0    0.0     0.0    0.0      0        24
   nextPrimes            Prime.PrimeList3                                     177      156994   0.0    1.4    99.4   98.3      0   5965772
    sweep                Prime.PrimeList3                                     179      500000  98.8   94.8    98.8   96.2    166 411957128
     extendedList        Prime.PrimeList3                                     180      500169   0.0    1.4     0.0    1.4      0   6010140
      deq                Prime.Queue                                          183         169   0.0    0.0     0.0    0.0      0     11432
      empty              Prime.Queue                                          181         169   0.0    0.0     0.0    0.0      0         0
    enq                  Prime.Queue                                          178       78059   0.6    0.7     0.6    0.7      1   3129368

The execution time (without profiling) dropped to about 1.9 seconds, or 82% the time.

Not bad, but now sweep is where all the time is being spent and all the memory is being allocated. The nastiest part of the function is the test for factors, splitting it out:

sweep c n | hasFactor n (fst shortList) = (n, shortList)
          | otherwise                     = sweep shortList (n+2) -- once odd, always odd
    where
      shortList = extendedList (limit n) c 

hasFactor n = not.(any ((== 0).(mod n)))

And running that, we find (as a diff):

-sweep                          Prime.PrimeList3      98.8   94.8    166 102989282
-main                           Main                   0.6    1.7      1   1851678
+hasFactor                      Prime.PrimeList3      95.8   91.9    159  99832288
+sweep                          Prime.PrimeList3       2.4    2.9      4   3156994
+main                           Main                   1.2    1.7      2   1851678

Most of the time and allocation is done in the fairly simple hasFactor function. Here's a guess, most of the time, the small numbers are the factors, but the list is in reverse order. So put the list in the other order:

extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l, q) | (x*x)>=limit || empty q
                              = (l, q)
                          | otherwise 
                              = extendedList limit (l++[min], newSet)
          where
            (min, newSet) = deq q 
            x = last l

And we now have a running time of 0.9 seconds, the profile diff looks like:

<       total time  =        3.62 secs   (181 ticks @ 20 ms)
<       total alloc = 438,489,520 bytes  (excludes profiling overheads)
---
>       total time  =        1.44 secs   (72 ticks @ 20 ms)
>       total alloc = 154,277,996 bytes  (excludes profiling overheads)
10,14c10,15
< hasNoFactor                    Prime.PrimeList3      96.7   91.1    175  99832288
< nextPrimes                     Prime.PrimeList3       2.2    1.4      4   1491443
< sweep                          Prime.PrimeList3       0.6    3.8      1   4156994
< main                           Main                   0.6    1.7      1   1851678
< extendedList                   Prime.PrimeList3       0.0    1.4      0   1502535
---
> hasNoFactor                    Prime.PrimeList3      56.9   74.4     41  28678852
> extendedList                   Prime.PrimeList3      36.1    4.2     26   1603090
> main                           Main                   4.2    4.8      3   1851678
> nextPrimes                     Prime.PrimeList3       2.8    3.9      2   1491443
> enq                            Prime.Queue            0.0    2.0      0    782348
> sweep                          Prime.PrimeList3       0.0   10.8      0   4156994

Figuring out if a number has a factor is now only using half the time, but it's still doing the bulk of the allocations. I'm not sure anything more can be done to speed it up, but on a guess, perhaps it's the partial functions and composition which is causing all the allocations. Lets remove them:


hasNoFactor n = any (\ x -> 0==(n `mod` x))

This does result in an improvement. The run time for the -O2 (no profiling) version takes a small amount under 0.9 seconds. But the amount of allocation is way down.

<       total time  =        1.44 secs   (72 ticks @ 20 ms)
<       total alloc = 154,277,996 bytes  (excludes profiling overheads)
---
<       total time  =        1.18 secs   (59 ticks @ 20 ms)
<       total alloc =  43,562,588 bytes  (excludes profiling overheads)
10,15c10,15
< hasNoFactor                    Prime.PrimeList3      56.9   74.4     41  28678852
< extendedList                   Prime.PrimeList3      36.1    4.2     26   1603090
< main                           Main                   4.2    4.8      3   1851678
< nextPrimes                     Prime.PrimeList3       2.8    3.9      2   1491443
< enq                            Prime.Queue            0.0    2.0      0    782348
< sweep                          Prime.PrimeList3       0.0   10.8      0   4156994
---
< hasNoFactor                    Prime.PrimeList3      71.2    9.2     42   1000000
< extendedList                   Prime.PrimeList3      27.1   14.7     16   1603090
< nextPrimes                     Prime.PrimeList3       1.7   13.7      1   1491443
< enq                            Prime.Queue            0.0    7.2      0    782348
< sweep                          Prime.PrimeList3       0.0   38.2      0   4156994
< main                           Main                   0.0   17.0      0   1851678

I don't think we're going to squeeze any more speed out of the factoring part, but 27% of the time is spent extending the list. The likely culprit is the last function being run all the time. So lets keep track of the last number in that list, and while we're at it, keep the squared number instead of the original since it's squared:

import Data.List
import qualified Data.IntSet as IntSet
import Prime.Queue

type Prime = Int
type PrimeCache = (([Prime], Int), Queue Prime)

data Primes = Primes (Prime, PrimeCache) Primes

primeList :: [Prime] 
primeList = 2:3:(nextPrimes (3, (([2], 4), enq newQueue 3)))
    where 
      p2l (Primes (p, _) next) = p:(p2l next)

nextPrimes :: (Prime, PrimeCache) -> [Prime]
nextPrimes (p, cache) = newP:(nextPrimes h)
    where
      h@(newP, _) = sweep (fst cache, enq (snd cache) p) (p+2)

sweep c n | hasNoFactor n (fst $ fst shortList) = sweep shortList (n+2) -- once odd, always odd
          | otherwise                     = (n, shortList)
    where
      shortList = extendedList n c 

hasNoFactor n = any ((== 0).(mod n))

extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l, q) | x>=limit || empty q
                              = (l, q)
                          | otherwise 
                              = extendedList limit (((fst l)++[min], min*min), newSet)
          where
            (min, newSet) = deq q 
            x = snd l

Amazingly, this change brought the running time down to 0.6 seconds. The profile is now:

 total time  =        1.10 secs   (55 ticks @ 20 ms)
 total alloc = 164,277,996 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc  ticks     bytes

hasNoFactor                    Prime.PrimeList3      92.7   69.8     51  28678852
sweep                          Prime.PrimeList3       3.6   16.2      2   6656994
enq                            Prime.Queue            1.8    1.9      1    782348
nextPrimes                     Prime.PrimeList3       1.8    3.6      1   1491443
extendedList                   Prime.PrimeList3       0.0    3.9      0   1603090
main                           Main                   0.0    4.5      0   1851678


                                                                                               individual    inherited
COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc  ticks     bytes

MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0      0       152
 main                    Main                                                 174           1   0.0    0.0     0.0    0.0      0      8792
 CAF                     Main                                                 168           3   0.0    0.0     0.0    4.5      0        24
  main                   Main                                                 175           0   0.0    4.5     0.0    4.5      0   7397920
 CAF                     GHC.Num                                              156           1   0.0    0.0     0.0    0.0      0       104
 CAF                     GHC.Handle                                           120           4   0.0    0.0     0.0    0.0      0      8620
 CAF                     Prime.PrimeList3                                      95           4   0.0    0.0   100.0   95.5      0        32
  primeList              Prime.PrimeList3                                     176           1   0.0    0.0   100.0   95.5      0        12
   enq                   Prime.Queue                                          182           1   0.0    0.0     0.0    0.0      0        24
   nextPrimes            Prime.PrimeList3                                     177      156994   1.8    3.6   100.0   95.5      1   5965772
    sweep                Prime.PrimeList3                                     179      500000   3.6   16.2    96.4   89.9      2  26627976
     hasNoFactor         Prime.PrimeList3                                     184      500000  92.7   69.8    92.7   69.8     51 114715408
     extendedList        Prime.PrimeList3                                     180      500169   0.0    3.9     0.0    3.9      0   6412360
      deq                Prime.Queue                                          183         169   0.0    0.0     0.0    0.0      0     11432
      empty              Prime.Queue                                          181         169   0.0    0.0     0.0    0.0      0         0
    enq                  Prime.Queue                                          178       78059   1.8    1.9     1.8    1.9      1   3129368

At 0.6 seconds to find all prime numbers under 1E6, I'm happy to stop, especially since the C++ version cannot create an infinite list of prime numbers.

Mostly Technical

2008-06-07

Prime Number Profiling

2008-05-02

Foreign Function Interface in Haskell

Blog Archive

Followers