Installing libraries for use with LispCabinet

For Microsoft Windows, third-party libraries are often provided as precompiled DLL files in 32-bit and 64-bit flavours. Normally, when an executable requires to be linked with a DLL, Windows will search for it in a number of standard places so putting a copy of the DLL in one of these, typically c:\windows\system32, is the usual thing to do.

However, there appear to be a couple of caveats when using LispCabinet (the easiest way to get Common Lisp on the Windows platform). First of all, LispCabinet seems to require 32-bit libraries even on 64-bit Windows. Secondly, if you put the library in one of the standard places it won’t be found by Lisp when you install the quicklisp package that uses the library. To install e.g. the SDL library for use with the lispbuilder-sdl bindings, you therefore have to download the 32-bit version of the DLL and copy it into the LispCabinetHome directory in your user directory (not the main LispCabinet install directory).

If you use the 64-bit DLL instead of the 32-bit DLL by mistake, or don’t put the DLL where Lisp can find it, then when quicklisp attempts to compile the package that uses the DLL it will complain that it can’t find it (with a numerical reason code that doesn’t appear to be documented anywhere!) and drops into the debugger. The debugger presents several restart options, including retrying loading the DLL, so once you’ve fixed the problem you can resume from where you left off.

Posted in Computing, Lisp | Leave a comment

Using OpenGL with Common Lisp

Edit: After posting this article to Reddit I received comments that my OpenGL is stone age and doesn’t reflect modern practice in that uses the slow, inefficient API (“direct mode”) instead of drawing objects using vertex buffers, and doesn’t make use of programmable shaders. Guilty as charged, although I would say that OpenGL 1 still offers the least painful way of getting something to display if you’re not worried about performance. I may post a more “modern” version in the future. In the meantime, the information below offers the path of least resistance, with the above caveats.

OpenGL is an API for three-dimensional computer graphics rendering. Its main advantages over other 3D graphics APIs are that it is cross-platform, an industry standard and published freely (i.e. you don’t need to pay to peek at the documentation!). Another API typically used in conjunction with OpenGL, called GLU (GL Utility library), provides a number of convenience functions for OpenGL. While there are no standard Common Lisp language bindings, the OpenGL and GLU APIs can be used with Common Lisp through a CFFI (C Foreign Function Interface).

Platform-specific things such as managing windows, mouse and keyboard events etc. are not within the scope of OpenGL and must be handled separately. An early library called the GLUT was developed for the purpose of providing a window for OpenGL rendering and mouse and keyboard handling. Nowadays GLFW nowadays provides a more modern alternative. Alternatively, other libraries such as for games and graphical user interface libraries, such as SDL and Qt, provide facilities for OpenGL.

Using OpenGL with Common Lisp

To use OpenGL with Common Lisp, you need the following bits and bobs.

  1. An OpenGL library and drivers installed on your computer. Most desktop computers nowadays ship with an OpenGL C library and 3D graphics hardware drivers already installed. While these are usually proprietary, Mesa provides an open source alternative (but is a software renderer only).
  2. A GLU library if you want to use GLU functions. Most OpenGL library packages also throw in a GLU library as well so if you already have OpenGL installed you probably won’t need to worry about it.
  3. A library that will provide OpenGL windows, events etc., installed where your Common Lisp will find it. (See my blog post for caveats when installing libraries for LispCabinet on Windows.)
  4. Common Lisp bindings for the above.

CLiki lists a number of Common Lisp packages related to OpenGL.

For interfacing to OpenGL and GLU, I use cl-opengl for no other reason than that it was the first I tried and it works. Installation via quicklisp on Mac OS X, Microsoft Windows 7 and Linux is straightforward: just (ql:quickload :cl-opengl) and (ql:quickload :cl-glu) at the REPL. Once they’re installed, with SBCL you can just put (require 'cl-opengl) and (require 'cl-glu) at the top of any source file that uses them and they will be loaded automagically. (As of the time of writing, April 2013, CLISP under Mac OS X 10.8 won’t work because the CFFI library it uses is broken on that platform.)

Interfacing OpenGL with the platform (window system, mouse, keyboard etc.) is where you have the most choice and it partly depends on what you want to do with OpenGL. For a lightweight library that just provides an OpenGL window and input, GLFW appears to fit the bill and has a set of Common Lisp CFFI bindings provided by cl-glfw. While simple to use, I found that under both Linux and Mac OS X, the window didn’t double buffer so there was an annoying flicker when refreshing the graphics. (I’ve not tried it on Windows.)

I therefore instead used the Simple Directmedia Layer (SDL) cross-platform multimedia library and the LispBuilder SDL Common Lisp bindings. SDL is cross-platform, seems well-supported and has good documentation.

A problem with Common Lisp CFFI binding libraries in general is a lack of documentation. This is understandable to some extent – there’s usually documentation and tutorials for the original library (usually in C) and the bulk of the Common Lisp bindings are generated from the C API automatically. However, there’s often not much documentation on how to use the Common Lisp bindings and it seems that more often that not, you’re supposed to work out the mapping convention for function names etc. (and what functions/features have been left out) yourself! Some packages thoughtfully provide examples and limited docs. With the quicklisp package management system, look in the appropriate directory under quicklisp/dists/quicklisp/software below the quicklisp directory itself. Lispbuilder-SDL stands out from the crowd in that it provides installation and download instructions, a user guide, an API reference and a sample code snippet on its web site. Kudos!

Installing Lispbuilder-SDL is explained in its documentation, except that it assumes the use of ASDF rather than quicklisp. There’s a minor hiccup when installing lispbuilder-sdl on Mac OS X. If you download it with quicklisp, there will be an error during compilation because it requires a ‘cocoahelper’ library to be compiled. To fix this problem, go into the lispbuilder-sdl/cocoahelper directory under the appropriate quicklisp directory (e.g. quicklisp/software/lispbuilder-XXXX-svn/lispbuilder-sdl/cocoahelper), type make to build and install the helper, then retry.

A simple framework

The following code is a very simple skeleton program that uses the SDL library to open a window for rendering OpenGL, then draws an image which is updated 20 times a second.

(require 'cl-opengl)
(require 'cl-glu)
(require 'lispbuilder-sdl)

(defun main-loop ()
  (sdl:with-init ()
    (sdl:window 800 800
        :title-caption "OpenGL Test"
        :opengl t
        :opengl-attributes '((:sdl-gl-doublebuffer 1)
                             (:sdl-gl-depth-size 16)))
    (setf (sdl:frame-rate) 20)

    (init-gl)
    (draw-frame)
    (sdl:update-display)

    (sdl:with-events ()
      (:quit-event () t)
      (:video-expose-event ()
        (sdl:update-display))
      (:idle ()
        (draw-frame)
        (sdl:update-display)))))

Using SBCL, if the cl-opengl and Lispbuilder-SDL libraries have been installed using ASDF or quicklisp, they can be pulled by code that requires them using the require expressions at lines 1–3.

(require 'cl-opengl)
(require 'cl-glu)
(require 'lispbuilder-sdl)

When opening a window, you can specify it to be used for OpenGL rendering using :opengl T and specify a list of OpenGL attributes. Here, we enable double buffering and set the depth buffer to 16-bits.

    (sdl:window 800 800
        :title-caption "OpenGL Test"
        :opengl t
        :opengl-attributes '((:sdl-gl-doublebuffer 1)
                             (:sdl-gl-depth-size 16)))

The following code then carries out OpenGL initialisation (set up the viewing parameters, lighting, culling, modelview matrix etc.) by evaluating (init-gl) and draws a frame with (draw-frame). The function sdl:update-display refreshes the window, swapping the front and back rendering buffers.

    (init-gl)
    (draw-frame)
    (sdl:update-display)

Lines 18–24 form the main event loop. The program enters the loop, waits for events (user closes the window, the window is ‘exposed’, mouse movement, key press etc.) and then takes the action specified in the program. :idle event specifies what action to take if no other events are pending. If we set the frame rate (line 12 of the program (setf (sdl:frame-rate) 20)), it results in the :idle handling code being called that many times per second. Here we simply draw the image again (which may have been animated).

    (sdl:with-events ()
      (:quit-event () t)
      (:video-expose-event ()
        (sdl:update-display))
      (:idle ()
        (draw-frame)
        (sdl:update-display)))))

That’s the basic framework of an SDL program. The rest is the OpenGL rendering code and the program logic.

Sample OpenGL Program

ogl-cubeIn response to a request in the comments, I’ve added some OpenGL bones to the framework as a simple demo that draws a spinning cube. The program is presented below.

(require 'cl-opengl)
(require 'cl-glu)
(require 'lispbuilder-sdl)

(defconstant +window-width+  600)
(defconstant +window-height+ 600)

(defconstant +cube-vertices+
  #(#(0 0 0)
    #(0 1 0)
    #(1 1 0)
    #(1 0 0)
    #(0 0 1)
    #(0 1 1)
    #(1 1 1)
    #(1 0 1)))

(defconstant +cube-faces+
  '((#(4 7 6 5) #(0 0 1))
    (#(5 6 2 1) #(0 1 0))
    (#(1 2 3 0) #(0 0 -1))
    (#(0 3 7 4) #(0 -1 0))
    (#(4 5 1 0) #(-1 0 0))
    (#(3 2 6 7) #(1 0 0))))

(defun draw-figure (verts faces)
  (labels ((set-normal (n)
             (gl:normal (aref n 0) (aref n 1) (aref n 2)))
           (set-vertex (index)
             (let ((v (aref verts index)))
               (gl:vertex (aref v 0) (aref v 1) (aref v 2))))
           (draw-face (vertex-indices normal)
             (set-normal normal)
             (gl:begin :quads)
             (map 'nil #'set-vertex vertex-indices)
             (gl:end)))

    (map 'nil #'(lambda (x) (draw-face (first x) (second x))) faces)))


(defun draw-frame (rotx roty rotz)
  (gl:matrix-mode :modelview)
  (gl:push-matrix)
  (gl:translate 0.5 0.5 0.5)
  (gl:rotate rotx 1 0 0)
  (gl:rotate roty 0 1 0)
  (gl:rotate rotz 0 0 1)
  (gl:translate -0.5 -0.5 -0.5)
  (draw-figure +cube-vertices+ +cube-faces+)
  (gl:pop-matrix))
  

(defun start ()
  (let ((rotx 0)
        (roty 0)
        (rotz 0))
    (sdl:with-init ()
      (sdl:window +window-width+ +window-height+ 
                  :opengl t
                  :opengl-attributes '((:sdl-gl-depth-size   16)
                                       (:sdl-gl-doublebuffer 1)))
      (setf (sdl:frame-rate) 10)

      (gl:viewport 0 0 +window-width+ +window-height+)
      (gl:matrix-mode :projection)
      (gl:load-identity)
      (glu:perspective 50 (/ +window-height+ +window-width+) 1.0 10.0)
      (glu:look-at -2 2 4 
                    0.5 0.5 0.5 
                    0 1 0)

      (gl:matrix-mode :modelview)
      (gl:load-identity)

      (gl:clear-color 0 0 0 0)
      (gl:shade-model :flat)
      (gl:cull-face :back)
      (gl:polygon-mode :front :fill)
      (gl:draw-buffer :back)
      (gl:material :front :ambient-and-diffuse #(0.7 0.7 0.7 0.4))
      (gl:light :light0 :position #(0 0 1 0))
      (gl:light :light0 :diffuse #(1 0 0 0))
      (gl:light :light1 :position #(-1 2 -0.5 0))
      (gl:light :light1 :diffuse #(0 1 0 0))
      (gl:enable :cull-face :depth-test
                 :lighting :light0 :light1)

      (gl:clear :color-buffer :depth-buffer)
      (draw-frame rotx roty rotz)
      (sdl:update-display)

      (sdl:with-events ()
        (:quit-event () t)
        (:video-expose-event () (sdl:update-display))
        (:idle
          (setq rotx (mod (+ rotx 2.5) 360.0))
          (setq roty (mod (+ roty 0.7) 360.0))
          (setq rotz (mod (+ rotz 4.4) 360.0))
          (gl:clear :color-buffer :depth-buffer)
          (draw-frame rotx roty rotz)
          (sdl:update-display))))))

The cube vertices are stored in the +cube-vertices+ array. We will draw each face of the cube as a GL quad primitive (four-side polygon), and store the vertex indices and normal vector of each face in +cube-faces+.

(defconstant +cube-vertices+
  #(#(0 0 0)
    #(0 1 0)
    #(1 1 0)
    #(1 0 0)
    #(0 0 1)
    #(0 1 1)
    #(1 1 1)
    #(1 0 1)))

(defconstant +cube-faces+
  '((#(4 7 6 5) #(0 0 1))
    (#(5 6 2 1) #(0 1 0))
    (#(1 2 3 0) #(0 0 -1))
    (#(0 3 7 4) #(0 -1 0))
    (#(4 5 1 0) #(-1 0 0))
    (#(3 2 6 7) #(1 0 0))))

The function draw-figure draws these. There are a couple of points of note regarding the interface provided by cl-opengl. The fact that we are using a C interface binding makes it difficult to pass a Common Lisp array (vector) to functions like glVertex3fv because they expect a pointer; I instead use helper functions set-normal and set-vertex to pass the x, y and z values as separate parameters. It is possible to use vertex buffers with cl-opengl for better performance, but it involves some kludgery with C data structures and their associated memory management — see the example program opengl-array.lisp provided with the cl-opengl release.

(defun draw-figure (verts faces)
  (labels ((set-normal (n)
             (gl:normal (aref n 0) (aref n 1) (aref n 2)))
           (set-vertex (index)
             (let ((v (aref verts index)))
               (gl:vertex (aref v 0) (aref v 1) (aref v 2))))
           (draw-face (vertex-indices normal)
             (set-normal normal)
             (gl:begin :quads)
             (map 'nil #'set-vertex vertex-indices)
             (gl:end)))

    (map 'nil #'(lambda (x) (draw-face (first x) (second x))) faces)))

The draw-frame function draws the cube rotated about the x, y and z axes by specified angles by appropriate manipulation of OpenGL’s modelview matrix (which maps from local object coordinates to world coordinates). We translate to the centre of the cube before applying the rotations so that the cube will be spin around its middle rather than a corner.

(defun draw-frame (rotx roty rotz)
  (gl:matrix-mode :modelview)
  (gl:push-matrix)
  (gl:translate 0.5 0.5 0.5)
  (gl:rotate rotx 1 0 0)
  (gl:rotate roty 0 1 0)
  (gl:rotate rotz 0 0 1)
  (gl:translate -0.5 -0.5 -0.5)
  (draw-figure +cube-vertices+ +cube-faces+)
  (gl:pop-matrix))

Finally, the top-level form. We use Lispbuilder SDL for window management in this case, so we must specify :opengl t and pass any OpenGL attributes as :opengl-attributes when the window is created: here we request a 16-bit depth buffer and double buffering. We also set sdl:frame-rate to set how often the idle event is called in the SDL main event loop.

(sdl:with-init ()
      (sdl:window +window-width+ +window-height+ 
                  :opengl t
                  :opengl-attributes '((:sdl-gl-depth-size   16)
                                       (:sdl-gl-doublebuffer 1)))
      (setf (sdl:frame-rate) 10)

There then follows some OpenGL initialisation to set the viewport to the whole window, and set the viewing (projection) and world coordinate transformation (modelview) matrices. The GLU convenience functions glu:perspective and glu:look-at are used to set the projection and viewing parameters.

      
      (gl:viewport 0 0 +window-width+ +window-height+)
      (gl:matrix-mode :projection)
      (gl:load-identity)
      (glu:perspective 50 (/ +window-height+ +window-width+) 1.0 10.0)
      (glu:look-at -2 2 4 
                    0.5 0.5 0.5 
                    0 1 0)

      (gl:matrix-mode :modelview)
      (gl:load-identity)

We now set the background to be cleared to black with gl:clear-color and configure polygon shading and culling, the cube colour properties and the lighting. The cube is white, and illuminated by a green light and a red light.

  
      (gl:clear-color 0 0 0 0)
      (gl:shade-model :flat)
      (gl:cull-face :back)
      (gl:polygon-mode :front :fill)
      (gl:draw-buffer :back)
      (gl:material :front :ambient-and-diffuse #(0.7 0.7 0.7 0.4))
      (gl:light :light0 :position #(0 0 1 0))
      (gl:light :light0 :diffuse #(1 0 0 0))
      (gl:light :light1 :position #(-1 2 -0.5 0))
      (gl:light :light1 :diffuse #(0 1 0 0))
      (gl:enable :cull-face :depth-test
                 :lighting :light0 :light1)

With those initialisations done, we clear the window and draw the first frame. The function sdl:update-display swaps the OpenGL buffers.

 
      (gl:clear :color-buffer :depth-buffer)
      (draw-frame rotx roty rotz)
      (sdl:update-display)

Finally, we enter the SDL main event loop. Handling the quit-event will cause the main loop to exit when we close the window. The idle event is called at the set frame rate and animates the cube by updating the rotation angles and redrawing it.

  
      (sdl:with-events ()
        (:quit-event () t)
        (:video-expose-event () (sdl:update-display))
        (:idle
          (setq rotx (mod (+ rotx 2.5) 360.0))
          (setq roty (mod (+ roty 0.7) 360.0))
          (setq rotz (mod (+ rotz 4.4) 360.0))
          (gl:clear :color-buffer :depth-buffer)
          (draw-frame rotx roty rotz)
          (sdl:update-display))))))
Posted in Computing, Graphics, Lisp | 3 Comments

Prime Number Searches — Functional and Imperative revisited

tl;dr In a previous blog post, I compared an imperative implementation of the Sieve of Eratosthenes prime number search algorithm with trial by division and a naïve functional version. Based on feedback, comments and further reading of Melissa O’Neill’s paper [1], I revisit the functional Sieve of Erathosthenes with an improved version using lists of prime factors instead of a single sieve, which outperforms trial by division. I then further improve the functional sieve by using a priority queue and lazy evaluation. The performance of the lazy sieve is a couple of orders olf magnitude slower than the imperative sieve, but does not appear to get significantly worse than that with increasing sieve sizes. Furthermore, it is far more scalable due to lower memory requirements.

Recap

Imperative Sieve of Eratosthenes

My initial imperative implementation of the Sieve of Eratosthenes was straightforward but I missed a further optimisation mentioned in the Rosetta Code guidelines [2]: if we’re finding primes up to n then we can discontinue “crossing off” prime multiples from the sieve when we reach the square root of n.

(defun primes (n)
  (let ((sieve (make-array (1+ n) :element-type '(unsigned-byte 8)
                                  :initial-element 0)))

    (loop for i from 2 upto (isqrt n) when (zerop (aref sieve i)) do
       (loop for j from (* i i) upto n by i do
          (setf (aref sieve j) 1)))

    (loop for i from 2 upto n when (zerop (aref sieve i)) collect i)))

Trial by Division

I’ve also made some tweaks to my original recursive trial by division version to improve readability and to reduce consing and garbage collection.

(defun primes (n)
   (labels ((find-primes (primes candidates)
     (if (null candidates)
        primes
        (let ((i (first candidates)))
          (find-primes (cons i primes)
                       (delete-if #'(lambda (x) (zerop (rem x i))) 
                                 (rest candidates)))))))

      (nreverse 
        (find-primes nil (loop for i from 2 upto n collect i)))))

Performance

The times to compute primes up to a search limit n of the imperative sieve and trial by division functions above, and the naïve functional sieve from my previous blog post, are plotted below (SBCL 1.0.58, Microsoft Windows 7 64-bit, 2.8 GHz Intel Core i5 760).

Performance: imperative sieve, trial by division and naive functional sieve
Trial by division is an order of magnitude slower than the imperative sieve for n=100 and its relative performance steadily grows worse with increasing n. The naïve functional sieve shows the same trend as trial by division but starts off two orders of magnitude slower than the imperative sieve at n=100. For calculating primes to 1,000,000, the imperative sieve takes around 20 milliseconds while trial by division takes over a minute. I got bored waiting for the naïve functional sieve.

Functional Sieve of Eratosthenes revisited

A list-based sieve

My first attempt at a functional Eratosthenes sieve stuck too rigidly to the idea of the imperative version by representing the sieve as a single “cross-off” list with an entry for every number. In my previous blog post, I defined a k-sieve for prime number k as the multiples of k from k2 up to our prime search limit n. (These are the numbers we “cross out” from the sieve for k.) Each time a prime was discovered, its k-sieve was calculated and then combined into the cross-off list which, in accordance with the functional programming tenet of no mutable data, was recreated each time. Further, although primes are found increasingly less frequently as the search progresses, searching for the next prime involved traversing the cross-off list to find the first zero value, which became progressively further from the head of the list.

To avoid repeatedly merging the k-sieves for each prime into a single sieve, we can store each k-sieve separately in a list of sieves. To avoid searching through numbers we’ve already eliminated, we can remove numbers from the list as they are crossed off. I argue that this is still the Sieve of Eratosthenes algorithm because we’re still “checking off” multiples of primes, even though the sieve itself is not structured as a linear list or array and is progressively discarded. The algorithm is as follows:

  1. Start with a list of candidate numbers from 2 to n, an empty list of primes and an empty list of k-sieves.
  2. Search for the next candidate i in all the k-sieves. (If the sieves are generated in order, we only have to test i against the first element of each k-sieve.)
  3. If we don’t find a match, i must be prime. Add it to the list of primes, and if i is less than the square root of n, add its k-sieve to the list of sieves.
  4. If we do find a match, then i cannot be prime. Remove it from the all of the k-sieves in which it occurs. If any k-sieves become empty as a result, remove them from the sieve list.
  5. Iterate until no candidates remain.

Here’s an example to illustrate this. We start with an empty list of sieves, an empty list of primes and a list of candidates from 2 to n.

Sieves     -> ()
Primes     -> ()
Candidates -> ( 2 3 4 5 6 ... n )

Our first candidate number is 2. There are no sieves, so we declare it to be prime, add it to the list of primes, and add the 2-sieve to the list of sieves.

Sieves     -> ( ( 4  6  8 10 12 14 ... ) )
Primes     -> ( 2 )
Candidates -> ( 3 4 5 6 7 ... n )

We now look at the next candidate, 3. It doesn’t occur in any of the k-sieves, so we again declare it to be prime and add the 3-sieve to the list of sieves.

Sieves     -> ( ( 9 12 15 18 21 24 ...)
                ( 4  6  8 10 12 14 ...) )
Primes     -> ( 2 3 )
Candidates -> ( 4 5 6 7 8 ... n )

The next candidate is 4. It occurs in the 2-sieve, so it can’t be prime. We delete it from any k-sieves in which it occurs, and continue to the next candidate.

Sieves     -> ( ( 9 12 15 18 21 24 ... )
                ( 6  8 10 12 14 16 ... ) )
Primes     -> ( 2 3 )
Candidates -> ( 5 6 7 8 9 ... n)

Similarly to the imperative sieve above, when a prime greater than the square root of n is found, we do not need to generate its k-sieve since its first element will be greater than n.

Knocking up a simple implementation is straightforward and takes less than 60 lines of code, including a few comments.

(defun make-sieve (k n)
  "Create a sieve for prime k consisting of the
   square of k, then each multiple of k up to n."
  (loop for i from (* k k) upto n by k collect i))

(defun in-sieve-p (i sieve)
  "Test whether i is in the sieve. (Tests whether i is equal 
   to the first element of sieve.)
   Returns the result of the test and either a copy of the
   sieve with i removed or the original sieve."
  (let ((in-sieve (equal i (first sieve))))
    (values in-sieve
            (if in-sieve (rest sieve) sieve))))

(defun prime-p (i sieve-list)
  "Test whether i is in any of the sieves in sieve-list.
   Returns the result and updated sieves."
  (labels ((test-sieve (is-prime out-sieves in-sieves)
             (if (null in-sieves)
                 (values is-prime out-sieves)
                 (multiple-value-bind (not-prime new-sieve)
                                      (in-sieve-p i (first in-sieves))
                   (test-sieve (if not-prime nil is-prime)
                               (cons new-sieve out-sieves)
                               (rest in-sieves))))))

     (test-sieve t nil sieve-list)))

(defun primes (n)
  (let ((sqrt-n (isqrt n)))

    (labels ((test-prime (primes sieves candidates)
               (if (null candidates)
                   primes
                   (let ((i (first candidates)))
                     (multiple-value-bind (is-prime new-sieve-list)
                                          (prime-p i sieves)
                       (test-prime (if is-prime (cons i primes) primes)
                                   (if (and is-prime (< i sqrt-n))
                                       (cons (make-sieve i n) new-sieve-list)
                                       new-sieve-list)
                                   (rest candidates)))))))

      (nreverse
        (test-prime nil nil (loop for i from 2 upto n collect i))))))

What about its performance?
Untitled-2

This is a major win. It not only performs an order of magnitude better than than the naïve functional sieve for n=100, it also outperforms trial by division despite a more complex implementation. Not only that, the gradient is flatter; it remains within two orders of magnitude of the imperative sieve performance until around n=1,000,000. Can we do better still?

Sorting the list: a Priority Queue

Further consideration of our list-based approach reveals a few possible optimisations. New k-sieves are added onto the front of the sieve list, and the prime number search function prime-p examines all the sieves. However, If we can keep the list of k-sieves ordered by their first element, we can stop searching as soon as we encounter a sieve with a first element greater than i. To keep the list sorted, we need to consider whether or not i prime.

  • If i is prime, we add the i-sieve to the list. Its first element will be greater than that of any other sieve in the list, so it will be added to the end of the list. For a purely functional (immutable) list structure, this is an expensive operation since it means that we will have to copy all of the cons cells in the original list. Fortunately, primes occur progressively less frequently as i increases.
  • If i is not prime, we delete the corresponding first elements from the matching k-sieves. This means we will have to sort the list, but the list will be in an almost sorted state. Sorting can be accomplished by removing the k-sieves that have i as their first element from the sieve list, deleting their first elements, then re-inserting them into the sieve list at the appropriate locations.

This in fact will effectively implement a simple priority queue [3], with the sieve head elements determining the priority. Melissa O’Neill also examines this approach in her paper.

I implemented a simple purely functional (i.e. immutable) priority queue package with the following functions: pq-insert and pq-remove, which removes and returns all elements of a given priority as well as the updated queue. The code is as follows:

(defun pq-insert (element queue)
  "Insert element into the priority queue."
   (if (null queue)
       (list element)
       (let ((priority (first element))
	     (next-element (first queue)))
	 (if (>= (first next-element) priority)
	     (cons element queue)
	     (cons next-element (pq-insert element (rest queue)))))))

(defun pq-remove (priority queue)
  "Remove elements with a given priority from the queue
   and return them along with the updated queue."
  (labels ((delete-elements (deleted q)
             (if (null q)
                 (values deleted nil)
                 (let* ((next-element (first q))
                        (cmp (- priority (first next-element))))
                   (cond ((minusp cmp)
                          (values deleted q))
                         ((zerop cmp)
                          (delete-elements 
                             (cons next-element deleted)
                             (rest q)))
                         (t
                          (multiple-value-bind (del-lst in-lst)
                                               (delete-elements deleted (rest q))
                            (values del-lst (cons next-element in-lst)))))))))

  (delete-elements nil queue)))

(I implement a combined “delete-and-return matching sieves” function because if search and delete are separate functions, in the majority of cases, where i is not prime, we will need to search the list twice — first for matching sieves and then to delete them. However, there is a penalty when a prime is encountered because the entire list will be copied. In practice, there doesn’t seem to be a huge performance difference between the two approaches.)

The Eratosthenes sieve is implemented based on this data structure as follows:

(defun build-sieve (k n) 
  (loop for i from (* k k) upto n by k collect i))

(defun prime-p (i sieves)
  (multiple-value-bind (matches new-sieves) 
                       (pq-remove i sieves)
    (if (null matches)
        (values t sieves)
        (values nil (reduce #'(lambda (q s) (pq-insert s q))
                            (remove-if 'null
                                       (mapcar 'rest matches))
                            :initial-value new-sieves)))))

(defun primes (n) 
  (let ((sqrt-n (isqrt n)))

    (labels ((test-prime (primes sieves candidates)
                (if (null candidates)
                primes
                (let ((i (first candidates)))
                   (multiple-value-bind (is-prime new-sieves)
                                        (prime-p i sieves)
                      (test-prime (if is-prime (cons i primes) primes)
                                  (if (and is-prime (< i sqrt-n))
                                      (pq-insert (make-sieve i n) new-sieves)
                                      new-sieves)
                                  (rest candidates)))))))

      (nreverse
        (test-prime nil nil (loop for i from 2 upto n collect i))))))

The question is whether the overhead of maintaining the priority queue is worth the effort. Let’s see:

Surprisingly, the priority queue overhead is sufficiently high that our brain-dead list-based implementation beats it for n less than around 100,000, and also performs worse than trial by division for n less than around 2,000. To understand why, let’s find out what the maximum length the sieve list will reach for n=100,000 — it should be equal to the number of primes found by sqrt(100,000):

CL-USER> (isqrt 100000)
316
CL-USER> (length (functional-list-primes 316))
66

So the maximum number of k-sieves in the list in this case will be 66, which will be attained at i=316, after which it will continue to diminish as we approach i=100,000.

This explains why simply using an unsorted list and testing all its elements serves us well for so long; the maximum number of sieves in the list only grows slowly with increasing n. The greater overhead of sorting (be it a priority queue, a set, a binary tree or whatever abstraction) therefore doesn’t pay off until n is fairly large. Even after the crossover point the performance gap only grows gradually because primes become increasingly scarce so the sieve list growth becomes slower and slower with increasing n. Moreover, by the time we start getting performance wins from sorting, our approach starts running into another scalability issue: memory consumption.

No sets please, we’re British

What about a more sophisticated data structure? Finding whether a candidate number i is prime can be viewed as testing whether i is a member of a set that is the union of all k-sieves. (In fact, we only need to test whether it is in the set of lowest values in all k-sieves if we remove elements as they are “crossed off”.) Sets (and general priority queues, for that matter) are often implemented based on a binary search tree [4] or some variant thereof, so would this not outperform a simple list?

To test this, I implemented a pure functional binary search tree and tried again. While a BST may be suitable for general applications which require sorting and rapid seaching, it may not be so useful for an Erathosthenes sieve. Because we generate the sieve incrementally, the tree becomes highly unbalanced; k-sieves are only added to the right subtree, and the tree thus degenerates into a linear linked list. (There may be the occasional left subtree of some nodes depending on the order in which we re-insert k-sieves after removing their first elements.) A self-balancing tree would be more complex to implement and might actually reduce performance because (a) the balancing operation adds overhead, and (b) the most frequently accessed nodes are those with the lowest values (the 2-sieve, followed by the 3-sieve, followed by the 5-sieve, etc.), but these would be pushed deeper into the tree by re-balancing.

Lazy Evaluation

Our list- or priority queue-based sieves run into a memory usage problem — the input list of candidates contains n-2 elements and each k-sieve is initialised with all multiples of k from k2 to n. Memory consumption is an issue that also afflicts the imperative sieve. We can address the issue by using lazy evaluation. A lazy Eratosthenes sieve would not only reduce memory usage considerably, it would also allow us to answer questions like what is the xth prime?, for which we would otherwise need to know the size of the sieve in advance or find it by trial and error.

Although Common Lisp uses strict evaluation, we can implement lazy evaluation quite easily using a generator; a function that, when called, returns the next value in the sequence. A lightweight method would be to store each k-list as a tuple containing the “current value”, the prime k and the square root of the search limit n. Given those three quantities we can simply calculate the next value when required using a general function. A heavier method (but one that is generally used) is to wrap these values in a thunk (a piece of code) that closes around them, hiding them from view and keeping the functional abstraction at a higher level.

With functional programming in non-pure languages, there’s often a tradeoff in terms of the level at which we present the “no mutable data” abstraction. Of course, even languages like Miranda and Haskell must alter memory cells to work but this is hidden behind the scenes by the compiler. Thunks that mutate data inside closures are used in functional programming to implement memoisation and lazy evaluation.

In this implementation, I’ll select the latter approach and represent a k-list as a dotted pair (cons cell) containing the current value in the CAR and the thunk in the CDR. We can access the next value in the k-list in the same way as we did before using first (or car, which does the same thing.) However, to get the next value, instead of using rest we need a lazy equivalent lazy-rest which calls the thunk to generate the next value. Here’s the lazy list abstraction:

(declaim (inline make-lazy-list
                 lazy-first
                 lazy-rest))

(defun make-lazy-list (thunk)
  (let ((value (funcall thunk)))
    (if (null value)
        nil
        (cons value thunk))))

(defun lazy-first (lazy-list)
  (car lazy-list))

(defun lazy-rest (lazy-list)
  (make-lazy-list (cdr lazy-list)))

Here’s a couple of functions that create a lazy k-list and a lazy list of candidates. (These are so similar it’d be worth creating a macro to make an arbitrary lazy list comprehension if we used this technique often.)

(defun make-lazy-sieve (k n)
  "Lazy k-sieve."
  (let ((next-value (* k k)))
    (make-lazy-list #'(lambda ()
             (if (null next-value) 
                 nil
                 (let ((current-value next-value)
                       (nv (+ k next-value)))
                   (setf next-value (if (> nv n) nil nv))
                   current-value))))))

(defun make-lazy-candidates (n)
  "Lazy prime number candidates list."
  (let ((next-value 2))
    (make-lazy-list #'(lambda ()
      (if (null next-value)
          nil
          (let ((current-value next-value)
                (nv (1+ next-value)))
               (setf next-value (if (> nv n) nil nv))
               current-value))))))

Our priority queue functions remain as before. (We could substitute function calls to first with lazy-first but these do exactly the same thing.) There are only slight modifications to the prime-p function and primes (shown underlined):

(defun prime-p (i sieves)
  (multiple-value-bind (matches new-sieves) 
                       (pq-remove i sieves)
    (if (null matches)
        (values t sieves)
        (values nil (reduce #'(lambda (q s) (pq-insert s q))
                            (remove-if 'null
                                       (mapcar 'lazy-rest matches))
                            :initial-value new-sieves)))))

(defun primes (n) 
  (let ((sqrt-n (isqrt n)))

    (labels ((test-prime (primes sieves candidates)
                (if (null candidates)
                primes
                (let ((i (lazy-first candidates)))
                   (multiple-value-bind (is-prime new-sieves)
                                        (prime-p i sieves)
                      (test-prime (if is-prime (cons i primes) primes)
                                  (if (and is-prime (< i sqrt-n))
                                      (pq-insert (make-sieve i n) new-sieves)
                                      new-sieves)
                                  (lazy-rest candidates)))))))

      (nreverse
        (test-prime nil nil (make-lazy-candidates n))))))

Now let’s look and see how big a penalty we had to pay for that laziness.

Basically, very little. The lazy sieve (using a priority queue) keeps pace with the non-lazy priority queue while remaining a couple of orders magnitude slower than the imperative sieve up to the greatest value of n tested (5 x 107). It also boasts far lower memory consumption; the maximum k-list size (number of entries in the priority queue) is the number of primes up to sqrt(50,000,000), which is 908 — a very modest number.

Conclusion

In my original blog article, I concluded that the Sieve of Eratosthenes algorithm was basically imperative, and that even with a superior data structure the functional version would have no advantage over the imperative version so would not be worth the effort. Revisiting the problem has shown that re-framing it in a way more amenable to tackling using functional techniques makes it not only tractable, the functional approach using lazy evaluation is superior to the imperative sieve in an important aspect; it is ultimately more scalable because it has far lower memory consumption. Furthermore, because the sieve is evaluated in a lazy fashion, with some very minor modifications we can write a program to calculate the nth prime number, something which we could not do with non-lazy or imperative sieves without determining the sieve size in advance.

References

[1] M. E. O’Neill, The Genuine Sieve of Eratosthenes.
[2] Rosetta Code. Sieve of Eratosthenes.
[3] Wikipedia. Priority Queue.
[4] Wikipedia. Binary Search Tree.

Posted in Computing, Lisp | Leave a comment

Prime Number Searches — Functional and Imperative

tl;dr Implementations of the Sieve of Eratosthenes prime number search algorithm are often given as functional programming language examples to show off their brevity. However, closer inspection reveals that some implementations are actually not the Sieve but use trial by division. I compare prime number searches using imperative Sieve, functional trial by division, and functional Sieve implementations in Common Lisp. The imperative implementation wins in terms of raw performance, while with a suitable algorithm the functional version is brief and its performance is acceptable. However, the Eratosthenes Sieve is an explicitly imperative algorithm and is a poor fit for FP; my functional solution is neither small nor elegant, and its performance is terrible. Careful choice of data structures may alleviate this at the expense of complexity.

Edit: (2013/2/26) See my later blog post . for an update, where I revisit the problem with different functional approaches that offer better performance and scalability while still following the Eratosthenes Sieve approach.

I’m currently exploring functional programming using Common Lisp as a vehicle. An example that is sometimes held up to show off power and brevity of functional programming languages is a prime number search by the Sieve of Erathosthenes. An example on the Miranda home page is given as:

        
    primes = sieve [2..]
             where
             sieve (p:x) = p : sieve [n | n <- x; n mod p > 0]

Miranda is a lazy evaluation, pure functional language. However, I’m currently trying to make Common Lisp my language of choice, and I attempted to create imperative and functional versions of the Sieve of Eratosthenes for comparison.

However, while I was digging through Rosetta Code (http://rosettacode.org/wiki/Sieve_of_Eratosthenes) to see how others tackled the problem in eager evaluation functional languages, I came across a few implementations that had been highlighted because they did not actually implement the Sieve of Erathosthenes algorithm but rather used trial division. My suspicions were aroused and looking more closely at the example on the Miranda homepage, yes is uses trial division, so is not the bona fide Sieve. The Sieve of Eratosthenes is an algorithm—a description of a method for solving a problem—so implementing another method and then claiming it to be the Sieve of Eratosthenes is false even if it achieves the same result.

I therefore set about to create imperative and two functional versions, one using trial division and the other the Sieve of Eratosthenes algorithm, for comparison.

Edit: Melissa E. O’Neill has also noticed this, and pipped me to the post with a sieve in Haskell. Check out her paper at http://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf.

Sieve of Eratothenes

As I understand it, the Sieve of Eratothenes algorithm to find primes up to some number n is basically as follows:

  • Create a table of prime number candidates from 2 to n.
  • Starting with 2, proceed to “cross out” multiples of 2 from the table. (Since they’re multiples of a prime they can’t ipso facto be prime themselves so can be eliminated from consideration.)
  • The next number in the table that hasn’t been crossed out is the next prime. Record it, then cross out its multiples from the table.
  • Rinse and repeat till you run out of candidate numbers.

A permissible optimisation (for Rosetta Code entries) is, for prime k, to start the crossing out with the square of k, since smaller multiples will have been taken care of by the earlier crossings out.

Imperative version

The Sieve of Eratosthenes is straightforward to implement in an imperative language, including BASIC and assembly language. In Common Lisp:

 1) (defun imperative-prime (n)
 2)    "Sieve of Eratosthenes to find primes up to n.
 3)     Imperative implementation."
 4)
 5)    (let ((sieve (make-array (1+ n) :element-type '(unsigned-byte 8)
 6)                                    :initial-element 0)))
 7)
 8)      (loop for i from 2 upto n when (zerop (aref sieve i)) do
 9)        (loop for j from (* i i) upto n by i do (setf (aref sieve j) 1)))
10)
11)      (loop for i from 2 upto n when (zerop (aref sieve i)) collect i)))

Lines 5–6 create an array to hold our “crossing out” table, initialised to all zeros (no values crossed out). The only slight complication is that in CL the first value in the array has the index 0, so I create an extra element so that the table entry for number k has the index k rather than k-1. Lines 8–9 walk through the table crossing out multiples of each prime found by setting them to 1, and line 11 collects all the numbers for which the table entry is 0 (the primes) into a list. Eleven lines of code, including comments and blanks for readability.

Functional Version 1 (trial division)

My first stab at a functional version is also short and simple.

 1) (defun functional-prime-1 (n)
 2)  "Find primes up to n.
 3)   Functional implementation using trial division."
 4)
 5)   (labels ((x-primes (primes candidates)
 6)     (if (null candidates)
 7)        primes
 8)        (let ((pr (car candidates)))
 9)          (x-primes (cons pr primes)
10)                     (remove-if #'(lambda (x) (zerop (rem x pr))) 
11)                                  (cdr candidates)))))))
12)
13)      (nreverse 
14)        (x-primes nil (loop for i from 2 upto n collect i)))))

The syntax of Lisp can make this hard to follow but the approach is classic recursion. Lines 5–11 define a local auxiliary function that takes as its arguments two lists: a list of primes we’ve found so far and a list of candidate numbers. We first test whether the list of candidates is empty (line 6) and if it is, we simply return the list of primes (line 7). Otherwise, we take the next prime number from the list of candidates, pr (line 8), then call the auxiliary function again with pr tacked onto the primes list (line 9) and the list of candidates with multiples of pr removed (line 10). Finally, lines 12–13 kick off the recursive function with initial values of the primes list (empty) and the candidates (natural numbers from 2 to n). A small detail is that because it’s much quicker to add an item to the front of a list than the back, the list of primes is built backwards so line 12 reverses it before returning it to the caller.

The same algorithm can be expressed in Miranda (which I personally find easier to grok):

x_primes :: [num] -> [num]
x_primes primes, []      = primes
x_primes primes, [x:xs]  = x_primes( x ++ primes, [k | k <- xs; k mod x > 0])

functional_prime_1 :: num -> [num]
functional_prime_1 n = reverse( x-primes([], [1..n]) )

Elegant and beautiful, no? Maybe, but is is not the Sieve; it’s using trial division to remove candidates, rather than crossing off multiples in a list, and testing every candidate.

Functional Version 2 (Sieve?)

This is my attempt to implement the Sieve in a functional style using Common Lisp. I aim for simplicity and correctness, at the expense of some optimisations; for example, I choose to represent the sieve as a list rather than an array.

For each prime k, we need to “set” the list elements corresponding to multiples of k to 1 (starting at k2). The data are immutable, so I chose a recursive approach, at each step creating a new sieve for each prime with just its multiples set, then combining it with the sieve from the previous step before passing it to the next iteration. Slightly differing from the iterative version, I also cross out the prime itself, and accumulate primes in a separate list. That way, finding the next prime simply consists of searching for the first element of the sieve that is 0.

For example, the initial sieve is all zeros except for the first element that corresponds to the natural number 1.

primes -> ()
sieve  -> (1 0 0 0 0 0 0 0 0 0 ...)

The first prime is the first element set to zero, which is the second in the list and so the number two. We add this to the list of primes, then generate a sieve for 2 and its multiples starting at 4 (22).

primes  -> ( 2 )
2-sieve -> (0 1 0 1 0 1 0 1 0 1 ...)
sieve   -> (1 0 0 0 0 0 0 0 0 0 ...)

We then recursive with the primes list and a new sieve generated by the logical or of the original sieve and the 2-sieve

primes -> ( 2 )
sieve  -> (1 1 0 1 0 1 0 1 0 1 ...)

The condition to terminate recursion is when there are no more 0s in the sieve, in which case we return the list of primes.

Now for the implementation. Say we want to find the primes less than some number n. We first create a function that, given a prime k, generates a list of sieve elements to “set”.

 1) (defun build-multiples-list (k n) 
 2)   "Generate a list consisting of the number k, its square, then
 3)    each multiple of k, up to n. The generated list is backwards."
 4)
 5)  (labels ((x-multiple (acc nxt)
 6)    (if (> nxt n) 
 7)        acc
 8)        (x-multiple (cons nxt acc) (+ k nxt)))))
 9)
10)    (x-multiple (list k) (* k k))))

I could use the mighty loop macro here but I want to use recursion. Lines 5–8 define an auxiliary function that takes as its arguments an accumulator holding the list so far, and the next value to add to the list. We first check if the next value is greater than n, the limit of our primes search (line 6) and if it is, we return the list of accumulated values (line 7). Otherwise, we recurse with the value prepended to the accumulator and the next multiple of k (line 8). The auxiliary function is invoked with k itself in the accumulator and the square of k as the first value (line 10). Because we prepend to the list, the list will be backwards.

We test it as follows:

CL-USER> (nreverse (build-factors-list 3 100))
(3 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99)

We now need to build a sieve corresponding to the list of multiples. Consider the list of multiples for 3, which is (3 9 12 15 …). The 3-sieve is

(0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 ...)

We observe that this is the concatenation of a number of smaller lists consisting of a sequence of 0s followed by a 1, i.e.

(0 0 1) ++ (0 0 0 0 0 1) ++ (0 0 1) ++ (0 0 1) ++  ...

where ++ denotes list concatenation. The length of each sublist in the sieve is simply the difference between successive multiples. Let’s create a function that, given a list of multiples, returns the length of each sublist in the k-sieve.

 1) (defun get-subsieve-lengths (vals)
 2)    (labels ((x-subsieve (outlst inlst)
 3)                (if (null (cdr inlst))
 4)                    outlst
 5)                    (x-subsieve (cons (- (cadr inlst) (car inlst)) 
 6)                                      outlst) 
 7)                                (cdr inlst)))))
 8)
 9)        (nreverse (x-subsieve nil (cons 0 vals)))))

Let’s test this. Recalling that the list of multiples to set in the 3-sieve is (3 9 12 15…)

CL-USER> (get-subsieve-lengths (nreverse (build-multiples-list 3 100)))
(3 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3)

which gives is the value of the first element of our list of multiples, followed by the differences between successive elements in the list.

Constructing a k-sieve is accomplished by the following functions:

 1) (defun gimme-n-of (what n)
 2)  "Return a list of n whats"
 3)  (loop repeat n collect what))
 4)
 5) (defun build-sieve (k n)
 6)  "Construct the 'cross-off' list (Eratosthenes sieve) for prime k."
 7)
 8)  (flet ((build-sub (len)
 9)    "Build a list of len-1 zeros and a 1 on the end"
10)     (nreverse (cons 1 (gimme-n-of 0 (1- len))))))
11)
12)       (let* ((multiples     (build-multiples-list k n))
13)              (last-multiple (car multiples))
14)              (sieve         (mapcan #'build-sub (get-subsieve-lengths (nreverse multiples)))))
15)
16)       (if (= n last-multiple)
17)           sieve
18)           (nconc sieve (gimme-n-of 0 (- n last-multiple)))))))

gimme-n-of is simply a convenience function that returns a list consisting of a specified number of constant elements. The sieve itself is constructed at line 14 through a pipeline of functions, using a local function build-sub (lines 8–10) to construct each subsieve. A minor complication is the we need to add zeros to the end of the sieve after the last subsieve to bring us up to the total sieve length (lines 16–18). The number of zeros we need to add is the search limit n minus the last multiple of k, extracted at line 13. (This is why I don’t reverse the multiples list before returning it — it makes it more efficient to get the “last” value.)

We now have the bits in place to build a functional Eratosthenes sieve. This function itself is short and straightforward.

 1)  (defun functional-prime-2 (n)
 2)   "Find primes up to n.
 3)    Functional implementation using Sieve of Eratosthenes."
 4)
 5)   (labels ((find-next-prime (primes sieve)
 6)              (let* ((pos (position 0 sieve)))
 7)                 (if (null pos)
 8)                     primes
 9)                     (let ((next-prime (1+ pos)))
10)                       (find-next-prime (cons next-prime primes)
11)                                        (mapcar #'logior (build-sieve next-prime n) 
12)                                                         sieve)))))))
13) 
14)      (nreverse 
15)        (find-next-prime nil (cons 1 (gimme-n-of 0 (1- n)))))))

And let’s check its output:

CL-USER> (functional-prime-2 100)
(2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97)

However, we expect performance to be terrible — we’re creating two sieves at each iteration, each a list of boxed values, involving a lot of consing and creating a lot of garbage in the process. Further, on each iteration we’re searching the sieve for the first zero, which is getting further and further away, each time walking down the list from the start. How bad will it be? Let’s find out.

Performance

We now have three programs to find prime numbers up to a given number. Let’s try some benchmarking. For reference, the machine is an Intel Core i5 780 2.80 GHz with 8GB RAM, running SBCL on Microsoft Windows 7 64-bit. No special optimisation is used.

We’ll measure the time it takes for 100,000 runs of finding the primes to 1,000. First, the imperative version:

CL-USER> (time (loop repeat 100000 do (imperative-prime 1000)))
Evaluation took:
  2.195 seconds of real time
  2.199614 seconds of total run time (2.152814 user, 0.046800 system)
  [ Run times consist of 0.079 seconds GC time, and 2.121 seconds non-GC time. ]
  100.23% CPU
  6,162,712,538 processor cycles
  236,807,008 bytes consed

Finding primes to 1,000 by our imperative method takes about 22 microseconds. The garbage collection and consing are probably to do with building the output primes list.

Now the trial division recursive version:

CL-USER> (time (loop repeat 100000 do (functional-prime-1 1000)))
Evaluation took:
  37.963 seconds of real time
  37.783443 seconds of total run time (37.580641 user, 0.202802 system)
  [ Run times consist of 2.188 seconds GC time, and 35.596 seconds non-GC time. ]
  99.53% CPU
  106,557,807,692 processor cycles
  13,168,932,696 bytes consed

About 380 microseconds to find the primes to 1,0000: about 17 times slower than the imperative version but still not bad in absolute terms. Garbage collection is less than 1% of runtime.

What about the functional Eratosthenes Sieve?

CL-USER> (time (loop repeat 100000 do (functional-prime-2 1000)))
Evaluation took:
  575.893 seconds of real time
  574.380082 seconds of total run time (570.636058 user, 3.744024 system)
  [ Run times consist of 36.774 seconds GC time, and 537.607 seconds non-GC time. ]
  99.74% CPU
  1,616,532,834,439 processor cycles
  274,064,489,504 bytes consed

About 5.8 milliseconds: 15 times slower than the trial by division functional version, 260 times slower than the imperative version. Also, the garbage collector kicked in for about 6% of the runtime and there was a lot of consing, as expected.

For our functional Sieve, using alternative data structures (arrays of unboxed integers instead of lists for the sieves) and reducing the time to search for the next prime by for example truncating the sieve at the last prime found may speed things up, at the expense of implementation complexity, but there is still a large amount of data copying going on.

(In contrast, on the 6MHz 68000-based system I built in the mid-1980s, the Sieve of Eratosthenes written in assembler took about 100ms to calculate the primes to 1,000.)

Conclusions

Functional programming can be regarded as a series of operations on data, with the flow of data through a processing pipeline of functions being emphasised. Where it tends to fall down is where we are required to apply repeated change to large collections of data, when the “no mutable data” means that in-place updates of items is eschewed and leads to huge amounts of copying instead. Where an algorithm (or data structure) that is a good fit for the functional paradigm exists (trial by division here) then the performance of the functional version is acceptable. However, the Sieve of Erathosthenes is an explicitly imperative algorithm, and attempting a naive functional implementation can be time consuming and the result is slow; it’s probably not worth the effort.

Posted in Computing, Lisp | 8 Comments

Installing Vector Linux on an old PC

Edit: I’ve since had to remove Vector Linux since updating the packages appeared to cause a dependency problem as a result of which a shared library couldn’t be found by a service at boot time and I could no longer start X windows. I’ll leave this post for posterity, however, on the grounds that it could well be a user error on my part.

Some years ago I bought an old IBM NetVista very cheaply for use as a low-cost Linux hacking box. Its small size makes it a good fit for the limited space I have – it measures just 30cm wide and less than 9cm high, and its case is rigid enough to serve as a monitor stand. It has an Intel Pentium III CPU running at 730 MHz, 294 MB of RAM, a 20GB hard disk and a low-profile CD-ROM drive. It still has a “Windows 2000 Professional” sticker on it, so was probably used in an office in its previous life. Not much horsepower, but more than I had available in the mid-1990s.

I used a version of Debian Linux with KDE3 on it for a while before letting it go fallow, but recently wanted to use it again, not having the funds to build myself a compact “PC” machine or to buy a Mac Mini, and having had issues with virtual machines in the past. However, the latest version of Debian was couldn’t be installed – the size of the OS notwithstanding as soon as I fired up the installer the screen would flicker uncontrollably. I don’t particularly want a huge bloated OS with lots of packages. In the early 1990s I used Sun SPARCstations running TWM as the desktop environment; basically enough to give me lots of terminal windows, an xclock and emacs. Netscape for web browsing, the email client was pine or elm (run in a terminal window) with xbiff for mail notification, and xview for viewing images. When hacking at work I use a MacPro with — lots of terminal windows and vim — so I’d be happy with that baseline.

I came across Vector Linux and decided to give the “Light” version a go. This is based on the old Slackware, a distro I was using in the mid 1990s and liked a lot for its simplicity, and comes with LXDE, IceWM or JWM desktops. The philosophy appears to be avoiding bloat by giving you a basic set of packages and then letting you extend or customise the system as you want.

Install

The ISO image fits nicely on a CD-ROM. The installer is curses-based text so no attempts to enter a GUI and no attempts to enter an unsupported video mode – hooray! There appears to be no net installer, but that would have been icing on the cake.

It tries to keep things simple but whether this is straightforward for ease of use by beginners or for simplicity of implementation is hard to fathom. The options are more limited than other OSes, and it will please neither camp; some things like having you specify which services to run in which runlevels will bewilder novices, but not supporting LVM and only allowing a limited set of partitioning options with no scope for customisation will annoy experienced users. (Partitioning the hard disk also requires a reboot; something I’ve not seen in other installers.)

The installer tries to use informal English in places to keep things informal and perhaps sound cool and trendy, and during the package installation process the “handles” of the team scroll across the top of the screen. I found the language slightly annoying and it could be confusing for some, especially non-native Engish speakers.

For package selection your choices are text-based (console/terminal) or GUI-based system, server or desktop, minimal or full. Choosing graphical, desktop, full gives a system with under 250MB in the root partition, a little over 3GB in /usr and about 50MB in /var. Vector claims that with fewer options the Light version can install into 1.8GB.

One nice feature is that at boot time you can have a choice of booting into a text console instead of a GUI, which is fine for “quick and dirty” operations (and you can always fire up X11 by typing startx anyway).

Updates

Once the base system has been installed, it’s time to update
packages to the latest version. Vector Linux doesn’t appear
to have an automatic updater, so once you’ve got networking working it’s time to patch. As root, run /sbin/slapt-get --update to update the local patch database followed by /sbin/slapt-get --upgrade to update the installed packages.

Using Japanese Keyboards

One cause of software bloat is internationalisation — supporting every language under the sun plus a few more like Tengwar or Klingon. Vector linux keeps things small and simple by the Henry Ford approach — you can have any language you want so long as it’s English, and if you want something else do it yourself. However, living in Japan I use a Japanese keyboard but the configuration tools don’t give you the option (though they do support lots of other non-English keyboards), although grovelling around in /usr/share/kbd showed the keymap existed. Digging around Japanese language pages on Slackware revealed the solution:

  1. Edit /etc/rc.d/rc.local and add the following lines at the bottom:
    if [ -x /usr/bin/loadkeys ]; then
       /usr/bin/loadkeys jp106
    fi
  2. Edit /etc/X11/xorg.conf.d/90-keyboard-layout.conf
    as follows (changed parts shown underlined):

    Section "InputClass"
      Identifier "keyboard-all"
      Driver "evdev"
      Option "XkbLayout" "jp"
      Option "XkbModel" "jp106"
      Option "XkbRules" "xorg"
      Option "XkbOptions" "terminate:ctrl_alt_bksp"
      MatchIsKeyboard "on"
    EndSection

(There’s an old guide for adding Japanese support on the forum: there may be more up-to-date information somewhere.)

Package Management

Package management is a modified Slackware system and uses slapt-get in text mode, Gslapt in graphical mode. You can add Slackware respositories to the sources by simply selecting from a list, which gives a bit more scope, but Slackware packages do not have dependency checking. (The Vector packages have “vl70″ appended to their names to distinguish them.)

Vector provides enough basic packages but don’t expect a comprehensive a range as Slackware or Debian, due to the smaller size of the organisation and its community. What is provided is supposed to be stable, and I tend to like to build stuff from source and install it in /usr/local or my home directory so as not to interfere with the stable core provided by the OS maintainers, so this is not an issue for me.

I’ve added emacs and CMUCL from the Vector Linux extras repository, but the system has a basic C/C++ software development framework so building other packages from source shouldn’t be an issue.

System Administration Tools

Traditionally, Slackware has not provided much in the way of GUI tools for system administration to keep things simple. Vector Linux adds a tool called VASM for basic day-to-day administration tasks. Operated in “user mode” it allows you to change password, set the default window manager and restore default settings. In “superuser” mode it lets you set up hardware, network, window manager, services, users, firewall and filesystem. For some tricky tasks this can take out some of the drudgery and changes of screwing things up. Otherwise, it’s based on Slackware underneath and so should be straightforward to tinker with just using a text editor.

Random Points

  • Good point: the ssh daemon configuration file /etc/ssh/sshd_config has the PermitRootLogin authentication option commented out by default.
  • Bad point: the /etc/nfsmount.conffile specified the default protocol version to start negotiation with my server as version 3. Had to comment out the line Defaultvers=3 to get it to mount an nfs share from my home fileserver (Fedora box). Took me a while to work out why NFS wasn’t working.

Summary

So far, installation and set up of Vector Linux have been fairly painless, modulo the keyboard issue which was soon addressed. Hopefully it should provide a stable platform for doing basic software hacking on an old machine without having to raid the piggy bank, which was my purpose in the first place.

Posted in Computing | Leave a comment

Getting Started with Common Lisp — Part 9: Further Reading

I’ve found the following to be useful for learning and as on-line reference. I intend to keep this page as a “living document” and update it as I find new material.

Books:

  • Practical Common Lisp by Peter Seibel. Available on-line (for free) as well as in print.
  • Land of Lisp by Conrad Barski. Cartoons and humour make this book very approachable, and it guides you through making programs that do interesting things rather than boring “compute Fibbonacci numbers” exercises. Touches on most of the interesting and practical aspects of the language as well, including string handling, I/O, macros, OOP, lazy evaluation, functional programming, and restarts.
  • Paradigms in Artificial Intelligence Programming: Case Studies in Common Lisp by Peter Norvig.

On-line resources:

  • The Common Lisp Hyperspec (TM): On-line reference material based on the Common Lisp standard.
  • Cliki: resources for learning about and using Common Lisp, including links to a collection of libraries.
Posted in Computing, Lisp | Leave a comment

Getting Started with Common Lisp — Part 8: Optimisation

In previous posts in this series we have developed a short program that generates a Mandelbrot set and writes it to a file. Simply compiling our Common Lisp program can make it run relatively quickly but there are times when we will need it to run quicker. Here we will look at compiler optimisations.

Substantial improvements to a program’s speed can often be gained by selecting alternative algorithms or data structures and avoiding or delaying computation. In our simple Mandelbrot program, reducing the number of iterations before we judge a value to have diverged (i.e., setting +max-iterations+ to a lower value) will result in a substantial speedup at minimal cost, albeit with a loss of accuracy close to the edge of the Mandelbrot set (a greater risk that a points will incorrectly be judged to be outside the Mandlebrot set). This is not a tutorial on optimisation, but on how what language facilities Common Lisp provides to allow programs to run faster.

Common Lisp is dynamically typed, and that gives it a great deal of flexibility. However, there is a cost to this; data objects must be “boxed” (wrapped in a small object that provides type and other information) and functions such as arithmetic must check their argument types, perform any type conversions required and then select the right piece of code to perform the operation. Common Lisp allows us to add type information that can allow compilers to produce more efficient, tighter code, and in some cases use unboxed values. We can also tune the compilation process.

Profiling

Optimisation is likely to be a waste of time unless it is targeted appropriately. We need to find out where the program is spending most of its time and work on reducing that. Our program is simple enough that we can tell by inspection that the program must spend most of its time in mandelbrot-eval, which will be called for every pixel in the image, and in terminate-p and mandelbrot-function, which will be called multiple times for each pixel (up to +max-iterations+ times). These three functions are therefore the most likely candidates.

However, intuition can be wrong with things like computer programs, especially as programs get more complex, and we therefore need profiling information to back up our guess. Unfortunately, beyond the time function we met earlier, the Common Lisp standard does not prescribe many profiling facilities. However, many Lisp implementations do provide profiling tools: SBCL, for example, provides two profilers, a “deterministic profiler” (which collects information for specific functions) profiler and a “statistical profiler“. There are also libraries that can be used, and for simple function call counting and time accounting, some programmers also roll their own simple profiling routines.

The source file we’re trying to optimise, mandelbrot-png.lisp, can be found on github. Let’s use SBCL’s statistical profiler to confirm our suspicions. First, we need to set up the environment:

* (ql-quickload "zpng")
To load "zpng":
  Load 1 ASDF system:
    zpng
; Loading "zpng"

("zpng")
* (require :sb-sprof)

("SB-SPROF")
* (and (compile-file "mandelbrot-png") (load "mandelbrot-png"))

; compiling file "/home/mark/mandelbrot-png.lisp" (written 04 JAN 2013 10:12:22 AM):
; compiling (DEFPACKAGE :MANDELBROT-PNG ...)
...
T
*

We first told Quickload to load the necessary libraries (zpng in this case). The (require :sb-sprof) tells Lisp to import the :sb-profile package which we will need for statistical profiling. We then compile and load our program by evaluating (and (compile-file "mandelbrot-png") (load "mandelbrot-png")). Because the Lisp and function uses “short-circuit” evaluation the compiled file will not be loaded unless the compilation succeeds. (Perl users will be familiar with this idiom.)

Now for profiling. Because this can generate a lot of output, and we’ll want to refer to it, it’s best to “dribble” the output into a file.

* (dribble "profile.log")

* (sb-sprof:with-profiling (:reset t :report :flat :loop nil)
    (mandelbrot:write-file "brot.png" 1000))


Number of samples:   1766
Sample interval:     0.01 seconds
Total sampling time: 17.66 seconds
Number of cycles:    0
Sampled threads:
 #

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1    275  15.6    653  37.0    275  15.6        -  SB-KERNEL:TWO-ARG-*
   2    245  13.9    467  26.4    520  29.4        -  SB-KERNEL:TWO-ARG-+
   3    190  10.8    190  10.8    710  40.2        -  REALPART
   4    138   7.8    138   7.8    848  48.0        -  IMAGPART
   5    122   6.9    397  22.5    970  54.9        -  ABS
   6     70   4.0     99   5.6   1040  58.9        -  SB-VM::GENERIC-+
   7     59   3.3     59   3.3   1099  62.2        -  SB-KERNEL:TWO-ARG--
   8     38   2.2     38   2.2   1137  64.4        -  SB-KERNEL:%DOUBLE-FLOAT
   9     37   2.1    453  25.7   1174  66.5        -  TERMINATE-P
  10     24   1.4   1625  92.0   1198  67.8        -  MANDELBROT-EVAL
  11     21   1.2     30   1.7   1219  69.0        -  SB-KERNEL:TWO-ARG->
  12     15   0.8   1115  63.1   1234  69.9        -  MANDELBROT-FUNCTION
  13     13   0.7     13   0.7   1247  70.6        -  SB-KERNEL:TWO-ARG-GCD
  14      7   0.4      9   0.5   1254  71.0        -  SB-BIGNUM:BIGNUM-TRUNCATE
  15      7   0.4      7   0.4   1261  71.4        -  ASH
  16      5   0.3      5   0.3   1266  71.7        -  TRUNCATE
  17      4   0.2     37   2.1   1270  71.9        -  SB-KERNEL::FLOAT-RATIO
  18      4   0.2      6   0.3   1274  72.1        -  SB-KERNEL::INTEGER-/-INTEGER
  19      4   0.2      4   0.2   1278  72.4        -  SB-BIGNUM::BIGNUM-ASHIFT-LEFT-UNALIGNED
  20      4   0.2      4   0.2   1282  72.6        -  SB-KERNEL:SCALE-SINGLE-FLOAT
  21      4   0.2      4   0.2   1286  72.8        -  INTEGER-LENGTH
  22      4   0.2      4   0.2   1290  73.0        -  SB-KERNEL::SINGLE-FROM-BITS
  23      3   0.2   1748  99.0   1293  73.2        -  WRITE-FILE
  24      3   0.2      3   0.2   1296  73.4        -  SB-BIGNUM::%NORMALIZE-BIGNUM
...
#
* (dribble)

*
Unfortunately, the statistical profiler does not appear to work on my Microsoft Windows 7 64-bit installation for SBCL 1.1.2. There is a Windows fork of SBCL by A. Kovalenko which adds threading and other enhancements for the Windows platform, which may work.

We can see our functions terminate-p, mandelbrot-eval and mandelbrot-function are near the top of the list, and cumulatively our program is spending just under 70% of its time in the inner loop (mandelbrot-function). There is also a lot of type conversion going on (SB-KERNEL:%DOUBLE-FLOAT), and a lot of time is being spent in SB-VM:GENERIC-+, a non-specialised addition function that accepts arbitary numeric types. If we can provide type information it might allow the compiler to eliminate type conversions and to use a faster, more specialised addition code in the inner loop.

Optimisation Directives

We can suggest to the compiler where to concentrate its efforts using the optimize declaration. optimize allows us to assign a “weight” value in the range 0–3 to various qualities — speed, compilation-speed, safety, debug and space — where 0 meaning the quality is totally important, 1 being neutral and 3 being very important. For example, we could use

(declare (optimize (speed 3) (safety 1)))

to request the compiler to concentrate on speed and provide only minimum run-time error checking.

It seems to be generally held that using (safety 0) is not a good idea, the marginal speed gain being not worth the loss of program robustness.

Optimisation can be applied to invidual functions (or even forms) or to entire source files, depending on where the declaration is placed. If we placed the above declaration at the top of our source file the compiler would try to optimise all the functions in the file. However, it is more frequent practice to use optimisation declarations for sections where it is deemed to be critical.

Inlining functions

We can suggest that the compiler inlines functions using the inline declaration. (The compiler is at liberty to ignore this, however.) The functions mandelbrot-function and terminate-p are good candidates for inlining since only called from inside mandelbrot-eval and are in the inner loop. We could easily write them directly inside mandelbrot-eval but instead of re-writing the code (possibly making it harder to understand) we can suggest that the compiler inlines them. Adding the declaration

(declaim (inline mandelbrot-function))

before its defun form will suggest that the compiler inlines mandelbrot-function.

Type Declarations

We can declare the types of variables with the type declaration, and the types of functions (arguments and return types) with the ftype declaration. How much attention is paid to type declarations depends on the compiler implementation; the extreme cases being  either all declarations are ignored and everything is checked at runtime, or declarations are respected and no type checking is done. With SBCL, precise type checking is carried out at runtime and if there is an inconsistency with the type declaration, an error occurs. This allows programmers to make use of a strict typing if desired. However, SBCL will not perform any automatic type conversions in this case, meaning that the programmer has to take much greater care with types.

Optimising for speed causes some implementations, including SBCL, to produce a lot of output during compilation to further assist optimisation, and this information is very useful in helping us to determine where to place type information. Again, dribble can be used to capture the compiler output for analysis.

A drawback of type declarations is that they add considerably to the verbosity of the source program and mar some of the elegance and simplicity of Lisp code. Type declarations are best used sparingly and only where required; for efficiency, if strict typing is desired, or for documentation, for example.

Applying to our program

Here are mandelbrot-function and terminate-p with annotations.

(declaim (inline mandelbrot-function))
(declaim (ftype (function 
		 ((complex double-float) (complex double-float)) (complex double-float))
		   mandelbrot-function))
(defun mandelbrot-function (z c)
  (declare (optimize (speed 3) (safety 1)))
  (+ c (* z z)))

(declaim (inline terminate-p))
(defun terminate-p (niter z)
  (declare (optimize (speed 3) (safety 1)))
  (declare (type (unsigned-byte 8) niter)
	   (type (complex double-float) z))
  (or (= niter +max-iterations+) 
      (> (abs z) 2.0)))

Optimisations have also been made to mandelbrot-eval, and compute-mandelbrot-set has had some type information and coercions added to satisfy the type system. The resulting modified source file mandelbrot-png-opt.lisp can also be downloaded from github.

The compiler diagnostics provide information on the effects of our type declarations. Here’s the compiler output before type declarations for mandelbrot-function

; in: DEFUN MANDELBROT-FUNCTION
;     (* MANDELBROT-PNG::Z MANDELBROT-PNG::Z)
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a RATIONAL.
;   The second argument is a NUMBER, not a FLOAT.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a FLOAT.
;   The second argument is a NUMBER, not a RATIONAL.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a SINGLE-FLOAT.
;   The second argument is a NUMBER, not a DOUBLE-FLOAT.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a DOUBLE-FLOAT.
;   The second argument is a NUMBER, not a SINGLE-FLOAT.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a (COMPLEX SINGLE-FLOAT).
;   The second argument is a NUMBER, not a (COMPLEX SINGLE-FLOAT).
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a (COMPLEX SINGLE-FLOAT).
;   The second argument is a NUMBER, not a REAL.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a REAL.
;   The second argument is a NUMBER, not a (COMPLEX SINGLE-FLOAT).
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a (COMPLEX DOUBLE-FLOAT).
;   The second argument is a NUMBER, not a (COMPLEX DOUBLE-FLOAT).
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a (COMPLEX DOUBLE-FLOAT).
;   The second argument is a NUMBER, not a REAL.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a REAL.
;   The second argument is a NUMBER, not a (COMPLEX DOUBLE-FLOAT).
; 
; note: unable to
;   convert x*2^k to shift
; due to type uncertainty:
;   The first argument is a NUMBER, not a INTEGER.
;   The second argument is a NUMBER, not a INTEGER.

;     (+ MANDELBROT-PNG::C (* MANDELBROT-PNG::Z MANDELBROT-PNG::Z))
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a RATIONAL.
;   The second argument is a NUMBER, not a FLOAT.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a FLOAT.
;   The second argument is a NUMBER, not a RATIONAL.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a SINGLE-FLOAT.
;   The second argument is a NUMBER, not a DOUBLE-FLOAT.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a DOUBLE-FLOAT.
;   The second argument is a NUMBER, not a SINGLE-FLOAT.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a (COMPLEX SINGLE-FLOAT).
;   The second argument is a NUMBER, not a (COMPLEX SINGLE-FLOAT).
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a (COMPLEX SINGLE-FLOAT).
;   The second argument is a NUMBER, not a REAL.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a REAL.
;   The second argument is a NUMBER, not a (COMPLEX SINGLE-FLOAT).
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a (COMPLEX DOUBLE-FLOAT).
;   The second argument is a NUMBER, not a (COMPLEX DOUBLE-FLOAT).
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a (COMPLEX DOUBLE-FLOAT).
;   The second argument is a NUMBER, not a REAL.
; 
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a REAL.
;   The second argument is a NUMBER, not a (COMPLEX DOUBLE-FLOAT).

;     (* MANDELBROT-PNG::Z MANDELBROT-PNG::Z)
; 
; note: forced to do GENERIC-* (cost 30)
;       unable to do inline float arithmetic (cost 3) because:
;       The first argument is a T, not a SINGLE-FLOAT.
;       The second argument is a T, not a SINGLE-FLOAT.
;       The result is a (VALUES NUMBER &OPTIONAL), not a (VALUES SINGLE-FLOAT
;                                                                &REST T).
;       unable to do inline float arithmetic (cost 3) because:
;       The first argument is a T, not a DOUBLE-FLOAT.
;       The second argument is a T, not a DOUBLE-FLOAT.
;       The result is a (VALUES NUMBER &OPTIONAL), not a (VALUES DOUBLE-FLOAT
;                                                                &REST T).
;       etc.

;     (+ MANDELBROT-PNG::C (* MANDELBROT-PNG::Z MANDELBROT-PNG::Z))
; 
; note: forced to do GENERIC-+ (cost 10)
;       unable to do inline float arithmetic (cost 2) because:
;       The first argument is a T, not a DOUBLE-FLOAT.
;       The second argument is a NUMBER, not a DOUBLE-FLOAT.
;       The result is a (VALUES NUMBER &OPTIONAL), not a (VALUES DOUBLE-FLOAT
;                                                                &REST T).
;       unable to do inline float arithmetic (cost 2) because:
;       The first argument is a T, not a SINGLE-FLOAT.
;       The second argument is a NUMBER, not a SINGLE-FLOAT.
;       The result is a (VALUES NUMBER &OPTIONAL), not a (VALUES SINGLE-FLOAT
;                                                                &REST T).
;       etc.

Note that the compiler is being forced to do expensive generic arithmetic operations because the types could be anything. After type declarations the compiler outputs:

 in: DEFUN MANDELBROT-FUNCTION
;     (DEFUN MANDELBROT::MANDELBROT-FUNCTION (MANDELBROT::Z MANDELBROT::C)
;       (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 1)))
;       (+ MANDELBROT::C (* MANDELBROT::Z MANDELBROT::Z)))
; --> PROGN EVAL-WHEN 
; ==>
;   (SB-IMPL::%DEFUN 'MANDELBROT::MANDELBROT-FUNCTION
;                    (SB-INT:NAMED-LAMBDA MANDELBROT::MANDELBROT-FUNCTION
;                        (MANDELBROT::Z MANDELBROT::C)
;                      (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 1)))
;                      (BLOCK MANDELBROT::MANDELBROT-FUNCTION
;                        (+ MANDELBROT::C (* MANDELBROT::Z MANDELBROT::Z))))
;                    NIL
;                    '(SB-C:LAMBDA-WITH-LEXENV NIL NIL NIL
;                      (MANDELBROT::Z MANDELBROT::C)
;                      (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 1)))
;                      (BLOCK MANDELBROT::MANDELBROT-FUNCTION
;                        (+ MANDELBROT::C (* MANDELBROT::Z MANDELBROT::Z))))
;                    (SB-C:SOURCE-LOCATION))
; 
; note: doing complex float to pointer coercion (cost 13) to "<return-value>"

We can also use the disassemble function to look at the generated assembly or bytecode directly to get a better idea of what the compiler is doing:

* (disassemble 'mandelbrot::mandelbrot-function)
; disassembly for MANDELBROT::MANDELBROT-FUNCTION
; 0B24E105:       DDD9             FSTPD FR1                  ; no-arg-parsing entry point
;       07:       DD4201           FLDD [EDX+1]
;       0A:       D9C9             FXCH FR1
;       0C:       DDDB             FSTPD FR3
;       0E:       DD4209           FLDD [EDX+9]
;       11:       D9CB             FXCH FR3
;       13:       DDDC             FSTPD FR4
;       15:       DD4201           FLDD [EDX+1]
;       18:       D9CC             FXCH FR4
;       1A:       DDDA             FSTPD FR2
;       1C:       DD4209           FLDD [EDX+9]
;       1F:       D9CA             FXCH FR2
;       21:       DDD8             FSTPD FR0
;       23:       D9C0             FLDD FR0
;       25:       D8CC             FMULD FR4
;       27:       DDD5             FSTD FR5
;       29:       DDD8             FSTPD FR0
;       2B:       D9C2             FLDD FR2
;       2D:       D8CA             FMULD FR2
;       2F:       9B               WAIT
;       30:       DCED             FSUB-STI FR5
;       32:       9B               WAIT
;       33:       DDD8             FSTPD FR0
;       35:       D9C1             FLDD FR1
;       37:       DCC9             FMUL-STI FR1
;       39:       9B               WAIT
;       3A:       DDD8             FSTPD FR0
;       3C:       D9C2             FLDD FR2
;       3E:       D8CC             FMULD FR4
;       40:       9B               WAIT
;       41:       D8C1             FADDD FR1
;       43:       9B               WAIT
;       44:       D9CD             FXCH FR5
;       46:       DDD2             FSTD FR2
;       48:       D9CD             FXCH FR5
;       4A:       DDD3             FSTD FR3
;       4C:       DDD8             FSTPD FR0
;       4E:       DD4701           FLDD [EDI+1]
;       51:       D9CA             FXCH FR2
;       53:       DDD1             FSTD FR1
;       55:       D9CA             FXCH FR2
;       57:       D8C1             FADDD FR1
;       59:       DDD4             FSTD FR4
;       5B:       DDD8             FSTPD FR0
;       5D:       DD4709           FLDD [EDI+9]
;       60:       D9CB             FXCH FR3
;       62:       DDD1             FSTD FR1
;       64:       D9CB             FXCH FR3
;       66:       D8C1             FADDD FR1
;       68:       DDD2             FSTD FR2
;       6A:       DDD8             FSTPD FR0
;       6C:       D9C3             FLDD FR3
;       6E:       D9CA             FXCH FR2
;       70:       DDD1             FSTD FR1
;       72:       D9CA             FXCH FR2
;       74:       64               FS-SEGMENT-PREFIX
;       75:       892D5C000000     MOV [#x5C], EBP
;       7B:       BA18000000       MOV EDX, 24
;       80:       64               FS-SEGMENT-PREFIX
;       81:       031534000000     ADD EDX, [#x34]
;       87:       64               FS-SEGMENT-PREFIX
;       88:       3B1538000000     CMP EDX, [#x38]
;       8E:       7607             JBE L0
;       90:       E84B98E1FC       CALL #x80679E0             ; alloc_overflow_edx
;       95:       EB0A             JMP L1
;       97: L0:   64               FS-SEGMENT-PREFIX
;       98:       891534000000     MOV [#x34], EDX
;       9E:       83EA18           SUB EDX, 24
;       A1: L1:   8D5207           LEA EDX, [EDX+7]
;       A4:       C742F922050000   MOV DWORD PTR [EDX-7], 1314
;       AB:       DD5201           FSTD [EDX+1]
;       AE:       D9C9             FXCH FR1
;       B0:       DD5209           FSTD [EDX+9]
;       B3:       D9C9             FXCH FR1
;       B5:       64               FS-SEGMENT-PREFIX
;       B6:       312D5C000000     XOR [#x5C], EBP
;       BC:       7402             JEQ L2
;       BE:       CC09             BREAK 9                    ; pending interrupt trap
;       C0: L2:   8BE5             MOV ESP, EBP
;       C2:       F8               CLC
;       C3:       5D               POP EBP
;       C4:       C3               RET
;       C5:       CC0A             BREAK 10                   ; error trap
;       C7:       02               BYTE #X02
;       C8:       18               BYTE #X18                  ; INVALID-ARG-COUNT-ERROR
;       C9:       4F               BYTE #X4F                  ; ECX
;       CA:       CC0A             BREAK 10                   ; error trap
;       CC:       02               BYTE #X02
;       CD:       2B               BYTE #X2B                  ; OBJECT-NOT-COMPLEX-DOUBLE-FLOAT-ERROR
;       CE:       90               BYTE #X90                  ; EDX
;       CF:       CC0A             BREAK 10                   ; error trap
;       D1:       04               BYTE #X04
;       D2:       2B               BYTE #X2B                  ; OBJECT-NOT-COMPLEX-DOUBLE-FLOAT-ERROR
;       D3:       FED001           BYTE #XFE, #XD0, #X01      ; EDI
*

Lots of floating point operations in registers, which is pretty reasonable given that we’re doing complex number arithmetic. Specifying (safety 0) would remove the checking code but would not likely have much impact on performance.

Note that because we’re now using a package and are not exporting the mandelbrot-function symbol, we have to prepend the package name and a double colon when referring to the mandelbrot-function symbol in the disassemble form. Packages make life easier for maintaining large programs but also make some operations a little more fiddly.

Of course, the final proof of the pudding is in the eating. Let’s see what the effect was on the actual performance of our program.

* (time (mandelbrot:write-file "test.png" 1000))
Evaluation took:
  2.546 seconds of real time
  2.424152 seconds of total run time (2.408151 user, 0.016001 system)
...

An approximately 7x speed up, and that’s shipping the file over an NFS mount!

Final Words

Given such a result it would be tempting to go back to our program and add type and function declarations all over but there are a number of reasons that it would not be a wise move. First, it takes a time and effort to analyse the program (working through the compiler diagnostics) and add the type declarations. Second, with SBCL at least there is a ripple-up impact in that caller functions must pass exactly the right types or a run-time error will occur. Third, cluttering the program with type declarations makes it more verbose and opaque. Fourth, in a program such as this, with most of the time spent in the inner loop and the remaining time probably dominated by disk I/O, it would be a waste of effort. Optimisation should therefore be applied only where needed.

Posted in Computing, Lisp | Leave a comment