Newsgroups: comp.lang.apl
Path: watmath!watserv1!70530.1226@compuserve.com
From: Mike Kent <70530.1226@CompuServe.COM>
Subject: APL execution efficiency revisited
Message-ID: <920320080728_70530.1226_CHC73-1@CompuServe.COM>
Sender: payne@watserv1.waterloo.edu (Doug Payne [DCS])
Organization: University of Waterloo
Date: Fri, 20 Mar 1992 08:07:29 GMT
Lines: 186

The APL-to-FORTRAN translator developed at IBM Research (Yorktown
Heights) and then called the YAT (Yorktown APL Translator) is available
for mainframe (at least for VM) APL2 users, thugh not from IBM; the 
vendor is InterProcess Systems, Inc.  They are in Atlanta, and I don't
recall their street address or phone number but a call to Atlanta
information will get you the number (and since they are an IBM Business
Partner, perhaps one of the IBMers who follows this newsgroup can post
this info).  (They are good people to do business with: honest,
straightforward, fair, and pleasant; their other APL add-ons are worth a
good hard look as well, especially theier integrated editor/debugger
which I have used to my considerable benefit).  I do *not* have any
personal experience with this product as I was not able to convince my
previous employer to buy it (I thought it was a no-brainer, we had a $2E6
internal timesharing bill in the department, and the product costs around
$10 or 15K as I recall, so a  5% overall application speedup would
recover software acquisition and application tweaking costs easily within
a year).  

Change in execution times reportedly range from slight degradation to 50-
or 100-fold improvement with 3- to 5-fold improvement being typical.  The
compiler is *not* an APL2 compiler (though its output is accessed via the
APL2 []NA interface); the code to be compiled should be ISO "flat" APL;
no shared variable code, please.  Gains are said to be rather small on
well-written, tuned, highly-parallel APL code running against large
arrays, but quite large on loopy scalar code where APL performance is
generally poor.

My experience with optimizing APL code indicates that there is a strong
Pareto-type (80/20) law:  large fractions of the time are spent in small
fractions of the application.  InterProcess specificaly suggests that
compilation should be limited to such bottlenecks.

The compiler produces the FORTRAN intermediate code and does NOT discard
it after FORTRAN compilation; this provides an alternative methodology (to
calling into APL) for those who might like to develop in APL and call from
other languages.

STSC, address provided in yesterday's responses to STATGRAPHICS queries ,
offers a compiler with similar benefits and limitations as an add-on to
their APL*PLUS mainframe interpreter; the STSC interpreter, however,
produces object code which r\must be run from within an APL*PLUS workspace
as it may make calls to certain interpreter services.  Performance
characteristics and caveats are generally as for the IBM/InterProcess
product.  Again, I have no personal experience with this product; the
description is hearsay (but I heard directly from the STSC developer and
from one of their clients).

The performance caveats and characteristics, and the type of results in my
previous article, illustrate something that APL users often forget:  APL
interpreters provide in their language primitives highly-optimized
versions of what would generally be library routines in other languages. 
I've been writing APL code for a living for over a dozen years now (which
doesn't make me an old-timer, but does put me past the arriviste stage),
and have been involved primarily in medium to  large scale applications
for interactive use by non-programmers, mostly close computer-illiterate
or close to it.  Acceptance of such applications is often grudging if the
response time is long (or more or less equivalently, if the internal
timesharing charges are high), so a good part of my work has been
concerned with improving execution efficiency.  In some cases, I have just
been stuck, unable to make a significant improvement to slow parts of an
app, but in many other cases improvements, often dramatic, have been
possible with more or less effort.  A few typical bottlenecks come up
often enough that I think it might be worth discussing them here.

[1]  Arrays, typically matrices, which grow by repeated catenation, often
     a row at a time.

     This is an absolute killer; it forces the interpreter to spend far 
     more time on storage management and data movement than on the actual
     problem being solved.  When the ultimate size of the array can be 
     determined in advance, allocating it up front and using indexed 
     assignment reduces the execution from quadratic in the final size to
     linear, and reduces the leading coefficient as well.  If the size 
     can't be determined in advance, increasing the size by relatively 
     large jumps and using indexed assignment (with a final "take" to 
     discard allocated unused rows) will generally pay off handsomely,
     at the cost of some extra complexity in the code.

[2]  Column-oriented assignments into and fetches from large matrices.

     Since APL allocates storage in row-major order, row-oriented
     operations are generally faster (especially on virtual-memory
     machines, where accessing a column may require consderable paging,
     increasing CPU time slightly and wall-clock time dramatically).

[3]  OOPS, how did these integers get to be 8 bytes?

     Probably because integer keys or pointers were stored in the same
     array with floating-point values (in APL2, you can create 16-byte 
     by commingling pointers with complex numbers).  This has two bad 
     effects on appplication performance:  storage requirements (ans os 
     storage management and data movement efffects) go up dramatically, 
     and many primitives are far more effficient on integers than on
     floats.  Dyadic iota is a prime example; when both arguments are
     integer arrays, it is generally very fast, while for floats it is 
     generally unbeably slow (for large arrays).  The cure is to segregate
     and pointers from data, either as items in a nested array, or in 
     separate variables,  The cost again is some extra complexity in the
     code, but the payoff again can be dramatically better execution 		
     times.

[4]  Inner and outer products in which most of the entries are discarded,
     or which are used only as intermediates.

     The classic example here is reducing a matrix to its unique rows by
     the following beautiful but amazingly inefficient idiom:

          (1 1 transpose < \ M ^.= transpose M) / [1] M

     The compression is fine; it's the calculation of the bit vector
     which is the problem.  The solution is to make use of a defined
     "matrix dyadic iota" function which computes (rows of M) iota
     (rows of M) efficiently, call it MIOTA, and then do
 
         ((M MIOTA M) = iota 1 rho rho M) / [1] M

     See below for a discussion of such an MIOTA function.

[5]  Naive searching (failure to exploit grade or special properties
     properties of the array being searched).

     Finding the uniques in a vector by ((V iota V) = itoa rho V) / V
     is fine for small V, or when dyadic iota is extremely fast, or 
     when the order of the uniques is important.  When V is large
     and order preservation is unimportant, try instead:

     [a]  sorting V and comparing adjacent items for inequality (using
          for instance, "V unequal 1 rotate V").

     [b]  in the very common case where V is a long vector of small
          (positive) integer keys or pointers, 

              B <- (1+ max / V) rho 0
              B[V] <- 1
              B/iota rho B
  
     For matrix dyadic iota (and vector dyadic iota for floats),
     merge the haystack with the needles, grade the result, permute
     a bit vector of (1s for haystack, 0s for needles) by the grade
     vector, compare rows, count permuted bits  ...  

     Compared to the idiomatic

        []IO + +/ and \ M or . unequal transpose M

     the merge-and-sort technique is faster for all but the smallest 
     arrays, and up to *several hundred* times as fast for (very) large
     arrays.  Further, the inner product generates an enormous bit
     array which can cause WS FULL when both matrices hae large row
     counts.
 
     The technique has been discussed in the APL literature
     (e.g. in the December '91 NY/SIGAPL newsletter; in Gary Bergquist's
     book "APL:  Advanced Techniques and Utilities"), and occurs in the
     defined IOTA function (an easily adapted function which handles
     the case of numeric vectors) which IBM formerly distributed in
     the UTILITY workspace along with VS APL.

[6]  Needlessly scalar-oriented code.

     I once replaced a graph-traversal algorithm (used for extracting
     such items as the salesmen who work for a sales office in a division
     of a firm, or all the G/L lines which contribute to "Total phone
     expenses", for instance) which iterated on each node as it went down
     the graph to an algorithm which iterated once for each depth level
     in the graph and got better than 100-fold improvement in the cases
     which the end users were complaining about (30 minutes of locked
     screen to answer the question who are the salesmen in the XXX
     division, reduced to around 10 or 15 seconds).   

I apologise in advance to those who follow this newsgroup and already
know and use these or simialr techniques, to those who don't write (and
don't like to read) long articles, and to those for whom execution
efficiency is not a major issue.  But I start to burn when I see
statements that APL is inherently and necessarily an order of magnitude
less efficient than FORTRAN or C or whatever.  There are certainly cases
where this is true, but there are a very large number of cases where it
is not, or need not be, true.  My experience has been that APL speeds up 
the design and, especially, the coding phase of application development
enough that there is generally plenty of time to find, elucidate, and
ameliorate most such "performance bugs".


Mike Kent     
70530.1226 at compuserve.com

