From phage!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com Mon Mar 23 12:31:31 EST 1992
Article: 1075 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com
From: Mike Kent <70530.1226@CompuServe.COM>
Subject: APL execution efficiency revisited
Message-ID: <920320080728_70530.1226_CHC73-1@CompuServe.COM>
Sender: payne@watserv1.waterloo.edu (Doug Payne [DCS])
Organization: University of Waterloo
Date: Fri, 20 Mar 1992 08:07:29 GMT
Lines: 186

The APL-to-FORTRAN translator developed at IBM Research (Yorktown
Heights) and then called the YAT (Yorktown APL Translator) is available
for mainframe (at least for VM) APL2 users, thugh not from IBM; the 
vendor is InterProcess Systems, Inc.  They are in Atlanta, and I don't
recall their street address or phone number but a call to Atlanta
information will get you the number (and since they are an IBM Business
Partner, perhaps one of the IBMers who follows this newsgroup can post
this info).  (They are good people to do business with: honest,
straightforward, fair, and pleasant; their other APL add-ons are worth a
good hard look as well, especially theier integrated editor/debugger
which I have used to my considerable benefit).  I do *not* have any
personal experience with this product as I was not able to convince my
previous employer to buy it (I thought it was a no-brainer, we had a $2E6
internal timesharing bill in the department, and the product costs around
$10 or 15K as I recall, so a  5% overall application speedup would
recover software acquisition and application tweaking costs easily within
a year).  

Change in execution times reportedly range from slight degradation to 50-
or 100-fold improvement with 3- to 5-fold improvement being typical.  The
compiler is *not* an APL2 compiler (though its output is accessed via the
APL2 []NA interface); the code to be compiled should be ISO "flat" APL;
no shared variable code, please.  Gains are said to be rather small on
well-written, tuned, highly-parallel APL code running against large
arrays, but quite large on loopy scalar code where APL performance is
generally poor.

My experience with optimizing APL code indicates that there is a strong
Pareto-type (80/20) law:  large fractions of the time are spent in small
fractions of the application.  InterProcess specificaly suggests that
compilation should be limited to such bottlenecks.

The compiler produces the FORTRAN intermediate code and does NOT discard
it after FORTRAN compilation; this provides an alternative methodology (to
calling into APL) for those who might like to develop in APL and call from
other languages.

STSC, address provided in yesterday's responses to STATGRAPHICS queries ,
offers a compiler with similar benefits and limitations as an add-on to
their APL*PLUS mainframe interpreter; the STSC interpreter, however,
produces object code which r\must be run from within an APL*PLUS workspace
as it may make calls to certain interpreter services.  Performance
characteristics and caveats are generally as for the IBM/InterProcess
product.  Again, I have no personal experience with this product; the
description is hearsay (but I heard directly from the STSC developer and
from one of their clients).

The performance caveats and characteristics, and the type of results in my
previous article, illustrate something that APL users often forget:  APL
interpreters provide in their language primitives highly-optimized
versions of what would generally be library routines in other languages. 
I've been writing APL code for a living for over a dozen years now (which
doesn't make me an old-timer, but does put me past the arriviste stage),
and have been involved primarily in medium to  large scale applications
for interactive use by non-programmers, mostly close computer-illiterate
or close to it.  Acceptance of such applications is often grudging if the
response time is long (or more or less equivalently, if the internal
timesharing charges are high), so a good part of my work has been
concerned with improving execution efficiency.  In some cases, I have just
been stuck, unable to make a significant improvement to slow parts of an
app, but in many other cases improvements, often dramatic, have been
possible with more or less effort.  A few typical bottlenecks come up
often enough that I think it might be worth discussing them here.

[1]  Arrays, typically matrices, which grow by repeated catenation, often
     a row at a time.

     This is an absolute killer; it forces the interpreter to spend far 
     more time on storage management and data movement than on the actual
     problem being solved.  When the ultimate size of the array can be 
     determined in advance, allocating it up front and using indexed 
     assignment reduces the execution from quadratic in the final size to
     linear, and reduces the leading coefficient as well.  If the size 
     can't be determined in advance, increasing the size by relatively 
     large jumps and using indexed assignment (with a final "take" to 
     discard allocated unused rows) will generally pay off handsomely,
     at the cost of some extra complexity in the code.

[2]  Column-oriented assignments into and fetches from large matrices.

     Since APL allocates storage in row-major order, row-oriented
     operations are generally faster (especially on virtual-memory
     machines, where accessing a column may require consderable paging,
     increasing CPU time slightly and wall-clock time dramatically).

[3]  OOPS, how did these integers get to be 8 bytes?

     Probably because integer keys or pointers were stored in the same
     array with floating-point values (in APL2, you can create 16-byte 
     by commingling pointers with complex numbers).  This has two bad 
     effects on appplication performance:  storage requirements (ans os 
     storage management and data movement efffects) go up dramatically, 
     and many primitives are far more effficient on integers than on
     floats.  Dyadic iota is a prime example; when both arguments are
     integer arrays, it is generally very fast, while for floats it is 
     generally unbeably slow (for large arrays).  The cure is to segregate
     and pointers from data, either as items in a nested array, or in 
     separate variables,  The cost again is some extra complexity in the
     code, but the payoff again can be dramatically better execution 		
     times.

[4]  Inner and outer products in which most of the entries are discarded,
     or which are used only as intermediates.

     The classic example here is reducing a matrix to its unique rows by
     the following beautiful but amazingly inefficient idiom:

          (1 1 transpose < \ M ^.= transpose M) / [1] M

     The compression is fine; it's the calculation of the bit vector
     which is the problem.  The solution is to make use of a defined
     "matrix dyadic iota" function which computes (rows of M) iota
     (rows of M) efficiently, call it MIOTA, and then do
 
         ((M MIOTA M) = iota 1 rho rho M) / [1] M

     See below for a discussion of such an MIOTA function.

[5]  Naive searching (failure to exploit grade or special properties
     properties of the array being searched).

     Finding the uniques in a vector by ((V iota V) = itoa rho V) / V
     is fine for small V, or when dyadic iota is extremely fast, or 
     when the order of the uniques is important.  When V is large
     and order preservation is unimportant, try instead:

     [a]  sorting V and comparing adjacent items for inequality (using
          for instance, "V unequal 1 rotate V").

     [b]  in the very common case where V is a long vector of small
          (positive) integer keys or pointers, 

              B <- (1+ max / V) rho 0
              B[V] <- 1
              B/iota rho B
  
     For matrix dyadic iota (and vector dyadic iota for floats),
     merge the haystack with the needles, grade the result, permute
     a bit vector of (1s for haystack, 0s for needles) by the grade
     vector, compare rows, count permuted bits  ...  

     Compared to the idiomatic

        []IO + +/ and \ M or . unequal transpose M

     the merge-and-sort technique is faster for all but the smallest 
     arrays, and up to *several hundred* times as fast for (very) large
     arrays.  Further, the inner product generates an enormous bit
     array which can cause WS FULL when both matrices hae large row
     counts.
 
     The technique has been discussed in the APL literature
     (e.g. in the December '91 NY/SIGAPL newsletter; in Gary Bergquist's
     book "APL:  Advanced Techniques and Utilities"), and occurs in the
     defined IOTA function (an easily adapted function which handles
     the case of numeric vectors) which IBM formerly distributed in
     the UTILITY workspace along with VS APL.

[6]  Needlessly scalar-oriented code.

     I once replaced a graph-traversal algorithm (used for extracting
     such items as the salesmen who work for a sales office in a division
     of a firm, or all the G/L lines which contribute to "Total phone
     expenses", for instance) which iterated on each node as it went down
     the graph to an algorithm which iterated once for each depth level
     in the graph and got better than 100-fold improvement in the cases
     which the end users were complaining about (30 minutes of locked
     screen to answer the question who are the salesmen in the XXX
     division, reduced to around 10 or 15 seconds).   

I apologise in advance to those who follow this newsgroup and already
know and use these or simialr techniques, to those who don't write (and
don't like to read) long articles, and to those for whom execution
efficiency is not a major issue.  But I start to burn when I see
statements that APL is inherently and necessarily an order of magnitude
less efficient than FORTRAN or C or whatever.  There are certainly cases
where this is true, but there are a very large number of cases where it
is not, or need not be, true.  My experience has been that APL speeds up 
the design and, especially, the coding phase of application development
enough that there is generally plenty of time to find, elucidate, and
ameliorate most such "performance bugs".


Mike Kent     
70530.1226 at compuserve.com



From phage!jvnc.net!yale.edu!qt.cs.utexas.edu!zaphod.mps.ohio-state.edu!mips!pacbell.com!UB.com!daver!leadsv!sunfse!iscnvx!psinntp!kepler1!andrew Mon Mar 23 12:32:05 EST 1992
Article: 1089 of comp.lang.apl
Path: phage!jvnc.net!yale.edu!qt.cs.utexas.edu!zaphod.mps.ohio-state.edu!mips!pacbell.com!UB.com!daver!leadsv!sunfse!iscnvx!psinntp!kepler1!andrew
From: andrew@rentec.com (Andrew Mullhaupt)
Newsgroups: comp.lang.apl
Subject: Re: APL execution efficiency revisited
Message-ID: <745@kepler1.rentec.com>
Date: 21 Mar 92 03:05:17 GMT
References: <920320080728_70530.1226_CHC73-1@CompuServe.COM>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 19

In article <920320080728_70530.1226_CHC73-1@CompuServe.COM> 70530.1226@CompuServe.COM (Mike Kent) writes:

	[Several classic APL don'ts omitted]

>[4]  Inner and outer products in which most of the entries are discarded,
>     or which are used only as intermediates.

	This can be almost impossible to avoid: it is one of the classic
examples of dynamic programming to find the order in which to evaluate a product
of matrices with the least amount of arithmetic. If you are given a list
of matrices and expect to do reasonably well, you will have to be prepared
to solve the dynamic program for the product order and then do the products
in that order. Note that APL is not at big disadvantage in this case. It
would only be out of necessity that one would resort to this in almost
any language, and it would not be particularly trivial in any that I can
think of. It just goes to show that you can't always avoid this mistake.

Later,
Andrew Mullhaupt


From phage!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com Mon Mar 23 12:32:20 EST 1992
Article: 1091 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com
From: Mike Kent <70530.1226@CompuServe.COM>
Subject: Re: Re:  APL execution efficiency revisited
Message-ID: <920322073241_70530.1226_CHC87-1@CompuServe.COM>
Sender: root@watserv1.waterloo.edu (Operator)
Organization: University of Waterloo
Date: Sun, 22 Mar 1992 07:32:42 GMT
Lines: 37

In article <745Akepler1.rentec.com>, andrew@rentec.com (Andrew Mullhaupt)
writes

 >  [Several classic APL dont's omitted]

Classic, but not as well-known or widely understood as one might hope,
if the code I've seen over the last decade is any indication.  The
season-ticketed audience in this venue is _not_ representative of the
APL-programming community overall and much or all of what I wrote 
is old news to them.  Still, you had made some rather general assertions
about APL execution efficiency (or I took your meaning wrong), so I felt
it appropriate to post some advice on solving execution efficency
problems, shopworn though it may be, since following these rules of thumb
will give code which avoids a large number of the worst cases.  And again,
your mileage may vary.

 >>[4] Inner and outer products in which most of the entries are        
       discarded, or used only as intermediates.
 
 > This can be almost impossible to avoid ...

No vendor has yet (so far as I know) optimized

    first plus.times / (vector of matrices)

but it's probably only a matter of time ...   Of course, the idiom would
be unrecognizable in, e.g., C.  One could write a library routine to
handle this in a compiled language, but at many current interpreters,
it's likely that the library routine would be callable (via a []NA-like
interface) from APL.  This is certainly true of APL2.

My point here is that APL code does not, ordinarily, have to be an order
of magnitude (or even a factor of 2) slower than compiled code.  And I am
tired of hearing that this _is_ so, and that it is _inherent_ in the
language.  




From phage!jvnc.net!yale.edu!spool.mu.edu!think.com!wupost!usc!rutgers!hsdndev!spdcc!dirtydog.ima.isc.com!ispd-newsserver!psinntp!kepler1!andrew Mon Mar 23 12:34:12 EST 1992
Article: 1092 of comp.lang.apl
Path: phage!jvnc.net!yale.edu!spool.mu.edu!think.com!wupost!usc!rutgers!hsdndev!spdcc!dirtydog.ima.isc.com!ispd-newsserver!psinntp!kepler1!andrew
From: andrew@rentec.com (Andrew Mullhaupt)
Newsgroups: comp.lang.apl
Subject: Re: Re:  APL execution efficiency revisited
Message-ID: <752@kepler1.rentec.com>
Date: 22 Mar 92 18:42:57 GMT
Article-I.D.: kepler1.752
References: <920322073241_70530.1226_CHC87-1@CompuServe.COM>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 36

In article <920322073241_70530.1226_CHC87-1@CompuServe.COM> 70530.1226@CompuServe.COM (Mike Kent) writes:
> >>[4] Inner and outer products in which most of the entries are        
>       discarded, or used only as intermediates.
> 
> > This can be almost impossible to avoid ...
>
>No vendor has yet (so far as I know) optimized
>
>    first plus.times / (vector of matrices)

	[ ... but suppose one did...]


Not good enough. Two of the cases where this optimization is likely to
be practiced is where not all the matrices can be held in memory at the
same time, or where they are being sequentially computed in control bound
code.

>My point here is that APL code does not, ordinarily, have to be an order
>of magnitude (or even a factor of 2) slower than compiled code.  And I am
>tired of hearing that this _is_ so, and that it is _inherent_ in the
>language.  

'Ordinarily' has a great deal to do with what you do. When you do scientific
computation, the only real problems are not enough space and not enough time.
There is always an obvious way to solve a problem which is far too expensive,
(or else you would use it...you spend all your time up against problems where
the obvious approach is not OK.) APL programmers who claim that APL is not
usually a lot slower don't seem to need 'non-dense' algorithms.

I do not claim that interpreted APL is _always_ a lot slower than compiled
code. However, it _is_ in enough cases to make you cry. I'll try to make
my position clearer in my next post.

Later,
Andrew Mullhaupt


From phage!jvnc.net!darwin.sura.net!wupost!usc!elroy.jpl.nasa.gov!jato!csi!sam Mon Mar 23 15:39:16 EST 1992
Article: 1096 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!wupost!usc!elroy.jpl.nasa.gov!jato!csi!sam
From: sam@csi.jpl.nasa.gov (Sam Sirlin)
Subject: Re: Re:  APL execution efficiency revisited
Message-ID: <1992Mar23.185558.2647@csi.jpl.nasa.gov>
Originator: sam@kalessin
Sender: usenet@csi.jpl.nasa.gov (Network Noise Transfer Service)
Nntp-Posting-Host: kalessin
Organization: Jet Propulsion Laboratory, Pasadena, CA
References:  <920322073241_70530.1226_CHC87-1@CompuServe.COM>
Date: Mon, 23 Mar 1992 18:55:58 GMT
Lines: 20


In article <920322073241_70530.1226_CHC87-1@CompuServe.COM>, Mike Kent <70530.1226@CompuServe.COM> writes:
|>  >>[4] Inner and outer products in which most of the entries are        
|>        discarded, or used only as intermediates.
|>  
|>  > This can be almost impossible to avoid ...
|> 
|> No vendor has yet (so far as I know) optimized
|> 
|>     first plus.times / (vector of matrices)

This issue was one of the things Tim Budd tackled with his demand
driven compiler - don't evaluate unneeded intermediate steps. This
particular (APL2?) example won't work, but ISO cases should (assuming 
the compiler works...) 

-- 
Sam Sirlin
Jet Propulsion Laboratory         sam@kalessin.jpl.nasa.gov



From phage!jvnc.net!yale.edu!qt.cs.utexas.edu!zaphod.mps.ohio-state.edu!sdd.hp.com!elroy.jpl.nasa.gov!jato!csi!sam Mon Mar 23 19:35:02 EST 1992
Article: 1081 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!yale.edu!qt.cs.utexas.edu!zaphod.mps.ohio-state.edu!sdd.hp.com!elroy.jpl.nasa.gov!jato!csi!sam
From: sam@csi.jpl.nasa.gov (Sam Sirlin)
Subject: Re: APL2 question
Message-ID: <1992Mar20.230242.3425@csi.jpl.nasa.gov>
Originator: sam@kalessin
Sender: usenet@csi.jpl.nasa.gov (Network Noise Transfer Service)
Nntp-Posting-Host: kalessin
Organization: Jet Propulsion Laboratory, Pasadena, CA
References: <1992Mar16.173450.1067@csi.jpl.nasa.gov> <730@kepler1.rentec.com> <1992Mar17.185047.8403@csi.jpl.nasa.gov> <732@kepler1.rentec.com>
Date: Fri, 20 Mar 1992 23:02:42 GMT
Lines: 64


In article <732@kepler1.rentec.com>, andrew@rentec.com (Andrew Mullhaupt) writes:
|> In article <1992Mar17.185047.8403@csi.jpl.nasa.gov> sam@csi.jpl.nasa.gov (Sam Sirlin) writes:
|> >I thought Fortran came before Algol, but I could easily be wrong. I
|> >meant languages like C, Pascal, Basic, PL1, etc. 
|> Well it did, and Algol is still not a FORTRAN derivative any more than LISP,
|> which came after both of these, is a derivative of either.

There doesn't seem to me to be much difference between Fortran and
Algol, but "much difference" isn't very quantitative. I think most
people would agree they are much closer to each other than they are to
Lisp or APL. Is there some technical (language theoritic?) term for
the class of languages {Fortran, Algol, C, Pascal etc}? I've heard
mention that APL treats arrays as "first class objects" or some such,
and that Lisp (J?) treat functions as first class objects.  

|> At that point, the interpreted APL will pay heinous penalties for problems
|> involving inputs about the maximum workspace size, if they can be done at
|> all. So why not worry about it in the first place and not use APL...

Well no one is forcing you to use APL. Reasons to use it are:
  - it's faster to write the code
  - it's faster to debug 
  - the result is more flexible: you can play with results in all
    kinds of ways between processing, change processing paths
    depending on (human interpretation of) intermediate results

|> Well remember that the APL contender will be what runs fastest in APL.

Unless you allow an APL compiler, in which case explicit loops are ok
if the compiler is efficient. 

|> Maybe... I think you ought to use the new standard eigenvalue method, which
|> because it is divide-and-conquer when you get down to the tridiagonal level

Hmm, actually the symmetric problem isn't important enough to me to
code up in APL - I only have QR and QZ routines. LAPACK seems to have
essentially the same sort of routines.

|> will likely make APL look really awful. Even more awful than if you go up
|> against EISPACK or ESSL.

I don't see why this is so, again assuming compiling is allowed. I do
agree that there are some algorithms that can't be expressed in (flat)
APL without loops, hence for any of those algorithms an interpreted
APL will be slow. I think this is a good argument for developing:

  - compilers, so that explicit loops can be made fast
  - additional language elements, so that loops can be expressed as
    language primitives, and hence this class of algorithms might
    execute fast even in an interpreter

A few compilers are now becomming generally available, and I've not
much practical experience with them. Generalized arrays and some form
of "each" are also becomming available, for example in APL2 and J. I
don't know how successful these are at this problem, but then the
"non-flat" versions of APL are still young. While these ideas may
prove fruitful, I personally like the simple idea of van Batenburg
(sp?) that adds a loop block construct to APL.

-- 
Sam Sirlin
Jet Propulsion Laboratory         sam@kalessin.jpl.nasa.gov



From phage!jvnc.net!darwin.sura.net!wupost!uunet!psinntp!kepler1!andrew Tue Mar 24 17:20:16 EST 1992
Article: 1113 of comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!wupost!uunet!psinntp!kepler1!andrew
From: andrew@rentec.com (Andrew Mullhaupt)
Newsgroups: comp.lang.apl
Subject: Re: Re:  APL execution efficiency revisited
Message-ID: <755@kepler1.rentec.com>
Date: 24 Mar 92 15:09:56 GMT
References: <920322073241_70530.1226_CHC87-1@CompuServe.COM> <1992Mar23.185558.2647@csi.jpl.nasa.gov>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 31

In article <1992Mar23.185558.2647@csi.jpl.nasa.gov> sam@csi.jpl.nasa.gov (Sam Sirlin) writes:
>
>In article <920322073241_70530.1226_CHC87-1@CompuServe.COM>, Mike Kent <70530.1226@CompuServe.COM> writes:
>|>  >>[4] Inner and outer products in which most of the entries are        
>|>        discarded, or used only as intermediates.
>|>  
>|>  > This can be almost impossible to avoid ...
>|> 
>|> No vendor has yet (so far as I know) optimized
>|> 
>|>     first plus.times / (vector of matrices)
>
>This issue was one of the things Tim Budd tackled with his demand
>driven compiler - don't evaluate unneeded intermediate steps. This
>particular (APL2?) example won't work, but ISO cases should (assuming 
>the compiler works...) 

Presuming that Budd's _An APL Compiler_ book represents what is in his
compiler, no. Demand driven evaluation does not significantly reduce
the worst case work done in this problem and probably has little impact
on the average case. On the other hand, finding the optimum order to
compute the matrix product in typically has a huge effect on this problem.

Note that other compiled languages do not provide this service, either.

Note that computing the product of several matrices can occur in ISO (i.e.
flat) APL. Neither APL (right to left) nor APL demand-driven, nor even
vectorized evaluation will do much for the general case of this problem.

Later,
Andrew Mullhaupt


From phage!jvnc.net!darwin.sura.net!wupost!uunet!psinntp!kepler1!andrew Tue Mar 24 17:22:50 EST 1992
Article: 1114 of comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!wupost!uunet!psinntp!kepler1!andrew
From: andrew@rentec.com (Andrew Mullhaupt)
Newsgroups: comp.lang.apl
Subject: Re: APL execution efficiency revisited
Message-ID: <756@kepler1.rentec.com>
Date: 24 Mar 92 15:34:27 GMT
References: <920322073241_70530.1226_CHC87-1@CompuServe.COM> <1992Mar23.185558.2647@csi.jpl.nasa.gov> <ROCKWELL.92Mar23224842@socrates.umd.edu>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 55

In article <ROCKWELL.92Mar23224842@socrates.umd.edu> rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell) writes:
>Andrew Mullhaupt:
>   |> No vendor has yet (so far as I know) optimized
>   |>     first plus.times / (vector of matrices)
CHECK YOUR ATTRIBUTIONS, PLEASE! I did not say this. 

>
>I'm not quite clear on what's supposed to be optimized here.

Optimizing the calculation of a matrix product is a classical problem
in computer science. The idea is that some of the intermediate products
are much smaller than others, depending on the sequence of shapes.
In order to 'efficiently' compute the product, you usually have to 
solve a dynamic program whose inputs are these shapes, and then do
the matrix arithmetic.

The Matrix Chain Product problem is for example, an exercise in Horowitz
and Sahni's _Fundamentals of Computer Algorithms_ (p.242 - 243) and
in Sedgewick's _Algorithms in C_ pp. 598ff. Sedgewick has an example
where left-to-right order uses 6024 multiplications and right-to-left
order uses 274,200 multiplications. (I wonder if he had APL in mind...)

Recall that when I raised this subject, the point was that it can be
difficult to avoid putting inner/outer products in a bad order when
writing code. This is a classical, well understood issue in computer
science. Either you're going to solve the dynamic program for every
matrix chain or you're going to accept a _lot_ of excess calculation.

>Presumably, if you knew something about the structure of the matrices,
>you could perform some kind of strength reduction on the algorithm
>[use a less powerful algorithm which requires fewer cpu resources].

Precisely. Usually all that is needed is to know the shapes, then
you can decide what order to do the multiplications in. Of course
the interesting case is when the shapes are changing at run time...

>Do I have the idea straight, so far?

I'm sorry, but no. There seems to be two confusions. 1. You are not talking
about things I said. 2. You may not be thinking of the Matrix Chain Product
problem, which is what I used as an example of how it may be hard to write
code that avoids excess calculations through inner/outer products.

>Raul Deluth Miller-Rockwell                   <rockwell@socrates.umd.edu>

Sorry Raul, if I sound a bit prickly, but you might want to check the
attributions and then repost (pun intended). I know I was thinking
of Matrix Product Chain, but I hadn't come out of the closet with it
at this point in the discussion.


Later,
Andrew Mullhaupt




From phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell Tue Mar 24 17:25:24 EST 1992
Article: 1104 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell
From: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Subject: Re: APL execution efficiency revisited
In-Reply-To: sam@csi.jpl.nasa.gov's message of Mon, 23 Mar 1992 18:55:58 GMT
Message-ID: <ROCKWELL.92Mar23224842@socrates.umd.edu>
Sender: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Organization: Traveller
References: <920322073241_70530.1226_CHC87-1@CompuServe.COM>
	<1992Mar23.185558.2647@csi.jpl.nasa.gov>
Date: Tue, 24 Mar 1992 03:48:42 GMT

Andrew Mullhaupt:
   |> No vendor has yet (so far as I know) optimized
   |>     first plus.times / (vector of matrices)

I'm not quite clear on what's supposed to be optimized here.

As I understand it,   plus.times / (vector of matrices)  will yield a
"scalar representation of an array", and   first  will convert that
into a flat array.

Presumably, if you knew something about the structure of the matrices,
you could perform some kind of strength reduction on the algorithm
[use a less powerful algorithm which requires fewer cpu resources].

And the criticism that Mr. Mullhaupt has made is that there are no
generally useful tools to analyze the structure of the code which
generates that vector of arrays, so as to perform this sort of
strength reduction automatically?

Do I have the idea straight, so far?

If so, well, I agree that we need more tools along the line of
symbolic manipulation of expressions and programs.  But until that
point, unless you want to work on designing such things [specifying,
coding, whatever], it seems that the most fruitful approach is in
picking "the right algorithm".

Human design effort is one probably the most powerful and general
strength reduction tool available.  Which is not to say that it can't
be reduced in strength as well.  Or something like that...

-- 
Raul Deluth Miller-Rockwell                   <rockwell@socrates.umd.edu>


From phage!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com Tue Mar 24 17:36:39 EST 1992
Article: 1107 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com
From: Mike Kent <70530.1226@CompuServe.COM>
Subject: APL execution efficiency
Message-ID: <920324052537_70530.1226_CHC142-1@CompuServe.COM>
Sender: root@watserv1.waterloo.edu (Operator)
Organization: University of Waterloo
Date: Tue, 24 Mar 1992 05:25:37 GMT
Lines: 49

In article <752@kepler1.rentec.com>, andrew@rentec.com (Andrew Mullhaupt)
writes [concerning optimizing sequential/iterated multiplication of
matrices]: 

 > Two of the cases where this optimizatin is likely to be practiced is   
 > where not all the matrices can be held in memory at the same time, or
 > where they are being sequentially computed in control bound code.

In the current (or just around the corner) version 2 of APL2 (mainframe)
IBM has done something about the not-enough-memory problem, by introducing
external variables (they resde on file, but this is (reasonably)
transparent (i.e., { V <- V, enclose  M } writes to the end of the file,
{ V[22] <- enclose M } updates the file, { 4 pick V } reads from the
file).  Don't know about efficiency since I lost acccess to mainframe APL2
in early January when I changed jobs (but I suspect the worst).

 > You spend all your time up against problems where the obvious approach
 > is not OK.

Ture, but hardly unique to APL.  At least in APL you HAVE the time to
optimize where it counts.  And if you can't find an APL optimization
which gives acceptable performance, you can call a compiled routine,
either from a library or something written specially for the application
in (for instance) FORTRAN.  I would think that for scientific computation
this would be the preferred approach; there's a ton of FORTRAN library
code for doing just about any kind of heavy-duty number crunching, so use
it to attack the bottlenecks, using APL for the 80% of the code for which
APL is sufficient.  That way, you get good execution efficiency, and
_most_ of the fast development benefits as well.  With varying degrees of
convenience, APL2, APL*PLUS, and Sharp APL can all call external routines,
so I don't see this as a serious problem.

 > "Ordinarily" has a great deal to do with what you do.

True again.  "Ordinarily" I do straight commercial applications.  It
happens that options pricing leads to interesting problems sometimes,
for instance, when there is no closed form for the arbitrage-free price
for an option and the Markov process which models the changes in forward
prices [ it has a quite sparse transition maatrix ] is best studied by
Monte Carlo simulaltion and some (simple) statistical analysis.  Here, the
"obvious" code is either a killer on both space and time (taking a power
limit of the transition matrix) or on time (nested loop to randomly walk
the phase space).  But this sparse problem is not too hard to convert to a
"dense" problem (using roll, indexing, and +\), and then it's only a mild
space problem, easily overcome by looping on blocks of 100 or 250 (or
whatever []WA allows) "parallel" iterations at a time.  [The overhead to
interpret "+\" is swamped by the computation on e.g. a 100 x 1500 matrix.
Likewise the time to do memory management.]



From phage!jvnc.net!yale.edu!spool.mu.edu!agate!ucbvax!STLVM20.VNET.IBM.COM!LIEBTAG Tue Mar 24 18:11:18 EST 1992
Article: 1110 of comp.lang.apl
Path: phage!jvnc.net!yale.edu!spool.mu.edu!agate!ucbvax!STLVM20.VNET.IBM.COM!LIEBTAG
From: LIEBTAG@STLVM20.VNET.IBM.COM ("David Liebtag")
Newsgroups: comp.lang.apl
Subject: APL2 File I/O
Message-ID: <9203241751.AA11873@ucbvax.Berkeley.EDU>
Date: 24 Mar 92 17:51:12 GMT
Article-I.D.: ucbvax.9203241751.AA11873
Sender: daemon@ucbvax.BERKELEY.EDU
Lines: 31

Warning: IBM APL2 Plug

Mike Kent writes:

> In the current (or just around the corner) version 2 of APL2
> (mainframe) IBM has done something about the not-enough-memory
> problem, by introducing external variables (they resde on file, but
> this is (reasonably) transparent (i.e., { V <- V, enclose  M } writes
> to the end of the file, { V 22  <- enclose M } updates the file, { 4
> pick V } reads from the file).  Don't know about efficiency since I
> lost acccess to mainframe APL2 in early January when I changed jobs
> (but I suspect the worst).

We did not particularly have in mind easing space constraints when we
built files as variables although it can be used for this.  We mainly
had in mind just being able to read and write files using easy to use
APL syntax rather than language add-ons like auxiliary processors or
system functions.

Since I am in large part responsible for its performance, I can't resist
quoting a statistic.  We have a customer test site who converted an
application from using auxliary processors to the new files as
variables.  The CPU time went from 25 to 4 minutes.  Our tests have
shown similar (or better) results.  We think this is a pretty healthy
improvement.


Regards,
David Liebtag

PS. Sorry for the inadvertent empty posting.


From phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell Wed Mar 25 08:00:25 EST 1992
Article: 1117 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell
From: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Subject: Re: APL execution efficiency revisited
In-Reply-To: andrew@rentec.com's message of 24 Mar 92 15:34:27 GMT
Message-ID: <ROCKWELL.92Mar25000046@socrates.umd.edu>
Sender: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Organization: Traveller
References: <920322073241_70530.1226_CHC87-1@CompuServe.COM>
	<1992Mar23.185558.2647@csi.jpl.nasa.gov>
	<ROCKWELL.92Mar23224842@socrates.umd.edu> <756@kepler1.rentec.com>
Date: Wed, 25 Mar 1992 05:00:46 GMT

I wrote:
   >Andrew Mullhaupt:
   >   |> No vendor has yet (so far as I know) optimized
   >   |>     first plus.times / (vector of matrices)

Andrew Mullhaupt:
   CHECK YOUR ATTRIBUTIONS, PLEASE! I did not say this. 

Sorry about that.  Articles here expire after three days, and I made
the mistake of trusting a second or third hand attribution.

   Optimizing the calculation of a matrix product is a classical
   problem in computer science. The idea is that some of the
   intermediate products are much smaller than others, depending on
   the sequence of shapes.  In order to 'efficiently' compute the
   product, you usually have to solve a dynamic program whose inputs
   are these shapes, and then do the matrix arithmetic.

Ah... I think I see the problem.  I was thinking of plus.times being
used to reduce a rank three matrix, not a nested setup where each of
the matrices are differently dimensioned.  You were thinking of the
latter case.

Um.. before I get into the mechanisms one might use for implementing
this, what sort of applications does this have?

-- 
Raul Deluth Miller-Rockwell                   <rockwell@socrates.umd.edu>


From phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell Wed Mar 25 08:04:17 EST 1992
Article: 1118 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell
From: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Subject: Re: APL execution efficiency revisited
In-Reply-To: andrew@rentec.com's message of 24 Mar 92 15:34:27 GMT
Message-ID: <ROCKWELL.92Mar25000355@socrates.umd.edu>
Sender: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Organization: Traveller
References: <920322073241_70530.1226_CHC87-1@CompuServe.COM>
	<1992Mar23.185558.2647@csi.jpl.nasa.gov>
	<ROCKWELL.92Mar23224842@socrates.umd.edu> <756@kepler1.rentec.com>
Date: Wed, 25 Mar 1992 05:03:55 GMT

   >   |>     first plus.times / (vector of matrices)

Looks like Mike Kent posted that one.

Again, sorry for the confusion.

-- 
Raul Deluth Miller-Rockwell                   <rockwell@socrates.umd.edu>


From phage!jvnc.net!yale.edu!spool.mu.edu!uunet!psinntp!kepler1!rjfrey Thu Mar 26 09:47:24 EST 1992
Article: 1127 of comp.lang.apl
Path: phage!jvnc.net!yale.edu!spool.mu.edu!uunet!psinntp!kepler1!rjfrey
From: rjfrey@rentec.com (Robert J Frey)
Newsgroups: comp.lang.apl
Subject: Re: APL2 File I/O
Message-ID: <759@kepler1.rentec.com>
Date: 25 Mar 92 16:38:16 GMT
Article-I.D.: kepler1.759
References: <9203241751.AA11873@ucbvax.Berkeley.EDU>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 53

In article <9203241751.AA11873@ucbvax.Berkeley.EDU> LIEBTAG@STLVM20.VNET.IBM.COM ("David Liebtag") writes:
>Warning: IBM APL2 Plug
>
>Mike Kent writes:
>
>> In the current (or just around the corner) version 2 of APL2
>> (mainframe) IBM has done something about the not-enough-memory
>> problem, by introducing external variables [which reside] on [a] file ...
>> ... this is (reasonably) transparent (i.e., { V <- V, enclose  M } writes
>> to the end of the file ...
>
>We did not particularly have in mind easing space constraints when we
>built files as variables ... We mainly [wanted to] be able to read and 
>write files using ... APL syntax ...
>
>We have a customer test site who converted an application from [AP's] to 
>[external variables]. The CPU time went from 25 to 4 minutes.  Our tests 
>have shown similar (or better) results.

Dyalog APL has had external variables for years. By calling

	'file-ref' quadXT 'var-name'

you establish an external variable called 'var-name' tied to the file
'file-ref'. In my experience performance is very good and it is often
possible to write much simpler code using some very large objects which 
not only do not inflate the size of the workspace but have persistence
from session to session *and* from workspace to workspace.

One disadvantage, however, is that an attempt to change an external variable
in situ sometimes forces the entire variable to be brought into the 
workspace. This limits the sizes of external objects to be much less than
that of the component files they often end up replacing. 

Given the size of modern workspaces that may not seem like very much
of a restriction, but for us (and many others) it can be a severe one.
We have have several real-time trading systems which must track markets 
and orders in several hundred different securities. As big as objects for 
those problems get, our research activities use some VERY BIG objects, 
because we're analyzing such data over a time horizon of several years.

What I would like to see, at least in APLs that run in a UNIX System V, 
Release 4 environment is the use of memory mapping to deal with external
variables. Memory mapping simply maps part of your address space onto a
file. Our experience has been that memory mapping is extremely efficient,
extremely easy to use and almost transparent.

The work involved in implementing external variables using memory mapping
would be a fraction of that required to do it 'from scratch' and the
result would certainly be more efficient and transportable.

Regards,
Robert


From phage!jvnc.net!darwin.sura.net!wupost!zaphod.mps.ohio-state.edu!rpi!ispd-newsserver!psinntp!kepler1!andrew Thu Mar 26 14:48:33 EST 1992
Article: 1134 of comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!wupost!zaphod.mps.ohio-state.edu!rpi!ispd-newsserver!psinntp!kepler1!andrew
From: andrew@rentec.com (Andrew Mullhaupt)
Newsgroups: comp.lang.apl
Subject: Re: APL execution efficiency revisited
Message-ID: <762@kepler1.rentec.com>
Date: 25 Mar 92 23:15:39 GMT
References: <ROCKWELL.92Mar23224842@socrates.umd.edu> <756@kepler1.rentec.com> <ROCKWELL.92Mar25000046@socrates.umd.edu>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 36

In article <ROCKWELL.92Mar25000046@socrates.umd.edu> rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell) writes:
> Andrew Mullhaupt:

	[Statement of Matrix Chain Product omitted]

>Um.. before I get into the mechanisms one might use for implementing
>this, what sort of applications does this have?

There are three big applications of this:

	1. Showing that you can't _really_ expect to avoid bad inner
	   and outer products with any great generality.

	2. Introducing computer science students to dynamic programming,
	   and providing a test problem for fast dynamic programming
	   ideas.

	3. Giving an example where most people _think_ they know how to
	   write fast code, but they don't, unless they've studied
	   computer science, not just programming.

You mean _real_ applications? This is not a good question for language
design. When APL was designed, computers were small enough that N^3 wasn't
that much worse than N log N if a complicated algorithm put up that
constant. (N wasn't going to get that big, after all...) The dense approach
of APL seldom does worse than put an N^2 where a log N ought to be, and
if that constant ratio is 1000 or so, (typical for bad fit APL) then N
can be as big as 100 before the N log N will start to shade the N^3. If
the N^3 comes from a three dimensional array, then it has 10^6 elements.
You wouldn't do problems that big very often in 1963, so for when it was
born, APL made a good tradeoff. Based on the 'real' applications, APL is
great until about 1975! That (as we in the vastly advanced future can easily
see) does not save the design of APL.

Later,
Andrew Mullhaupt


From phage!jvnc.net!darwin.sura.net!wupost!usc!rpi!news-server.csri.toronto.edu!torsqnt!jtsv16!itcyyz!yrloc!rbe Thu Mar 26 21:04:59 EST 1992
Article: 1136 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!wupost!usc!rpi!news-server.csri.toronto.edu!torsqnt!jtsv16!itcyyz!yrloc!rbe
From: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Subject: Re: APL execution efficiency revisited
Message-ID: <1992Mar26.163242.5298@yrloc.ipsa.reuter.COM>
Reply-To: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Organization: Snake Island Research Inc, Toronto
References: <920322073241_70530.1226_CHC87-1@CompuServe.COM> <1992Mar23.185558.2647@csi.jpl.nasa.gov> <ROCKWELL.92Mar23224842@socrates.umd.edu> <756@kepler1.rentec.com>
Date: Thu, 26 Mar 92 16:32:42 GMT
Lines: 49

In article <756@kepler1.rentec.com> andrew@rentec.com (Andrew Mullhaupt) writes:
>Optimizing the calculation of a matrix product is a classical problem
>in computer science. The idea is that some of the intermediate products
>are much smaller than others, depending on the sequence of shapes.
>In order to 'efficiently' compute the product, you usually have to 
>solve a dynamic program whose inputs are these shapes, and then do
>the matrix arithmetic.
>
>The Matrix Chain Product problem is for example, an exercise in Horowitz
>and Sahni's _Fundamentals of Computer Algorithms_ (p.242 - 243) and
>in Sedgewick's _Algorithms in C_ pp. 598ff. Sedgewick has an example
>where left-to-right order uses 6024 multiplications and right-to-left
>order uses 274,200 multiplications. (I wonder if he had APL in mind...)
>
>Recall that when I raised this subject, the point was that it can be
>difficult to avoid putting inner/outer products in a bad order when
>writing code. This is a classical, well understood issue in computer
>science. Either you're going to solve the dynamic program for every
>matrix chain or you're going to accept a _lot_ of excess calculation.
>
>Precisely. Usually all that is needed is to know the shapes, then
>you can decide what order to do the multiplications in. Of course
>the interesting case is when the shapes are changing at run time...
>
>
Note that APL is ideally set to perform the operations in any order
ASSUMING YOU DON'T CARE ABOUT PRECISION LOSS: The list of arrays and
their shape is immediately available at run time, whereas it can be
lost in a swirl of DO-loops and other junk in other languages.

IN SHARP APL, we had a whole bunch of different ways to do matrix 
product. The appropriate one was picked at run time based on
matrix shape(fat vs skinny), available workspace, the two functions
involved in the product, and on the two data types involved.
No big deal to include a bit more code to handle the reduction 
across a bunch of arrays. It is obvious that you pay a performance
penalty for doing this, but you DO get the ability to do 
the reduction in ANY order, not just left to right, or right to left. 
Bob





Robert Bernecky      rbe@yrloc.ipsa.reuter.com  bernecky@itrchq.itrc.on.ca 
Snake Island Research Inc  (416) 368-6944   FAX: (416) 360-4694 
18 Fifth Street, Ward's Island
Toronto, Ontario M5J 2B9 
Canada


From phage!jvnc.net!darwin.sura.net!wupost!usc!rpi!news-server.csri.toronto.edu!torsqnt!jtsv16!itcyyz!yrloc!rbe Thu Mar 26 22:33:00 EST 1992
Article: 1136 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!wupost!usc!rpi!news-server.csri.toronto.edu!torsqnt!jtsv16!itcyyz!yrloc!rbe
From: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Subject: Re: APL execution efficiency revisited
Message-ID: <1992Mar26.163242.5298@yrloc.ipsa.reuter.COM>
Reply-To: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Organization: Snake Island Research Inc, Toronto
References: <920322073241_70530.1226_CHC87-1@CompuServe.COM> <1992Mar23.185558.2647@csi.jpl.nasa.gov> <ROCKWELL.92Mar23224842@socrates.umd.edu> <756@kepler1.rentec.com>
Date: Thu, 26 Mar 92 16:32:42 GMT
Lines: 49

In article <756@kepler1.rentec.com> andrew@rentec.com (Andrew Mullhaupt) writes:
>Optimizing the calculation of a matrix product is a classical problem
>in computer science. The idea is that some of the intermediate products
>are much smaller than others, depending on the sequence of shapes.
>In order to 'efficiently' compute the product, you usually have to 
>solve a dynamic program whose inputs are these shapes, and then do
>the matrix arithmetic.
>
>The Matrix Chain Product problem is for example, an exercise in Horowitz
>and Sahni's _Fundamentals of Computer Algorithms_ (p.242 - 243) and
>in Sedgewick's _Algorithms in C_ pp. 598ff. Sedgewick has an example
>where left-to-right order uses 6024 multiplications and right-to-left
>order uses 274,200 multiplications. (I wonder if he had APL in mind...)
>
>Recall that when I raised this subject, the point was that it can be
>difficult to avoid putting inner/outer products in a bad order when
>writing code. This is a classical, well understood issue in computer
>science. Either you're going to solve the dynamic program for every
>matrix chain or you're going to accept a _lot_ of excess calculation.
>
>Precisely. Usually all that is needed is to know the shapes, then
>you can decide what order to do the multiplications in. Of course
>the interesting case is when the shapes are changing at run time...
>
>
Note that APL is ideally set to perform the operations in any order
ASSUMING YOU DON'T CARE ABOUT PRECISION LOSS: The list of arrays and
their shape is immediately available at run time, whereas it can be
lost in a swirl of DO-loops and other junk in other languages.

IN SHARP APL, we had a whole bunch of different ways to do matrix 
product. The appropriate one was picked at run time based on
matrix shape(fat vs skinny), available workspace, the two functions
involved in the product, and on the two data types involved.
No big deal to include a bit more code to handle the reduction 
across a bunch of arrays. It is obvious that you pay a performance
penalty for doing this, but you DO get the ability to do 
the reduction in ANY order, not just left to right, or right to left. 
Bob





Robert Bernecky      rbe@yrloc.ipsa.reuter.com  bernecky@itrchq.itrc.on.ca 
Snake Island Research Inc  (416) 368-6944   FAX: (416) 360-4694 
18 Fifth Street, Ward's Island
Toronto, Ontario M5J 2B9 
Canada


From phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell Fri Mar 27 09:28:03 EST 1992
Article: 1140 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell
From: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Subject: Re: APL execution efficiency revisited
In-Reply-To: andrew@rentec.com's message of 25 Mar 92 23:15:39 GMT
Message-ID: <ROCKWELL.92Mar26223823@socrates.umd.edu>
Sender: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Organization: Traveller
References: <ROCKWELL.92Mar23224842@socrates.umd.edu> <756@kepler1.rentec.com>
	<ROCKWELL.92Mar25000046@socrates.umd.edu> <762@kepler1.rentec.com>
Date: Fri, 27 Mar 1992 03:38:23 GMT

I wrote:
   >Um.. before I get into the mechanisms one might use for
   >implementing this, what sort of applications does this have?

Andrew Mullhaupt:
   There are three big applications of this:

           1. Showing that you can't _really_ expect to avoid bad
              inner and outer products with any great generality.

I don't see why you say this.

           2. Introducing computer science students to dynamic
              programming, and providing a test problem for fast
              dynamic programming ideas.

And?  I mean, what's the application for this style of dynamic
programming? 

           3. Giving an example where most people _think_ they know
              how to write fast code, but they don't, unless they've
              studied computer science, not just programming.

Fast code depends on an understanding of the amount of time required
to evaluate the underlying operations.  In any event, I'm not sure I
understand the thrust of this statement.

   You mean _real_ applications? This is not a good question for
   language design.

Sure it is.  It can give you a feel for what you're working with, can
inspire good names for the the functions you define, and can be useful
in testing the results.

Anyways, below is a script I put together to express what I believe is
a solution to this problem.  Care to comment on it?

-- 
Raul Deluth Miller-Rockwell                   <rockwell@socrates.umd.edu>



NB. fake dot product, so we can look at results

M        =. <@":"0@i.
plus     =. (    [  , '+'"1 ,    ]  )&.>
times    =. (    [  , '*'"1 ,    ]  )&.>
paren    =. ( '('"1 ,    ]  , ')'"1 )&.>
dot      =. paren @(plus/ . times)

NB. dry run.
NB. I expect that reducing the 'big axis' first results in the least
NB. overall work.

   (M 2 3) dot (M 3 2) dot (M 2 2)              NB. should be inefficient
+-------------------------------------+-------------------------------------+
|(0*(0*0+1*2)+1*(2*0+3*2)+2*(4*0+5*2))|(0*(0*1+1*3)+1*(2*1+3*3)+2*(4*1+5*3))|
+-------------------------------------+-------------------------------------+
|(3*(0*0+1*2)+4*(2*0+3*2)+5*(4*0+5*2))|(3*(0*1+1*3)+4*(2*1+3*3)+5*(4*1+5*3))|
+-------------------------------------+-------------------------------------+

   ((M 2 3) dot (M 3 2)) dot (M 2 2)            NB. should be efficient
+---------------------------------+---------------------------------+
|((0*0+1*2+2*4)*0+(0*1+1*3+2*5)*2)|((0*0+1*2+2*4)*1+(0*1+1*3+2*5)*3)|
+---------------------------------+---------------------------------+
|((3*0+4*2+5*4)*0+(3*1+4*3+5*5)*2)|((3*0+4*2+5*4)*1+(3*1+4*3+5*5)*3)|
+---------------------------------+---------------------------------+

NB. ok, we've got a starting place.

NB. given a boxed list of n arrays, return a 2 by n boxed array, where
NB. the first column indicates 'desired order of operation'.  The
NB. array with the lowest number should be multiplied by it's
NB. immediate neighbor to the left (that is, the row above) before any
NB. other multiplications.

rate=.   ;~"0   /: @\: @:(#&>)

Sample data:
   list =. 2  <@M@,/\  1+6?20           NB. random list of arrays
   $&>  list                            NB. shapes of arrays
18  5
 5 10
10  8
 8 13
13 15
   > {."1 rate list                     NB. desired order of reduction
0 4 2 3 1

NB. ignore the first entry -- it's just a place holder
NB.
NB. We need a routine to split our list of matrices into two pieces,
NB. with the split occurring at the location indicated by the highest
NB. rating in the list.  More specifically, we need a routine which,
NB. given a list of ratings (e.g. 0 4 2 3 1) will give the number of
NB. matrices to keep on the left half of the split (in this case, 1
NB. matrix).

choose=.  (i. >./@}.)  @> @:({."1)

NB. test it
     choose rate list                   NB. should split on smallest dimension
1

NB. there are four cases we'll have to deal with in this 'optimal
NB. reduction', and they correspond to lists of length 0, 1, 2, and
NB. more than 2.

case0    =.   (i. 0 0)"]        NB. return an empty matrix
case1    =.   > @((<0 1)&{)     NB. return the matrix
case2    =.   dot&>&(1&{)/      NB. multiply the matrices

switch   =. 3&<.@#              NB. classify length of array

opt_red  =. case0`case1`case2`(choose ($:@{. dot $:@}.) ]) @.switch

NB. for test purposes, lets make a definition of 'dot' which shows the
NB. operations that would have happened

dot=.  times & paren &.<

NB. test this fake dot product (just shows shapes and order of operation)
   (": 2 3) dot (": 3 4)
(2 3)*(3 4)

NB. For test purposes, we'll also need to use a fake list which holds
NB. formatted shape information in place of the arrays:

   ] fake_list =. (0 _1 }. rate list) ,.  ": @$&.> list
+-+-----+
|0|18 5 |
+-+-----+
|4|5 10 |
+-+-----+
|2|10 8 |
+-+-----+
|3|8 13 |
+-+-----+
|1|13 15|
+-+-----+

NB. ok, acid test time:
   opt_red  fake_list
(18 5)*(((5 10)*(10 8))*((8 13)*(13 15)))

NB. Looks good to me.

NB. let's restore our earlier 'test' version of dot:

dot      =. paren @(plus/ . times)

NB. and repeat our earlier example, but see if the computer can pick
NB. the right option.

   opt_red rate (M 2 3) ; (M 3 2) ; < (M 2 2)
+---------------------------------+---------------------------------+
|((0*0+1*2+2*4)*0+(0*1+1*3+2*5)*2)|((0*0+1*2+2*4)*1+(0*1+1*3+2*5)*3)|
+---------------------------------+---------------------------------+
|((3*0+4*2+5*4)*0+(3*1+4*3+5*5)*2)|((3*0+4*2+5*4)*1+(3*1+4*3+5*5)*3)|
+---------------------------------+---------------------------------+

NB. looks good.

NB. Finally, let's check this thing  for correctness

dot=. +/ .*
reduce=. opt_red @rate

   res1=. reduce         i.@$&.> list
   res2=. >   +/ .* &.>/ i.@$&.> list
   res1 -: res2
1
   $ res1
18 15
   7 7 { res1
7.41958e10 7.49855e10 7.57752e10  7.6565e10 7.73547e10 7.81444e10 7.89341e10
 2.0833e11 2.10547e11 2.12765e11 2.14982e11   2.172e11 2.19417e11 2.21634e11
3.42464e11 3.46109e11 3.49754e11 3.53399e11 3.57045e11  3.6069e11 3.64335e11
4.76599e11 4.81671e11 4.86744e11 4.91817e11 4.96889e11 5.01962e11 5.07035e11
6.10733e11 6.17233e11 6.23734e11 6.30234e11 6.36734e11 6.43235e11 6.49735e11
7.44867e11 7.52795e11 7.60723e11 7.68651e11 7.76579e11 7.84507e11 7.92435e11
8.79001e11 8.88357e11 8.97713e11 9.07069e11 9.16424e11  9.2578e11 9.35136e11

NB. presumably somebody gets some benefit from this

NB. that's all, folks


From phage!jvnc.net!yale.edu!qt.cs.utexas.edu!zaphod.mps.ohio-state.edu!uunet.ca!geac!itcyyz!yrloc!rbe Mon Mar 30 16:41:45 EST 1992
Article: 1153 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!yale.edu!qt.cs.utexas.edu!zaphod.mps.ohio-state.edu!uunet.ca!geac!itcyyz!yrloc!rbe
From: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Subject: Re: APL execution efficiency revisited
Message-ID: <1992Mar27.153146.15426@yrloc.ipsa.reuter.COM>
Reply-To: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Organization: Snake Island Research Inc, Toronto
References: <920322073241_70530.1226_CHC87-1@CompuServe.COM> <1992Mar23.185558.2647@csi.jpl.nasa.gov> <ROCKWELL.92Mar23224842@socrates.umd.edu> <756@kepler1.rentec.com> <ROCKWELL.92Mar25000046@socrates.umd.edu>
Date: Fri, 27 Mar 92 15:31:46 GMT
Lines: 39

In article <ROCKWELL.92Mar25000046@socrates.umd.edu> rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell) writes:
>
>Um.. before I get into the mechanisms one might use for implementing
>this, what sort of applications does this have?

The question which vi deleted for me (Arggh) was: 
  What sort of thing would I use  the expression:
     +.* reduce a;b;c;d   

for when a b c d are different shaped arrays.
I only have encountered one case where this was useful in practice.

In doing etch-a-sketch(tm) graphics, aka line drawings, in which
you represent lines as an n by 2 table of coordinates, if you
matrix multiply the table (I seem to recall pasting a row of zeros
on here somewhere) by a 3 by 3 matrix, you can perform linear
transforms, including scaling, rotation, and translation (and skew).

So, I might do: 
  drawing +.* T +.* R +.* S

to combine several transforms before applying them to a BIG drawing.
I could also do:
   (((drawing +.* S) +.* R)+.* T
but this does a lot more arithmetic if you are using a serial 
computer such as a pencil.

That's why ordering (aside from little details like getting the
right floating point answer) is important.
BOb




Robert Bernecky      rbe@yrloc.ipsa.reuter.com  bernecky@itrchq.itrc.on.ca 
Snake Island Research Inc  (416) 368-6944   FAX: (416) 360-4694 
18 Fifth Street, Ward's Island
Toronto, Ontario M5J 2B9 
Canada


From phage!jvnc.net!yale.edu!think.com!sdd.hp.com!hp-cv!ogicse!das-news.harvard.edu!spdcc!dirtydog.ima.isc.com!ispd-newsserver!psinntp!kepler1!andrew Mon Mar 30 16:43:29 EST 1992
Article: 1154 of comp.lang.apl
Path: phage!jvnc.net!yale.edu!think.com!sdd.hp.com!hp-cv!ogicse!das-news.harvard.edu!spdcc!dirtydog.ima.isc.com!ispd-newsserver!psinntp!kepler1!andrew
From: andrew@rentec.com (Andrew Mullhaupt)
Newsgroups: comp.lang.apl
Subject: Re: APL execution efficiency revisited
Message-ID: <771@kepler1.rentec.com>
Date: 27 Mar 92 23:08:00 GMT
Article-I.D.: kepler1.771
References: <ROCKWELL.92Mar23224842@socrates.umd.edu> <756@kepler1.rentec.com> <1992Mar26.163242.5298@yrloc.ipsa.reuter.COM>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 61

In article <1992Mar26.163242.5298@yrloc.ipsa.reuter.COM> rbe@yrloc.ipsa.reuter.COM (Robert Bernecky) writes:
>In article <756@kepler1.rentec.com> andrew@rentec.com (Andrew Mullhaupt) writes:
>>Precisely. Usually all that is needed is to know the shapes, then
>>you can decide what order to do the multiplications in. Of course
>>the interesting case is when the shapes are changing at run time...

>Note that APL is ideally set to perform the operations in any order
>ASSUMING YOU DON'T CARE ABOUT PRECISION LOSS:

Umm - I can't imagine being completely indifferent to precision loss. If
I were, I could compute any answer in one operation: 'return 0'.

> ....The list of arrays and
>their shape is immediately available at run time, whereas it can be
>lost in a swirl of DO-loops and other junk in other languages.

It can be lost in a swirl in APL. If the arrays are computed one after
the other, and memory is too small to hold two of them, and you don't
know the shapes until run time, it is not simple to avoid doing the 
non-optimal calculation. In any language.

>IN SHARP APL, we had a whole bunch of different ways to do matrix 
>product. The appropriate one was picked at run time based on
>matrix shape(fat vs skinny), available workspace, the two functions
>involved in the product, and on the two data types involved.
>No big deal to include a bit more code to handle the reduction 
>across a bunch of arrays. It is obvious that you pay a performance
>penalty for doing this, but you DO get the ability to do 
>the reduction in ANY order, not just left to right, or right to left. 

Now for the particular case  of +.x / (A B C D ...) you will be able to
do something, but only in the case where you know about all of the matrices
A, B, C, ... before you do the products. If I hand you the matrices one
at a time, it is much harder to optimize the product, since you do it
'pairwise'.

Note that optimizing the product of two matrices is almost irrelevant
to the case of optimizing a product of N>2 matrices. The concerns in
the N > 2 case involve how to associate the matrices, i.e. how to group
the parentheses, A +.x (B +.x C) vs. (A +.x B) +.x C, but the N=2 case
involves only issues like going along rows or columns to get nice machine
specific performance. It goes without saying that I _assume_ the pairwise
products are optimized for memory throughput. However, you can quickly
convince yourself that the number of floating point multiplications is
not affected if you rearrange the calculations.

You can see by the simple example where A is NxN, B is NxN, and C is Nx1,
that the number of multiplications required is:

	Product Order  			# Multiplications
	---------------			-----------------
	A +.x (B +.x C) 		       2N^2
	(A +.x B) +.x C			    N^3 + N^2

Now it isn't really important if you are smart about going by columns or
rows, since the thrashing/paging penalty of your OS is a fixed number. I
simply put N to be 10 times this penalty factor, and you will prefer the
ordering A +.x (B +.x C).

Later,
Andrew Mullhaupt


From phage!jvnc.net!darwin.sura.net!wupost!zaphod.mps.ohio-state.edu!rpi!news-server.csri.toronto.edu!torsqnt!jtsv16!itcyyz!yrloc!rbe Mon Mar 30 19:18:59 EST 1992
Article: 1164 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!wupost!zaphod.mps.ohio-state.edu!rpi!news-server.csri.toronto.edu!torsqnt!jtsv16!itcyyz!yrloc!rbe
From: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Subject: Re: APL execution efficiency revisited
Message-ID: <1992Mar30.203818.15221@yrloc.ipsa.reuter.COM>
Reply-To: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Organization: Snake Island Research Inc, Toronto
References: <ROCKWELL.92Mar23224842@socrates.umd.edu> <756@kepler1.rentec.com> <1992Mar26.163242.5298@yrloc.ipsa.reuter.COM> <771@kepler1.rentec.com>
Date: Mon, 30 Mar 92 20:38:18 GMT

>I were, I could compute any answer in one operation: 'return 0'.
>
>
>Now for the particular case  of +.x / (A B C D ...) you will be able to
>do something, but only in the case where you know about all of the matrices
>A, B, C, ... before you do the products. If I hand you the matrices one
>at a time, it is much harder to optimize the product, since you do it
>'pairwise'.

So, if you hand ne the matrices one at a time, the whole question 
of ordering is moot. This is like trying to nail Jello to a tree.
Can we please stick with one problem at a time, please?

-----------------------------------------------------------------

My point about precision loss is that the loss in precision depends on
the ordering of the matrix product, which is DATA-sensitive, and is,
I suspect, a much harder problem to solve than the how-do-I-reduce-
the-number-of-mulitplies problem. 


>
>Note that optimizing the product of two matrices is almost irrelevant
>to the case of optimizing a product of N>2 matrices. The concerns in
>the N > 2 case involve how to associate the matrices, i.e. how to group
>the parentheses, A +.x (B +.x C) vs. (A +.x B) +.x C, but the N=2 case
>involves only issues like going along rows or columns to get nice machine
>specific performance. It goes without saying that I _assume_ the pairwise
>products are optimized for memory throughput. However, you can quickly
>convince yourself that the number of floating point multiplications is
>not affected if you rearrange the calculations.
>
>You can see by the simple example where A is NxN, B is NxN, and C is Nx1,
>that the number of multiplications required is:
>
>	Product Order  			# Multiplications
>	---------------			-----------------
>	A +.x (B +.x C) 		       2N^2
>	(A +.x B) +.x C			    N^3 + N^2
>

This was, I thought, the original point of this thread. My point, 
which you seem to be missing, is that IF I know that I have 
M arrays, and IF I know their shape, than I can DYNAMICALLY decide
the best (in the sense of minimizing multiplies) ordering for the
operation. Now, let's see you do the same in Fortran. No cheating now:
Array sizes are to be determined at run time. 



Robert Bernecky      rbe@yrloc.ipsa.reuter.com  bernecky@itrchq.itrc.on.ca 
Snake Island Research Inc  (416) 368-6944   FAX: (416) 360-4694 
18 Fifth Street, Ward's Island
Toronto, Ontario M5J 2B9 
Canada


From phage!jvnc.net!darwin.sura.net!ukma!asuvax!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com Tue Mar 31 14:41:37 EST 1992
Article: 1169 of comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!ukma!asuvax!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com
From: 70530.1226@CompuServe.COM (Mike Kent)
Newsgroups: comp.lang.apl
Subject: Matrix chain product:  FORTRAN vs. APL title bout
Message-ID: <920331064737_70530.1226_CHC146-1@CompuServe.COM>
Date: 31 Mar 92 06:47:37 GMT
Sender: root@watserv1.waterloo.edu (Operator)
Organization: University of Waterloo
Lines: 18

In article <1992Mar30.203818.15221@yrloc.ipsa.reuter.COM>,
rbe@yrloc.ipsa.reuter.COM (Robert Bernecky) writes concerning order of
evaluation in multiple matrix products:

 > ... I can DYNAMICALLY decide the best ... ordering for the operation.
 > Now, let's see you do the same in Fortran.  No cheating now:  array
 > sizes are to be determined at run time.

Mmmm.  Unless the shape is known a priori or passed in, FORTRAN hasn't
a clue about the dimensions of a matrix.  So if you're going to allow 
FORTRAN to play in this sandbox, you _have_ to allow shapes to be passed.
Once you do, the game is over:  I pass three arguments in to the FORTRAN
routine:  [1] K, the number of matrices; [2] S, the Kx2 table of shapes;
and [3] V, a vector of the +/x/S matrix entries.  I can then pass pairs of
shapes and pairs of matrices to a standard matrix-product library routine
(by pointing to rows of S and items of V).  It's just work from there on,
proabably not even _too_ hard, modulo determiantion of optimal order.
This might make a good useful external function.


From phage!jvnc.net!darwin.sura.net!europa.asd.contel.com!uunet!psinntp!kepler1!andrew Wed Apr  1 10:02:41 EST 1992
Article: 1173 of comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!europa.asd.contel.com!uunet!psinntp!kepler1!andrew
From: andrew@rentec.com (Andrew Mullhaupt)
Newsgroups: comp.lang.apl
Subject: Re: APL execution efficiency revisited
Message-ID: <783@kepler1.rentec.com>
Date: 31 Mar 92 16:28:04 GMT
References: <1992Mar26.163242.5298@yrloc.ipsa.reuter.COM> <771@kepler1.rentec.com> <1992Mar30.203818.15221@yrloc.ipsa.reuter.COM>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 118

In article <1992Mar30.203818.15221@yrloc.ipsa.reuter.COM> rbe@yrloc.ipsa.reuter.COM (Robert Bernecky) writes:
>>Now for the particular case  of +.x / (A B C D ...) you will be able to
>
>So, if you hand ne the matrices one at a time, the whole question 
>of ordering is moot. This is like trying to nail Jello to a tree.
>Can we please stick with one problem at a time, please?
Sure, but I originally posed this not as a problem but as an example where
you cannot avoid doing 'bad' inner and outer products. I _never_ stated
that you were allowed to assume these matrices were given to you in a vector,
but some people have continually assumed this. It should also be noted that
the Matrix Chain Product was not brought up as an example where APL is any worse
than FORTRAN, but as an example where you have to do bad inner products
in nearly any language.


Now as it happens, Raul has posted a possible optimization for this restricted
version of the problem, and I have not had time to determine its complexity.
I suspect he has probably come up with the normal O(n^3) dynamic program, 
which in view of Yao's O(n^2) speedup must be considered the wrong answer.
I intend to make a comprehensive post about APL and dynamic programming in the
next week or so, but it is necessary to do a few literature searches before
I can complete the article. For those who want a running start, it is going
to take Yao's convex quadrangle approach as a point of departure.
>
>-----------------------------------------------------------------
>
>My point about precision loss is that the loss in precision depends on
>the ordering of the matrix product, which is DATA-sensitive, and is,
>I suspect, a much harder problem to solve than the how-do-I-reduce-
>the-number-of-mulitplies problem. 

Not so hard. Although inner products are the most pernicious thing in all
numerical analysis (Cf. Golub and van Loan, etc.) you can pretty much
eliminate the loss of precision by re-representing everything in polar (QR)
form using orthogonal transformations, and then do your multiplications
that way. Now almost everyone uses this approach when solving equations
(or inverting matrices, if we must be _that_ gauche) but almost nobody
does their multiplies this way, since it's about ten times as expensive as
the normal way. Note that this method is general enough that I would call
it non DATA-sensitive...
>
>>You can see by the simple example where A is NxN, B is NxN, and C is Nx1,
>>that the number of multiplications required is:
>>
>>	Product Order  			# Multiplications
>>	---------------			-----------------
>>	A +.x (B +.x C) 		       2N^2
>>	(A +.x B) +.x C			    N^3 + N^2
>
>This was, I thought, the original point of this thread. My point, 
>which you seem to be missing, is that IF I know that I have 
>M arrays, and IF I know their shape, than I can DYNAMICALLY decide
>the best (in the sense of minimizing multiplies) ordering for the
>operation. Now, let's see you do the same in Fortran. No cheating now:
>Array sizes are to be determined at run time. 

Who said this could be done easily in FORTRAN? This is the one I said was
hard in general! 

This thread has had a little history and a couple branches, so I'm not too
surprised that it's become tangled. Let me set out a little summary here,
so we'll all have a chance to get back on the same page.


1. A guy asked how to do intersection in APL2. I posted a very slick 
   solution, which by coincidence is one of my bag of anti-APL examples.
   Since I had seen nothing but 'J' in this group for months, I thought
   I'd take this chance to stir things up by pointing out how, although
   this intersection idiom is really the best you can do in APL2, it's
   a piker compared to any compiled language. (Which it is). I pointed
   out that it's relatively slow and very hard to read. (Which it is).


2. At this point, the APL2 community made it's predictable response, which
   was in four parts.
      a) Some of the people rose to the bait "Oh yeah? what do you say _I_
         cannot do in APL?" I gave them the matrix bandwidth problem.
         One guy posted an "APL is faster" example of a partition sum. I'm
         not posting anything in response to him until I can figure out
         how his FORTRAN program was that slow. My present hypothesis is that
         he left an extra test in the loop. He mentioned he didn't have
         a vector head, so he might not have thought this important.

      b) One of the (probably more experienced) APL programmers posted
         a very good list of things not to do in APL, probably thinking
         he was helping a 'newbie' learn APL. Well his advice is very good
         and _should_ be in most manuals and textbooks. For the two years
         (out of more than twenty years of computing) which I used APL2
         I used this kind of advice almost every hour. He did mention to
         avoid bad ordering of inner/outer products. I thought it would
         useful to point out that you cannot always follow this advice.
         I gave the Matrix chain product as an example, and specifically
         pointed out that this is a 'showstopper' in almost any language.
         Several people have tried to show that APL is actually better
         suited to this optimizations than not. More on this in the future.

      c) Some people said 'nobody's forcing you to use APL'. That's not true.
         When you have to optimize an APL function, you normally have to call
         it from APL. This has recently been fixed in APL2 but not in most
         APL's yet. 

      d) Several people mentioned APL compilation. We knew enough about APL
         compilation _not_ to use it, but things may be better now. I'm
         waiting to hear from these people on the matrix bandwidth problem
         with actual timings. We have, however pointed out using 2b that
         demand-driven evaluation doesn't do _everything_.

It is important to keep in mind that so far, the Matrix chain product has only
been used to show that you can't avoid bad inner products, and as a convenient
example where demand-driven evaluation isn't optimal. Now if Raul can only
hold on a little longer, I think  I'll be able to show him the use of it in
practice. But for the moment, maybe we can pin down which questions are open
and what relates to what.

Later,
Andrew Mullhaupt




From phage!jvnc.net!yale.edu!think.com!ames!haven.umd.edu!socrates!socrates!rockwell Fri Apr  3 00:35:42 EST 1992
Article: 1183 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!yale.edu!think.com!ames!haven.umd.edu!socrates!socrates!rockwell
From: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Subject: Re: AL efficiency revisited
In-Reply-To: Mike Kent's message of Thu, 2 Apr 1992 08:28:26 GMT
Message-ID: <ROCKWELL.92Apr2194843@socrates.umd.edu>
Sender: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Organization: Traveller
References: <920402082825_70530.1226_CHC47-1@CompuServe.COM>
Date: Fri, 3 Apr 1992 00:48:43 GMT

Mike Kent:
   And here's the fortran code, reconstucted from memory, but sans capital
   letters:

          subroutine partsum (count1, count2, lenvec, itmvec, result)

um... but what does the code do?  I seem to have missed that part.

-- 
Raul Deluth Miller-Rockwell                   <rockwell@socrates.umd.edu>


From phage!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com Fri Apr  3 14:56:12 EST 1992
Article: 1180 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com
From: Mike Kent <70530.1226@CompuServe.COM>
Subject: Re:  AL efficiency revisited
Message-ID: <920402082825_70530.1226_CHC47-1@CompuServe.COM>
Sender: daemon@watserv1.waterloo.edu
Organization: University of Waterloo
Date: Thu, 2 Apr 1992 08:28:26 GMT
Lines: 54

In one of yesteerdays articles, Andrew Mullhaupt refers to two APL
programmers, remarks that this thread has wandered, and that he has been
misunderstood, and speculates about some FORTRAN code.


The two APLers:  it was the same guy, namely me.

Wandering:  threads _will_ wander.

Misunderstandings:  not everything in the thread is a reply to the 
original article (see immediately above).

And here's the fortran code, reconstucted from memory, but sans capital
letters:


       subroutine partsum (count1, count2, lenvec, itmvec, result)
       integer*4 count1, count2, itmndx
       integer*4 lenvec(count1), result(count1), result(count2)

       itmndx=0
       do 40 j=1,count1,1
           result(j) = 0
           do 20 k=1,lenvec(j),1 
	      itmndx = 1+itmndx    
              result(j)=result(j)+itmvec(itmndx+j)
  20       continue
           itmndx=itmndx+lenvec(j)
  40   continue
       return  


>From a quick scan, it looks to me like this could be coded in assembler
so as to do the work using only general registers, which means that 
there should be _no_ reads except for the count parameters and one touch 
on each item of {lenvec} and of {itmvec}, and only one write to {result}   
for each item of {lenvec}, and the source is short enough that I expected
the compiler to produce that kind of machine code.  I don't have the
compiler output (it _has_ been several years), but I looked at the
pseudo-assembler listing back when, and there was a lot of memory I/O --
the comparisons for the loop variables were the culprits, as I recall.
The obvious assembler techniques -- "BCTR" to control the loops and
"LA k,4(j,k)" to advance the data indices -- were _not_ generated.  For
OPTIMIZE(0) compilations, this is OK, but for OPTIMIZE(2)???  

BTW, what test can I eliminate from this code?

In any case, the points I wanted to make were two:  first, compilation
doesn't always mean you get close-to-optimal code; and second, the
assembler code which underlies APL2 is in many cases _very_ well tuned,
probably better than average pretty-good compiler output.  
         




From phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell Fri Apr  3 15:09:12 EST 1992
Article: 1178 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell
From: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Subject: Re: APL execution efficiency revisited
In-Reply-To: andrew@rentec.com's message of 31 Mar 92 16:28:04 GMT
Message-ID: <ROCKWELL.92Apr1185212@socrates.umd.edu>
Sender: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Organization: Traveller
References: <1992Mar26.163242.5298@yrloc.ipsa.reuter.COM> <771@kepler1.rentec.com>
	<1992Mar30.203818.15221@yrloc.ipsa.reuter.COM> <783@kepler1.rentec.com>
Date: Wed, 1 Apr 1992 23:52:12 GMT

Andrew Mullhaupt:
   Now as it happens, Raul has posted a possible optimization for this
   restricted version of the problem, and I have not had time to
   determine its complexity.

By restricted, I presume you mean "restricted to the case where one
can determine the order in which the matrix multiplies occur"?  I'll
agree that if you're not allowed access to all of the information
about the problem the resulting problem becomes more difficult.

This is true of many application areas, by the way...

   I suspect he has probably come up with the normal O(n^3) dynamic
   program, which in view of Yao's O(n^2) speedup must be considered
   the wrong answer.

I dunno... what is 'n'?  For that matter, who is Yao?

Obviously, I'm not aware of any way of speeding up this operation any
further.  But I'd be very interested if there is some key concept I've
overlooked.

   ...

   Now if Raul can only hold on a little longer, I think I'll be able
   to show him the use of it in practice. But for the moment, maybe we
   can pin down which questions are open and what relates to what.

Sure, no problem (though I was begining to wonder if I'd forgotten to
post that article).  As it happens, my computer died Monday night, so
holding on is about the only thing I'm in a good position to do.

-- 
Raul Deluth Miller-Rockwell                   <rockwell@socrates.umd.edu>


From phage!jvnc.net!yale.edu!qt.cs.utexas.edu!zaphod.mps.ohio-state.edu!uunet.ca!geac!itcyyz!yrloc!rbe Fri Apr  3 17:32:53 EST 1992
Article: 1188 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!yale.edu!qt.cs.utexas.edu!zaphod.mps.ohio-state.edu!uunet.ca!geac!itcyyz!yrloc!rbe
From: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Subject: Re: Matrix chain product:  FORTRAN vs. APL title bout
Message-ID: <1992Apr3.165927.8241@yrloc.ipsa.reuter.COM>
Reply-To: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Organization: Snake Island Research Inc, Toronto
References: <920331064737_70530.1226_CHC146-1@CompuServe.COM>
Date: Fri, 3 Apr 92 16:59:27 GMT
Lines: 70

In article <920331064737_70530.1226_CHC146-1@CompuServe.COM> Mike Kent <70530.1226@CompuServe.COM> writes:
>In article <1992Mar30.203818.15221@yrloc.ipsa.reuter.COM>,
>rbe@yrloc.ipsa.reuter.COM (Robert Bernecky) writes concerning order of
>evaluation in multiple matrix products:
>
> > ... I can DYNAMICALLY decide the best ... ordering for the operation.
> > Now, let's see you do the same in Fortran.  No cheating now:  array
> > sizes are to be determined at run time.
>
>Mmmm.  Unless the shape is known a priori or passed in, FORTRAN hasn't
>a clue about the dimensions of a matrix.  So if you're going to allow 
>FORTRAN to play in this sandbox, you _have_ to allow shapes to be passed.
>Once you do, the game is over:  I pass three arguments in to the FORTRAN
>routine:  [1] K, the number of matrices; [2] S, the Kx2 table of shapes;
>and [3] V, a vector of the +/x/S matrix entries.  I can then pass pairs of
>shapes and pairs of matrices to a standard matrix-product library routine
>(by pointing to rows of S and items of V).  It's just work from there on,
>proabably not even _too_ hard, modulo determiantion of optimal order.
>This might make a good useful external function.

At the risk of being rude (something I try to avoid doing by accident),
I think you have entirely missed the point, which is: WHO does the work?

In the case of Fortran, the APPLICATION PROGRAMMER has to:
  a. be aware of the existence of the "useful external function"
  b. be aware that the UEF can help performance
  c. code explcitly to invoke the UEF.
  d. Encounter possible bugs trying to invoke UEF.
 and so on.

In the case of APL:

a. Arrays are self-describing, and it is impossible to get the shape of
   an array wrong(assuming a working interpreter/compiler). Since Fortran 90
   has discovered that arrays exist along with their bounds, this is
   not a strong difference. 

b. The compiler/interpreter writer, NOT the application programmer, does
   the work required to determine the optimal order of application.

c. The application programmer merely does a matrix product reduction,
   or does a sequence of matrix products, as usual. If the interpreter
   has the UEF under the covers, it will take advantage of it. If 
   the UEF doesn't exist, no one is the wiser, and things continue to work,
   albeit slower.

The whole point of all this is to hide superfluous detail from the  
applications programmer, who should (properly) be thinking about 
the problem space, not how to make fast code. The fast code is
properly left to the system, as in APL, where primitives such as 
grade(sort), set membership, table lookup, matrix multiply, etc., 
are optimized to such a degree that no sensible user ever thinks of
coding a user-defined one. 

Furthermore, when interpreter code improves (as it does on almost
any commercial system which intends to stay in business), all 
appications immediately speed up, whereas user-coded primitives
or UEF's do not speed up.

It's a division of labor, and Fortran is simply too primitive to
support a proper division.




Robert Bernecky      rbe@yrloc.ipsa.reuter.com  bernecky@itrchq.itrc.on.ca 
Snake Island Research Inc  (416) 368-6944   FAX: (416) 360-4694 
18 Fifth Street, Ward's Island
Toronto, Ontario M5J 2B9 
Canada


From phage!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com Mon Apr  6 15:46:39 EDT 1992
Article: 1190 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!yale.edu!qt.cs.utexas.edu!cs.utexas.edu!utgpu!watserv1!70530.1226@compuserve.com
From: Mike Kent <70530.1226@CompuServe.COM>
Subject: What does that FORTRAN code  _do_?
Message-ID: <920404100728_70530.1226_CHC22-1@CompuServe.COM>
Sender: root@watserv1.waterloo.edu (Operator)
Organization: University of Waterloo
Date: Sat, 4 Apr 1992 10:07:29 GMT
Lines: 75

In article <ROCKWELL.92apr2194843@socrates.umd.edu>
rowkwell@socrates.umd.edu (Raul Deluth Miller-Rockwell) asks:

 > um... but what does the code do?

My apologies ... FORTRAN is so easy to read that I generally neglect to
comment it (ho ho).

Calling

               partsum(count1, count2, lenvec, itmvec, result)

produces a partitioned sum of the integer vector <itmvec> in <result>,
that is, a sum of the <count1> consecutive subvectors of respective
lengths <lenvec>.  It does this a bit more quickly than the usual APL
+\ , compress, first differences idiom with a partitioning bit vector
marking the partitions of <itmvec> when <count1> is very small, and
considerably more _slowly_ when <count1> modest to large (modest == 50).

There are some corrections to the original posting (my memory is
imperfect, alas):  the excess memory accesses were  associated with the
outer loop, and no initialization to zero is required for the items
or <result>, it being assumed that the _caller_ has built a vector of 0's
ahead of time (since the code was written to be called as an APL2 external
function).

  
 * * * * * *


In article <1992Apr3.165927.8241@yrloc.ipsa.reuter.COM>,
rbe@yrloc.ipsa.reuter.COM (Robert Bernecky) goes on a bit (politely
enough) about the proper division of labor between applications and
systems software.  I reply:

Mmm, but Bob, you had written "I'd like to see you _do_ that in
FORTRAN" [emphasis mine].  I just wanted to point out that it isn't
all that hard, just mainly tedious.  On reconsideration, I've
reconsidered, though; a truly general solution would have to do some
dynamic memory management, which is far, far, far better left to the
system than handled by the application, at least if _I_ have to write
the code. 

You are of course right that the more the system does for you, without
constraining _your_ options, the better off you are.  I thought, and
still think, that { first +.x / } is not an especially compelling 
candidate for a new primitive, or []MATCHAINPROD, or idiom recognition 
(though I'll give a bit on idiom recognition).  Like a lot of other
problems, it's of vital interest to a lot of people, and of next to no
interest to a whole lot more people.  That's why I think it would be more
useful as an external function ( == library routine ).  On the other hand,
if the idiom recognized were { first f.g / } with identification of the
cases where the pair f::g has an associative inner product and shape-
based ordering of the reduction so that that { and . = / } and a personal
favorite { or . and / } also benefitted for example, THAT would be a real
good candidate for incorporation into the system software, i.e., the APl
interpreter.  On the other other hand, I'd bet a small amount that in most
real cases, long chains consist of odd-shaped members at one or both ends
with the rest of the chain consisting of square matrices so that the
problem is just whether to eat the chain from the left, from the right,
or to compute the square-matrix chain product first and then do a three-
item product optimally.  Those cases can be handled by a rule-of-thumb
which is simple enough to code up in APL -- no need to resort to general
dynamic-programming techniques.  The only time _I_ had to contend with
extensive matrix chain products, the problem involved two-item vectors
with assorted chains of 2x2 matrices in between, and the default
association was optimal (though I didn't worry about it at the time ...
I was only interested in getting the right answers, in a time frame of a
couple of weeks, as I had to form a 12x12 matrix of such products,
average chain length was around 6 or 7, and I was having fits working out
1000 trivial matrix multiplies by hand).

Mike Kent [ who would just as soon _not_ flog the mouldering corpse of
            this dead horse any further ]



From phage!jvnc.net!yale.edu!spool.mu.edu!sol.ctr.columbia.edu!zaphod.mps.ohio-state.edu!uunet.ca!rose!tmsoft!itcyyz!yrloc!rbe Mon Apr  6 15:49:21 EDT 1992
Article: 1191 of comp.lang.apl
Path: phage!jvnc.net!yale.edu!spool.mu.edu!sol.ctr.columbia.edu!zaphod.mps.ohio-state.edu!uunet.ca!rose!tmsoft!itcyyz!yrloc!rbe
From: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Newsgroups: comp.lang.apl
Subject: Re: APL execution efficiency revisited
Message-ID: <1992Apr5.180150.41@yrloc.ipsa.reuter.COM>
Date: 5 Apr 92 18:01:50 GMT
Article-I.D.: yrloc.1992Apr5.180150.41
References: <1992Mar26.163242.5298@yrloc.ipsa.reuter.COM> <771@kepler1.rentec.com> <1992Mar30.203818.15221@yrloc.ipsa.reuter.COM> <783@kepler1.rentec.com>
Reply-To: rbe@yrloc.ipsa.reuter.COM (Robert Bernecky)
Organization: SNake Island Research Inc, Toronto
Lines: 63

In article <783@kepler1.rentec.com> andrew@rentec.com (Andrew Mullhaupt) writes:
>In article <1992Mar30.203818.15221@yrloc.ipsa.reuter.COM> rbe@yrloc.ipsa.reuter.COM (Robert Bernecky) writes:
>>>Now for the particular case  of +.x / (A B C D ...) you will be able to
>>
>>So, if you hand ne the matrices one at a time, the whole question 
>>of ordering is moot. This is like trying to nail Jello to a tree.
>>Can we please stick with one problem at a time, please?
>Sure, but I originally posed this not as a problem but as an example where
>you cannot avoid doing 'bad' inner and outer products. I _never_ stated
>that you were allowed to assume these matrices were given to you in a vector,
>but some people have continually assumed this. It should also be noted that
>the Matrix Chain Product was not brought up as an example where APL is any worse
>than FORTRAN, but as an example where you have to do bad inner products
>in nearly any language.

The point I am STILL trying to make (and will give up if this iteration
fails...) is that APL already HAS all the information at hand to 
optimally (from the standpoint of ordering the matrix products to 
minimize ops) determine the best ordering. 

This is regardless of the number of arrays (A B C D ...) being 
multiplied at that instant. 

Furthermore, the application, or lack of application of said optimization
can be done without ANY changes to the application, so that relatively
inexperienced programmers obtain the benefit of such optimizations with
no work on their part.

It is this exploitation of skilled system programmer skills that is
valuable: An APL programmer need not be highly skilled in understanding
the machine architecture, cache structure, heavy duty dynamic programming
math, etc., to take advantage of those algorithms.

>   Since I had seen nothing but 'J' in this group for months, I thought
>   I'd take this chance to stir things up by pointing out how, although
>   this intersection idiom is really the best you can do in APL2, it's
>   a piker compared to any compiled language. (Which it is). I pointed
>   out that it's relatively slow and very hard to read. (Which it is).

A. A posting by a single programmer is not necessarily optimal for any
   language.

B. Predicting the speed of an idiom is walking on thin ice: Many
   APL (and other) systems detect idioms and produce special purpose 
   code to handle them efficiently. This is no different from Fortran
   compilers which stare at 5 lines of DO loop code, and pump out a
   matrix product.

>      c) Some people said 'nobody's forcing you to use APL'. That's not true.
>         When you have to optimize an APL function, you normally have to call
>         it from APL. This has recently been fixed in APL2 but not in most
>         APL's yet. 

Don't make assertions about products about which you are ignorant.
I am not aware of any APL which has been released in the last several years
which does not have call-in call-out capabilities.


Robert Bernecky      rbe@yrloc.ipsa.reuter.com  bernecky@itrchq.itrc.on.ca 
Snake Island Research Inc  (416) 368-6944   FAX: (416) 360-4694 
18 Fifth Street, Ward's Island
Toronto, Ontario M5J 2B9 
Canada


From phage!jvnc.net!yale.edu!qt.cs.utexas.edu!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!usenet.coe.montana.edu!ogicse!das-news.harvard.edu!spdcc!dirtydog.ima.isc.com!ispd-newsserver!psinntp!kepler1!andrew Tue Apr  7 17:52:58 EDT 1992
Article: 1199 of comp.lang.apl
Path: phage!jvnc.net!yale.edu!qt.cs.utexas.edu!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!usenet.coe.montana.edu!ogicse!das-news.harvard.edu!spdcc!dirtydog.ima.isc.com!ispd-newsserver!psinntp!kepler1!andrew
From: andrew@rentec.com (Andrew Mullhaupt)
Newsgroups: comp.lang.apl
Subject: Re: What does that FORTRAN code  _do_?
Message-ID: <791@kepler1.rentec.com>
Date: 6 Apr 92 21:02:26 GMT
References: <920404100728_70530.1226_CHC22-1@CompuServe.COM>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 100

In article <920404100728_70530.1226_CHC22-1@CompuServe.COM> 70530.1226@CompuServe.COM (Mike Kent) writes:
>In article <ROCKWELL.92apr2194843@socrates.umd.edu>
>rowkwell@socrates.umd.edu (Raul Deluth Miller-Rockwell) asks:
>Calling
>
>               partsum(count1, count2, lenvec, itmvec, result)
>
>produces a partitioned sum of the integer vector <itmvec> in <result>,
>that is, a sum of the <count1> consecutive subvectors of respective
>lengths <lenvec>.  It does this a bit more quickly than the usual APL
>+\ , compress, first differences idiom with a partitioning bit vector
>marking the partitions of <itmvec> when <count1> is very small, and
>considerably more _slowly_ when <count1> modest to large (modest == 50).

The reason for this is that you have smaller inner loops. You can
beat this one by copying the array and summing 'in place', then
differencing. I probably would have tried this first. On the other
hand, Mike's post was a reasonable attempt. His point that when you
can express an algorithm in APL which doesn't use much flow of control
or a lot of excess memory that it will be pretty fast. You have a
right to expect this from your interpreter vendor. I must admit that
Mike's FORTRAN is reasonable. However, FORTRAN certainly allows for
an algorithm which would compete with the APL, so I don't think it's
fair to use this partition sum as an example where APL is faster than
FORTRAN. It _is_ an example of where good APL is faster than a reasonable
FORTRAN program, and I guess this is what Mike's focus is.

I am more interested in the limits of the language, and to determine these
you need to be prepared to use the _best_ possible algorithm for the
programming system. Recently I posted an APL2 intersection idiom which
I think is about as fast as APL2 can go. If I can go faster using C
or FORTRAN, I interpret this as evidence that APL2 is "not as fast" as
FORTRAN or C, since I gave APL2 its best shot. If someone can find a much
faster APL2 intersection, then I have to reconsider.


I like to use the bandwidth as a good example of an APL limit. In APL2,
a fast way to determine the bandwidth of a matrix M is:

	1 + max / ravel B x abs T jot.- T <- iota rho B <- 0 unequal M

It works by forming a boolean matrix B which indicates the nonzero elements
of M, then it forms a matrix (abs T jot.-T) which labels the diagonals
then it takes the maximum value of this bandwidth matrix which corresponds
to a nonzero element of M.

It is interesting to speculate that an supersmart APL interpreter could
avoid creating a full B matrix by looking at the entire line, but in
reality, B will be fully computed by most APLs. Now on a vector machine,
you are still in contention since APL is going to get pretty long vectors.
Now (abs T jot.- T) will also be formed. This is where you start to get
beat by compiled algorithms. The extra memory for this matrix will cost
you, and then you're going to go through it again to multply by B. Then
we assume the ravel is free, but the max reduce does cost you a full set
of comparisons. It is pretty easy to see how this algorithm is not space
or time efficient compared to fairly simple C or FORTRAN, and it is not
so easy to see how to improve it. (Obviously, I'm interested in any faster
approach, but I don't think you can gain much...)

 

>still think, that { first +.x / } is not an especially compelling 
>candidate for a new primitive, or []MATCHAINPROD, or idiom recognition 

Actually, I think that {}MATCHAINPROD would be totally uncalled for.
If the only thing you can do in APL is expand the system to include
bad examples, then you'll end up with PL/I with funny characters.
The value of discussing hard examples in APL is not so much to argue
for expanding APL, but to raise our consciousness about APL as an
expressive language. On the blackboard, there is essentially _nothing_
wrong with using an inefficient APL expression. It's often only when you
want to solve large problems that you find the limits of 'thinking in APL
terms'.

Now I take the point of view that a language which wants to be thought
of as expressive needs to take seriously _any_ algorithm with performance
advantages, not so much because you can make a million bucks because you
have this fast implementation, but because 'expressiveness' in a
performance vacuum is meaningless. Prolog is a great example of a 
language which is _tremendously_ expressive in some directions, but it
is also possible to write Prolog programs which do not terminate at all
merely because you put the desired clauses in a different order.


> ... On the other hand,
>if the idiom recognized were { first f.g / } with identification of the
>cases where the pair f::g has an associative inner product and shape-
>based ordering of the reduction so that that { and . = / } and a personal
>favorite { or . and / } also benefitted for example, THAT would be a real
>good candidate for incorporation into the system software, i.e., the APl
>interpreter. 

Such is actually the case, but I'd rather be able to write the algorithms
in APL rather than have every clever twist turn up as a new primitive.

>Mike Kent [ who would just as soon _not_ flog the mouldering corpse of
>            this dead horse any further ]

Later,
Andrew Mullhaupt


From phage!jvnc.net!darwin.sura.net!mips!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!usenet.coe.montana.edu!ogicse!das-news.harvard.edu!spdcc!dirtydog.ima.isc.com!ispd-newsserver!psinntp!kepler1!andrew Tue Apr  7 17:59:50 EDT 1992
Article: 1200 of comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!mips!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!usenet.coe.montana.edu!ogicse!das-news.harvard.edu!spdcc!dirtydog.ima.isc.com!ispd-newsserver!psinntp!kepler1!andrew
From: andrew@rentec.com (Andrew Mullhaupt)
Newsgroups: comp.lang.apl
Subject: Re: APL execution efficiency revisited
Message-ID: <794@kepler1.rentec.com>
Date: 6 Apr 92 22:36:54 GMT
References: <1992Mar30.203818.15221@yrloc.ipsa.reuter.COM> <783@kepler1.rentec.com> <1992Apr5.180150.41@yrloc.ipsa.reuter.COM>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 87

In article <1992Apr5.180150.41@yrloc.ipsa.reuter.COM> rbe@yrloc.ipsa.reuter.COM (Robert Bernecky) writes:
>In article <783@kepler1.rentec.com> andrew@rentec.com (Andrew Mullhaupt) writes:
>>In article <1992Mar30.203818.15221@yrloc.ipsa.reuter.COM> rbe@yrloc.ipsa.reuter.COM (Robert Bernecky) writes:
>>>>Now for the particular case  of +.x / (A B C D ...) you will be able to

>The point I am STILL trying to make (and will give up if this iteration
>fails...) is that APL already HAS all the information at hand to 
>optimally (from the standpoint of ordering the matrix products to 
>minimize ops) determine the best ordering. 

Yes. If you are doing +.x / you can optimize this. It is much harder to
say something about 'on-line' matrix chain product, where you are supposed
to compute the partial product as you go along, as in the case of a loop.

>Furthermore, the application, or lack of application of said optimization
>can be done without ANY changes to the application, so that relatively
>inexperienced programmers obtain the benefit of such optimizations with
>no work on their part.

In the simple cases, yes. If a program can be transformed from one which
accumulates its matrix product in a loop to one which uses +.x /, then
experience may be helpful.

>A. A posting by a single programmer is not necessarily optimal for any
>   language.

Yes, but that posting was claimed to be optimal for APL2. I have reason
to believe that it is very fast in APL2, and nobody seems to have posted
better, so maybe I'll just claim "best known", although you might want
to be conservative about it and say that it's hard enough to do better
that nobody is motivated to post better...

>B. Predicting the speed of an idiom is walking on thin ice: Many
>   APL (and other) systems detect idioms and produce special purpose 
>   code to handle them efficiently. This is no different from Fortran
>   compilers which stare at 5 lines of DO loop code, and pump out a
>   matrix product.

Yes, and it is also sometimes necessary to foil idiom detection because
APL doesn't always get it right. In APL2 it was necessary for a while to
put a 'harmless' ravel in the unique idiom to _prevent_ the interpreter
from recognizing it, because it had special code, but the dyadic iota
which was being avoided, was actually faster. Then IBM fixed this by taking
out the idiom but you'll still see the extra comma in some people's code.

On the other hand, I like to use matrix bandwidth as an example which is
as fast as you can go in many APL interpreters, but still slow compared
to good compiled language. I have in mind the APL line:

	1 + max / B x T jot.- T <- iota rho B <- 0 unequal M

which computes the bandwidth of a matrix M. I have not seen interpreted
APL which significantly outperforms this, and I think that you won't
be able to cut much off this. I am aware of some of the differences between
different APL systems, but I don't see any which would make this much less
than optimal in practice. In "theory" you might do better by looping over
rows, or recursive subdivision of the matrix, but not without a _giant_
workspace.

>>      c) Some people said 'nobody's forcing you to use APL'. That's not true.
>>         When you have to optimize an APL function, you normally have to call
>>         it from APL. This has recently been fixed in APL2 but not in most
>>         APL's yet. 
>
>Don't make assertions about products about which you are ignorant.
>I am not aware of any APL which has been released in the last several years
>which does not have call-in call-out capabilities.

You're right. I should have said something other than most. Although almost
all APL interpreters I used have call-out, very few supported call-in. 
I used APL2 (which now has it) and STSC and Sharpe APL. One of our resident
APL programmers here doesn't think you can do it with Dyalog. Maybe we're
not talking about the same thing.

By call-in, I mean the ablilty to invoke APL code by calling a function in a
compiled program, not the ability to communicate over a pipe. If people are
putting call-in in interpreters, it raises some interesting issues. (Is it
re-entrant? Can you call in and out and in and out... etc.?) Is it a 'full
bandwidth' interface? (An example of a practical 'full-bandwidth' interface
is the dynamic loading / call-back in the Splus interpreter, where the overhead
of calling in or out is essentially the same as the overhead of calling a
compiled function from within a compiled program. Note that you have to
have called out in order to call back in in Splus...)


Later,
Andrew Mullhaupt


From phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell Tue Apr  7 18:01:38 EDT 1992
Article: 1201 of comp.lang.apl
Newsgroups: comp.lang.apl
Path: phage!jvnc.net!darwin.sura.net!haven.umd.edu!socrates!socrates!rockwell
From: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Subject: Re: APL execution efficiency revisited
In-Reply-To: andrew@rentec.com's message of 6 Apr 92 22:36:54 GMT
Message-ID: <ROCKWELL.92Apr7091904@socrates.umd.edu>
Sender: rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell)
Organization: Traveller
References: <1992Mar30.203818.15221@yrloc.ipsa.reuter.COM> <783@kepler1.rentec.com>
	<1992Apr5.180150.41@yrloc.ipsa.reuter.COM> <794@kepler1.rentec.com>
Date: Tue, 7 Apr 1992 14:19:04 GMT

Robert Bernecky:
   >The point I am STILL trying to make (and will give up if this
   >iteration fails...) is that APL already HAS all the information at
   >hand to optimally (from the standpoint of ordering the matrix
   >products to minimize ops) determine the best ordering.

Andrew Mullhaupt:
   Yes. If you are doing +.x / you can optimize this. It is much
   harder to say something about 'on-line' matrix chain product, where
   you are supposed to compute the partial product as you go along, as
   in the case of a loop.

This is a good argument against using loops.  The significant question
is _why_ are you using a loop?  If the reason is that you must deal
with time (say, interactive input, where you must display partial
results), there isn't a whole lot you should do about this.  If the
reason is that you couldn't think of anything better, well, that's a
personal problem.

If the reason is that the whole problem couldn't fit into memory, you
can take an APL solution and expand it -- once you've got a decent
algorithm.  Its tempting to recommend that the language support this
transparently, but virtual memory might be a way around this problem.
[Also, it is harder to write fast code that's general when there's
many orders of magnitudes difference between different kinds of
storage.]

   >A. A posting by a single programmer is not necessarily optimal for any
   >   language.

   Yes, but that posting was claimed to be optimal for APL2. I have
   reason to believe that it is very fast in APL2, and nobody seems to
   have posted better, so maybe I'll just claim "best known", although
   you might want to be conservative about it and say that it's hard
   enough to do better that nobody is motivated to post better...

Well, there's a number of reasons I favor J over APL2.  One is that it
seems much easier to optimize.  At present, this is barely even a
theoretical advantage, but... we'll see.

           1 + max / B x T jot.- T <- iota rho B <- 0 unequal M

Um... shouldn't you ravel that before doing the max reduce?  Also, are
you sure you don't want to also take the absolute value at the same
time?

   which computes the bandwidth of a matrix M. I have not seen interpreted

   You're right. I should have said something other than most.
   Although almost all APL interpreters I used have call-out, very few
   supported call-in.  I used APL2 (which now has it) and STSC and
   Sharpe APL. One of our resident APL programmers here doesn't think
   you can do it with Dyalog. Maybe we're not talking about the same
   thing.

I understand that STSC now has it as well (quadNA).  J has it [of
course].  I don't know about Sharp, but J is (to my mind) its
successor.  I don't know about Dyalog either -- but if you want it,
why not contact the people who produce it?

-- 
Raul Deluth Miller-Rockwell                   <rockwell@socrates.umd.edu>


From phage!jvnc.net!yale.edu!spool.mu.edu!uunet!decwrl!infopiz!lupine!uupsi!psinntp!kepler1!andrew Mon Apr 13 10:25:48 EDT 1992
Article: 1208 of comp.lang.apl
Path: phage!jvnc.net!yale.edu!spool.mu.edu!uunet!decwrl!infopiz!lupine!uupsi!psinntp!kepler1!andrew
From: andrew@rentec.com (Andrew Mullhaupt)
Newsgroups: comp.lang.apl
Subject: Re: APL execution efficiency revisited
Message-ID: <797@kepler1.rentec.com>
Date: 8 Apr 92 15:15:06 GMT
Article-I.D.: kepler1.797
References: <1992Apr5.180150.41@yrloc.ipsa.reuter.COM> <794@kepler1.rentec.com> <ROCKWELL.92Apr7091904@socrates.umd.edu>
Organization: Renaissance Technologies Corp., Setauket, NY.
Lines: 22

In article <ROCKWELL.92Apr7091904@socrates.umd.edu> rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell) writes:
>This is a good argument against using loops.  The significant question
>is _why_ are you using a loop?  If the reason is that you must deal
>with time (say, interactive input, where you must display partial
>results), there isn't a whole lot you should do about this. 

This is my point all along.

>
>           1 + max / B x T jot.- T <- iota rho B <- 0 unequal M
>
>Um... shouldn't you ravel that before doing the max reduce?  Also, are
>you sure you don't want to also take the absolute value at the same
>time?

Yes, I forgot the absolute value and ravel. Did I forget it in both posts? 
The correct idiom is:

           1 + max / abs ravel B x T jot.- T <- iota rho B <- 0 unequal M

Later,
Andrew Mullhaupt


