Newsgroups: comp.lang.apl
Path: watmath!watserv1!70530.1226@compuserve.com
From: Mike Kent <70530.1226@CompuServe.COM>
Subject: Re:  AL efficiency revisited
Message-ID: <920402082825_70530.1226_CHC47-1@CompuServe.COM>
Sender: daemon@watserv1.waterloo.edu
Organization: University of Waterloo
Date: Thu, 2 Apr 1992 08:28:26 GMT
Lines: 54

In one of yesteerdays articles, Andrew Mullhaupt refers to two APL
programmers, remarks that this thread has wandered, and that he has been
misunderstood, and speculates about some FORTRAN code.


The two APLers:  it was the same guy, namely me.

Wandering:  threads _will_ wander.

Misunderstandings:  not everything in the thread is a reply to the 
original article (see immediately above).

And here's the fortran code, reconstucted from memory, but sans capital
letters:


       subroutine partsum (count1, count2, lenvec, itmvec, result)
       integer*4 count1, count2, itmndx
       integer*4 lenvec(count1), result(count1), result(count2)

       itmndx=0
       do 40 j=1,count1,1
           result(j) = 0
           do 20 k=1,lenvec(j),1 
	      itmndx = 1+itmndx    
              result(j)=result(j)+itmvec(itmndx+j)
  20       continue
           itmndx=itmndx+lenvec(j)
  40   continue
       return  


From a quick scan, it looks to me like this could be coded in assembler
so as to do the work using only general registers, which means that 
there should be _no_ reads except for the count parameters and one touch 
on each item of {lenvec} and of {itmvec}, and only one write to {result}   
for each item of {lenvec}, and the source is short enough that I expected
the compiler to produce that kind of machine code.  I don't have the
compiler output (it _has_ been several years), but I looked at the
pseudo-assembler listing back when, and there was a lot of memory I/O --
the comparisons for the loop variables were the culprits, as I recall.
The obvious assembler techniques -- "BCTR" to control the loops and
"LA k,4(j,k)" to advance the data indices -- were _not_ generated.  For
OPTIMIZE(0) compilations, this is OK, but for OPTIMIZE(2)???  

BTW, what test can I eliminate from this code?

In any case, the points I wanted to make were two:  first, compilation
doesn't always mean you get close-to-optimal code; and second, the
assembler code which underlies APL2 is in many cases _very_ well tuned,
probably better than average pretty-good compiler output.  
         


