Retrochallenge 2016/01:
Maze War for Olivetti M10 and NEC PC-8201A

Episode 9: Quest for Speed

After last episode's midterm fun, we're back to more serious considerations, regarding our little program.

Taking time

Taking time to think about it.

Ambitious Ideas

I'm not too happy about the rendering speed of the LCD and am consiquently musing about machine language. Maybe we define an integer array and compile a little program for the display into the space reserved by it. The space required for the display data could be reserved in the same array, or, even better, in an integer array of it's own.

My ideas regarding the ML program (intel 8085) went as far as follows (based on a demo program in the PC-8201A Technical Manual; NEC Corporation, 1984):

; writing up to 50 bytes to the display

   OFF75   EQU   765Ch              ; addr. disable interrupts (Model 100)
   ON75    EQU   743Ch              ; addr. enable interrupts  (Model 100)
   PORTC   EQU   FEh
   PORTD   EQU   FFh

; address count is decimal!

   LCDWRT:
00         CD 5C 76   CALL OFF75     ; disable interrupts

03         CD 21 00   CALL LCDBUSY   ; wait for the display being ready
06         3A 28 00   LDA PGOFS      ; load page and offset
09         D3 FE      OUT PORTC      ; send it to the display

11         21 2C 00   LXI H,DATA     ; get data address
14         3A 2A 00   LDA COUNT      ; get number of bytes
17         41         MOVE C,A       ; move it to C

   WRITE:
18         CD 21 00   CALL LCDBUSY   ; wait for the display
21         7E         MOVE A,M       ; get byte
22         D3 FF      OUT PORTD      ; write it to the display
24         23         INX H          ; increment HL registers
25         0D         DCR C          ; decrement C
26         C2 12 00   JNZ WRITE      ; redo, if not zero

29         CD 3C 74   CALL ON75      ; enable interrupts
32         C9         RET            ; return

   LCDBUSY:
33         DB FE      IN PORTC       ; read status
35         07         RLC            ; rotate to get busy state
36         CA 21 00   JC LCDBUSY     ; repeat, if busy
39         C9         RET            ; return

40
   PGOFS:  0,0                    ; page and pixel offset
42
   COUNT:  0,0                    ; length of data
44
   DATA:   0                      ; data starts here

! Caution: Assembled by hand, don't trust me!
Green: machine dependent, blue: to be converted to absolute addresses, red: to be set by BASIC before each call.

All we had to do, is to define an integer array of half the length of the program (integer is 2 bytes) + 50 bytes for the data as in "DIM SP%(48)", get the start address by "AD=VARPTR(SP%(0))" and fix up all the local addresses . That is: OFF75 and ON75 depending on the model used, LCDBUSY (AD+33), PGOFS (AD+40), COUNT (AD+42), and DATA (AD+44).

In case you'd ask: PGOFS and COUNT are 2 bytes in order to match a subscript of the integer array. Integers are stored in little-endian, so any even offset-address would match an index to directly write to it from BASIC. Preferably, we put all the data in an array of it's own, thus starting at subscript 0.

Once again, the NEC PC-8201A's BASIC (N82-BASIC) is different, as it hasn't an implementation of VARPNTR. But there's a little piece of BASIC/ML code to emulate it and this wouldn't really pose a problem.
(See VARPTR.NEC by Steve Sarna, 11/12/84.)

Transferring Data

There's another, more intesting question here, regarding the method of talking to our little program. Obviously, it's nice to set the page, offset, and the number of bytes to write just by using a BASIC assignment to an integer subscript. Should we do this also with the up to 50 bytes of display data (thus having to duplicate the increment on HL, as in INX H at offset 24, in order to advance by two bytes at once), or should we rather POKE the values directly into the data space?

Obviously we would do this in a loop, and probably we would have the appropriate array subscript already at hand, while we would have to add it to the base address when using POKE. Also, subscripts would be all integer, POKING would require at least single precision variables.

So, what would be faster?

Surprise

Let's have two little, mostly identical test programs:

10  REM (1) Using Subscripts
100 DEFSNG A, DEFINT B-Z
110 DIM B(49)
120 PRINT TIME$
130 FOR I=0 TO 100
140 FOR J=0 TO 49:B(J)=J:NEXT
150 NEXT
160 PRINT TIME$:END

10  REM (2) Using Pokes
100 DEFSNG A, DEFINT B-Z
110 DIM B(49):A=VARPTR(B(0))
120 PRINT TIME$
130 FOR I=0 TO 100
140 FOR J=0 TO 49:POKE A+J, J:NEXT
150 NEXT
160 PRINT TIME$:END

On the Olivetti M10 program (1) uses a runtime of 18 seconds, while program (2) needs 20 seconds to finish. Bummer!

So, POKE is substantially slower, even without the addition to assemble the target address. (Tested.) Bummer, again!

This is really slow, especially with regard to all the display data we've to write. So, would we gain any by writing the display data first to memory and displaying it then by our machine language routine?

Let's have another program, testing the relative performance of BASIC's OUT command, we're using for talking to the display:

10 REM (3) Testing Performance of OUT (Olivetti M10)
100 DEFSNG A, DEFINT B-Z
110 PA=185:PB=186:PC=254:PD=255:B=4
120 CLS:PRINT "a":CALL 29558:OUT PA,2:OUT PB,0
130 FOR I=0 TO 100:OUT PC,0
140 FOR J=0 TO 49:OUT PD,B:NEXT
150 NEXT
160 CALL 28998:PRINT "b":END

We can't use TIME$ here, since we have interrupts disabled and by this the clock ticks, too. So we just write "a" and "b" to the display and have an eye at the watch to get the interval. And this, dear reader, is below 15 seconds!

In other words, assigning anything to a variable or memory location is already slower than sending the same value to the display via the processor's serial port.

Maybe, it's the data lookup missing in version (3)? Let's modify line 140 accordingly:

10 REM (4) Testing Performance of OUT with subscripts (Olivetti M10)
100 DEFSNG A, DEFINT B-Z
110 DIM B(49):PA=185:PB=186:PC=254:PD=255
120 CLS:PRINT "a":CALL 29558:OUT PA,2:OUT PB,0
130 FOR I=0 TO 100:OUT PC,0
140 FOR J=0 TO 49:OUT PD,B(J):NEXT
150 NEXT
160 CALL 28998:PRINT "b":END

No difference at all! Lookups of individial subscripts of the integer array B come virtually for free.

So, interrupts are turned off, maybe this is making the difference?

Let's have a look at subscripted assignments again (change in line 140, everything else the same as in v. 4):

10 REM (5) Testing Performance of subscripts w/o interrupts (Olivetti M10)
100 DEFSNG A, DEFINT B-Z
110 DIM B(49):PA=185:PB=186:PC=254:PD=255
120 CLS:PRINT "a":CALL 29558:OUT PA,2:OUT PB,0
130 FOR I=0 TO 100:OUT PC,0
140 FOR J=0 TO 49:B(J)=J:NEXT
150 NEXT
160 CALL 28998:PRINT "b":END

No difference to program (1), again!
(Mind that we're measuring time in seconds rather than milliseconds or ticks. The minor differences in runtime caused by interrupts don't show up, because they are too tiny.)

We conclude, assignments of any kind (and this includes POKE) are generally a speed killer in MS BASIC. It's interesting that a few arithmetic operations or lookups aren't really making a difference, but assignments do.

Using OUT to drive the display directly from BASIC is by all means faster than any kind of method of transferring the data to memory. That's actually a pity, since the idea of assembling the data first in BASIC and then writing it rapidly to the display by a ML subroutine would have had some charm to it. Especially, since we're spending most of the time with the data assembly in BASIC, while having interrupts disabled, which doesn't really recommend itself as a method of choice for a game that should be doing networking, too.

Enough for today. With a nod to networking performance, the ML approach may be still a viable option.

 

Next:   Episode 10: Cross-Platform ROM Diving

Previous:   Episode 8: Creative Pico Murder — A Virtual Marketing Campaign

Back to the index.

— This series is part of Retrochallenge 2016/01. —