BASIC Variables & Strings — with Commodore

Investigations into the memory utilization of Commodore BASIC (PET 2001, VIC-20, C64)

Variables and Strings in Commodore BASIC
The astounding intricacies of Commodore BASIC variables.

In our last episode, in which we were investigating the storage format for BASIC text for the sake of renumbering, we stumbled over some surprising facts, regarding the variable management and memory allocation in Commodore BASIC and consequences thereof. Namely, we found that there’s a single pointer, ARYTAB, used both for pointing at the next available memory location for the allocation of simple (non-indexed) variables and for marking the start of the memory space used to store arrays. Meaning, everytime the interpreter encounters a new simple variable, the entire block of previously defined arrays has to be moved by 7 bytes in order to provide the required space in memory.

This is certainly something, we may want to optimize our programs for. (Define all simple variables first, then allocate array space by “DIM()”.) However, there may be more of note in that quarter. Reason enough to investigate variable management in Commodore BASIC — and especially strings,.

(Yes, an advert.)

BASIC Memory Partions

Let’s recap how BASIC memory is partioned and how we may find out at any time about how this is currently configured. Here, we concentrate on the PET 2001 (ROM 1.0 and 2.0), the VIC-20 and the C64, which share the same flavor of BASIC. There are 6 pointers in the zeropage, which are used by the system for this, named TXTTAB, VARTAB, ARYTAB, STREND, FRETOP, and MEMSIZ. (Inbetween, there’s is also a utility pointer FRESPC, which is used for operations involving the use of scratch space, which won’t be of further interest here).

And this is, where we find them on each of these systems:

• C64 and VIC-20

label   loc.hex       loc.dec    comment

TXTTAB  002B-002C      43-44     Pointer: Start of BASIC Text
VARTAB  002D-002E      45-46     Pointer: Start of BASIC Variables
ARYTAB  002F-0030      47-48     Pointer: Start of BASIC Arrays
STREND  0031-0032      49-50     Pointer: End of BASIC Arrays (+1)
FRETOP  0033-0034      51-52     Pointer: Bottom of String Storage
FRESPC  0035-0036      53-54     Utility String Pointer
MEMSIZ  0037-0038      55-56     Pointer: Highest Address Used by BASIC


• PET 2001 ROM 2.0 ("new ROM")

TXTTAB  0028-0029      40-41     Pointer: Start of BASIC Text
VARTAB  002A-002B      42-42     Pointer: Start of BASIC Variables
ARYTAB  002C-002D      44-45     Pointer: Start of BASIC Arrays
STREND  002E-002F      46-47     Pointer: End of BASIC Arrays (+1)
FRETOP  0030-0031      48-49     Pointer: Bottom of String Storage
FRESPC  0032-0033      50-51     Utility String Pointer
MEMSIZ  0034-0035      52-52     Pointer: Highest Address Used by BASIC


• PET 2001 ROM 1.0 ("old ROM")

TXTTAB  007C-007D     124-125    Pointer: Start of BASIC Text
VARTAB  007E-007F     126-127    Pointer: Start of BASIC Variables
ARYTAB  0080-0081     128-129    Pointer: Start of BASIC Arrays
STREND  0082-0083     130-131    Pointer: End of BASIC Arrays (+1)
FRETOP  0084-0085     132-133    Pointer: Bottom of String Storage
FRESPC  0086-0087     134-135    Utility String Pointer
MEMSIZ  0088-008A     136-137    Pointer: Highest Address Used by BASIC

These system pointers enjoy the following use and meaning:

TXTTAB
points to the very start of the BASIC program text.
VARTAB
is the start of the variable space. It follows immediately after the end of the program in memory (the final empty line-link.) Simple variables are stored here. Each variable occupies 7 bytes of memory, regardless of the type. ARYTAB points to the next available space.
ARYTAB
is (also) the beginning of the space allocated for array storage. It points to the location following immediately after the last byte allocated for simple variables.
STREND
is the lower end of the space used for storing string literals. String literals are stored at the highest availble address, as marked by FRETOP, growing down from the top. The space starting at STREND and up to FRETOP is the free available memory.
FRETOP
marks the top end of the unallocated memory, below the last allocated string. New string literals will be stored immediately below this address.
FRESPC
is a utility pointer used internally by BASIC. It is not directly involved in variable management, but used for string handling.
MEMSIZ
is the top address of accessible memory. Strings start growing down from here.

In case of a PET 2001 with ROM 2.0 (“new ROM”) and 8K of RAM — we’ll use this for our demonstrations, since we have it at ready hands as a default configuration in our online emulation —, this is what these pointers look like, just after a fresh start or reset:

PET 2001, 8K, ROM 2.0, freshly initialized

TXTTAB  0028-0029:  01 04  →  $0401  (BASIC Text: 00 00)
VARTAB  002A-002B:  03 04  →  $0403
ARYTAB  002C-002D:  03 04  →  $0403
STREND  002E-002F:  03 04  →  $0403
FRETOP  0030-0031:  00 20  →  $2000
FRESPC  0032-0033:  44 44  →  $4444
MEMSIZ  0034-0035:  00 20  →  $2000  (Top of 8K RAM + 1)

So, on a PET 2001, BASIC starts at $0401 (on a C64, it’s $0801, and on a VIC-20 it starts at $1000, or, if there’s a memory extension of at least 8K, at $1200). The minimal BASIC text consist of two zero bytes, representing an empty line-link and thus the end of program (compare last episode). The next free, avaible space (which might be used in direct mode) for a variable to go starts immediately after this, ad $0403 (VARTAB). Since we have no variables defined yet, this block is of zero length and $0403 is also the beginning of the array space (ARYTAB). As we haven’t defined any yet, this is also the end of it and the begin of free, unused memory (STREND). Finally, $2000 is hex for 8K (MEMSIZ), and, since we haven’t used any strings either, this is also, where any stored string literals would start to grow down from (FRETOP).

Now, let’s load a simple program and RUN it:

10 DIM A(10)
20 FOR I=0 TO 10:A(I)=I:NEXT
30 B$=CHR$(66)

A dump of the memory provides the following image:

Memory map of a PET 2001 after execution of a simple program

We may easily discern some structure in this:

Stored Variables

Let’s have some fun and explore this a bit further.
For a beginning, let’s try some simple variables:

10 REM SIMPLE VARIABLES TEST
20 R=1:  REM REAL
30 R1=2: REM REAL, DOUBLE CHARACTER NAME
40 I%=1: REM INT
50 I1%=2:REM INT, DOUBLE CHARACTER NAME
60 S$=CHR$(65): REM "A"
70 S1$=CHR$(66):REM "B"

which provides us with the following memory dump:

The PET uses $AA as a fill byte, where the better known C64 uses $EA, which happens to be the op-code for the 6502 instruction NOP (dummy statement, no operation).

VARTAB → 04B7

04B7:                      52         R
04B8: 00 81 00 00 00 00 52 31  ......R1
04C0: 82 00 00 00 00 C9 80 00  ........
04C8: 01 00 00 00 C9 B1 00 02  ........
04D0: 00 00 00 53 80 01 FF 1F  ...S....
04D8: 00 00 53 B1 01 FE 1F 00  ..S.....
04E0: 00 AA AA AA AA AA AA AA  ........
...
1FF8: AA AA AA AA AA AA 42 41  ......BA

Again, we may recognize a few ASCII characters, hinting at variable names. We already know that all variables occupy 7 bytes, each, regardless of their type, so we may reformat this as:

R:   52 00 81 00 00 00 00  R......
R1:  52 31 82 00 00 00 00  R1.....
I%:  C9 80 00 01 00 00 00  ....... $C9 = $80 + "I"
I1%: C9 B1 00 02 00 00 00  ....... $B1 = $80 + "1"
S$:  53 80 01 FF 1F 00 00  S...... → $1FFF = "A"
S1$: 53 B1 01 FE 1F 00 00  S...... → $1FFE = "B"

So normal variables (real) are identified by their name in ASCII (max two bytes), where the second byte is zero for single-character names. Integer variables have the high-bit set on both name bytes and string variables only on the second one.

Real variables use the remaining 5 bytes for the floating point representation of their value. Integers have the value in HI-LO order in their third and forth byte, strings have their length (max 255) in the third byte, followed by a pointer to the storage location of the PETSCII sequence in the usual LO-HI format. Any remaining space to complete the 7 bytes is filled with binary zeros.

Arrays

What may be more fun than trying the same for arrays? ;-)

10 REM SIMPLE ARRAY TEST
20 I=0:DIM RA(2):DIM IA%(2):DIM SA$(2)
30 FOR I = 0 TO 2
40 RA(I)=1+I:IA%(I)=1+I:SA$(I)=CHR$(65+I)
50 NEXT

Which provides the following memory dump:

VARTAB → 047C
ARYTAB → 0482

047C:          49 00 82 40 00     I..@.
0480: 00 00 52 41 16 00 01 00  ..RA....
0488: 03 81 00 00 00 00 82 00  ........
0490: 00 00 00 82 40 00 00 00  ....@...
0498: C9 C1 0D 00 01 00 03 00  ........
04A0: 01 00 02 00 03 53 C1 10  .....S..
04A8: 00 01 00 03 01 FF 1F 01  ........
04B0: FE 1F 01 FD 1F AA AA AA  ........
...
1FF8: AA AA AA AA AA 43 42 41  .....CBA

Again, we may recognize some ASCII characters hinting at the start of a variable. And, indeed, the same naming conventions apply. (High-bit set on both name bytes for integers and solely on the second one for strings.) — Let’s reformat this array segement…

0482: 52 41 16 00 01 00 03   RA(0..2)
      81 00 00 00 00         1.0
      82 00 00 00 00         2.0
      82 40 00 00 00         3.0

0498: C9 C1 0D 00 01 00 03   IA%(0..2)
      00 01                  1
      00 02                  2
      00 03                  3

04A5: 53 C1 10 00 01 00 03   SA$(0..2)
      01 FF 1F               length: 1  → $1FFF ("A")
      01 FE 1F               length: 1  → $1FFE ("B")
      01 FD 1F               length: 1  → $1FFD ("C")

At first glance, we can see that arrays use a much more compact storage format for values. No extra bytes are wasted for padding, since each of the array members are of the same type and length. However, there are some extra bytes inbetween the variable name and the list of values, which are, BTW, obviously stored in ascending index order.

Like the BASIC text, array variables are organized as a linked list. However, in contrast to the individual lines of the BASIC text, array variables are linked by relative offsets. Thus, the entire memory block is self-contained and may be relocated easily. Which will happen inevitably, everytime, you add a new simple variable.

And What About Multidemsional Arrays?

10 REM MULTI DIM ARRAY TEST
20 A=0:B=0:DIM I%(2,2)
30 FOR A=0 TO 2
40 FOR B=0 TO 2
60 I%(A,B)=3*B+1*A
70 NEXT B
80 NEXT A

Wich provides the following dump:

ARYTAB → 0482

0482:       C9 80 1B 00 02 00    ......
0488: 03 00 03 00 00 00 01 00  ........
0490: 02 00 03 00 04 00 05 00  ........
0498: 06 00 07 00 08 AA AA AA  ........

Reformatted:

0482: C9 80 1B 00 02 00 03 00 03
      00 00
      00 01
      00 02

      00 03
      00 04
      00 05

      00 06
      00 07
      00 08

Two things may be observed: The array recieves another descriptor for the extend of the second dimension, added after the first discriptor, which we allready know, and the last index iterates fastest. (Meaning, given a series X (0..2) × Y (0..2), where X is the first index and Y the second index, items are stored as [X0/Y0], [X0/Y1], [X0/Y2], [X1/Y0], [X1/Y1]….) Moreover, we may confirm that the 5th byte provides indeed the dimensionality cont of the array, since it has here changed to 2.

Strings

As we’ve already seen, strings are a bit different, insofar as they do not contain their value by themselves, but are rather pointers. Let’s have a closer look at this, since there may be more to it, hidden in the finer details of string processing and referencing…

May we suggest another, small test program?

10 A$="THE QUICK":B$=" BROWN FOX ":C$="JUMPS OVER THE LAZY DOG"

Which produces the following dump:

0401:    42 04 0A 00 41 24 B2   B...A$.
0408: 22 54 48 45 20 51 55 49  "THE QUI
0410: 43 4B 22 3A 42 24 B2 22  CK":B$."
0418: 20 42 52 4F 57 4E 20 46   BROWN F
0420: 4F 58 20 22 3A 43 24 B2  OX ":C$.
0428: 22 4A 55 4D 50 53 20 4F  "JUMPS O
0430: 56 45 52 20 54 48 45 20  VER THE 
0438: 4C 41 5A 59 20 44 4F 47  LAZY DOG
0440: 22 00 00 00 41 80 09 09  "...A...
0448: 04 00 00 42 80 0B 18 04  ...B....
0450: 00 00 43 80 17 29 04 00  ..C..)..
0458: 00 AA AA AA AA AA AA AA  ........

A$: 41 80 09 09 04 00 00   length:  9  → $0409
B$: 42 80 0B 18 04 00 00   length: 11  → $0418
C$: 43 80 17 29 04 00 00   length: 23  → $0429

As may be observed, there are no entries pointing to the string storage area. Rather, the string variables point to the string literals inside the BASIC text (here printed in teal). Which is actually a great idea, since, this way, quite an amount of memory and runtime for copying may be saved.

However, mind the following program:

10 A$="THE QUICK":B$=" BROWN FOX ":C$="JUMPS OVER THE LAZY DOG"
20 D$=A$+B$
30 E$=D$+C$

Now, this results in quite an amount of string allocation, where we find the stored literals, as composed by the assignments, at the very top of the memory:

1FC0: AA 54 48 45 20 51 55 49  .THE QUI
1FC8: 43 4B 20 42 52 4F 57 4E  CK BROWN
1FD0: 20 46 4F 58 20 4A 55 4D   FOX JUM
1FD8: 50 53 20 4F 56 45 52 20  PS OVER
1FE0: 54 48 45 20 4C 41 5A 59  THE LAZY
1FE8: 20 44 4F 47 54 48 45 20   DOGTHE
1FF0: 51 55 49 43 4B 20 42 52  QUICK BR
1FF8: 4F 57 4E 20 46 4F 58 20  OWN FOX 

Notably, we find the sequence “THE QUICK BROWN FOX ” (A$+B$) twice, since string variables can only reference consecutive sequences of memory and there is now way to reuse this. Over time, string literals will pile up in memory, as we procede to compose strings in our program.

String Functions

So, what about string operations using BASIC’s built-in functions, which access partial strings, like “LEFT$()”, “RIGHT$()”, or “MID$”? Certainly, these can make use of the very properties of the string variables, by just modifing the pointer address and/or the length?

10 A$="TEST":B$=LEFT$(A$,2)a

0401:    1A 04 0A 00 41 24 B2   ....A$.
0408: 22 54 45 53 54 22 3A 42  "TEST":B
0410: 24 B2 C8 28 41 24 2C 32  $..(A$,2
0418: 29 00 00 00 41 80 04 09  )...A...
0420: 04 00 00 42 80 02 FE 1F  ...B....
0428: 00 00 AA AA AA AA AA AA  ........
...
1FF8: AA AA AA AA AA AA 54 45  ......TE


B$: 42 80 02 FE 1F 00 00

Oh no! Rather counter-intuitively, this has generated it’s own entry in the string storage! Surely, this is so that this can operate on complex expressions, which may be provided as an argument in some kind of buffer or scratch area, without piling up garbage in the string storage?

10 A$=LEFT$("TE"+"ST",3)

0401:    17 04 0A 00 41 24 B2   ....A$.
0408: C8 28 22 54 45 22 AA 22  .("TE"."
0410: 53 54 22 2C 33 29 00 00  ST",3)..
0418: 00 41 80 03 F9 1F 00 00  .A......
...
1FF8: AA 54 45 53 54 45 53 54  .TESTEST

Oh double-no!
There’s both the composed, temporary string and the sequence resulting from the “LEFT$()” operation! — Now, this isn’t optimized in any way.

Printing Strings

So, we may ask, are there any consequences for the PRINT statement? Obviously, it’s not a good idea to compose strings just for the sake of printing, as in “C$=A$+" "+B$:PRINT C$”.
But, does this also apply for PRINT expressions? Is there any difference in the following two statements?

10 PRINT "THE LAZY "+"DOG"

10 PRINT "THE LAZY ";"DOG"

Let’s give it a try…

10 PRINT "THE LAZY ";"DOG"
RUN
THE LAZY DOG

READY.
█

1FF8: AA AA AA AA AA AA AA AA  ........

As expected, not much to be observed here, the string storage area is still empty. Now for the more thrilling test, will the string concatenation by “+” generate a new entry, as we previously observed it in assignments?

10 PRINT "THE LAZY "+"DOG"
RUN
THE LAZY DOG

READY.
█

1FF0: AA AA AA AA 54 48 45 20  ....THE
1FF8: 4C 41 5A 59 20 44 4F 47  LAZY DOG

Oops, there it is! — Make sure to join your strings by semicolons (“;”) in your PRINT statements!

Other Operations Involving Strings

Just in case, you supposed, you had seen all of it, by now, including all possible pitfalls, consider the case of loading a program:

### COMMODORE BASIC ###

 7167 BYTES FREE

READY.
LOAD "TESTPROGRAM",8

SEARCHING FOR TESTPROGRAM
LOADING

→

1FF0: AA AA AA AA AA 54 45 53  .....TES
1FF8: 54 50 52 4F 47 52 41 4D  TPROGRAM

Yes, there it is. Any string operation like this will generate it’s own entry in the string storage area. Obviously, direct mode doesn’t use any references to the BASIC input buffer.

But, how about this?

### COMMODORE BASIC ###

 7167 BYTES FREE

READY.
10 LOAD "TEST-TAPE"
RUN

PRESS PLAY ON TAPE #1

→

0401:    13 04 0A 00 93 20 22   ..... "
0408: 54 45 53 54 2D 54 41 50  TEST-TAP
0410: 45 22 00 00 00 AA AA AA  E"......
...
1FF8: AA AA AA AA AA AA AA AA  ........

Indeed, when executing a LOAD statement from inside a program, our string storage is still empty.

Lessons Learned

Here, we may finsih our explorations. But there are a few things, which are worth keeping in mind, when operating with strings. As there are, in no particular order:

Exploiting String Variables

Now, could we use this for something productive, like, a nifty exploit? Certainly, we could modify a string on the fly, say, to print a line of a video game screen for a BASIC 10-liner contest or the like. If we put a string definition right at the very beginning of the program, we may easily work out, where the actual string sequence starts in the BASIC text. Moreover, since this is also the very first variable defined, it will be easy to work out its location in memory, simply by PEEKing the contents of pointer “VARTAB”.

Let’s have simple definition like,

10 A$="123456789"

Since this is the very first line, it will start at “TXTTAB, the start of the BASIC text in memory. On a PET (with ROM 2.0), this is at location $0401 (dec. 1025) and on a C64 at $0801 (dec. 2049). Add 2 locations to this for the link to the next line and another 2 for the line number (stored in binary). Hence, the text of our line starts at TXTTAB + 4 and, adding another 4 for “A$="”, we worked out that our string literal starts at TXTTAB + 9, on a PET at $040A, on a C64 at $080A. This is, where we’ll find the character “1”, being the first character in the string literal.

Quite the same, we may work out, where the properties of the string variable A$ are stored: Since it’s the very first variable, the length will be at VARTAB + 2, and the reference to the start location of the literal in VARTAB + 3 and VARTAB + 4.

• PET 2001 (ROM 2.0)
VARTAB: 002A-002B  (dec. 42,43)
VT = PEEK(43)*256+PEEK(42)

• C64, VIC-20
VARTAB: 002D-002E  (dec. 45,46)
VT = PEEK(45)*256+PEEK(46)

Now consider something like a cañon game, where we print a varying, winding shaft, filling the screen, which is procedurally scolled up by adding lines to the bottom by a print statement. We may define a string, wider than a screen line, containing 30 fill characters to left, a passage of 9 spaces in the middle, and another 30 fill characters to the right. Then, we could set the effective length of the string variable to 39 (3rd byte), and adjust any padding to the left (and by this the position of the “shaft” in the middle) by modifying the pointer in the string variable (4th and 5th byte). This way, we may adjust a window for printing inside our larger string literal, a technique, which may be also used for horizontal scrolling and the like.

String exploit in Commodore BASIC
POC of a nifty exploit manipulating Commodore BASIC string variables (PET 2001).

And here is the same in text representation (grey dots in the listing represent blanks):

LIST

 10 A$="▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒··
·······▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒"
 20 VT=PEEK(43)*256+PEEK(42):A=1025+9
 30 POKE VT+2,39:REM LENGTH A$=39
 40 P=A+15:REM 15 INTO THE STRING
 50 POKE VT+3,P AND 255
 60 POKE VT+4,INT(P/256)
 70 PRINT A$
READY.


RUN
▒▒▒▒▒▒▒▒▒▒▒▒▒▒         ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒

READY.
█

Here, our tiny proof of concept is drawing a random generated cañon by repeatedly printing a sliding window into the string literal defined for variable A$:

Proof of concept of a string exploit in Commodore BASIC
Our POC drawing a random generated cañon (PET 2001, ROM 2.0).

— That’s all, folks! —