The Case of the Missing 4th Commodore BASIC Variable (and the 5th Byte)

Yet another detective story.

Title illustation

We’ve met them aready, back in the happyier days of ’20, when things still looked right, before that cloud of gloom settled over the city, the jolly bunch known as the Commodore BASIC variables. To common knowledge, there are 3 of them, Float, Integer and String, and if you browse the gazettes and stories distributed over the main streat counters of the Internet, this may all you know about them. Each of them is known by their signature grin and each of them comes with a purpose.

Let’s have them rounded up for a quick identification:

MugMemory SignatureBusinessStature, Personal Traits
A1A 1(0x40 0x31)Floating Point Number5 bytes: exponent/sign, 4 bytes mantissa
I2%I̅ 2̅(0xC9 0xB2)Integer Number5 bytes: 2 bytes binary value, 3 zero-bytes (unused)
S3$S 3̅(0x53 0xB3)String 5 bytes: length, 2-byte memory pointer, 2 zero-bytes

Each of them is 7 bytes in memory, 2 bytes for the name, followed by a 5-byte variable body, which does the actual business. The name also encodes their type, so you already know who you’re dealing with as soon as they come around. They don’t make much of a secret of their business, as they proudly show it off, right in their face, by sign marks sprinkled all over them.

Specifically, Floats comes with a clean baby face, with no marks at all, fresh ASCII strings all over. Integer, however, is yet another character, marked by signs on both cheeks, and ol‘ Strings is known by a single sign mark on the second, right-hand side of his signature grin.

- Commodore BASIC variables by sign-bit -

0 0   Float
1 1   Integer
0 1   String

If you have been around for some time in the backyards and alleys they call the Binaries, you eventually develop a feel for this. Something was telling me that this may not be all, that there may be still some in hiding. A little something, we hadn‘t seen, yet. Who knows, maybe a damsel in distress?

Just to put you in the picture, I interrogated them with my trusty PET 2001 emulator, which now comes with a snappy tool for disassembling variables as in memory. (This is yet another story, stay tuned.) There’s no hiding anymore and here is what they look like without their pretty listing clothes:

screenshot of an emulated PET screen sowing a BASIC program
No-gloves investigation into Float, Integer, and String.
### COMMODORE BASIC ###

 15359 BYTES FREE

READY.
10 A1 =2.345
20 I2%=258
30 S3$="BLA"

RUN

READY.
█
→ Utils/Export → Disassemble Variables
                         .[simple BASIC variables]

042B  41 31               A1
042D  82 16 14 7A E2      =  2.345
0432  C9 B2               I2%
0434  01 02 00 00 00      =  258
0439  53 B3               S3$
043B  03 24 04 00 00      len: 3, @ $0424

                         .[end of BASIC variables]

(Mind that they look a bit different, when they come in a flock and identify only by subscript.)

But where is the story in that — and what about the damsel?

Another Type?

It wasn’t before a buddy of mine came around with an old source of his that I caught a first glimpse of her: (Fancy talk aside, of which we may have had enough by now, this was Jason Cook, who became an invaluable beta tester for the new version of the emulator. Check out his new PET game!)

1C0A  D2 00 B4 0A 13 1C B5  ;var: "R" + sign-bit, 0

There she was, shyly revealing the sign-bit that adorned her first byte!

So there actually are,

- Commodore BASIC variables by sign-bit -

0 0   Float
1 1   Integer
0 1   String
1 0   damsel in distress?

But, who was she, and was she actually in distress?

This is an even deeper mystery, since Commodore never made much of a mystery of variable formats, right from the beginning. The PET manuals clearly describe how BASIC interacts with memory and provides some examples for in-memory formats, but it only mentions 3 types, floating point, integer, and string. So what may this 4th variable type be, and what mysteries are lurging behind this?

I knew already some, namely that she was known by the single letter “R”. So it wasn‘t that difficult to trace her down to the origins, hidden in a bunch of densely formatted BASIC statements:

150 DEFFNR(X)=INT(X*RND(U)):GOSUB8010:A1$="NLTSMR"

(STARTREK1978.PRG by Jason Cook)

It’s a DEFFN variable! — This makes actually some sense that these user defined functions should be stored as variables, in order to look them up by name.

So let’s have a closer look at her (*blush*) anatomy…

In order to do so, let‘s come up with a much simpler example that lends itself a bit easier to investigations:

10 DEFFNR(X)=1+X*X
20 PRINT FNR(3)

RUN
 10

Now let‘s have a look at the variable as in memory:

→ Utils/Export → Disassemble Variables

                         .[simple BASIC variables]

0420  D2 00               FNR()
0422  0C 04 29 04 31      – ??? –
0427  58 00               X
0429  00 00 00 00 00      =  0

                         .[end of BASIC variables]

And, as we’re at it, let’s inspect the tokenized program as in memory, as well:

→ Utils/Export → Disassemble Program

                         .[tokenized BASIC text]

0401  12 04               link: $0412
0403  0A 00               line# 10
0405  96                  token DEF
0406  A5                  token FN
0407  52 28 58 29         ascii «R(X)»
040B  B2                  token =
040C  31                  ascii «1»
040D  AA                  token +
040E  58                  ascii «X»
040F  AC                  token *
0410  58                  ascii «X»
0411  00                  -EOL-
0412  1E 04               link: $041E
0414  14 00               line# 20
0416  99                  token PRINT
0417  20                  ascii « »
0418  A5                  token FN
0419  52 28 33 29         ascii «R(3)»
041D  00                  -EOL-
041E  00 00               -EOP- (link = null)

                         .[end of BASIC text]

A versed investigator of BASIC affairs may have spotted it already, right away: the first two bytes are pointers into memory, as given away by their second (high) byte of 04, pointing at addresses in the 0x04000x04FF range, with BASIC starting on the PET at 0x0401, populated by the tokenized BASIC text, followed by simple variables and then arrays, if there are any.

Let’s rearrange this:

0401  12 04               link: $0412
0403  0A 00               line# 10
0405  96                  token DEF
0406  A5                  token FN
0407  52 28 58 29         ascii «R(X)»
040B  B2                  token =
040C  31                  ascii «1»
040D  AA                  token +
040E  58                  ascii «X»
040F  AC                  token *
0410  58                  ascii «X»
0411  00                  -EOL-

      (...)

0420  D2 00               FNR()
0422  0C 04               pointer to $040C (low, high)
0424  29 04               pointer to $0429 (low, high)
0426  31                  – ??? –
0427  58 00               X
0429  00 00 00 00 00      =  0

The first pointer taps directly into the function body after the assignment to the function definition.
The second pointer taps directly into the variable body of the argument “X”, which is actually a global variable. (Which does make some sense, as there are only global variables in BASIC.)

This already promises some speedy and optimized execution at run-time, as the pointers refer immediately to memory as needed. Moreover, we can see, why only floating point values are allowed as an argument, as the pointer to the argument skips past any notion of the name and type of that variable, assuming, it‘s a float, right away.

The Mystery of the 5th Byte

So, what may the 5th byte be about? Some of this may remind us of how strings are stored, by a first byte storing the length and then a pointer to the in-memory location, at which the string starts. Is it a length of sorts? (This may seem even more plausible, as the code for executing “DEFFN” borrows some from the code for string handling.)

This was actually my first assumption, nourished by some coincidence. However, this, of course, it is not. The execution at run-time just stops at the first colon (“:“) or the first end of line, what ever comes first, extending over a single BASIC statement. No lengths required for that. Is it related to the variable name? But this was yet another coincidence in my early investigations into this. As can be clearly seen by the above example, where 0x31 gives the ASCII code for “1”, which bears no relation to “R”. So, what is it?

Let‘s expand on our little experiment:

10 DEFFNR(X)=1+X*X
20 DEFFNG(Y)=3*Y+4

Which (after RUN) provides the following variable read-out:

0425  D2 00               FNR()
0427  0C 04 2E 04 31      @ $040C, arg @ $042E, ??
042C  58 00               X
042E  00 00 00 00 00      =  0
0433  C7 00               FNG()
0435  1D 04 3C 04 33      @ $041D, arg @ $043C, ??
043A  59 00               Y
043C  00 00 00 00 00      =  0

So, the first variable has a 5th byte of 0x31 and the second variable one of 0x33. Is it some counter? (This also shows, once again, that this isn‘t related to any names, since nothing in either “R”, “G”, “X”, or “Y” translates to a difference of 2.)

So let’s add another onother DEFFN definition to this, just to verify:

10 DEFFNR(X)=1+X*X
20 DEFFNG(Y)=3*Y+4
30 DEFFNI(T)=3*T-2

0436  D2 00               FNR()
0438  0C 04 3F 04 31      @ $040C, arg @ $043F, ??
043D  58 00               X
043F  00 00 00 00 00      =  0
0444  C7 00               FNG()
0446  1D 04 4D 04 33      @ $041D, arg @ $044D, ??
044B  59 00               Y
044D  00 00 00 00 00      =  0
0452  C9 00               FNI()
0454  2E 04 5B 04 33      @ $042E, arg @ $045B, ??
0459  54 00               T
045B  00 00 00 00 00      =  0

Hum, this is somewhat disappointing: both the second and the third FN variable have 0x33 as their last byte. So it isn’t a counter at all. Moreover, adding some other variables to our short program or changing any of the names doesn’t show any effect on this 5th byte of the variable body, at all.

However, if we change the very first character of the function body, we finally do make a difference:

30 DEFFNI(T)=4*T-2

0452  C9 00               FNI()
0454  2E 04 5B 04 34      @ $042E, arg @ $045B, ??

Let’s make this

30 DEFFNI(T)=T-2

0450  C9 00               FNI()
0452  2E 04 59 04 54      @ $042E, arg @ $0459, ??

As the eagle-eyed may have observed already, 0x34 is the ASCII code for “1” and 0x54 is ASCII “T”.
It’s the first byte literal of our DEFFN function body!

Let’s check this with a token in the first position:

30 DEFFNI(T)=INT(T)

0451  C9 00               FNI()
0453  2E 04 5A 04 B5      @ $042E, arg @ $045A, ??

Yes, 0xB5 has the sign-bit set, giving away the BASIC token, and it is the BASIC token for INT, indeed:

0425  1E 00               line# 30
0427  96                  token DEF
0428  A5                  token FN
0429  49 28 54 29         ascii «I(T)»
042D  B2                  token =
042E  B5                  token INT
042F  28 54 29            ascii «(T)»
0432  00                  -EOL-

Well, this is that mystery solved.

But, does this 5th byte matter?

10 DEFFNR(X)=1+X*X
20 DEFFNG(Y)=3*Y+4
30 DEFFNI(T)=INT(T)
40 POKE 1160,32 : REM DEC 1160 = $0488
50 PRINT FNI(4.1)

RUN
 4

READY.

It doesn’t seem so. The result is still what we’d expect as a result of the BASIC function INT. It’s also not what we’d expected, if we replaced the token INT in the BASIC text by 32, which is a simple space/blank, giving “ (T)”.

And we actually changed that last byte:

0482  C9 00               FNI()
0484  2E 04 8B 04 20      @ $042E, arg @ $048B, « »

Let‘s have another go at this, this time replacing the “1” in FNR by the ASCII code for “2”:

10 DEFFNR(X)=1+X*X
20 DEFFNG(Y)=3*Y+4
30 DEFFNI(T)=INT(T)
40 POKE 1130,50 : REM DEC 1130 = $046A
50 PRINT FNR(2)

RUN
 5

0464  D2 00               FNR()
0466  0C 04 6D 04 32      @ $040C, arg @ $046D, «2»

This didn‘t make a difference, as well.

A more thorough investigation into literature on the matter produced a sole source, namely “Programming the PET/CBM” by Raeto West. Here, we find FN variables actually described as a distinctive type, on p. 9, where the very last byte is described as “INITIAL OF VAR.”

faximile: Raeto West, Programming the PET/CBM; p.9

The exact meaning of “INITIAL OF VAR.” may not be that clear as it‘s provided without further context, but — as we‘ve established already — this is fair and correct, if we’re meant to understand, “the initial byte of the function body refered to by the variable.” (As opposed to, e.g., “the first character of the variable identifier,“ or similar.) The descriptive text goes as follows,

A function definition has two pointers; one to the definition in the body of the BASIC program, and one to the floating-point dependent variable. They point just after the '=' sign and to the exponent byte respectively. The final byte is garbage, generated when the definition is set up, and is not used.

Well, I guess, that’s it. Especially, as (as already mentioned) the code makes use of some resources dedicated to string handling. However, it is still a bit strange that this 5th byte isn’t just set to 0 as with any other surplus bytes in integer and string variables.

What is this FN damsel hiding? Which causes her such distress that she should exhibit the most intimate secrets of her build like this in broad daylight?

I guess, this is yet another story. Which also brings this true detective story to an end.

Anyways, if you want to have a closer look at the new version of the PET emulator, here it is running all the latest demos:

Edit/Update

While it may be correct to speak of the function parameter (argument) as a global variable, in the sense that it is created together with the FN variable and stored along it in the global variable memory, it doesn‘t behave like one:

10 X=1
20 DEFFNR(X)=1+X*X
30 PRINT X
40 PRINT FNR(2)
50 PRINT X
RUN
 1
 5
 1

READY.
█

Moreover, even, if there is no conflict, the function parameter isn‘t accessible from outside:

10 DEFFNR(X)=1+X*X
20 PRINT FNR(2)
30 PRINT X
RUN
 5
 0

READY.
█

As may be inferred from this, the value of the variable (1 byte exponent and 4 bytes mantissa) is saved before the variable is accessed as a parameter/argument and then restored again. As user defined functions are callable from inside user defined functions, this cannot be just a buffer in the zero-page (meaning, there may be more than one value to be saved at any given time), rather, the contents of the variable body is pushed to the processor stack and then restored from there. So, while they may be defined as global variables, these are actually behaving like local variables.

We may note that the effort taken (5 pushs and 5 pulls to and from the stack, together with the reads and writes that go with this) to make these behave like local variables somewhat counteracts the efficiency suggested by the argument pointer tapping directly in the variable body.

Wait, there is more: don’t miss the Bonus Episode!

Postscriptum: I’m aware that this text may cause some irritation. Which is a good thing. Let’s say, there was another, similar text, transposing computer history to the private detective paperback genre, and I was somewhat alarmed by the inherent violence. This is, of course, owed to the very genre, its styling of the narrator as an individual without empathy and interest, but cold curiosity for his object, and a general air of objectification. The portrayal of the narrator is equally offensive or scandalous as is the portrayal of any other story characters through his eyes. Of course, these are not real humans, but paper-cut, mechanical figures owing their existence to the sole purpose of promoting the story, but still.

So, why use a trope like this? First, it’s meant to (a) raise an interest in the subject and to encourage readers to conduct their own investigations with the tools advertised, but (b) it is also meant to highlight a preexisting disposition in the language of research and tech, which comes with its own air of general objectification. (I.e., saw myself faced with using jargon before and felt uneasy about this.)

Why does tech jargon use terms like anatomy for in-memory structures or build for code configurations? What happens, if this meets the universal human habit of anthropomorphizing the objects we‘re interacting with? Is the language of curiosity innocent? Is this linked to (a specific) gender (I don’t think so), or is there a more fundamental problem? Is there even a way to link the anthropomorphic and the language of cold-blooded research? Is there a fundamental opposition in research and care for an object? If so, what does this mean for curiosity? Where arises the problem, if we happen to anthropomorphize a simple data structure, as it is common human habit, in the context of a specific genere of story telling? How much of this is imported by the reader, how much is owed to the semantic field? (A text in its actualization is always a cooperation between the reader and the material, which is basic text theory. Mind that you are not necessarily meant or required to sympathize or to identify with the narrator character. You even should‘t do so, there are markers for this right from the beginning. — Hint: You are not supposed to identify with the anti-hero. — Is a marker, highlighting the obvious, e.g., “*blush*”, in itself offensive?)

Is it creepy? Heck, yes, right from the very setup. However, I‘d argue that this creepiness is a structural one, which is the point. (If we were able to confront ourselves with this kind of inherent creepiness, this kind of confrontation may be even fun, like an old, disgusting movie. It should be pretty impossible nowadays to consume something like the classic detective story genre from a naïve, affirmative perspective.)

Mind that this is also part of an ongoing series discussing various textual strategies. (Compare 1, 2, 3, 4, 5, 6, including the “bonus episode”, published together with this and providing a foil for this text, just as this text provides a foil for that, and at least another post meant to appear soon.) I may also highlight that my background is in such things as social sciencies, philosophy, media and film analysis, (and not in tech) and that this is not a naïve text. See also the conlusion at the end of the second part.

(This is also an experiment regarding how those two texts will perform in search, etc., rankings.)