Go, PET, Let Hen

Curious adventures in (Commodore) BASIC tokenizing.

Stylized illustration involving a Commodore PET 2001 computer and a hen.

No, this is not about a one-hit-wonder from the charts of 1977, rather, it’s about something that could have happened on a Commodore PET around this time. More generally, it is about a curious bug/feature in early BASICs and how this gave way to an additional keyword in Commodore BASIC for the PET.

Both are related to BASIC syntax and how white space in BASIC text is handled by the interpreter. — Which sets up the topic of today’s installment, namely, the tokenizer routine in early MS 9-digit BASIC for the 6502 and how this evolved over time. As this promises some fun with #BASIC and #6502 code on the #PET and involves some #archeology, as well, we really ought to have a look at this…

To begin with, let’s fire up a PET 2001 with BASIC 1.0, the first ROM version shipped with the PET 2001, also known as “Old ROM”, to start a little experiment:

So, what’s happening here?
And why does a perfectly valid BASIC statement throw a sntax error?

Let’s investigate…

As you may probably know, all white space in BASIC is optional and we do not require any of this as a keyword delimiter or separator of any kind. The two following statements work perfectly the same:

However, there’s actually a small difference: while the first one may be more readable, the second one executes a bit faster and requires less memory, for which it is preferred by the greybeards.

Going back to our little experiment, something that may spring to the eye is a tiny, but crucial difference in the BASIC text as entered and the BASIC text as listed:

Did you see it? — No, there is no paranormal entity. — Rather, it’s something missing in the listing, namely the blank between “LE” and “THEN”, indicating that this is about a misshap in white space handling.

In order to investigate this further, we have to ask first

Words and Letters, or, What is “Tokenizing”?

An important consequence of not requiring any separation between keywords and other program text is that every letter in the program text can be the beginning of a keyword. E.g., as we encounter

We do not know, yet, whether this is the start of a keyword, as in

or “V” is part of an identifier (variable name), as in

The only way to discern this is by scanning ahead and finding out if this adds up to a valid keyword (a BASIC command) or not.

Now, going over a list of all keywords for each letter, comparing each letter to a valid start of a keyword and then scanning ahead to see, if this does indeed match, is a rather costly operation. It would be nice, if we could somehow avoid this.
And this is where the tokenizing comes to the rescue: every time, we enter a line of BASIC, we’ll copied the line to an input buffer, and there will be routine to go over this, in order to discern any keywords. Every time the routine does find a keyword, it will replace it by an integer value not allowed in regular BASIC text. As BASIC is generally restricted to 7-bit ASCII (that is, outside strings), we can use any values starting at 128 (hex 0x80) and above. This way, we do it once and for all. Later, when the runtime routine of the interpreter encounters such a value (a token), it will immediately know what this is about and it will be even able to use this as an index into a lookup table for quick access.

Moreover, in times when memory was still expensive and thus scarce, this reduced the memory requirements for storing a BASIC program quite significantly. In an average BASIC listing, keywords make up for about half of the text, besides line numbers, so the savings will quite dramatically extend the reach of our memory.

What’s not to love about this? Tokens are really the key to BASIC.

In-Memory BASIC Programs

Knowing this, we may have a deeper look into BASIC text as is represented in memory. As for instance, on the example of our offending line of BASIC:

The equivalent tokenized program text is stored in the user area of the memory, which starts on the PET at 0x0401:

addr  code                semantics

0401  17 04               link: $0417    (addr. of next line, 16-bit little endian)
0403  0A 00               line# 10       (16-bit binary)
0405  8B                  token IF
0406  20 4C 53 20         ascii « LS »
040A  B2                  token =
040B  20                  ascii « »
040C  88                  token LET
040D  48 45 4E 20         ascii «HEN »
0411  89                  token GOTO
0412  20 31 30 30         ascii « 100»
0416  00                  -EOL-
0417  00 00               -EOT-          (link = null)

At first glance, we can see that BASIC has converted our input string “LE THEN ” into the token for “LET” (0x88) and the remaining text “HEN ”. Moreover, we can see that any keyword tokens are encoded as a single byte with a set sign-bit (n0x80) and anything else is preserved as-is in plain ASCII text.

So, quite clearly, BASIC skipped over the blank separating “LE” from “THEN”, recognizing the keyword (command) “LET” and encoding this as the token ID 0x88. — But, why should it do so?

BASIC and White Space

Initially, BASIC simply ignored any white space, just like FORTRAN and Algol did. The parser simply skipped over any blanks and these were of no significance, whatsoever. This is how Dartmouth BASIC did it and how many versions of BASIC did it in the 1970s. Notably, is this also what you would find in listings.

But this had consequences: e.g., at this time, it wasn’t that clear whether “GOTO” was a single word or two, as in “GO TO”. In English, these are certainly two words and to the interpreter it didn’t matter. So we may be rather undecided on this. — And, indeed, there were plenty of listings, where you could find constructs like:

In MS BASIC, this is taken care of by the tokenizer routine. As we may observe in our above peek into the in-memory representation, it’s particularly in keywords, where the white space was ignored, while other instances of white space are preserved and faithfully reproduced in any listings.

That is, AppleSoft BASIC, another early version of MS BASIC for the 6502 and a close relative to Commodore BASIC, skips any white space and inserts a single blank after each keyword.
Thus, on the Apple ][, we get (even more obviously giving away, what might have gone wrong):

This is not dissimilar from what Commodore BASIC does with line numbers:

Like AppleSoft BASIC in any other context, it skips over any white space after the line number and inserts a single blank for listings.

Which is also why we can’t have indented text and have to resort to tricks, if we really want indentations:

As we may see, Commodore BASIC doesn’t skip any blanks after a separating colon (“:”).

And Numbers? (BTW)

Glad you should ask. It’s a lesser known fact that numbers are another area where Commodore BASIC deliberately skips over any blanks:

While the listing proves that the constant “1 000 000 .00” is faithfully stored in memory, the interpreter doesn’t hick up over the interspersed blanks, rather, it elegantly skips over them.

We can see this also, as we peek into memory:

addr  code                semantics

0401  16 04               link: $0416
0403  0A 00               line# 10
0405  41                  ascii «A»
0406  B2                  token =
0407  20 31 20 30 30 30   ascii « 1 000 000 .00»
040D  20 30 30 30 20 2E   
0413  30 30               
0415  00                  -EOL-
0416  21 04               link: $0421
0418  14 00               line# 20
041A  41                  ascii «A»
041B  B2                  token =
041C  20 41               ascii « A»
041E  AC                  token *
041F  32                  ascii «2»
0420  00                  -EOL-
0421  29 04               link: $0429
0423  1E 00               line# 30
0425  99                  token PRINT
0426  20 41               ascii « A»
0428  00                  -EOL-
0429  00 00               -EOT- (link = null)

But why should it do so? Obviously, this is slow and cumbersome, since the interpreter has to parse each of those constants anew and skip any of those blanks, whenever it encounters a number, often repeatedly so, like in a FORNEXT loop.

Well, there‘s more than a single way to represent a number.
E.g., all those are the same:

1000
1000.0000
1 000
1 000.00
1E3
1.0E+3
(and so on)

It wouldn’t be that user friendly to type

just to have this listed back as, say,

The same goes for identifiers (variable names). The interpreter could maintain a reference table and just insert offsets into the program, but, for a dialog-like system, it’s much better to maintain a meaningful symbolic representation for listing and editing (instead of maybe an offset marker or hash, like #34e2). Or, as only the first two characters are significant, it could easily truncate any longer names.

A compiler, on the other hand, can convert this at compile time, it can also resolve any calculations at compile time and even optimize over such precomputations. E.g, our above example

10 A= 1 000 000 .00
20 A= A*2

could become just something along the lines of

<STORE-REAL@FRAME-OFFSET>  0x48  2e6

Meaning, if BASIC is slow (and it is), it is so, because it’s user-friendly.
It’s slow so that it can preserve all the intricacies of the input, in order to reproduce them in a meaningful listing.

That is but for the instances where it doesn’t, like with our “LET HEN” example.

Crunch Time

As for now, we have already established that our syntax error isn’t a bug, but a user error: if the keyword routine can skip over blanks to correctly identify constructs like “GO TO”, this implies that no adjacent strings of letters may add up to something that contains a valid BASIC command, even across word boundaries. (That is, outside of quoted string constants, of course.)
In this case, we got the legitimate keyword “LET”, which isn’t a valid right-hand part of a comparison. So, yes, this a syntax error, indeed.

There isn’t any argueing over this: if skipping blanks is a feature, the error is in front of the screen.

Notably, this is not an issue on a PET with the upgraded Commodore BASIC 2.0 (AKA “New ROM”) or, for the matter, with any other BASIC for Commodore 8-bits, thereafter.

Commodore BASIC 2.0, “New ROM”:

So, what is special about Commodore BASIC 1.0 (or any other early flavor of MS BASIC for the 6502) and what does it do differently?

PET BASIC, “Old ROM”

What‘s needed for keyword parsing comes in to parts: a list of keywords and a routine to do the actual work, suggestively named “CRUNCH”, which splits kewords from other text in the input stream and converts them to tokens.

Let‘s have a look at the data first.
The list of keywords starts at 0xC092 (underlined characters indicate a set sign-bit):

The tokenizer routine (AKA CRUNCH) traverses over this list, using a set sign-bit for a word termination after that given character. In other words: a set sign-bit indicates the last character of any keyword. If a match is found, the index into this list added by 0x80 (again, the sign-bit set) provides the token ID.
If no match is found, as indicated by reaching the terminating zero-byte, the character, which was just inspected, must be a plain ASCII character (maybe part of a name) and is consequently copied to the program text as-is.

Thus, the above list translates to the following keyword-token table:

We may observe a few pecularities:

The Code — a Tour

So, as we know what’s actually processed, we may have a look at the CRUNCH-y routine.

This starts $C48D and processes zero-terminated input data from the BASIC buffer, which is for BASIC 1.0 located at $000A$0059 (10–89, 80 bytes long). The initial index from $0000 into this will be available in $C9, pointing to the first non-blank character after the line number (at least 0x0B or decimal 11).

The BASIC input buffer also doubles as the output buffer, meaning, CRUNCH converts the contents of the BASIC buffer in situ. (Due to tokenization, the output will be almost always shorter than the input and is guaranteed to not exceed it. Moreover, while reading starts at the first character after the line number, which is at least at 0x0B, the ouput will follow behind, starting at 0x0A.)
The X register is generally used as an index or cursor for reading from the BASIC buffer, while the Y register is used either as an index into the keyword list or as an output cursor.

Note: With JavaScript enabled, you can hover over a labeled target to highlight the related label.)

Well, this may be quite a bit to swollow at once. Let’s have look at this in decent bytes.
— (I am not sorry!) —

The first part is just a bit of initialization, which also introduces a few important concepts:

C48D  A6 C9              LDX $C9         ;first char position from $0000
C48F  A0 04              LDY #$04        ;initial write index from $0005
C491  84 60              STY $60

First, we set up our read cursor, reading it from location $C9. This will be initially set (and reset) to 0x09, but as we enter the routine, this will have been advanced to point to the very first non-blank character after the line number. (So we have already skipped over any leading white space.) Notably, this will be used as an index from $0000.
The output/write cursor is initialized to 0x04. This an offset from $0005, so this will point to $0009, just before the start of the BASIC buffer.

Finally, we also store 0x04 in $60, which is used as mode flag. (0x04 is coincidentially the ASCII code for EOT.) Over the run of the routine, this mode flag may also receive the values 0 or 0x49. But we’ll only care about bit #6 of this value, which will only set in 0x49, indicating that we’re reading a “DATA” section.

C493  B5 00      iC493   LDA $00,X      ;fetch char from buffer
C495  10 07              BPL iC49E
C497  C9 FF              CMP #$FF       ;`π`?
C499  F0 41              BEQ iC4DC
C49B  E8                 INX
C49C  D0 F5              BNE iC493

This is the head of the main read-process loop.
LDA $00,X” loads the next byte/character to inspect. First, we check the sign-bit. If it’s unset, this means we’re dealing with a regular ASCII character and skip ahead to the next block. Otherwise, we test for 0xFF, the code for pi (π). If it is pi, we branch to write it to the output position as-is.

If we’re still going, the sign-bit is set and it’s not pi. Meaning, it must be an illegal (shifted or graphics) character, which is not valid input text. Hence, we advance the read cursor and branch unconditionally to the beginning of the loop for the next character, ignoring the invalid byte.

C49E  C9 20      iC49E   CMP #$20       ;blank?
C4A0  F0 3A              BEQ iC4DC
C4A2  85 5B              STA $5B
C4A4  C9 22              CMP #$22       ;`"`?
C4A6  F0 58              BEQ iC500
C4A8  24 60              BIT $60
C4AA  70 30              BVS iC4DC
C4AC  C9 3F              CMP #$3F       ;`?`?
C4AE  D0 04              BNE iC4B4
C4B0  A9 99              LDA #$99
C4B2  D0 28              BNE iC4DC
C4B4  C9 30      iC4B4   CMP #$30       ;`0`?
C4B6  90 04              BCC iC4BC
C4B8  C9 3C              CMP #$3C       ;`<`?
C4BA  90 20              BCC iC4DC

Here, we’re back for valid input characters. Time to handle some further special cases:

First, we check for a blank (0x20), and if it is, we branch to $C4DC to copy it as-is.

Then, we store the current character in $5B to be used as a string delimiter later on. (This will only matter, if the current character is a quotation mark.)
Let’s check, if it actually is a quotation mark (0x22): if so, we branch to $C500 to copy the entire string.

Now it‘s $60’s heyday: the BIT instruction transfers (besides other things) bit #6 from our mode flag into the V flag of the processor status register. If it’s set, this indicates that we’re currently reading a DATA section and we branch (again) to that part at $C4DC to copy this as-is.

The next check is for 0x39 (“?”): if it’s not a question mark, we skip ahead. But, if it is, this is the standard shorthand for PRINT (which is not in our keyword list) and we load the corresponding token number (0x99) and branch unconditionally (we just loaded the non-zero value 0x99, therefor the zero flag is unset) to $C4DC to write this byte.

Finally, we check for digits (0…9), “:” (0x3A) and “;” (0x3B), which can be copied at $C4DC, as well.

Now, if we’re still processing, this means that it will be probably a letter or an operator. Time to check for keywords…

C4BC  84 C0      iC4BC   STY $C0
C4BE  A0 00              LDY #$00
C4C0  84 5C              STY $5C        ;keyword index
C4C2  88                 DEY
C4C3  86 C9              STX $C9
C4C5  CA                 DEX

Let’s prepare for the keyword search: First, we have to backup our write cursor, since we will now use the Y register as an index into our keyword list, and also, as we’re at it, we reset Y to zero. Next, we reset the the counter in $5C, which keeps track of the current keyword index, to zero, as well. Then, we decrement the read index into the keyword list (in Y) to -1, because our main search loop will start with an increment.

We also store a backup of our buffer read cursor (in X) at $C9 (the same pointer, we initialized it from), since we’re going to probingly read ahead and will probably have to start over again. And, just as before, we decrement it to prepare for the coming pre-increment loop.

C4C6  C8         iC4C6   INY
C4C7  E8         iC4C7   INX
C4C8  B5 00      iC4C8   LDA $00,X
C4CA  C9 20              CMP #$20       ;blank?
C4CC  F0 F9              BEQ iC4C7
C4CE  38                 SEC
C4CF  F9 92 C0           SBC $C092,Y
C4D2  F0 F2              BEQ iC4C6
C4D4  C9 80              CMP #$80       ;sign-bit?
C4D6  D0 2F              BNE iC507
C4D8  05 5C              ORA $5C        ;keyword index

Welcome to the main attraction, this is why we came here: the inner loop of the keyword search!

We start by advancing both the index into the keyword list and the read cursor and read the next byte. If it’s a blank (0x20), we branch back to get read the next character from the buffer.
That’s it: this is how the tokenizer routine skips specially over any white space in keywords!

If it’s a non-blank character, we’re engaging into some destructive tests:
First, we subtract the corresponding byte from the keyword list. If the result is zero, we have an exact match, meaning, we have still a valid candidate, but not yet reached the final character, and redo for the next pair.

Next, we check for the end of a keyword, as indicated by the difference amounting to exactly 0x80. If the difference amounts to anything else, we have a mismatch and branch to $C507, in order to prepare for another go with the next keyword candidate.
If, on the other hand, the difference is 0x80, which is a set sign-bit, we successfully advanced to the very last character of the given keyword and matched it. By OR-ing this remaining 0x80 with the current keyword index (as stored in $5C) we get our token.

This should be worth a brief contemplation: in a byte-size subtraction, it doesn’t matter, which of the two operands has the sign-bit set and which one not: (0x80+x)-xx-(0x80+x) ≡ 0x80.
Since there is no carry involved to propagate the sign, this is essentially an XOR operation. (Which is also the raison d'être of the overflow flag: otherwise, we’d never know.)
Indeed, the same could have been achieved by an EOR instruction — and even a bit more economically so. (That it is not so, hints at an origin in a more general codebase adn option to encode this in a different manner.)

Coincidentially (or, rather, very much so by design), a shifted PETSCII character is the same as its unshifted counterpart, with a set sign-bit. This is how BASIC abbreviations work in Commodore BASIC: simply by tricking the tokenizer routine into matching a keyword early. If we were successfully matching a candidate word up to here and the difference is 0x80, the routine assumes that it reached the end of the keyword and matched the entire word: voila, here is your token!

On the other hand, this is also responsible for some awkwardness in this: e.g., if we enter “iN” or “inpU”, this matches “INPUT#” rather than the much more frequent “INPUT”, because the former is tested first (compare the order of the keyword list). And necessarily so, otherwise the tokenizer routine couldn’t match “INPUT#”, at all. And the same applies to “PRINT#”.

Skipping out-of-order ahead to $C507, we get to where the loop is prepared for a flying start-over to check against the next keyword candidate in the list (which is where to branched to in case of a missmatch):

C507  A6 C9      iC507   LDX $C9
C509  E6 5C              INC $5C        ;keyword index
C50B  C8         iC50B   INY
C50C  B9 91 C0           LDA $C091,Y
C50F  10 FA              BPL iC50B      ;last keyword char?
C511  B9 92 C0           LDA $C092,Y
C514  D0 B2              BNE iC4C8      ;end of list?
C516  B5 00              LDA $00,X
C518  10 C0              BPL iC4DA      ;write it (unconditional)

First, we restore our buffer read cursor to where we started trying to match a keyword, then, we increment our counter into the keyword list (as a base for a potential token ID).

Then we advance our index into the keyword list to the next end of word: starting with the usual pre-increment, we read the current byte again. Mind that we’re not reading from $C092 as before, but from $C091. And we repeat this until we find a byte with a set sign-bit.

Having thus reached the end of the word, we peek ahead, this time reading from $C092, to see, if this the zero-byte marking the end of the keyword list, or if there’s more to do. If there is, we branch to the entrance of our keyword search loop with X and Y already set up.

In case we exhausted the list and read a zero-byte, we just write the current input character as-is. For this, we read the current input character from the BASIC buffer fresh again and branch unconditionally (there is no way of an overflow of the index register for a buffer that ends at $0059) to $C4DA, where we will write this to the output.

So far, we’ve seen the greater workings of the loop. If we previously fell through with a valid BASIC token in the accumulator, we first restore the output cursor from its backup:

C4DA  A4 C0      iC4DA   LDY $C0

Then it’s time to actually store a processed byte, which is either a token or a raw character (this is the very address, we branched to earlier for various special cases):

C4DC  E8         iC4DC   INX
C4DD  C8                 INY
C4DE  99 05 00           STA $0005,Y
C4E1  B9 05 00           LDA $0005,Y
C4E4  F0 34              BEQ iC51A      ;EOL?

After a now customary preincrement to advance both the read and write cursors to the respective next buffer position, we simply store the current byte — just to load it again, do do some post-processing tests. First, we check if we just stored a zero-byte (EOL), indicative of having reached the of the line. If so, we branch to $C51A to finalize.

(We may observe that both instructions for write and read-back are of a 3-byte, full address format, where a zero-byte address mode would have done perfectly. There’s no technical reason for this, since there’s no chance that we could exceed the indexing range for a buffer, which ends at $59. There’s obvious opportunity for further optimzation. But, more importantlty, it’s another hint at this having been derived from a more general codebase.)

If still in business and there’s still more input to process, we engage in another round of destructive testing:

C4E6  38                 SEC
C4E7  E9 3A              SBC #$3A       ;`:`
C4E9  F0 04              BEQ iC4EF
C4EB  C9 49              CMP #$49       ;$83 = DATA?
C4ED  D0 02              BNE iC4F1
C4EF  85 60      iC4EF   STA $60
C4F1  38         iC4F1   SEC
C4F2  E9 55              SBC #$55       ;$8F = REM?
C4F4  D0 9D              BNE iC493      ;to entrance of main loop

First, we subtract 0x3A, the code for a colon (“:”). If it is now zero, it’s a match and we skip ahead to store this as the current mode flag in 0x60. Otherwise, we compare this to 0x49.
0x3A + 0x49 = 0x83, which is — consulting our keyword-token table — the token for “DATA”. If it is now 0x49, it must have been a DATA token before the subtraction, and we skip ahead to store this remainder as the new mode flag.

In any other case, we skip ahead and subtract 0x55:
0x3A + 0x55 = 0x8F, which happens to be the token for “REM”, and, if the subtraction yields a zero in the accumulator, we have a match.
If it’s not a REM statement, we branch back to the entrance of our read-process loop at $C493 to fetch and process the next input character.

C4F6  85 5B              STA $5B       ;delimiter
C4F8  B5 00      iC4F8   LDA $00,X
C4FA  F0 E0              BEQ iC4DC     ;EOL?
C4FC  C5 5B              CMP $5B       ;delimiter met?
C4FE  F0 DC              BEQ iC4DC
C500  C8         iC500   INY
C501  99 05 00           STA $0005,Y
C504  E8                 INX
C505  D0 F1              BNE iC4F8      ;write it (unconditional)

If it was “REM”, we clear the delimiter in 0x5B, in order to read to the end of line (EOL) and to consume the remaining line entirely.

What follows is the loop that copies a continous string of characters, as for a REM clause or a quoted string. We load the current character, and, if it’s not the end of the line, we compare it to the delimiter. If it is the end of line or the delimiter, we jump back to $C4F8 to write the character as usual (where also the EOL zero-byte will be finally caught).

Otherwise, we advance the output cursor and write the byte. And, after a pre-increment of the input cursor, we unconditionally branch to the entrance of this copy loop. (Again, there is no way to overflow on X for a buffer of 80 bytes length.)

Mind that this close copy loop applies to quoted strings and REM statements only. Notably, REM statements may contain shifted characters, while, due to a bug in the LIST routine, these will be expanded to their full keyword equivalent in any listing, as if these were tokens. (As we may observe, as parsed, a comment is essentially an unquoted string extending to the end of line. But this is not how LIST handles it.)

C51A  99 07 00   iC51A   STA $0007,Y
C51D  A9 09              LDA #$09
C51F  85 C9              STA $C9
C521  60                 RTS

The last few lines are finalizing the transformation and we only get here, if we read a terminating zero-byte. First we write this zero-byte to the buffer (we may have done so previously, but better to be sure) and after resetting the index for the input cursor to 0x09, we finally exit the routine.

— Phew! —

Closing Observations

By this, the BASIC line

will be transformed in the buffer from

0000: 4C 30 D1 00 00 00 7F 7F  L0......
0008: 0A 00 31 30 20 50 52 49  ..10 PRI
0010: 4E 54 20 41 2B 35 00 44  NT A+5.D

into the equivalent tokenized code:

0000: 4C 30 D1 00 00 00 7F 7F  L0......
0008: 0A 00 99 20 41 AA 35 00  ... A.5.
0010: 4E 00 20 41 2B 35 00 44  N. A+5.D

(Mind the stray zero-byte at $0011, an artefact of the final better-sure-than-sorry write operation, which happened after the byte was already written and the pointers (cursors in X and Y) were pre-incremented again. We may also observe that this does no harm and that there’s also a termination in the correct memory location.)

If in direct mode, this code can be executed immediately from here, otherwise, it will be copied to user memory as (part of) a stored program.
We may also observe the 16-bit binary equivalent (0A 00) of the already processed line number immediately before this, at locations $0008 and $0009.

A few observations:

The CRUNCH Bug

This bug is exclusive to Commodre BASIC, as it is related to PETSCII encoding. As we have seen, keyword abbreviations are really a side effect of how shifted characters are encoding in PETSCII, matching a character with a set sign-bit as stored in the keword list. The tokenizer routine is entirely agnostic of this.

This also means that we can enter a keyword terminating character. While this will be rejected in a leading position, it will be processed like any other byte, when we read ahead trying to match any keyword candidates. As far as the keyword search code is concerned, any such premature matching of the terminating difference of 0x80 is just a side effect of the subtraction working either way, regardless of which of the two compared bytes has the sign-bit set.

Now, what happens, if we enter a shifted a shifted character at the final position of a keyword? We’ll have a perfect character match with zero difference! Meaning, the routine will still go on, trying to match what must remaining characters of the keyword candidate, thus spilling over into next keyword candidate, but without incrementing the count of keywords which have been inspected already. Should this eventually match (be it on this next keyword, or on any consecuitive try), the keyword count will be lagging behind by one, resulting in an incorrect token.

We can demonstrate this by a little trick of magic:

My attractive assistant will now enter some nonsensical BASIC code containing two consecutive keywords, and by the magic touch of the RETURN key, one of these keywords will vanish and the line will be transformed into perfectly valid BASIC!

To improve the performance and to amuse our esteemed audience, we will switch for this to the lower-case/upper-case character set:

*** commodore basic ***

 7167 bytes free

ready.
100 rem a subroutine:
110 print "hocus-pocus!"
120 return

rem *and now with a touch of magic*

ready.
200 gosuBreturn 100

run 200
hocus-pocus!

ready.
list 200

 200 gosub 100
ready.
Our little demonstration of magic.

I hope, you’re suitably impressed!

Peeking into memory, we can prove that the magical transformation is complete and not a fluke:

addr  code                semantics

0430  3A 04               link: $043A
0432  C8 00               line# 200
0434  8D                  token GOSUB
0435  20 31 30 30         ascii « 100»
0439  00                  -EOL-
043A  00 00               -EOT-

Now, how does this work?

Well, “gosuB” is number 13 in the keyword list (token 0x8D) and “returN” comes immedtiatly after this, at index 14 (token 0x8E).
As the keyword matching code runs away over the perfect match of the shifted “B” in both the input and the keyword candidate (indicating that we’re still in the middle of a valid keyword), it eventually matches “return”, but without having incremented the keyword count. Thus, the resulting token is 0x8D, the token for GOSUB, while the entire string has been consumed and the tokenized line amounts to just “200 GOSUB 100”.
The input string “return”, though, is gone, gone with the magic wind of a run-away error!

And, by, this, we‘ve seen what‘s to be seen, in tokenizing with Commodore BASIC 1.0.

Further Developments

As already indicated, this was not to last. The upgraded “New ROM” with BASIC 2.0 (the one introducing proper IEEE488 disk support), brought various changes and modifications, also doing away with skipping over white space in keywords. And “LET HEN” was no more.

BASIC 2.0 “New ROM”

All that had to be done to mend the “LET HEN” bug was the ommision of two instructions.

The BASIC input buffer is now located at $0200 to $0250 (moved to outside of the zero-page — we can see now why those write instructions hadn’t been zero-page instructions, before) and the there’s a related extra instruction near the end for resetting the read cursor (for use as a double-byte pointer at $77 / $78).

Otherwise the code is still the same, as can be seen in this comparison:

Accordingly, “LE THEN” is tokenized as expected:

And, as in memory:

addr  code                semantics

0401  17 04               link: $0417
0403  0A 00               line# 10
0405  8B                  token IF
0406  20 4C 53 20         ascii « LS »
040A  B2                  token =
040B  20 4C 45 20         ascii « LE »
040F  A7                  token THEN
0410  20                  ascii « »
0411  89                  token GOTO
0412  20 31 30 30         ascii « 100»
0416  00                  -EOL-
0417  00 00               -EOT-

But now, there was a problem with two-word “GO TO”, the traditional and most frequent use of white space skipping in keywords. For this, the keyword list was amended for another token, “GO”, token ID 0xCB:

The new token “GO” bears all the hallmarks of an afterthought: It works only in place of a plain “GOTO” and relies on the already existing token “TO” (as in “FOR … TO”). As there is no token for “SUB”, we can‘t use it for “GOSUB” and a two-word “GO TO” isn’t checked for in a “ON … GOTO …” construct, either.

But, it’s for this new token that we can still write and successfully execute

and even:

But, we can’t use “GO TO’ in place of “THEN”, like “GOTO”:

While more of a superficial fix, “GO” makes for a rather baffling new effect of the CRUNCH run-away bug:
(here for better readability in the lower-case/upper-case character set)

### commodore basic ###

 15359 bytes free

ready.
10 gosuB 100

list

 10 mid$su 100
ready.
A curious run-away bug in conjunction with “GO”.

What happens here? Well, when the routine runs away over the shifted “B” in “gosubB” and subsequently fails on matching the next keyword candidate, “RETURN”, as the remaining part, it continues the keyword search as usual, but with the keyword counter now lagging behind by one. As it finally finds “GO” at the very end of the list, the resulting token is thus 0xCA instead of the correct 0xCB. — And 0xCA is really the token for “MID$”, found just before “GO” in the keword list!

Subsequently, the remaining “suB” is parsed as plain ASCII text, with the shifted letter “B” rejected.
Thus, we end up with the following tokenized line of BASIC:

addr  code                semantics

0401  0D 04               link: $040D
0403  0A 00               line# 10
0405  CA                  token MID$
0406  53 55 20 31 30 30   ascii «SU 100»
040C  00                  -EOL-
040D  00 00               -EOT-

— Yay, this worked out greatly! —

Having seen the major attraction, we may reassueredly turn our attention to the next major iteration, which is

Commodore BASIC 4.0

This one came with new commands for disk control, some new PETSCII control characters (like for switching the character set or setting tabulators), and featured an enhanced editor with support for things like sub-windowing for proper business use on the 40xx/80xx series. Since this required another 4K of ROM, at 0xB0000XBFFF, this wasn’t available for the original PET 2001 (the one with the built-in Datasette), which lacked the required ROM slot. But it was available as an upgrade for the PET 2001-N and other non-CRTC PETs (but here, the code for the new editor features was disabled, while the BASIC ROMs where the same).

Here’s the new version in comparison to Commodore BASIC 2.0 (“New ROM”), it shared the system addresses (zero-page, etc.) with:

The changes are really just about how the keyword list is accessed. While it was previously done as in

SBC $C092,Y         ;$C4D5
LDA $C091,Y         ;$C513

the keyword list is now accessed in post-incremented indirect address mode via a pointer in 0x1F / 0x20:

SBC ($1F),Y         ;$B548
LDA ($1F),Y         ;$B590

This is, because the original keyword list is already 255 bytes long and the keywords added in BASIC 4.0 are beyond the reach of a simple indexed load instruction, with the index register just wrapping around. Thus, we need a more sophisticated construction using a pointer.

All other changes are related to this, setting and updating the pointer in 0x1F and 0x20 accordingly.
(Moreover, the read cursor in X is now generally adjusted first, before the new pointer is handled.)

But there’s also a curious peculiarity in this new code, found at $B584:

What are PHP and PLP achieving here? Well, this an interesting way to guard against side effects of patching. This is replacing a simpler construct, to which it is adding after the LDA instruction — and the result of this load instruction will be checked by a BPL instruction in the unpatched section of the code immediately following to this. So, by pushing the status register (P) to the stack and pulling it at the end of the patch, again, the negative flag will be preserved and there’s no need to interfere with the existing control logic.

And, in case you were to ask, none of these changes affect the run-away bug, which is still what it ever was:

Beyond the PET

BASIC V.2 for the VIC-20 and C64 is essentially derived from BASIC 2.0 for the PET. BASIC V.2 does away with the second cassette buffer, adds support for color (i.e., code maintaining the color RAM), implements IEEE-over serial (somewhat unfortunately so, as the clock synchronization didn’t work out as expected), adds support for RS-232C via the user port, and adds a few control characters (much like it’s found in BASIC 4.0 for the PET). But, otherwise it’s still the same.

Commodore 64 BASIC V.2, Kernal Rev. 2

Hence, it’s small wonder that the CRUNCH routine for rev. 2 of the C64 ROM is still the same.
(We may safely assume it’s also the same for the VIC and the rarer rev. 1 Kernal. I’ve to admit, I didn’t even check.)

System addresses are now slightly different, otherwise, it’s still the same, as is the keyword/token list:

As may be expected from this, the code is also functionally the same.

Commodore 64 BASIC V.2, Kernal Rev. 3

In Kernal Rev. 3, the last incarnation of this code, which is also found in the C128, the tokenizing is still the same, while its address has moved by 4 bytes in ROM:

Therefor, all the observation, we made for PET BASIC 1.0 (less the two instructions for white space skipping in keywords) and BASIC 2.0 are still valid, half a decade later.

BTW, you can easily discern the Kernal revision of a C64 by inspecting the contents of ROM location $FF80 (as in “PEEK (65408)”):

And that’s it, for today. — With some of the questions actually answered, we drop the curtain (in lesser dismay.)