Contents:

GW-BASIC tokenised program format.
Program format.
GW-BASIC tokens.
Numeric constants and other tokens related to them.
GW-BASIC keywords.
Un-tokenizing from hexdumps.
"Protected" GW-BASIC programs.
The unprotection routine (with comments).
GW-BASIC resources:
Other BASIC resources.

GW-BASIC tokenised program format.

Program format:

First byte = 0FFh = non-protected program.
A different byte specifies a protected program and the rest of the data is pseudo-encrypted by XORing the data with successive bytes from two chunks of BASIC constants (the two chunk lengths being relatively prime to each other). I currently don't have the decoding information available so protected programs are not covered here (but it's coming).

Subsequent bytes:

Two bytes giving the offset of the start of the next line (or the end of the BASIC program) stored in Intel binary integer format with the least-significant byte first. The actual value is not terribly important as the offsets vary depending on other data in memory and the offsets are re-calculated at loading time but the relative difference between the offset of one line and the offset of the next may provide some sanity checks on the decoding of the second of the two lines. The important value to remember is that an offset of 0000 indicates the end of the program. In an actual GW-BASIC program being interpreted, the offset is the offset of the next such pointer or the offset of the second of three zero bytes at the end of a tokenised program.
Two bytes giving the current BASIC line number in Intel integer format with the least-significant byte first.
A variable number of bytes with the tokenised BASIC program text for one line. When not following a numeric constant token, space (20h) to '~' (7Eh) represent themselves as untokenized ASCII characters and 00, 0Bh to 1Fh and 81h to 0FFh are tokens or parts of tokens as listed below.
A zero (00h) byte indicating the end of the line. Followed by the next "(a)".
A final trailing 1A (a control-Z) is tacked onto the end of a saved program.

GW-BASIC tokens:

Note: all numeric values given in hexadecimal unless otherwise noted. "xx" represents one unspecified hexadecimal byte which can contain arbitrary information.

Numeric constants and other tokens related to them:

00: The end of the program line.
0Bxxxx: An octal constant (defined with "&3627" or "&O3627" for example)
0Cxxxx: A hexadecimal constant (defined with "&H2D4F" for example).
0Dxxxx: Line pointer -- a previous line number after being used by GOTO or GOSUB so now it doesn't have to be searched for). It points to the byte just before the line to GOTO (the zero byte that ends the previous line). Only in use in a running program. Saved programs are always stored using 0Exxxx.
0Exxxx: A line number before being used by GOTO or GOSUB or in a saved tokenised program.
0Fxx: A one-byte integer constant, 11 to 255.
10: Never used as a token in a line. Flags constant no longer being processed. See 1E.
11 to 1B: Constants 0 to 10.
1Cxxxx: A two-byte integer constant.
1Dxxxxxxxx: A four-byte single-precision floating-point constant.
1E: This is not used as a token in a program line. It's used to flag that a numeric constant is being processed. Normally BASIC's program pointer points to the line where processing and interpreting is taking place. The item pointed to is the item currently being interpreted. A subroutine can be called anywhere in the interpreter to ask it to "fetch the current item". Elsewhere, it can be told, "I'm done with the current item so increment the program pointer and fetch the next item." This could cause a problem if the program pointer was pointing to a numeric constant because the next byte would not be a BASIC token but an arbitrary numeric value depending on the constant value. For that reason, whenever BASIC encounters a numeric constant, 0B to 0F or 1C or 1D or 1F, it processes the numeric constant, puts it in a special accumulator stack, advances the BASIC pointer past the constant, pushes that on a stack and points to the first byte of a 1E10 byte string. When 1E is returned by a "fetch the current item" call, the numeric constant type token is returned. When a "I'm done with the current item so increment the program pointer and fetch the next item." call is made, the program counter now points to 10 which prompts the interpreter to discard the constant from the accumulator stack and pop the original program pointer (now pointing past the constant) back off the stack and continue past the constant.
1Fxxxxxxxxxxxxxxxx: An eight-byte double-precision floating-point constant.

Two-byte constants,

xxxx specified by 0B, 0C, 0D, 0E or 1C. The bytes are in Intel order with the least significant byte first. The range of values for signed integers (0B, 0C or 1C) is -32768 (hexadecimal 8000) to +32767 (hexadecimal 7FFF). Unsigned integers, specified by 0D or 0E are unsigned numbers in the range 0 to 65535.

Four- and eight-byte floating-point constants,

[xxxxxxxx]xxxxxxxx specified by 1D or 1F. The bytes are stored in reverse order with the least-significant byte first.

[ hh gg ff ee ] dd cc bb aa rearranged in significant order and converted to binary:
aaaaaaaabbbbbbbbccccccccddddddddeeeeeeeeffffffffgggggggghhhhhhhh

If aaaaaaaa equals zero then the numeric value of the constant is assumed to be zero and the other three or seven bytes are ignored.

If aaaaaaaa is not zero, it is the base-2 exponent of the floating-point number with 128 (80 hexadecimal) added to it. 01 hexadecimal equals 2 to the -127 power, ... , 7F hexadecimal equals 2 to the -1 power, 80 hexadecimal equals 2 to the 0 power (1), 81 hexadecimal equals 2 to the 1 power (2), ... , FF hexadecimal equals 2 to the 127 power.

For a non-zero value, the first bit of bbbbbbbb is the sign bit, 1 represents a negative number and 0 represents a positive number. The rest of the bits store the absolute value (i.e. no inversion of bit values is done for negative numbers) with the most significant bit (assumed to follow the binary point) omitted as it is assumed to always be present:
b (sign bit) 0.1 (assumed and not stored) bbbbbbb (the 2nd to 8th significant data bits). cc and dd and (for double-precision) ee to hh are the rest of the data bits. The range for non-zero values is roughly ±2.9387360000E-39 to ±1.7014120000E+38 decimal.

Single-precision examples (in binary):

00000000 00000000 00000000 00000000 equals 0.0 decimal or 0.000000000000000000000000 binary.

00000000 11111111 11111111 11111111 still equals 0.0 decimal or 0.000000000000000000000000 binary (the numeric data bits are ignored when the exponent byte equals zero).

10000000 00000000 00000000 00000000 equals 0.5 decimal or 0.100000000000000000000000 binary.

10000000 10000000 00000000 00000000 equals -0.5 decimal or -0.100000000000000000000000 binary.

10000001 00000000 00000000 00000000 equals 1.0 decimal or 1.00000000000000000000000 binary.

10000010 00000000 00000000 00000000 equals 2.0 decimal or 10.0000000000000000000000 binary.

10000011 00000000 00000000 00000000 equals 4.0 decimal or 100.000000000000000000000 binary.

10000011 01000000 00000000 00000000 equals 6.0 decimal or 110.000000000000000000000 binary.

10000011 01100000 00000000 00000000 equals 7.0 decimal or 111.000000000000000000000 binary.

10000011 01110000 00000000 00000000 equals 7.5 decimal or 111.100000000000000000000 binary.

10000011 01111000 00000000 00000000 equals 7.75 decimal or 111.110000000000000000000 binary.

10000011 11111000 00000000 00000000 equals -7.75 decimal or -111.110000000000000000000 binary.

10000011 01111100 00000000 00000000 equals 7.875 decimal or 111.111000000000000000000 binary.

10000000 00000000 00000000 00000000 equals 0.5 decimal or 0.100000000000000000000000 binary (same as above, repeated here).

01111111 00000000 00000000 00000000 equals 0.25 decimal or 0.010000000000000000000000 binary.

01111110 00000000 00000000 00000000 equals 0.125 decimal or 0.001000000000000000000000 binary.

01111101 00000000 00000000 00000000 equals 0.0625 decimal or 0.000100000000000000000000 binary.

01111101 10000000 00000000 00000000 equals -0.0625 decimal or -0.000100000000000000000000 binary.

Eight-byte double-precision numbers are the same except that they have four more significant bytes to the right of the other bytes.

GW-BASIC keywords:

One-byte tokens:

81: END
82: FOR
83: NEXT
84: DATA
85: INPUT
86: DIM
87: READ
88: LET
89: GOTO
8A: RUN
8B: IF
8C: RESTORE
8D: GOSUB
8E: RETURN
8F: REM
90: STOP
91: PRINT
92: CLEAR
93: LIST
94: NEW
95: ON
96: WAIT
97: DEF
98: POKE
99: CONT
9A: (Undefined)
9B: (Undefined)
9C: OUT
9D: LPRINT
9E: LLIST
9F: (Undefined)
A0: WIDTH
A1: ELSE (stored as 3AA1, ":ELSE" but the ":" is suppressed when the program is listed.)
A2: TRON
A3: TROFF
A4: SWAP
A5: ERASE
A6: EDIT
A7: ERROR
A8: RESUME
A9: DELETE
AA: AUTO
AB: RENUM
AC: DEFSTR
AD: DEFINT
AE: DEFSNG
AF: DEFDBL
B0: LINE
B1: WHILE (stored as B1E9, "WHILE+" but the "+" is suppressed when the program is listed.)
B2: WEND
B3: CALL
B4: (Undefined)
B5: (Undefined)
B6: (Undefined)
B7: WRITE
B8: OPTION
B9: RANDOMIZE
BA: OPEN
BB: CLOSE
BC: LOAD
BD: MERGE
BE: SAVE
BF: COLOR
C0: CLS
C1: MOTOR
C2: BSAVE
C3: BLOAD
C4: SOUND
C5: BEEP
C6: PSET
C7: PRESET
C8: SCREEN
C9: KEY
CA: LOCATE
CB: (Undefined)
CC: TO
CD: THEN
CE: TAB( (note that the following "(" is part of the keyword with no intervening space. That's why "TAB (..." will not work.)
CF: STEP
D0: USR
D1: FN
D2: SPC( (note that the following "(" is part of the keyword with no intervening space. That's why "SPC (..." will not work.)
D3: NOT
D4: ERL
D5: ERR
D6: STRING$
D7: USING
D8: INSTR
D9: ' (stored as 3A8FD9, ":REM'" but the ":REM" is suppressed when the program is listed.)
DA: VARPTR
DB: CSRLIN
DC: POINT
DD: OFF
DE: INKEY$
DF: (Undefined)
E0: (Undefined)
E1: (Undefined)
E2: (Undefined)
E3: (Undefined)
E4: (Undefined)
E5: (Undefined)
E6: >
E7: =
E8: <
E9: +
EA: -
EB: *
EC: /
ED: ^
EE: AND
EF: OR
F0: XOR
F1: EQV
F2: IMP
F3: MOD
F4: \
F5: (Undefined)
F6: (Undefined)
F7: (Undefined)
F8: (Undefined)
F9: (Undefined)
FA: (Undefined)
FB: (Undefined)
FC: (Undefined)
FD81: CVI
FD82: CVS
FD83: CVD
FD84: MKI$
FD85: MKS$
FD86: MKD$
FD8B: EXTERR
FE81: FILES
FE82: FIELD
FE83: SYSTEM
FE84: NAME
FE85: LSET
FE86: RSET
FE87: KILL
FE88: PUT
FE89: GET
FE8A: RESET
FE8B: COMMON
FE8C: CHAIN
FE8D: DATE$
FE8E: TIME$
FE8F: PAINT
FE90: COM
FE91: CIRCLE
FE92: DRAW
FE93: PLAY
FE94: TIMER
FE95: ERDEV
FE96: IOCTL
FE97: CHDIR
FE98: MKDIR
FE99: RMDIR
FE9A: SHELL
FE9B: ENVIRON
FE9C: VIEW
FE9D: WINDOW
FE9E: PMAP
FE9F: PALETTE
FEA0: LCOPY
FEA1: CALLS
FEA4: NOISE (PCjr only); DEBUG (Sperry PC only)
FEA5: PCOPY (PCjr or EGA system only)
FEA6: TERM (PCjr only)
FEA7: LOCK
FEA8: UNLOCK
FF81: LEFT$
FF82: RIGHT$
FF83: MID$
FF84: SGN
FF85: INT
FF86: ABS
FF87: SQR
FF88: RND
FF89: SIN
FF8A: LOG
FF8B: EXP
FF8C: COS
FF8D: TAN
FF8E: ATN
FF8F: FRE
FF90: INP
FF91: POS
FF92: LEN
FF93: STR$
FF94: VAL
FF95: ASC
FF96: CHR$
FF97: PEEK
FF98: SPACE$
FF99: OCT$
FF9A: HEX$
FF9B: LPOS
FF9C: CINT
FF9D: CSNG
FF9E: CDBL
FF9F: FIX
FFA0: PEN
FFA1: STICK
FFA2: STRIG
FFA3: EOF
FFA4: LOC
FFA5: LOF

Un-tokenizing from hexdumps:

An example of manually detokenizing a memory hexdump of a GW-BASIC program. The program being detokenized was used to dump the memory segment it resided in. The detokenizing of the hexdump of a tokenized GW-BASIC program file would be similar except for three differences:

The first byte of the file is FF and the equivalent byte ahead of the first program line in memory is 00.
You initially don't know the offset that the file was saved from so you have to untokenize the first line to get the offset of the second line (and then use arithmetic to compute the offset of the first line and the offset of the whole file) and add that to the file offsets if you want the convenience of using the pointers to the next line to determine which zero bytes in the file are end-of-line flags.
You should never encounter a line pointer, 0Dxxxx in a program file. They are only in a running program to speed up execution.

Practically, the only time you really need to manually de-tokenize a GW-BASIC program is when you have a GW-BASIC program file in tokenized form but no copy of GW-BASIC that supports that file (possible if the GW-BASIC version you have came out before some additional keywords used by the tokenized program were added to newer versions of GW-BASIC. If you do have a version of GW-BASIC that understands the file, you can load the file with the command,
load "filename.ext"
and then save it in ASCII form with the command,
save "newname.ext",A

"Protected" GW-BASIC programs.

GW-BASIC has an optional parameter to the SAVE statement that allows a program to be saved in "protected" form. The file is pseudo-encrypted and the initial byte of the file is changed to FE instead of FF. When the program is loaded, it is decoded and a flag is set in the GW-BASIC data work area to indicate that a protected program is loaded. While this flag is set, any GW-BASIC statement is still allowed in a program line but editing the program is prohibited and several statements are disabled in direct statements to prevent access to the protected program. SAVE without the P option is prohibited; LIST and LLIST is prohibited; PEEK, POKE, BSAVE and BLOAD are disabled, and CHAIN cannot include the MERGE option.

While this appears to suggest that you are out of luck if you have an old protected GW-BASIC program that needs updating (such as an accounting package that won't accept non-US addresses for employees when you just went international) there is still a way to unprotect such programs if you have a copy of GW-BASIC. It relies on one of the bugs in the VAL function.

If a way can be found to poke a zero value into the the GW-BASIC interpreter's protection flag byte, a loaded protected program would suddenly become unprotected and could then be LISTed, SAVEd in unprotected form or edited. To do that you need two things, the address of the protection flag for your version of GW-BASIC and a way to trick the GW-BASIC interpreter into thinking that a manually-entered POKE statement was a program statement and not a direct manual entry.

Protection flag addresses I have run across:

1124, &H0464 -- IBM BASICA
1228, &H04CC -- Compaq BASIC 2.11
1433, &H0599 -- Compaq BASIC 3.11
1435, &H059B -- TI Professional BASIC 1.10
1435, &H059B -- Solution 5000 PC BASIC 1.10
1450, &H05AA -- Tandy PC BASIC (version unknown), Model 2500XL
1453, &H05AD -- Tandy PC BASIC (version unknown)
1457, &H05B1 -- Corona PC GWBASIC 1.10
1483, &H05CB -- Tandy PC BASIC (version unknown)
1548, &H0601 -- Sperry PC BASICA 2.03

To determine the protection flag address for your version of GW BASIC, you need to load an unprotected program into memory and then find out which address will protect the program when it is changed from a zero byte to a non-zero byte. Load the following short program into your BASIC interpreter, save it in unprotected form, LIST it, and then run the last line as a direct statement by typing over the line number and pressing your <Enter> or <Return> key. A bunch of successive addresses will be printed and then the program will stop with "Illegal function call" displayed. Write down the last address printed. That will be the address of the protection flag for your version of GW-BASIC. Here is the short program (I call it "poketest.bas" -- make sure line 104 is entered as a single line even if your browser wraps it):

100 ' type the following line 104 as a direct statement or just LIST
102 ' this and then run line 104 by deleting the line number "104"
103 ' and pressing the <Return> or <Enter> key:
104 FOR I=1000 TO 16000:PRINT I: J=PEEK(I): POKE I,((J=0)AND 255) OR J: POKE I,J:NEXT I

What the program does is change successive zero bytes to non-zero and then restores them back to their original values until it hits the protection flag. Once the POKE I,((J=0)AND 255) OR J statement turns on the protection flag, the POKE I,J statement becomes a forbidden statement in a direct statement.

Now go back to the top of this page and review what has been written about the numeric tokens, paying special attention to 0Dxxxx, 0Exxxx, 10 and 1E. Then review the detokenizing example at the link above. The background there will help you understand the bug we are going to exploit.

Almost everywhere in the BASIC interpretor, one of two routines are called to get the current or next significant character to be processed. The "get next item" routine would be very confused if it encountered a byte in a numeric constant because those bytes can have any value. Therefore, whenever a numeric token is encountered, the following things are done:

The numeric constant is evaluated and converted to a full integer if it is one of the one- (11 to 1B) or two-byte (0Fxx) tokens,
the two-, four- or eight-byte value is placed in a temporary eight-byte accumulator,
the type (integer, single- or double-precision floating-point, line number or line pointer) is saved in a type flag byte,
the constant is skipped and the address of the next byte after the constant is stored in a temporary program-counter storage area,
the current regular program counter is pointed to the 1E byte of a byte pair, 1E10 and
the 1E is returned to the requesting interpreter routine which tells it to use the numeric constant in the temporary accumulator if that is a valid item at this point in the statement being interpreted.

What happens with VAL? The regular program counter is pushed on a stack, it is changed to point to the string to be evaluated, a zero-byte is inserted at the end of the string to ensure that VAL won't evaluate past the end of the string (such as when two strings adjacent in BASIC's string work area have the values "123" and "456" respectively, we don't want VAL to return 123456 as the value), and then BASIC's numeric-evaluation routine is called to evaluate the string. It uses the same call to "get current byte" and "get next byte" as the rest of the interpreter does. When the evaluation encounters what it thinks is the end of the number, it returns to VAL, VAL then replaces the zero byte, pops the original program counter from the stack and returns the evaluation result to its caller.

"Where's the bug!" you ask. What do you think happens if VAL is followed by a numeric constant and the number in the string is also followed by a numeric constant? Right. The one in the program line that follows VAL is over-written by the one in the string. On return to the VAL processing routine, the program counter is not restored to point to the statement with the VAL function. It is restored to point to the constant in the string. Now the string is being interpreted instead of the statement with VAL in it.

Normally this would cause a syntax error but there is one statement in BASIC that can have a numeric constant follow a function without intervening punctuation. That is a PRINT statement.

"How does this help?" you ask. The answer is that POKE is not allowed in a direct statement but is allowed in a program line. If we can go to a program line that has a POKE in it to turn off the protection flag then that will be allowed and the protection is turned off. Once a GOTO is executed, we are no longer in a direct statement. Obviously we cannot go to any line in the protected program. Firstly we don't know what line numbers are in it so we can't go to any one of them. Secondly, we don't know what is in the lines so we couldn't go to a suitable POKE if there was one, and thirdly, the chances of there being a suitable POKE in the protected program are slimmer than my chances of becoming richer than Bill Gates. However, using a pointer token instead of a line-number token means that the "program" line we go to can be anywhere in memory and doesn't have to really be in the program we need to unprotect. A numeric array is a convenient place for it.

What we need to do is to

put a "POKE flagaddress,0" statement into a numeric array (with some leading spaces for flexibility),
create a string (B$) that consists of
1. a number in ASCII for VAL to process,
2. a tokenized numeric constant,
3. a colon to start a new statement, and
4. a GOTO statement with a line-pointer that points to the array
and then run a direct PRINT statement such as "PRINT VAL(B$) 123".

Replace the 1450 below with the protection flag address of your version of GW-BASIC and "secret.bas" with the name of the program to be unprotected.

The unprotection routine (with comments):

load "secret.bas"
dim a%[14]
a%[0]=0:a%[1]=&h2020:a%[2]=&h2020:a%[3]=&h2097
'               "  "         "  "      "DEF "
a%[4]=&h4553:a%[5]=&h2047:a%[6]=&H203A:a%[7]=&H2098
'       "SE"         "G "         ": "       "POKE "
a%[8]=&H1C20            :a%[9]=1450      :a%[10]=&h112C
'   " " + integer token  the flag address     ",0"
a%[11]=&h903A    :a%[12]=0
'       ":STOP"  end-of-line
b$="" ' must be defined before VARPTR is called.
b$="123"+chr$(28)+":::"+chr$(137)+chr$(13)+mki$(varptr(a%[0]))+":"

print val(b$) 456

Was that sneaky or what?

Visitors to this page are encouraged to copy the information above and make it available to others as long as credit is given for the source. I'm not getting any younger and would hate to have the information lost forever if I should unexpectedly drop dead next week, month, year, decade or century.

GW-BASIC resources:

Other BASIC resources:

I have patched a copy of GW-BASIC to get rid of some bugs and add the new features described and documented in the NewBASIC.txt text file. [Ooops! I forgot to change the permissions for the file after uploading it so it wasn't accessible for a couple of days. It should be accessible now.] The patched BASIC (NBASIC.EXE) is available for download in a zipped file, NBASIC.EXE.zip.

Note: My NBASIC.EXE is totally unrelated to another BASIC interpreter also called "NBASIC", available at:

My Computer Hints, Tips, and Utilities page.

My Home page.