Contents:

GW-BASIC tokenised program format.

Program format:

First byte = 0FFh = non-protected program.
A different byte specifies a protected program and the rest of the data is pseudo-encrypted by XORing the data with successive bytes from two chunks of BASIC constants (the two chunk lengths being relatively prime to each other). I currently don't have the decoding information available so protected programs are not covered here (but it's coming).

Subsequent bytes:

  1. Two bytes giving the offset of the start of the next line (or the end of the BASIC program) stored in Intel binary integer format with the least-significant byte first. The actual value is not terribly important as the offsets vary depending on other data in memory and the offsets are re-calculated at loading time but the relative difference between the offset of one line and the offset of the next may provide some sanity checks on the decoding of the second of the two lines. The important value to remember is that an offset of 0000 indicates the end of the program. In an actual GW-BASIC program being interpreted, the offset is the offset of the next such pointer or the offset of the second of three zero bytes at the end of a tokenised program.
  2. Two bytes giving the current BASIC line number in Intel integer format with the least-significant byte first.
  3. A variable number of bytes with the tokenised BASIC program text for one line. When not following a numeric constant token, space (20h) to '~' (7Eh) represent themselves as untokenized ASCII characters and 00, 0Bh to 1Fh and 81h to 0FFh are tokens or parts of tokens as listed below.
  4. A zero (00h) byte indicating the end of the line. Followed by the next "(a)".
  5. A final trailing 1A (a control-Z) is tacked onto the end of a saved program.

GW-BASIC tokens:

Note: all numeric values given in hexadecimal unless otherwise noted. "xx" represents one unspecified hexadecimal byte which can contain arbitrary information.


Numeric constants and other tokens related to them:

00
The end of the program line.
0Bxxxx
An octal constant (defined with "&3627" or "&O3627" for example)
0Cxxxx
A hexadecimal constant (defined with "&H2D4F" for example).
0Dxxxx
Line pointer -- a previous line number after being used by GOTO or GOSUB so now it doesn't have to be searched for). It points to the byte just before the line to GOTO (the zero byte that ends the previous line). Only in use in a running program. Saved programs are always stored using 0Exxxx.
0Exxxx
A line number before being used by GOTO or GOSUB or in a saved tokenised program.
0Fxx
A one-byte integer constant, 11 to 255.
10
Never used as a token in a line. Flags constant no longer being processed. See 1E.
11 to 1B
Constants 0 to 10.
1Cxxxx
A two-byte integer constant.
1Dxxxxxxxx
A four-byte single-precision floating-point constant.
1E
This is not used as a token in a program line. It's used to flag that a numeric constant is being processed. Normally BASIC's program pointer points to the line where processing and interpreting is taking place. The item pointed to is the item currently being interpreted. A subroutine can be called anywhere in the interpreter to ask it to "fetch the current item". Elsewhere, it can be told, "I'm done with the current item so increment the program pointer and fetch the next item." This could cause a problem if the program pointer was pointing to a numeric constant because the next byte would not be a BASIC token but an arbitrary numeric value depending on the constant value. For that reason, whenever BASIC encounters a numeric constant, 0B to 0F or 1C or 1D or 1F, it processes the numeric constant, puts it in a special accumulator stack, advances the BASIC pointer past the constant, pushes that on a stack and points to the first byte of a 1E10 byte string. When 1E is returned by a "fetch the current item" call, the numeric constant type token is returned. When a "I'm done with the current item so increment the program pointer and fetch the next item." call is made, the program counter now points to 10 which prompts the interpreter to discard the constant from the accumulator stack and pop the original program pointer (now pointing past the constant) back off the stack and continue past the constant.
1Fxxxxxxxxxxxxxxxx
An eight-byte double-precision floating-point constant.

Two-byte constants,

xxxx specified by 0B, 0C, 0D, 0E or 1C. The bytes are in Intel order with the least significant byte first. The range of values for signed integers (0B, 0C or 1C) is -32768 (hexadecimal 8000) to +32767 (hexadecimal 7FFF). Unsigned integers, specified by 0D or 0E are unsigned numbers in the range 0 to 65535.

Four- and eight-byte floating-point constants,

[xxxxxxxx]xxxxxxxx specified by 1D or 1F. The bytes are stored in reverse order with the least-significant byte first.

[ hh gg ff ee ] dd cc bb aa rearranged in significant order and converted to binary:
aaaaaaaabbbbbbbbccccccccddddddddeeeeeeeeffffffffgggggggghhhhhhhh

If aaaaaaaa equals zero then the numeric value of the constant is assumed to be zero and the other three or seven bytes are ignored.

If aaaaaaaa is not zero, it is the base-2 exponent of the floating-point number with 128 (80 hexadecimal) added to it. 01 hexadecimal equals 2 to the -127 power, ... , 7F hexadecimal equals 2 to the -1 power, 80 hexadecimal equals 2 to the 0 power (1), 81 hexadecimal equals 2 to the 1 power (2), ... , FF hexadecimal equals 2 to the 127 power.

For a non-zero value, the first bit of bbbbbbbb is the sign bit, 1 represents a negative number and 0 represents a positive number. The rest of the bits store the absolute value (i.e. no inversion of bit values is done for negative numbers) with the most significant bit (assumed to follow the binary point) omitted as it is assumed to always be present:
b (sign bit) 0.1 (assumed and not stored) bbbbbbb (the 2nd to 8th significant data bits). cc and dd and (for double-precision) ee to hh are the rest of the data bits. The range for non-zero values is roughly ±2.9387360000E-39 to ±1.7014120000E+38 decimal.

Single-precision examples (in binary):

00000000 00000000 00000000 00000000 equals 0.0 decimal or 0.000000000000000000000000 binary.

00000000 11111111 11111111 11111111 still equals 0.0 decimal or 0.000000000000000000000000 binary (the numeric data bits are ignored when the exponent byte equals zero).

10000000 00000000 00000000 00000000 equals 0.5 decimal or 0.100000000000000000000000 binary.

10000000 10000000 00000000 00000000 equals -0.5 decimal or -0.100000000000000000000000 binary.

10000001 00000000 00000000 00000000 equals 1.0 decimal or 1.00000000000000000000000 binary.

10000010 00000000 00000000 00000000 equals 2.0 decimal or 10.0000000000000000000000 binary.

10000011 00000000 00000000 00000000 equals 4.0 decimal or 100.000000000000000000000 binary.

10000011 01000000 00000000 00000000 equals 6.0 decimal or 110.000000000000000000000 binary.

10000011 01100000 00000000 00000000 equals 7.0 decimal or 111.000000000000000000000 binary.

10000011 01110000 00000000 00000000 equals 7.5 decimal or 111.100000000000000000000 binary.

10000011 01111000 00000000 00000000 equals 7.75 decimal or 111.110000000000000000000 binary.

10000011 11111000 00000000 00000000 equals -7.75 decimal or -111.110000000000000000000 binary.

10000011 01111100 00000000 00000000 equals 7.875 decimal or 111.111000000000000000000 binary.

10000000 00000000 00000000 00000000 equals 0.5 decimal or 0.100000000000000000000000 binary (same as above, repeated here).

01111111 00000000 00000000 00000000 equals 0.25 decimal or 0.010000000000000000000000 binary.

01111110 00000000 00000000 00000000 equals 0.125 decimal or 0.001000000000000000000000 binary.

01111101 00000000 00000000 00000000 equals 0.0625 decimal or 0.000100000000000000000000 binary.

01111101 10000000 00000000 00000000 equals -0.0625 decimal or -0.000100000000000000000000 binary.

Eight-byte double-precision numbers are the same except that they have four more significant bytes to the right of the other bytes.


GW-BASIC keywords:


One-byte tokens:

81
END
82
FOR
83
NEXT
84
DATA
85
INPUT
86
DIM
87
READ
88
LET
89
GOTO
8A
RUN
8B
IF
8C
RESTORE
8D
GOSUB
8E
RETURN
8F
REM
90
STOP
91
PRINT
92
CLEAR
93
LIST
94
NEW
95
ON
96
WAIT
97
DEF
98
POKE
99
CONT
9A
(Undefined)
9B
(Undefined)
9C
OUT
9D
LPRINT
9E
LLIST
9F
(Undefined)
A0
WIDTH
A1
ELSE   (stored as 3AA1, ":ELSE" but the ":" is suppressed when the program is listed.)
A2
TRON
A3
TROFF
A4
SWAP
A5
ERASE
A6
EDIT
A7
ERROR
A8
RESUME
A9
DELETE
AA
AUTO
AB
RENUM
AC
DEFSTR
AD
DEFINT
AE
DEFSNG
AF
DEFDBL
B0
LINE
B1
WHILE   (stored as B1E9, "WHILE+" but the "+" is suppressed when the program is listed.)
B2
WEND
B3
CALL
B4
(Undefined)
B5
(Undefined)
B6
(Undefined)
B7
WRITE
B8
OPTION
B9
RANDOMIZE
BA
OPEN
BB
CLOSE
BC
LOAD
BD
MERGE
BE
SAVE
BF
COLOR
C0
CLS
C1
MOTOR
C2
BSAVE
C3
BLOAD
C4
SOUND
C5
BEEP
C6
PSET
C7
PRESET
C8
SCREEN
C9
KEY
CA
LOCATE
CB
(Undefined)
CC
TO
CD
THEN
CE
TAB(   (note that the following "(" is part of the keyword with no intervening space. That's why "TAB   (..." will not work.)
CF
STEP
D0
USR
D1
FN
D2
SPC(   (note that the following "(" is part of the keyword with no intervening space. That's why "SPC   (..." will not work.)
D3
NOT
D4
ERL
D5
ERR
D6
STRING$
D7
USING
D8
INSTR
D9
'   (stored as 3A8FD9, ":REM'" but the ":REM" is suppressed when the program is listed.)
DA
VARPTR
DB
CSRLIN
DC
POINT
DD
OFF
DE
INKEY$
DF
(Undefined)
E0
(Undefined)
E1
(Undefined)
E2
(Undefined)
E3
(Undefined)
E4
(Undefined)
E5
(Undefined)
E6
>
E7
=
E8
<
E9
+
EA
-
EB
*
EC
/
ED
^
EE
AND
EF
OR
F0
XOR
F1
EQV
F2
IMP
F3
MOD
F4
\
F5
(Undefined)
F6
(Undefined)
F7
(Undefined)
F8
(Undefined)
F9
(Undefined)
FA
(Undefined)
FB
(Undefined)
FC
(Undefined)

Two-byte tokens:

FD81
CVI
FD82
CVS
FD83
CVD
FD84
MKI$
FD85
MKS$
FD86
MKD$
FD8B
EXTERR
FE81
FILES
FE82
FIELD
FE83
SYSTEM
FE84
NAME
FE85
LSET
FE86
RSET
FE87
KILL
FE88
PUT
FE89
GET
FE8A
RESET
FE8B
COMMON
FE8C
CHAIN
FE8D
DATE$
FE8E
TIME$
FE8F
PAINT
FE90
COM
FE91
CIRCLE
FE92
DRAW
FE93
PLAY
FE94
TIMER
FE95
ERDEV
FE96
IOCTL
FE97
CHDIR
FE98
MKDIR
FE99
RMDIR
FE9A
SHELL
FE9B
ENVIRON
FE9C
VIEW
FE9D
WINDOW
FE9E
PMAP
FE9F
PALETTE
FEA0
LCOPY
FEA1
CALLS
FEA4
NOISE   (PCjr only)
DEBUG   (Sperry PC only)
FEA5
PCOPY   (PCjr or EGA system only)
FEA6
TERM   (PCjr only)
FEA7
LOCK
FEA8
UNLOCK
FF81
LEFT$
FF82
RIGHT$
FF83
MID$
FF84
SGN
FF85
INT
FF86
ABS
FF87
SQR
FF88
RND
FF89
SIN
FF8A
LOG
FF8B
EXP
FF8C
COS
FF8D
TAN
FF8E
ATN
FF8F
FRE
FF90
INP
FF91
POS
FF92
LEN
FF93
STR$
FF94
VAL
FF95
ASC
FF96
CHR$
FF97
PEEK
FF98
SPACE$
FF99
OCT$
FF9A
HEX$
FF9B
LPOS
FF9C
CINT
FF9D
CSNG
FF9E
CDBL
FF9F
FIX
FFA0
PEN
FFA1
STICK
FFA2
STRIG
FFA3
EOF
FFA4
LOC
FFA5
LOF

Un-tokenizing from hexdumps:

An example of manually detokenizing a memory hexdump of a GW-BASIC program. The program being detokenized was used to dump the memory segment it resided in. The detokenizing of the hexdump of a tokenized GW-BASIC program file would be similar except for three differences:

  1. The first byte of the file is FF and the equivalent byte ahead of the first program line in memory is 00.
  2. You initially don't know the offset that the file was saved from so you have to untokenize the first line to get the offset of the second line (and then use arithmetic to compute the offset of the first line and the offset of the whole file) and add that to the file offsets if you want the convenience of using the pointers to the next line to determine which zero bytes in the file are end-of-line flags.
  3. You should never encounter a line pointer, 0Dxxxx in a program file. They are only in a running program to speed up execution.

Practically, the only time you really need to manually de-tokenize a GW-BASIC program is when you have a GW-BASIC program file in tokenized form but no copy of GW-BASIC that supports that file (possible if the GW-BASIC version you have came out before some additional keywords used by the tokenized program were added to newer versions of GW-BASIC. If you do have a version of GW-BASIC that understands the file, you can load the file with the command,
    load "filename.ext"
and then save it in ASCII form with the command,
    save "newname.ext",A


"Protected" GW-BASIC programs.

GW-BASIC has an optional parameter to the SAVE statement that allows a program to be saved in "protected" form. The file is pseudo-encrypted and the initial byte of the file is changed to FE instead of FF. When the program is loaded, it is decoded and a flag is set in the GW-BASIC data work area to indicate that a protected program is loaded. While this flag is set, any GW-BASIC statement is still allowed in a program line but editing the program is prohibited and several statements are disabled in direct statements to prevent access to the protected program. SAVE without the P option is prohibited; LIST and LLIST is prohibited; PEEK, POKE, BSAVE and BLOAD are disabled, and CHAIN cannot include the MERGE option.

While this appears to suggest that you are out of luck if you have an old protected GW-BASIC program that needs updating (such as an accounting package that won't accept non-US addresses for employees when you just went international) there is still a way to unprotect such programs if you have a copy of GW-BASIC. It relies on one of the bugs in the VAL function.

If a way can be found to poke a zero value into the the GW-BASIC interpreter's protection flag byte, a loaded protected program would suddenly become unprotected and could then be LISTed, SAVEd in unprotected form or edited. To do that you need two things, the address of the protection flag for your version of GW-BASIC and a way to trick the GW-BASIC interpreter into thinking that a manually-entered POKE statement was a program statement and not a direct manual entry.

Protection flag addresses I have run across:

To determine the protection flag address for your version of GW BASIC, you need to load an unprotected program into memory and then find out which address will protect the program when it is changed from a zero byte to a non-zero byte. Load the following short program into your BASIC interpreter, save it in unprotected form, LIST it, and then run the last line as a direct statement by typing over the line number and pressing your <Enter> or <Return> key. A bunch of successive addresses will be printed and then the program will stop with "Illegal function call" displayed. Write down the last address printed. That will be the address of the protection flag for your version of GW-BASIC. Here is the short program (I call it "poketest.bas" -- make sure line 104 is entered as a single line even if your browser wraps it):


100 ' type the following line 104 as a direct statement or just LIST
102 ' this and then run line 104 by deleting the line number "104"
103 ' and pressing the <Return> or <Enter> key:
104 FOR I=1000 TO 16000:PRINT I: J=PEEK(I): POKE I,((J=0)AND 255) OR J: POKE I,J:NEXT I

What the program does is change successive zero bytes to non-zero and then restores them back to their original values until it hits the protection flag. Once the POKE I,((J=0)AND 255) OR J statement turns on the protection flag, the POKE I,J statement becomes a forbidden statement in a direct statement.

Now go back to the top of this page and review what has been written about the numeric tokens, paying special attention to 0Dxxxx, 0Exxxx, 10 and 1E. Then review the detokenizing example at the link above. The background there will help you understand the bug we are going to exploit.

Almost everywhere in the BASIC interpretor, one of two routines are called to get the current or next significant character to be processed. The "get next item" routine would be very confused if it encountered a byte in a numeric constant because those bytes can have any value. Therefore, whenever a numeric token is encountered, the following things are done:

  1. The numeric constant is evaluated and converted to a full integer if it is one of the one- (11 to 1B) or two-byte (0Fxx) tokens,
  2. the two-, four- or eight-byte value is placed in a temporary eight-byte accumulator,
  3. the type (integer, single- or double-precision floating-point, line number or line pointer) is saved in a type flag byte,
  4. the constant is skipped and the address of the next byte after the constant is stored in a temporary program-counter storage area,
  5. the current regular program counter is pointed to the 1E byte of a byte pair, 1E10 and
  6. the 1E is returned to the requesting interpreter routine which tells it to use the numeric constant in the temporary accumulator if that is a valid item at this point in the statement being interpreted.

What happens with VAL? The regular program counter is pushed on a stack, it is changed to point to the string to be evaluated, a zero-byte is inserted at the end of the string to ensure that VAL won't evaluate past the end of the string (such as when two strings adjacent in BASIC's string work area have the values "123" and "456" respectively, we don't want VAL to return 123456 as the value), and then BASIC's numeric-evaluation routine is called to evaluate the string. It uses the same call to "get current byte" and "get next byte" as the rest of the interpreter does. When the evaluation encounters what it thinks is the end of the number, it returns to VAL, VAL then replaces the zero byte, pops the original program counter from the stack and returns the evaluation result to its caller.

"Where's the bug!" you ask. What do you think happens if VAL is followed by a numeric constant and the number in the string is also followed by a numeric constant? Right. The one in the program line that follows VAL is over-written by the one in the string. On return to the VAL processing routine, the program counter is not restored to point to the statement with the VAL function. It is restored to point to the constant in the string. Now the string is being interpreted instead of the statement with VAL in it.

Normally this would cause a syntax error but there is one statement in BASIC that can have a numeric constant follow a function without intervening punctuation. That is a PRINT statement.

"How does this help?" you ask. The answer is that POKE is not allowed in a direct statement but is allowed in a program line. If we can go to a program line that has a POKE in it to turn off the protection flag then that will be allowed and the protection is turned off. Once a GOTO is executed, we are no longer in a direct statement. Obviously we cannot go to any line in the protected program. Firstly we don't know what line numbers are in it so we can't go to any one of them. Secondly, we don't know what is in the lines so we couldn't go to a suitable POKE if there was one, and thirdly, the chances of there being a suitable POKE in the protected program are slimmer than my chances of becoming richer than Bill Gates. However, using a pointer token instead of a line-number token means that the "program" line we go to can be anywhere in memory and doesn't have to really be in the program we need to unprotect. A numeric array is a convenient place for it.

What we need to do is to

  1. put a "POKE flagaddress,0" statement into a numeric array (with some leading spaces for flexibility),
  2. create a string (B$) that consists of
    1. a number in ASCII for VAL to process,
    2. a tokenized numeric constant,
    3. a colon to start a new statement, and
    4. a GOTO statement with a line-pointer that points to the array
  3. and then run a direct PRINT statement such as "PRINT VAL(B$) 123".

Replace the 1450 below with the protection flag address of your version of GW-BASIC and "secret.bas" with the name of the program to be unprotected.

The unprotection routine (with comments):


load "secret.bas"
dim a%[14]
a%[0]=0:a%[1]=&h2020:a%[2]=&h2020:a%[3]=&h2097
'               "  "         "  "      "DEF "
a%[4]=&h4553:a%[5]=&h2047:a%[6]=&H203A:a%[7]=&H2098
'       "SE"         "G "         ": "       "POKE "
a%[8]=&H1C20            :a%[9]=1450      :a%[10]=&h112C
'   " " + integer token  the flag address     ",0"
a%[11]=&h903A    :a%[12]=0
'       ":STOP"  end-of-line
b$="" ' must be defined before VARPTR is called.
b$="123"+chr$(28)+":::"+chr$(137)+chr$(13)+mki$(varptr(a%[0]))+":"

print val(b$) 456

Was that sneaky or what?


Visitors to this page are encouraged to copy the information above and make it available to others as long as credit is given for the source. I'm not getting any younger and would hate to have the information lost forever if I should unexpectedly drop dead next week, month, year, decade or century.


GW-BASIC resources:

Other BASIC resources:

I have patched a copy of GW-BASIC to get rid of some bugs and add the new features described and documented in the NewBASIC.txt text file. [Ooops! I forgot to change the permissions for the file after uploading it so it wasn't accessible for a couple of days. It should be accessible now.] The patched BASIC (NBASIC.EXE) is available for download in a zipped file, NBASIC.EXE.zip.

Note: My NBASIC.EXE is totally unrelated to another BASIC interpreter also called "NBASIC", available at:


My Computer Hints, Tips, and Utilities page.

My Home page.