CPUSim64 Programmer Guide

Introduction

The CPUSim64 code consists of an assembler and an emulator. Assembly language files are plain text files that can create with any programming editor. Do not use a word processor like Word or Pages as these do not save their files as plain text. Then you use the assembler to compile your assembly language source files into CPUSim64 machine code object files. These object files can then be run with the CPUSim64 emulator. Both of these programs are written in Java so they themselves run on the Java VM (virtual machine).

Memory Model Diagram
Development Model Diagram

Installation

CPUSim64 is composed of Java classes in jar files located in the lib folder of the distribution and script files in the main directory that you can use to run the assembler and emulator in different modes. Download the CPUSim64.zip file and expand it to a location of your choice. For this guide we will assume that you expand the ZIP archive in your Documents dirctory.

You are required to install Java JDK 17 on your system and make sure that the Java JDK bin directory is in your system's PATH ejvironment variable. Type the following in your terminal window to see if Java is installed correctly.

> java --version java 17.0.5 2022-10-18 LTS Java(TM) SE Runtime Environment (build 17.0.5+9-LTS-191) Java HotSpot(TM) 64-Bit Server VM (build 17.0.5+9-LTS-191, mixed mode, sharing)

If you don't see the Java version printed, check your install and PATH variable. Also it can be handy to put the location of your CPUSim64 directory in the path variable as well. Then you can execute the CPUSim64 scripts from any directory.

TODO Explain how to set PATH environment variable.

Once you have Java and CPUSim64 installed, you can test it out using a the very small program listed below. Use a text editor to enter the text of the program then save it as example001.asm.

Our simple program only has three instructions. The NOP instruction is called a No-Op because it doesn't do anything but take up one CPU cycle. The STOP instruction tells the CPU to stop executing your program and return to the command line. At the very end of all your programs you must place two STOP instructions as this tells the assembler to stop compiling instructions.

Once the program source file is saved you can run the assembler to compile your assembly language program into CPUSim64 machine code. This is done with the compile.sh (or compile.bat for Windows) script. Using the terminal window, navigate to the directory where your source file is saved. Then run the compile script by typing the script filename followed by the name of your source file. Do not include the ".asm" extension as that is assumed by the script. You should see output similar to the output below.

> compile.sh example001

The program will compile your assembly language source file and create a machine lanugage object file with an ".obj.gz" extension. It will print how many words were compiled from your source when it is complete. If there are errors, they will be displayed. The symbols used and the labels used in your source will be listed.

To run the program use the run.sh (or run.bat for Windows) script name followed by the base name of your program source file. Like the compile script, the run script will first compile your source into an object file. It does this in a quiet mode so you don't get the same verbose output as the compile script. Then the run script will execute your object file on the CPUSim64 emulator. It will print statistics related to your program before it runs such as code size, heap size and maximum stack size. Then your program will run. After it runs, the message "System Halted!" will be printed and statistics related to the run will be printed such as the number of user CPU cycles used, the wall clock time it took to execut and the return code from your program.

> run.sh example001

If your program doesn't work as expected you can run it in debug mode using the debug.sh (or debug.bat for Windows) script name followed by the base name of your program source file. Like the run script your program will be compiled and run, but this time the assembler and emulator will be run in debug mode. The assembler will print out your source program as it understands it, complete with addresses and symbolic addresses. You can use this as a reference when you are debugging and can also use it to make sure that the assembler generates the instructions that you expect. When the CPUSim64 emulator runs your program in debug mode it will print out the entire state of the emulated CPU everytime it encounters a NOP instruction. It will also print the final state of the CPU when your program ends.

> debug.sh example001

For more information about how your program runs, you can debug your program in trace mode with trace.sh (or trace.bat for Windows) script name followed by the base name of your program source file. In this mode, each instruction will be printed as it is executed.

> trace.sh example001

Comments

Because assembly language can be hard to read, it is important to put plenty of documentation in your source code in the form of comments. A comment line starts with two slashes (//) and causes the line to be ignored by the assembler. You can also put comments using double slashes at the end of instruction lines which causes everything from the double slashes to the end of the line to be ignored. It is good practice to put a documentation block at the beginning of the program, functions or other important units of code to explain what the code is supposed to do when it works properly. This way someone (perhaps yourself six months from now) can debug errant code because the documentation will tell them what the correct operation should be.

Move Operations

One of the most basic operations available is to move a constant into either an integer register or a floating point register. This is done with the MOVE operation. It can take two arguments, the first is the destination register and the second is the constant to move into the register. Constants can be 16-bit Unicode character constants, integer constants (in decimal or hexadecimal), or floating point constants. Characters constants are formed using a single character in single quotes. It is also possible to use special escape sequences for special characters and Unicode characters whose codepoint is known. The special characters are as follows:

'\0'
NULL Character, codepoint 0
'\b'
Backspace Character, codepoint 8
'\t'
Tab Character, codepoint 9
'\n'
New Line Character, codepoint 10
'\f'
Form Feed Character, codepoint 12
'\r'
Carriage Return Character, codepoint 13
'\"'
Double Quote, codepoint 34
'\''
Single Quote, codepoint 39
'\\'
Backslash, codepoint 92

Given the Unicode codepoint you can also specify a character using the escape sequence of the form \uxxxx where 'xxxx' is the four hexadecimal digit value for the codepoint.

Integer constants can be positive or negative. They can be in decimal format or hexadecimal format. Hexadecimal constants are always preceeded by Ox.

Floating point constants can be positive or negative. They are written using floating point notation (a decimal point is required) of up to 16 decimal significant digits. They can also be written using scientific notation such as 1.23e10 or 3.456e-20.

CPUSim64 stores the constants to move inside the 64-bit instruction. It turns out that very large integers can not fit inside the the instruction so they are instead stored in the data segement at the end of your code. A LOAD instruction is substituted for your MOVE instruction to load it into the register by the assembler. This same process applies to all floating point MOVE instructions that specify constants since the 64-bit floating point constant can not fit inside the MOVE instruction.

You can also use the MOVE instruction to move data from one general purpose register to another or from one floating point register to another. You can also use it to move data between general purpose registers and floating point registers.

Debugging

When run using the debug script as in the above example, the entire CPU state is printed when the program completes. That way you can confirm the results of your program. But if you are only interested in viewing a few registers at a time during the execution of your program, you can use the DEBUG instruction to display 1-4 registers at a time. The nice thing about the DEBUG is that it only gets compiled into your code when you use the --debug option on the assembler as is the case in the debug script. Likewise if your code was compiled with debug instructions turned on, they are only acted upon when the CPUSim64 emulator is run with the --debug option, as is likewise the case with the debug script. If you run your program without the --debug option on the emulator, debug is off and the DEBUG instructions are treated as NOP instructions.

If you want to display the entire CPU state at some time during the execution of your program in addition to at the end use the instruction int iPrintCPUState. Unfortunately this uses a system interrupt and is not automatically disabled by the debug settings. You will have to remove it own your own when you no longer need it. You will also need to include the system definition file <system/system.def> as well for this to work.

Arithmetic

Arithmetic operations are available to perform addition, subtraction, multiplication and division on integer or floating point registers. There are a variety of arguments that can be supplied. There are two forms of two operand operations. The first form takes two register operands, the first operand is the destination and the second is the value to apply to it based on the arithmetic operation and stores the result in the first operand. The second form takes a register operand and an integer literal. Like the first form, it applies the literal second operand to the first and stores the result in the first.

The arithmetic operators also have a three operand form. There are four forms for three operands:

What all these forms have in common is that they apply the operation to the second and third operands and store the result in the first operand. For example:

In addition to the divide operations we have seen so far, there is an additional form that takes four general purpose registers. This version divides the third operand by the fourth then places the integer quotient in the first operand and the remainder in the second.

Finally there are two arithmetic operations that take just a single operand: negation and reciprocal. The negate operation takes a general purpose or floating point register, negates it then stores it back into the register. The recip operation takes a floating point register, computes its reciprocal then stores it back into the register.

Loops

We can make control structures such as loops using the JUMP instruction. The JUMP instruction can branch unconditionally or based on the condition of one of the bits in the status register (SR). We can make the equivalent of DO WHILE/WEND and DO/WHILE loops as illustrated below.

When we use a JUMP instruction we must always supply an address as the last operand. It can either be a general purpose register with an address in it or an address literal. Address literals are symbols that may with an '@' character. If the address literal is prefixed with the '@' character it means that the label is within the function currently being defined. Address literals without the '@' character refer to global labels outside the function being defined. The literal must match a label somewhere else in the code. Labels are symbols that end with a colon (':').

Compare and Test

Two instructions that can be helpful when writing loops are COMPARE and TEST.

COMPARE takes two operands and subtracts them, setting the status register bits according to the computed difference. If the two operands are equal, the Z (zero) status bit will be true. Likewise it will be false (not zero) if they are not equal. When the status register is printed by the DEBUG instruction the zerio bit will be a capital Z if it is set (true) and a lowercase z if it is not set (false). The condition codes we use in the JUMP instruction are 'z' if set and 'nz' if not set, corresponding to 'is zero' and 'is not zero' respectively. You may also use the condition 'eq' for 'z' or 'ne' for 'nz'.

The TEST instruction simply tests the single operand supplied, setting the status registers based on the attributes of the operand. It is essentially equivalent to comparing the operand to zero.

The table below describes some of the condition codes that can be used with the JUMP instruction after a MOVE, COMPARE or TEST.

Condition CodeSR Bit CheckedRelational Equivalent
uunconditional
z or eqzeroop0 == op1
nz or nenot zeroop0 != op1
n or ltnegativeop0 < op1
p or gtpositiveop0 > op1
nn or genot negativeop0 >= op1
np or lenot positiveop0 <= op1

Logical Operators

We have a number of bitwise logical operators that can be used for Boolean arithmetic. These operators can also be used for simple logical testing if we restrict our use of -1 and 0 for operand values. This allows us to represent TRUE as -1 (all bits set) and FALSE as 0 (no bits set). The binary logical operators are AND, OR and XOR. There is one operator that takes a single argument, the COMPLIMENT operator (also known as logical NOT) The following example illustrates this use of the logical operators to print out truth tables.

This last example makes use of console output functions to print the truth tables and format the values. We will talk about console output later in this document.

Named Constants

Using symbolic names for literal values helps with the readability of our programs. It also helps eliminate mistakes caused by repeated typing of the same literal value. With symbolic constants we get the added benefit that if we misspell the symbolic constant, we should get a compile error. Unlike mistyping a literal value is often just a different and wrong legal literal value.

One way to create symbolic constants is to use a preprocessor directive #define. The preprocessor directives define simple text substitutions that happen on our code before it is compiled. We have been using one such directive #include to add in code from another file, The #define directive establishes a simple substitution between a symbolic name and a numeric literal (integer or floating point). When ever the symbolic name is used in our code, the corresponding literal value is substituted just as if we had typed it into the code ourselves. When we use #define we often use all upper case symbols to help reminds us where text substitutions are occuring in our code.

The other mechanism for creating named constants is to declare a constant in memory at the end of our code. This is done with the DCI or DCF compiler directives to store an integer or floating point value. There are also DCC and DCS directives for storing a character or string respectively.

Using named constants we can refactor the truth tables program to be more readable and avoid redundancies thus reducing the chances of typing errors. Symbols for TRUE and FALSE are defined in <system/system.def>. We use a DCS constant for the formatting strings to fprintf().

Arrays

To access elements of an array all you need is the base address of the array and an offset to the element. When you create an array with DCA you should give it a label which will be the base address of the array. Depending on whether the array has integer elements or floating point elements you can use one of the load instruction of the form:

load r0, BASE_ADDR[offset]

load f0, BASE_ADDR[offset]

The offset can be a literal integer or the value in an integer register. Valid offsets are zero through the size of the array minus one. You can also use the special offset -1 to get the size of the array.

Console Output

If you include the system header file <system/io.asm> you will gain access to a number of helpful functions for performing output to the console.

FunctionDescription
puts(str)Prints a string
putc(value)Prints a character
put_int(value, base)Prints an integer using supplied base
put_dec(value)Prints an integer in base 10
put_hex(value)Prints an integer in base 16
put_fp(value)Prints a floating point
put_nl()Prints a new line
fprintf(STDOUT, fmt, values...)Uses a format string to print variable number of value

Because these functions pass arguments on the stack, you use the #call directive when you wish to call them. All arguments should be either integer or floating point registers as appropriate to the call. The arguments will then be pushed onto the stack and the function call made.

Command Line Arguments

When writing command line programs it is often necessary to pass in arguments to the program on the command line. There are two system level interrupts that we can invoke. One will give us the count of items on the command line and the other can be used to get each item on the command line. Items on the command line are strings and are separated by spaces. For example:

> run.sh example015 326 Hello 3.1415

Interrupts are operating system level functions that are executed via the software interrupt mechanism of the cpu invoked with the INT instruction. Interrupt instructions take a single integer operand to identify the system level function to execute. Interrupts use a register passing convention. If the interrupt requires an input argument it is expected in r0 or f0. Likewise a value can be returned from the interrupt in r0 or f0. Symbols for the various interrupt codes available are defined in the system definition files. You must include the appropriate file to gain access to the definitions. For the command line argument interrupts we will need to include <system/system.def>.

The first command line argument in element zero of ARGS is always the name of the program file that is running.

Converting Strings to Numbers

Often we will want to take command line arguments (which are strings) and convert them to integers or floating point numbers so that we can do calculations with them. In the <system/string.def> definition file we have some interrupt codes defined to help us with that.

FunctionDescription
iPARSE_INTConverts string at r0 to integer in r0
Accepts both decimal and hexadecimal with the '0x' prefix
iPARSE_DECConverts decimal string at r0 to integer in r0
iPARSE_HEXConverts hexadecimal string at r0 to integer in r0
iPARSE_FLOATConverts FP string at r0 to floating point in f0

If these parsing functions can not make any sense of the string passed in r0, they will return zero.

Conditional Control

Conditional execution of code is acoomplished using the JUMP instruction. We can compose an IF/THEN construct using one JUMP or an IF/THEN/ELSE construct using two JUMP instructions. For example if we look at the number of command line argumnents passed to the program we can branch based on whether there are any or not. If we compare the result of interrupt iARGC to 2 we will know that there aren't any command line arguments if the result of the comparison is less than. In pseudocode we would have:

if argc < 2 then print no arguments end if

Because we want to use a JUMP instruction to branch around the code in the THEN part of the IF, we actually jump on the opposite test. Since greater or equal is the opposite of less than, we shall jump around if argc ≥ 2. See the example code for how this is implemented.

If we want to implement an IF/THEN/ELSE we need two JUMP instructions. Again comparing to the result of interrupt iARGC we can print one message if there are no arguments and a different message if there are arguments. The pseudocode for this is as follows:

if argc ≥ 2 then print the number of arguments else print no arguments end if

If argc < 2 we need to jump around the THEN statements. The THEN statements likewise need to jump around the ELSE statements so the THEN section must end with a JUMP to the end of the IF/THEN/ELSE.

Macros

Often we need to repeat code or similar code multiple times. If the code follows a pattern with just a few elements that differ we can write a macro substitution that can be used to implement the code more easily and correctly.

Macro substitutions are setup with the #def_macro preprocessor directive. Using this directive we specifiy the name of the macro and its arguments in parenthesis. Following that we provide the statements we want to be substituted when the macro is used in our code. In the substitution code we can use the macro variables by using the special syntax ${varname}. Each time such a special variable symbol is used, it is replaced by the text supplied for that variable when the macro is used.

This next example uses macros with three arguments to compute the minimum and the maximum of two integers inline.

You can make these two macros even more compact (two instructions each) by using a special form of the MOVE instruction that takes a SR just like the JUMP instruction does. If the condition is true, it moves the third operand into the second. If the condition is false, it moves the fourth operand into the second. This is called a conditional move.

Register-based Functions

We can better organize our code and make it easier to understand by breaking it up into separate functions that do one thing well. We can then call our functions from many places in our code eliminating the redundancy inherent in using macros.

The basic form of function in assembly language uses register-based calling conventions. This means that the inputs to a function are expected to be in specific registers, typically, r0, r1, etc for integer arguments or addresses and f0, f1, etc. for floating point. By convention, functions are allowed to destroy the contents of r0 and/or f0. In fact these are the two registers in which a function might return a value. But if the function uses any other registers it is expected that it will PUSH the values of the registers it uses onto the stack to save them and restore them with POP when it ends.

Register-based functions are started with a unique label which gives the function its name. The function must use a RETURN statement at the end of its code to return to the calling code.

The calling code sets up the required registers then issues a CALL instruction with the address of the function. The CALL instruction can also take a SR condition like the JUMP instruction does to make the CALL conditional.

Functions can be defined before or after the code that calls them, but functions generally are not to be declared inside other functions.

In the following example we take our minimum and maximum macros and turn them into register-based functions.

Stack-based Functions

Stack-based calling conventions give us some advantges over register calling conventions. First they manage all the PUSH and POP instructions necessary to use registers other then r0 or f0. They also allow us to give symbolic names to arguments and other registers, making our code easier to read.

Stack-based functions are defined using the preprocessor declaration #DEF_FUNC and stack-based functions are called with the preprocessor declaration #CALL.

At the beginning of your function you can use one #SVAR directive to declare additional stack named variables. You will have to use LOAD and STORE to access these variables. That can be followed by one #VAR directive to declare all your integer or address register named variables. These register variables can be used direvtly by instructions and can be changed with MOVE. That in turn can be followed by one #FVAR to declare all you floating point register named variables.

To return a value simply put it in r0 or f0. You can also use the #RETURN or #FRETURN which take a register and move it into r0 or f0 respectively.

Below is our min/max example using stack-based functions. There is also a sum function for summing a floating point array.

We can even turn our main code into a function and simply call it from the start of our program then exit with a return value returned by main.

Heap Dynamic Memory Allocation

Our programs have access to a whole region of memory called the heap. Our programs can dynamically allocate blocks from the heap as they are running. When we are done with an allocated block we must release or "free" it. In this way we can manage the memory in the heap.

As an example, say we want to allocate enough memory to hold 100 integers. We can do this issue the software interrupt iALLOC with the size of the block we want to allocate in r0. It will return the address of the allocated block in r0 or 0 if there is an error.

For small allocations the iALLOC may allocate a slightly larger block so that there are consistent size blocks in the help which helps when blocks are freed an can be reused. The allocated size is stored in the word right before the address returned by iALLOC.

String Functions

There are a number of functions for operating on strings. Strings are simply an array of Unicode codepoints terminated with the null character (0 or '\0'). Strings are either statically allocated using a DCS assembler directive and stored in the code segment or they are dynamically allocated in the heap using interrupt iALLOC. It is important to free strings when you don't need them any longer to free up memory in the heap.

Math Functions

Math functions are available for most of the standard math functionality on floating point values. There are also some functions for generating random numbers which is helpful in simulations and games.