CS471/571 - Operating Systems

A System Call in Xv6

Consider the following Xv6 userspace program, to print "Hello, world!" to the screen:

#include "types.h"
#include "stat.h"
#include "user.h"

int main(void)
{
  printf(1, "Hello, world!\n");
  exit();
}

In this example:

user.h defines the prototypes for all the system call wrappers and userspace functions defined in ulib.c
types.h and stat.h define the types and stat structure needed for the system calls and functions defined in user.h. All three of the headers should be included at a minimum for an Xv6 user-space program.
The first parameter to printf() in Xv6 is the file descriptor to write to, so printf also works as a form of fprintf().
exit() must be called to end your user-space programs, there is no C run-time in Xv6 where a return from main enters into the C run times shutdown routine which calls exit() to terminate the program. There is also no exit value for Xv6 programs, so no parameter to the exit() system call.

The printf(), after processing the format string and arguments results in one or more write() system calls, in this case, likely something like:

write(1, "Hello, world!\n", 14);

Here the system call asks the kernel to send the 14 bytes of data in memory given by the address of the string to whatever file-descriptor 1 represents. In Xv6 all these parameters are pushed to the processes stack from right to left, leaving the first parameter at the top of the stack, then the system call is made.

To trace how the system call is made we start with the following Xv6 files:

`syscall.h`

syscall.h contains #defines for the system calls and their numbers. The number represents the value that should be placed in the eax register (the 32 bit portion of the rax register. Xv6 operates in 32 bit mode, so all registers are their 32 bit or smaller versions.) The write system call is defined as such:

...
#define SYS_write   16
...

These #defines are of particular interest to the next file usys.S:

`usys.S`

usys.S is the user-space system call wrapper. It makes a assembly entry for each system call in the system that sets up the system call by setting the eax register to the appropriate system call number, then makes the system call by invoking the assembly instruction int 0x40 then returns back to the caller function via the ret instruction.

The following macro setups each system call:

 #define SYSCALL(name) \
  .globl name; \
  name: \
    movl $SYS_ ## name, %eax; \
    int $T_SYSCALL; \
    ret

for the write() (i.e. SYSCALL(write) found in usys.S), after macro expansion then becomes:

 .globl write;                  ; Makes the symbol "write" global
   write:
    movl $SYS_write, %eax;      ; Moves sytem call number into eax
    int $0x40;                  ; Interrupt into the kernel, T_SYSCALL == 0x40
    ret                         ; Return from the "wrapper", back to C

Note that the assembly is in AT&T syntax. movl moves a long (32 bit) value (in this case SYS_write which == 16) into the eax register. The int instruction invokes a specific "interrupt", effectively jumping to the address of a function defined by one of 256 "interrupt vectors" defined in what is called the Interrupt Descriptor Table (IDT). The particular interrupt vector used to perform a system call is defined by the kernel itself and is arbitrary.

Some operating systems might use one IDT entry per system call, such as DOS, however Unix and most modern kernels use a single vector for system calls and use the other vectors for things such as hardware interrupts, timers or compatibility calls.

`trap.c`

Xv6 refers to each interrupt as a "trap", so in Xv6 system calls, hardware interrupts and hardware exceptions (such as segmentation faults, divide by zero, etc,) are all "traps". The tvinit() function sets up the trap vectors (i.e. the IDT vectors.)

The setup of the IDT inside of the kernel is handled mostly by the tvinit() function which setups the idt array which is then installed as the IDT by the privileged lidt instruction called in the idtinit() function.

lidt(idt, sizeof(idt));

The lidt() function is a C wrapper (defined in x86.h) for the privileged instruction lidt which:
sets the interrupt descriptor table, which is a table of (addresses of) handler functions for each interrupt type

In tvinit() each IDT is setup with a default for each idt[i] that would call the function defined by vectors[i], were it an active IDT, which they are not. Then the system call vector (T_SYSCALL or 0x40 or 64) is setup with the following macro (defined in mmu.h):

SETGATE(idt[T_SYSCALL], 1, SEG_KCODE<<3, vectors[T_SYSCALL], DPL_USER);

This defines:

Call the function defined by vectors[T_SYSCALL] (set to vector64, which is defined in vectors.S) for Interrupt Descriptor Table entry number T_SYSCALL which is defined as 0x40 or 64 decimal.
DPL_USER defines the trap as callable from user-space via the int instruction.
SEG_KCODE<<3 == Set it to use the kernel code segment, i.e. run the interrupt in kernel mode.

`vector.S`

The vector.S file is an auto-generated (via the perl script vector.pl) list of the 256 function vectors for the vectors[] array used to specify the addresses of the 256 functions to call when a specific interrupt is called. When int 0x40 is called, it is equivalent to: jmp vector64 Which then pushes the trap number onto the stack (padded out to 64 bits) then calls the alltraps assembly function which is the common entry point to handle all traps.

...
vector64:
  pushl $0
  pushl $64       ; trap number (eventually located in tp->trapno)
  jmp alltraps

...

`trapasm.S`

The function alltraps basically saves all the segment registers, then all the other registers (via the pushal instruction) as it enters into the kernel by pushing them to the stack and then setting the segment registers for the kernel context. Then calls the trap kernel function which is the C handler for all traps.

After the return from the trap function the registers are restored and an iret (interrupt return) is issued which will load the processes saved registers from the stack and then return to the point just following the int 0x40 instruction. In X86_64 assembly there are modern equivalents to int 0x40 and iret are syscall and sysret which may be about twice as fast.

alltraps:
  ; .. save registers ..
  ; .. setup segment context ..

  call trap             ; Calls the "trap(tf)" C function in trap.c
                        ; tf will point at the top of the stack (esp)

  ; .. restore everything ..

  iret                  ; (same as sysret)

`trap.c`

The trap function handles all the "traps", either system calls from the user or the hardware interrupt and/or exceptions. The important part is the beginning which tests if the trap number is 0x40 (T_SYSCALL).

void
trap(struct trapframe *tf)
{
  if(tf->trapno == T_SYSCALL){
    if(myproc()->killed)
      exit();
    myproc()->tf = tf;
    syscall();
    if(myproc()->killed)
      exit();
    return;
  }
  ...

struct trapframe (defined in x86.h) points to the stack location of all the pushed registers of the currently running user-space processor.
- Example: tf->eax == old value of the eax register
myproc() is a pseudo-global representing the current running process.
exit() - Shuts down the current running process (not the user-space exit although sort of the same.)
syscall() The function that performs the system call, found in syscall.c.

`syscall.c`

At the top of syscall.c is the system call dispatch table, which is an array of function pointers. The system call number that was placed in the eax register (accessed via curproc->tf->eax (i.e. the value of eax pushed to the stack during the alltraps function,)) is used as the index into the syscalls array to select the function pointer of the system call to process.

The syscall() function mostly just checks that the system call number is valid, then executes the given function, storing the return value in the eax field of the processes trap frame, which will be the return value of the function when the registers are restored off the stack upon system call return.

static int (*syscalls[])(void) = {
...
[SYS_write]   sys_write,
...
};

void
syscall(void)
{
  ...
  num = curproc->tf->eax;
  if(num > 0 && num < NELEM(syscalls) && syscalls[num]) {
    curproc->tf->eax = syscalls[num]();
  } else {
  ...
}

`sysfile.c`

For the sys_write() function itself is defined in the sysfile.c file. It is mostly a wrapper to load the parameters for the filewrite function off the processes stack using the arg* functions. The arg* functions are defined more in depth in the next lesson, but are

int
sys_write(void)
{
  struct file *f;
  int n;
  char *p;

  if(argfd(0, 0, &f) < 0 || argint(2, &n) < 0 || argptr(1, &p, n) < 0)
    return -1;
  return filewrite(f, p, n);
}