注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

widebright的个人空间

// 编程和生活

 
 
 

日志

 
 

Linux系统下面的System Call解释(转)  

2008-08-07 23:16:23|  分类: linux相关 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

原文地址http://www.win.tue.nl/~aeb/linux/lk/lk-4.html

是《The Linux kernel Andries Brouwer著》de< 的一部分http://www.win.tue.nl/~aeb/linux/lk/lk.html#toc4de<

Next Previous Contents


4. System Calls

4.1 System call numbers

System calls are identified by their numbers. The number of the call de<foode< is de<__NR_foode<. For example, the number of de<_llseekde< used above is de<__NR__llseekde<, defined as 140 in de</usr/include/asm-i386/unistd.hde<. Different architectures have different numbers.

Often, the kernel routine that handles the call de<foode< is called de<sys_foode<. One finds the association between numbers and names in the de<sys_call_tablede<, for example in de<arch/i386/kernel/entry.Sde<.

Change

The world changes and system calls change. Since one must not break old binaries, the semantics associated to any given system call number must remain fully backwards compatible.

What happens in practice is one of two things: either one gets a new and improved system call with a new name and number, and the libc routine that used to invoke the old call is changed to use the new one, or the new call (with new number) gets the old name, and the old call gets "old" prefixed to its name.

For example, long ago user IDs had 16 bits, today they have 32. de<__NR_getuidde< is 24, and de<__NR_getuid32de< is 199, and the former belongs to the 16-bit version of the call, the latter to the 32-bit version. Looking at the associated kernel routines, we find that these are de<sys_getuid16de< and de<sys_getuidde<, respectively. (Thus, de<sys_getuidde< does not have number de<__NR_getuidde<.) Looking at glibc, we find code somewhat like

int getuid32_available = UNKNOWN;uid_t getuid(void) {        if (getuid32_available == TRUE)                return INLINE_SYSCALL(getuid32, 0);        if (getuid32_available == UNKNOWN) {                uid_t res = INLINE_SYSCALL(getuid32, 0);                if (res == 0 || errno != ENOSYS) {                        getuid32_available = TRUE;                        return res;                }                getuid32_available = FALSE;        }        return INLINE_SYSCALL(getuid, 0);}

For an example where the name was moved and the old call got a name prefixed by "old", see de<__NR_oldoldunamede<, de<__NR_oldunamede<, de<__NR_unamede<, belonging to de<sys_oldunamede<, de<sys_unamede<, de<sys_newunamede<, respectively. One also has de<__NR_oldstatde<, de<__NR_statde<, de<__NR_stat64de< belonging to de<sys_statde<, de<sys_newstatde<, de<sys_stat64de<, respectively. And de<__NR_umountde<, de<__NR_umount2de< belonging to de<sys_oldumountde<, de<sys_umountde<, respectively. And de<__NR_selectde<, de<__NR__newselectde< belonging to de<old_selectde<, de<sys_selectde<, respectively.

These moving names are confusing - now you have been warned: the system call with number de<__NR_foode< does not always belong to the kernel routine de<sys_foo()de<.

4.2 The call

What happens? The assembler for a call with 0 parameters (on i386) is

#define _syscall0(type,name) \type name(void) \{ \long __res; \__asm__ volatile ("int $0x80" \        : "=a" (__res) \        : "0" (__NR_##name)); \__syscall_return(type,__res); \}
Thus, the basic ingredient is the assembler instruction INT 0x80. This causes a programmed exception and calls the kernel de<system_callde< routine. Some relevant code fragments:

/* include/asm-i386/hw_irq.h */#define SYSCALL_VECTOR              0x80/* arch/i386/kernel/traps.c */        set_system_gate(SYSCALL_VECTOR,&system_call);/* arch/i386/kernel/entry.S */#define GET_CURRENT(reg) \        movl $-8192, reg; \        andl %esp, reg#define SAVE_ALL \        cld; \        pushl %es; \        pushl %ds; \        pushl %eax; \        pushl %ebp; \        pushl %edi; \        pushl %esi; \        pushl %edx; \        pushl %ecx; \        pushl %ebx; \        movl $(__KERNEL_DS),%edx; \        movl %edx,%ds; \        movl %edx,%es;#define RESTORE_ALL     \        popl %ebx;      \        popl %ecx;      \        popl %edx;      \        popl %esi;      \        popl %edi;      \        popl %ebp;      \        popl %eax;      \1:      popl %ds;       \2:      popl %es;       \        addl $4,%esp;   \3:      iret;ENTRY(system_call)        pushl %eax                      # save orig_eax        SAVE_ALL        GET_CURRENT(%ebx)        testb $0x02,tsk_ptrace(%ebx)    # PT_TRACESYS        jne tracesys        cmpl $(NR_syscalls),%eax        jae badsys        call *SYMBOL_NAME(sys_call_table)(,%eax,4)        movl %eax,EAX(%esp)             # save the return valueENTRY(ret_from_sys_call)        cli                             # need_resched and signals atomic test        cmpl $0,need_resched(%ebx)        jne reschedule        cmpl $0,sigpending(%ebx)        jne signal_return        RESTORE_ALL

We transfer execution to de<system_callde<, save the original value of the EAX register (it is the number of the system call), save all other registers, verify that we are not being traced (otherwise the tracer must be informed and entirely different things happen), make sure that the system call number is within range, and call the appropriate kernel routine from the table de<sys_call_tablede<. Upon return we check a few things and when all is well restore the registers and call IRET to return from this INT.

(This was for the i386 architecture. All details differ on other architectures, but the basic idea is the same: store the syscall number and the syscall parameters somewhere the kernel can find them, in registers, on the stack, or in a known place of memory, do something that causes a transfer to kernel code, etc.)

4.3 System call parameters

On i386, the parameters of a system call are transported via registers. The system call number goes into de<%eaxde<, the first parameter in de<%ebxde<, the second in de<%ecxde<, the third in de<%edxde<, the fourth in de<%eside<, the fifth in de<%edide<, the sixth in de<%ebpde<.

Ancient history

Earlier versions of Linux could handle only four or five system call parameters, and therefore the system calls de<select()de< (5 parameters) and de<mmap()de< (6 parameters) used to have a single parameter that was a pointer to a parameter block in memory. Since Linux 1.3.0 five parameters are supported (and the earlier de<selectde< with memory block was renamed de<old_selectde<), and since Linux 2.3.31 six parameters are supported (and the earlier de<mmapde< with memory block was succeeded by the new de<mmap2de<).

4.4 Error return

Above we said: typically, the kernel returns a negative value to indicate an error. But this would mean that any system call only can return positive values. Since the negative error returns are of the form de<-ESOMETHINGde<, and the error numbers have small positive values, there is only a small negative error range. Thus

#define __syscall_return(type, res) \do { \        if ((unsigned long)(res) >= (unsigned long)(-125)) { \                errno = -(res); \                res = -1; \        } \        return (type) (res); \} while (0)
Here the range [-125,-1] is reserved for errors (the constant 125 is version and architecture dependent) and other values are OK.

What if a system call wants to return a small negative number and it is not an error? The scheduling priority of a process is set by de<setpriority()de< and read by de<getpriority()de<, and this value ranges from -20 (top priority) to 19 (lowest priority background job). The library routines with these names use these numbers, but the system call de<getpriority()de< returns 20 - P instead of P, moving the output interval to positive numbers only.

Or, similarly, the subfunctions PEEK* of de<ptracede< return the contents of a memory word in the traced process, and any value is possible. However, the system call returns this value in the de<datade< argument, and glibc does something like

res = sys_ptrace(request, pid, addr, &data);        if (res >= 0) {                errno = 0;                res = data;        }        return res;
so that a user program has to do
errno = 0;        res = ptrace(PTRACE_PEEKDATA, pid, addr, NULL);        if (res == -1 && errno != 0)                /* error */

4.5 Interrupted system calls

Above we saw in de<ret_from_sys_callde< the test on de<sigpendingde<: if a signal arrived while we were executing kernel code, then just before returning from the system call we first call the user program's signal handler, and when this finishes return from the system call.

When a system call is slow and a signal arrives while it was blocked, waiting for something, the call is aborted and returns de<-EINTRde<, so that the library function will return -1 and set de<errnode< to de<EINTRde<. Just before the system call returns, the user program's signal handler is called.

(So, what is "slow"? Mostly those calls that can block forever waiting for external events; read and write to terminal devices, but not read and write to disk devices, de<waitde<, de<pausede<.)

This means that a system call can return an error while nothing was wrong. Usually one will want to redo the system call. That can be automated by installing the signal handler using a call to de<sigactionde< with the de<SA_RESTARTde< flag set. The effect is that upon an interrupt the system call is aborted, the user program's signal handler is called, and afterwards the system call is restarted from the beginning.

Why is this not the default? It was, for a while, but often it is necessary to react to a signal while the reacting is not done by the signal handler itself. It is difficult to do nontrivial things in a signal handler since the rest of the program is in an unknown state, and most signal handlers just set a flag that is tested elsewhere.

A demo:

#include <stdio.h>#include <errno.h>#include <stdlib.h>#include <unistd.h>#include <signal.h>int got_interrupt;void intrup(int dummy) {        got_interrupt = 1;}void die(char *s) {        printf("%s\n", s);        exit(1);}int main() {        struct sigaction sa;        int n;        char c;        sa.sa_handler = intrup;        sigemptyset(&sa.sa_mask);        sa.sa_flags = 0;        if (sigaction(SIGINT, &sa, NULL))                die("sigaction-SIGINT");        sa.sa_flags = SA_RESTART;        if (sigaction(SIGQUIT, &sa, NULL))                die("sigaction-SIGQUIT");        got_interrupt = 0;        n = read(0, &c, 1);        if (n == -1 && errno == EINTR)                printf("read call was interrupted\n");        else if (got_interrupt)                printf("read call was restarted\n");        return 0;}

Here Ctrl-C will interrupt the read call, while after Ctrl-\ the read call is restarted.

4.6 Sysenter and the vsyscall page

It has been observed that a 2 GHz Pentium 4 was much slower than an 850 MHz Pentium III on certain tasks, and that this slowness is caused by the very large overhead of the traditional de<int 0x80de< interrupt on a Pentium 4.

Some models of the i386 family do have faster ways to enter the kernel. On Pentium II there is the de<sysenterde< instruction. Also AMD has a de<syscallde< instruction. It would be good if these could be used.

Something else is that in some applications de<gettimeofday()de< is a done very often, for example for timestamping all transactions. It would be nice if it could be implemented with very low overhead.

One way of obtaining a fast de<gettimeofday()de< is by writing the current time in a fixed place, on a page mapped into the memory of all applications, and updating this location on each clock interrupt. These applications could then read this fixed location with a single instruction - no system call required.

There might be other data that the kernel could make available in a read-only way to the process, like perhaps the current process ID. A vsyscall is a "system" call that avoids crossing the userspace-kernel boundary.

Linux is in the process of implementing such ideas. Since Linux 2.5.53 there is a fixed page, called the vsyscall page, filled by the kernel. At kernel initialization time the routine de<sysenter_setup()de< is called. It sets up a non-writable page and writes code for the de<sysenterde< instruction if the CPU supports that, and for the classical de<int 0x80de< otherwise. Thus, the C library can use the fastest type of system call by jumping to a fixed address in the vsyscall page.

Concerning de<gettimeofday()de<, a vsyscall version for the x86-64 is already part of the vanilla kernel. Patches for i386 exist. (An example of the kind of timing differences: John Stultz reports on an experiment where he measures de<gettimeofday()de< and finds 1.67 us for the de<int 0x80de< way, 1.24 us for the de<sysenterde< way, and 0.88 us for the vsyscall.)

Some details

The kernel maps a page (0xffffe000-0xffffefff) in the memory of every process. (This is the one but last addressable page. The last is not mapped - maybe to avoid bugs related to wraparound.) We can read it:

/* get vsyscall page */#include <unistd.h>#include <string.h>int main() {        char *p = (char *) 0xffffe000;        char buf[4096];#if 0        write(1, p, 4096);        /* this gives EFAULT */#else        memcpy(buf, p, 4096);        write(1, buf, 4096);#endif        return 0;}
and if we do, find an ELF binary.
% ./get_vsyscall_page > syspage% file syspagesyspage: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), stripped% objdump -h syspagesyspage:     file format elf32-i386Sections:Idx Name          Size      VMA       LMA       File off  Algn  0 .hash         00000050  ffffe094  ffffe094  00000094  2**2                  CONTENTS, ALLOC, LOAD, READONLY, DATA  1 .dynsym       000000f0  ffffe0e4  ffffe0e4  000000e4  2**2                  CONTENTS, ALLOC, LOAD, READONLY, DATA  2 .dynstr       00000056  ffffe1d4  ffffe1d4  000001d4  2**0                  CONTENTS, ALLOC, LOAD, READONLY, DATA  3 .gnu.version  0000001e  ffffe22a  ffffe22a  0000022a  2**1                  CONTENTS, ALLOC, LOAD, READONLY, DATA  4 .gnu.version_d 00000038  ffffe248  ffffe248  00000248  2**2                  CONTENTS, ALLOC, LOAD, READONLY, DATA  5 .text         00000047  ffffe400  ffffe400  00000400  2**5                  CONTENTS, ALLOC, LOAD, READONLY, CODE  6 .eh_frame_hdr 00000024  ffffe448  ffffe448  00000448  2**2                  CONTENTS, ALLOC, LOAD, READONLY, DATA  7 .eh_frame     0000010c  ffffe46c  ffffe46c  0000046c  2**2                  CONTENTS, ALLOC, LOAD, READONLY, DATA  8 .dynamic      00000078  ffffe578  ffffe578  00000578  2**2                  CONTENTS, ALLOC, LOAD, DATA  9 .useless      0000000c  ffffe5f0  ffffe5f0  000005f0  2**2                  CONTENTS, ALLOC, LOAD, DATA% objdump -d syspagesyspage:     file format elf32-i386Disassembly of section .text:ffffe400 <.text>:ffffe400:       51                      push   %ecxffffe401:       52                      push   %edxffffe402:       55                      push   %ebpffffe403:       89 e5                   mov    %esp,%ebpffffe405:       0f 34                   sysenter ffffe407:       90                      nop    ffffe408:       90                      nop            ... more nops ...ffffe40d:       90                      nop    ffffe40e:       eb f3                   jmp    0xffffe403ffffe410:       5d                      pop    %ebpffffe411:       5a                      pop    %edxffffe412:       59                      pop    %ecxffffe413:       c3                      ret            ... zero bytes ...ffffe420:       58                      pop    %eaxffffe421:       b8 77 00 00 00          mov    $0x77,%eaxffffe426:       cd 80                   int    $0x80ffffe428:       90                      nop    ffffe429:       90                      nop            ... more nops ...ffffe43f:       90                      nop    ffffe440:       b8 ad 00 00 00          mov    $0xad,%eaxffffe445:       cd 80                   int    $0x80

The interesting addresses here are found via

% grep ffffe System.map ffffe000 A VSYSCALL_BASEffffe400 A __kernel_vsyscallffffe410 A SYSENTER_RETURNffffe420 A __kernel_sigreturnffffe440 A __kernel_rt_sigreturn%

So de<__kernel_vsyscallde< pushes a few registers and does a de<sysenterde< instruction. And de<SYSENTER_RETURNde< pops the registers again and returns. And de<__kernel_sigreturnde< and de<__kernel_rt_sigreturnde< do system calls 119 and 173, that is, sigreturn and rt_sigreturn, respectively.

What about the jump just before de<SYSENTER_RETURNde<? It is a trick to handle restarting of system calls with 6 parameters. As Linus said: I'm a disgusting pig, and proud of it to boot.

The code involved is most easily seen from a slightly earlier patch.

A tiny demo program.

#include <stdio.h>int pid;int main() {        __asm__(                "movl $20, %eax    \n"                "call 0xffffe400   \n"                "movl %eax, pid    \n"        );        printf("pid is %d\n", pid);        return 0;}
This does the de<getpid()de< system call (de<__NR_getpidde< is 20) using de<call 0xffffe400de< instead of de<int 0x80de<.

However, the proper thing to do is not de<call 0xffffe400de< but de<call *%gs:0x18de<. If de<%gsde< has been set up so that it addresses de<0xffffe000de<, then at location de<0xffffe018de< we find the value of de<__kernel_vsyscallde<, the entry point of the kernel vsyscalls. Such general setup requires the parsing of the ELF headers of this vsyscall page, but then is future-proof.


Next Previous Contents
  评论这张
 
阅读(753)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017