注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

widebright的个人空间

// 编程和生活

 
 
 

日志

 
 

Linux系统的内存管理机制学习  

2010-01-15 09:35:45|  分类: linux相关 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

一直对Linux的内存管理机制模模糊糊的,这两天有时间,就把Intel的手册和 《Understanding the Linux Kernel》还有《Understanding the Linux? Virtual Memory Manager》和内存管理相关的都仔细看了一遍。以前老是对分页的地方不是很清楚,这次看到手册上页表里面的物理地址是怎么存放的,就有豁然开朗的感 觉,看其他部分代码也就很容易理解了。

Linux系统的内存管理机制学习 - widebright - widebright的个人空间

上面所有地址转换机制由cpu硬件实现,但软件必须设好好相应的分段用的寄存器和分页用的页表。

逻辑地址通常也成为 “长指针(far pointer)”,有16位的“segment selector” 和32位的offset 组成。offset也就是我们C语言编程时,32位的指针类型的值。“segment selector”则是保存在cs ds等段寄存器中的,一般不是有程序自己修改,而是系统自己修改或者编译器指定。经过两次转换之后才得到物理地址。x86的pc机上物理地址是等于总线地 址的。32位机器上cpu有32条地址总线管脚和外面的内存等设备相连的。

为什么要搞“分段”,然后是“分页”这么两套机制呢?一个好处是让所有的程序都能获得一个统一的地址空间,不同的进程可以共同拥有一个同样的地址空 间比如0x12345678对应的一块内存,但两个进程经过不同的映射方式,把这个同样的地址映射到不同的物理内存上去。还有就是所有的进程都可以 0~3G的进程地址空间,即使你的物理内存只有比如说512M或者1G的时候。系统甚至可以把你没有用到的内存暂时保存到硬盘上去。在32为机器上逻辑地 址是可以达到2的32次方4G那么多的,但Linux系统人为的划分一下逻辑地址,让1~3G的线性地址作为应用程序的可访问地址,3~4G的地址作为内 核空间的可访问地址了。内核为了维护这个逻辑地址到物理地址的映射是要做很多工作的,这就是标题所说的“Linux系统的内存管理机制”,呵呵Linux系统的内存管理机制学习 - widebright - widebright的个人空间


Linux系统的内存管理机制学习 - widebright - widebright的个人空间

上面就是“分段”的原理图,处理器从“段寄存器”的16位数据中得到一个“段选择符 (Segment Selector )”,然后根据高13位得到一个index,第3位确定是去Global Descriptor Table (GDT ) 或者Local Descriptor Table(LDT)的数组表中得到“段描述符Segment Descriptors ”。其中GDTLDT的线性地址(linear address)的起始地址和数组抵消分别保存在cpu的GDTR寄存器(Global Descriptor Table Register (GDTR))和Local Descriptor Table Register (LDTR)中。

Linux系统的内存管理机制学习 - widebright - widebright的个人空间

不过Linux应该是有限度的使用了分段功能的,应该有点类似上图,所以Linux系统上面只是用分段功能来区分ring 0和ring3的校验,Linux上面的逻辑地址和线性地址应该是相等的。 仅有的_ _USER_CS, _ _USER_DS, _ _KERNEL_CS, _ _KERNEL_DS 几个段选择符起始地址也都是从0开始的。详见《Understanding the Linux Kernel》一书。




Linux系统的内存管理机制学习 - widebright - widebright的个人空间

上面是分页的图示,首先cpu的cr3寄存器保存第一级页表的物理地址,然后每个页表项(在32位机器上就是一个32位整数)除了包含页的标志位 外,有20位是用于指定下一级页表的物理地址的。因为一个页框大小就是4086(2的12次方)字节,所以页表的地址用12+20位是刚好可以表示32位 地址空间的。对于页表来说,20位的物理地址,线性地址中的10位偏移值表明一个页表最多可以表示1024项,页表项大小为4个字节,刚好填满20位地址 空间。
Both the Directory and the Table fields are 10 bits long, so Page Directories and Page Tables can include up to 1,024 entries. It follows that a Page Directory can address up to 1024 x 1024 x 4096=232 memory cells, as you'd expect in 32-bit addresses.

我以前就以为不理解每个页表项里面指定了下一级页表的物理地址,所以总是看的不明白书上说什么。

cr3寄存器的值是由内核指定的,所有的内核进程应该使用同一个叫做“主内核页表”的,不过其他内核固定映射的也用特别的一个。不同的用户进程应该是用不同的一级页表的,所以cr3寄存器就要在切换进程的时候修改吧,这个值保存在进程对应的memory descriptor 结mm_struct 的

pgd_t *

pgd

里面。

而第二级以后的页表都是到必须使用的时候才分配的。

9.2.1. Memory Descriptor of Kernel Threads


Kernel threads run only in Kernel Mode, so they never access linear addresses below TASK_SIZE (same as PAGE_OFFSET, usually 0xc0000000). Contrary to regular processes, kernel threads do not use memory regions, therefore most of the fields of a memory descriptor are meaningless for them.

Because the Page Table entries that refer to the linear address above TASK_SIZE should always be identical, it does not really matter what set of Page Tables a kernel thread uses. To avoid useless TLB and cache flushes, a kernel thread uses the set of Page Tables of the last previously running regular process. To that end, two kinds of memory descriptor pointers are included in every process descriptor: mm and active_mm.

The mm field in the process descriptor points to the memory descriptor owned by the process, while the active_mm field points to the memory descriptor used by the process when it is in execution. For regular processes, the two fields store the same pointer. Kernel threads, however, do not own any memory descriptor, thus their mm field is always NULL. When a kernel thread is selected for execution, its active_mm field is initialized to the value of the active_mm of the previously running process (see the section "The schedule( ) Function" in Chapter 7).

There is, however, a small complication. Whenever a process in Kernel Mode modifies a Page Table entry for a "high" linear address (above TASK_SIZE), it should also update the corresponding entry in the sets of Page Tables of all processes in the system. In fact, once set by a process in Kernel Mode, the mapping should be effective for all other processes in Kernel Mode as well. Touching the sets of Page Tables of all processes is a costly operation; therefore, Linux adopts a deferred approach.

We already mentioned this deferred approach in the section "Noncontiguous Memory Area Management" in Chapter 8: every time a high linear address has to be remapped (typically by vmalloc( ) or vfree( )), the kernel updates a canonical set of Page Tables rooted at the swapper_pg_dir master kernel Page Global Directory (see the section "Kernel Page Tables" in Chapter 2). This Page Global Directory is pointed to by the pgd field of a master memory descriptor , which is stored in the init_mm variable.[*]

 We mentioned in the section "Kernel Threads" in Chapter 3 that the swapper process uses init_mm during the initialization phase. However, swapper never uses this memory descriptor once the initialization phase completes.

Later, in the section "Handling Noncontiguous Memory Area Accesses," we'll describe how the Page Fault handler takes care of spreading the information stored in the canonical Page Tables when effectively needed.

从上面的解释可以知道内核空间使用的是最近运行的普通进程的页表集,因为所有进程的3~4G这个内核空间页表项总是一样的。

内核里想去访问用户进程地址空间(1~3G)时,就可以直接去访问的,就能使用普通的memcpy函数因为他们是属于同一个页表的,但如果你传的是一个无效的地址,那么就会导致内核挂掉。而使用系统推荐使用的copy_from_user 和copy_to_user函数来操作的话,他这两个函数除了实现memcpy的功能外,还检查你传过来的进程空间地址是不是正确的。如果你读到一个无效的地址,cpu产生缺页异常后就调用到do_page_fault函数,这个函数将处理“非法地址/页不在当前内存”等情况,而且如果他通过cpu的eip寄存器检查到这个错误是 copy_from_user 产生的,还会调用 copy_from_user提供的一段fixup代码来修复错误让copy_from_user返回一个错误的返回值。copy_from_user比memcpy更好的就是他可以让你非法地址也能从异常中恢复回来,让内核继续跑下去。而memcpy碰到非法地址,到了do_page_fault函数之后找不到fixup代码,就会让内核崩溃了。很多时候我们并不能保证进程空间地址是不是合法的,这就是你应该去使用copy_from_user 系列函数的时候了。

 几本书上都对这个导致的页异常错误的动态地址修复代码方法做了解释,《Understanding.the.Linux.Kernel.3rd.Ed》的第十章,“系统调用”讲到访问进程空间的参数时说到这个,《Understanding the Linux? Virtual Memory Manager》的4.5,4.7小节也说到这个技术。

10.4.2. Accessing the Process Address Space

System call service routines often need to read or write data contained in the process's address space. Linux includes a set of macros that make this access easier. We'll describe two of them, called get_user( ) and put_user( ). The first can be used to read 1, 2, or 4 consecutive bytes from an address, while the second can be used to write data of those sizes into an address.

Each function accepts two arguments, a value x to transfer and a variable ptr. The second variable also determines how many bytes to transfer. Thus, in get_user(x,ptr), the size of the variable pointed to by ptr causes the function to expand into a _ _get_user_1( ), _ _get_user_2( ), or _ _get_user_4( ) assembly language function. Let's consider one of them, _ _get_user_2( ):

     _ _get_user_2:
addl $1, %eax
jc bad_get_user
movl $0xffffe000, %edx /* or 0xfffff000 for 4-KB stacks */
andl %esp, %edx
cmpl 24(%edx), %eax
jae bad_get_user
2: movzwl
-1(%eax), %edx
xorl %eax, %eax
ret
bad_get_user:
xorl %edx, %edx
movl $-EFAULT, %eax
ret

The eax register contains the address ptr of the first byte to be read. The first six instructions essentially perform the same checks as the access_ok( ) macro: they ensure that the 2 bytes to be read have addresses less than 4 GB as well as less than the addr_limit.seg field of the current process. (This field is stored at offset 24 in the thread_info structure of current, which appears in the first operand of the cmpl instruction.)

If the addresses are valid, the function executes the movzwl instruction to store the data to be read in the two least significant bytes of edx register while setting the high-order bytes of edx to 0; then it sets a 0 return code in eax and terminates. If the addresses are not valid, the function clears edx, sets the -EFAULT value into eax, and terminates.

The put_user(x,ptr) macro is similar to the one discussed before, except it writes the value x into the process address space starting from address ptr. Depending on the size of x, it invokes either the _ _put_user_asm( ) macro (size of 1, 2, or 4 bytes) or the _ _put_user_u64( ) macro (size of 8 bytes). Both macros return the value 0 in the eax register if they succeed in writing the value, and -EFAULT otherwise.

Several other functions and macros are available to access the process address space in Kernel Mode; they are listed in Table 10-1. Notice that many of them also have a variant prefixed by two underscores (_ _). The ones without initial underscores take extra time to check the validity of the linear address interval requested, while the ones with the underscores bypass that check. Whenever the kernel must repeatedly access the same memory area in the process address space, it is more efficient to check the address once at the start and then access the process area without making any further checks.

Table 10-1. Functions and macros that access the process address space

Function

Action

get_user _ _get_user

Reads an integer value from user space (1, 2, or 4 bytes)

put_user _ _put_user

Writes an integer value to user space (1, 2, or 4 bytes)

copy_from_user _ _copy_from_user

Copies a block of arbitrary size from user space

copy_to_user _ _copy_to_user

Copies a block of arbitrary size to user space

strncpy_from_user _ _strncpy_from_user

Copies a null-terminated string from user space

strlen_user strnlen_user

Returns the length of a null-terminated string in user space

clear_user _ _clear_user

Fills a memory area in user space with zeros


10.4.3. Dynamic Address Checking: The Fix-up Code

As seen previously, access_ok( ) makes a coarse check on the validity of linear addresses passed as parameters of a system call. This check only ensures that the User Mode process is not attempting to fiddle with the kernel address space; however, the linear addresses passed as parameters still might not belong to the process address space. In this case, a Page Fault exception will occur when the kernel tries to use any of such bad addresses.

Before describing how the kernel detects this type of error, let's specify the three cases in which Page Fault exceptions may occur in Kernel Mode. These cases must be distinguished by the Page Fault handler, because the actions to be taken are quite different.

  1. The kernel attempts to address a page belonging to the process address space, but either the corresponding page frame does not exist or the kernel tries to write a read-only page. In these cases, the handler must allocate and initialize a new page frame (see the sections "Demand Paging" and "Copy On Write" in Chapter 9).

  2. The kernel addresses a page belonging to its address space, but the corresponding Page Table entry has not yet been initialized (see the section "Handling Noncontiguous Memory Area Accesses" in Chapter 9). In this case, the kernel must properly set up some entries in the Page Tables of the current process.

  3. Some kernel functions include a programming bug that causes the exception to be raised when that program is executed; alternatively, the exception might be caused by a transient hardware error. When this occurs, the handler must perform a kernel oops (see the section "Handling a Faulty Address Inside the Address Space" in Chapter 9).

  4. The case introduced in this chapter: a system call service routine attempts to read or write into a memory area whose address has been passed as a system call parameter, but that address does not belong to the process address space.

The Page Fault handler can easily recognize the first case by determining whether the faulty linear address is included in one of the memory regions owned by the process. It is also able to detect the second case by checking whether the corresponding master kernel Page Table entry includes a proper non-null entry that maps the address. Let's now explain how the handler distinguishes the remaining two cases.

10.4.4. The Exception Tables

The key to determining the source of a Page Fault lies in the narrow range of calls that the kernel uses to access the process address space. Only the small group of functions and macros described in the previous section are used to access this address space; thus, if the exception is caused by an invalid parameter, the instruction that caused it must be included in one of the functions or else be generated by expanding one of the macros. The number of the instructions that address user space is fairly small.

Therefore, it does not take much effort to put the address of each kernel instruction that accesses the process address space into a structure called the exception table. If we succeed in doing this, the rest is easy. When a Page Fault exception occurs in Kernel Mode, the do_ page_fault( ) handler examines the exception table: if it includes the address of the instruction that triggered the exception, the error is caused by a bad system call parameter; otherwise, it is caused by a more serious bug.

Linux defines several exception tables . The main exception table is automatically generated by the C compiler when building the kernel program image. It is stored in the _ _ex_table section of the kernel code segment, and its starting and ending addresses are identified by two symbols produced by the C compiler: _ _start_ _ _ex_table and _ _stop_ _ _ex_table.

Moreover, each dynamically loaded module of the kernel (see Appendix B) includes its own local exception table. This table is automatically generated by the C compiler when building the module image, and it is loaded into memory when the module is inserted in the running kernel.

Each entry of an exception table is an exception_table_entry structure that has two fields:


insn

The linear address of an instruction that accesses the process address space


fixup

The address of the assembly language code to be invoked when a Page Fault exception triggered by the instruction located at insn occurs

The fixup code consists of a few assembly language instructions that solve the problem triggered by the exception. As we will see later in this section, the fix usually consists of inserting a sequence of instructions that forces the service routine to return an error code to the User Mode process. These instructions, which are usually defined in the same macro or function that accesses the process address space, are placed by the C compiler into a separate section of the kernel code segment called .fixup.

The search_exception_tables( ) function is used to search for a specified address in all exception tables: if the address is included in a table, the function returns a pointer to the corresponding exception_table_entry structure; otherwise, it returns NULL. Thus the Page Fault handler do_page_fault( ) executes the following statements:

    if ((fixup = search_exception_tables(regs->eip))) {
regs->eip = fixup->fixup;
return 1;
}

The regs->eip field contains the value of the eip register saved on the Kernel Mode stack when the exception occurred. If the value in the register (the instruction pointer) is in an exception table, do_page_fault( ) replaces the saved value with the address found in the entry returned by search_exception_tables( ). Then the Page Fault handler terminates and the interrupted program resumes with execution of the fixup code .

10.4.5. Generating the Exception Tables and the Fixup Code

The GNU Assembler .section directive allows programmers to specify which section of the executable file contains the code that follows. As we will see in Chapter 20, an executable file includes a code segment, which in turn may be subdivided into sections. Thus, the following assembly language instructions add an entry into an exception table; the "a" attribute specifies that the section must be loaded into memory together with the rest of the kernel image:

    .section _ _ex_table, "a"
.long faulty_instruction_address, fixup_code_address
.previous

The .previous directive forces the assembler to insert the code that follows into the section that was active when the last .section directive was encountered.

Let's consider again the _ _get_user_1( ), _ _get_user_2( ), and _ _get_user_4( ) functions mentioned before. The instructions that access the process address space are those labeled as 1, 2, and 3:

    _ _get_user_1:
[...]
1: movzbl (%eax), %edx
[...]
_ _get_user_2:
[...]
2: movzwl -1(%eax), %edx
[...]
_ _get_user_4:
[...]
3: movl -3(%eax), %edx
[...]
bad_get_user:
xorl %edx, %edx
movl $-EFAULT, %eax
ret
.section _ _ex_table,"a"
.long 1b, bad_get_user
.long 2b, bad_get_user
.long 3b, bad_get_user
.previous

Each exception table entry consists of two labels. The first one is a numeric label with a b suffix to indicate that the label is "backward;" in other words, it appears in a previous line of the program. The fixup code is common to the three functions and is labeled as bad_get_user. If a Page Fault exception is generated by the instructions at label 1, 2, or 3, the fixup code is executed. It simply returns an -EFAULT error code to the process that issued the system call.

Other kernel functions that act in the User Mode address space use the fixup code technique. Consider, for instance, the strlen_user(string) macro. This macro returns either the length of a null-terminated string passed as a parameter in a system call or the value 0 on error. The macro essentially yields the following assembly language instructions:

        movl $0, %eax
movl $0x7fffffff, %ecx
movl %ecx, %ebx
movl string, %edi
0: repne; scasb

subl %ecx, %ebx
movl %ebx, %eax
1:
.section .fixup,"ax"
2: xorl %eax, %eax
jmp 1b
.previous
.section _ _ex_table,"a"
.long 0b, 2b
.previous

The ecx and ebx registers are initialized with the 0x7fffffff value, which represents the maximum allowed length for the string in the User Mode address space. The repne;scasb assembly language instructions iteratively scan the string pointed to by the edi register, looking for the value 0 (the end of string \0 character) in eax. Because scasb decreases the ecx register at each iteration, the eax register ultimately stores the total number of bytes scanned in the string (that is, the length of the string).

The fixup code of the macro is inserted into the .fixup section. The "ax" attributes specify that the section must be loaded into memory and that it contains executable code. If a Page Fault exception is generated by the instructions at label 0, the fixup code is executed; it simply loads the value 0 in eaxthus forcing the macro to return a 0 error code instead of the string lengthand then jumps to the 1 label, which corresponds to the instruction following the macro.

The second .section directive adds an entry containing the address of the repne; scasb instruction and the address of the corresponding fixup code in the _ _ex_table section.



=====================================================================

4.5 Exception Handling

A very important part of VM is how kernel address space exceptions, which are not bugs, are caught.[1] This section does not cover the exceptions that are raised with errors such as divide by zero. I am only concerned with the exception raised as the result of a page fault. There are two situations where a bad reference may occur. The first is where a process sends an invalid pointer to the kernel by a system call, which the kernel must be able to safely trap because the only check made initially is that the address is below PAGE_OFFSET. The second is where the kernel uses copy_from_user() or copy_to_user() to read or write data from userspace.

Many thanks go to Ingo Oeser for clearing up the details of how this is implemented.

At compile time, the linker creates an exception table in the __ex_table section of the kernel code segment, which starts at __start___ex_table and ends at __stop___ex_table. Each entry is of type exception_table_entry, which is a pair consisting of an execution point and a fixup routine. When an exception occurs that the page fault handler cannot manage, it calls search_exception_table() to see if a fixup routine has been provided for an error at the faulting instruction. If module support is compiled, each module's exception table will also be searched.

If the address of the current exception is found in the table, the corresponding location of the fixup code is returned and executed. We will see in Section 4.7 how this is used to trap bad reads and writes to userspace.




4.7 Copying to/from Userspace

It is not safe to access memory in the process address space directly because there is no way to quickly check if the page addressed is resident or not. Linux relies on the MMU to raise exceptions when the address is invalid and have the Page Fault Exception handler catch the exception and fix it up. In the x86 case, an assembler is provided by the __copy_user() to trap exceptions where the address is totally useless. The location of the fixup code is found when the function search_exception_table() is called. Linux provides an ample API (mainly macros) for copying data to and from the user address space safely as shown in Table 4.6.


Table 4.6. Accessing Process Address Space API

unsigned long copy_from_user(void *to, const void *from, unsigned long n)
Copies n bytes from the user address(from) to the kernel address space(to).

unsigned long copy_to_user(void *to, const void *from, unsigned long n)
Copies n bytes from the kernel address(from) to the user address space(to).

void copy_user_page(void *to, void *from, unsigned long address)
Copies data to an anonymous or COW page in userspace. Ports are responsible for avoiding D-cache aliases. It can do this by using a kernel virtual address that would use the same cache lines as the virtual address.

void clear_user_page(void *page, unsigned long address)
Similar to copy_user_page(), except it is for zeroing a page.

void get_user(void *to, void *from)
Copies an integer value from userspace (from) to kernel space (to).

void put_user (void *from, void *to)
Copies an integer value from kernel space (from) to userspace (to).

long strncpy_from_user(char *dst, const char *src, long count)
Copies a null terminated string of at most count bytes long from userspace (src) to kernel space (dst).

long strlen_user(const char *s, long n)
Returns the length, upper bound by n, of the userspace string including the terminating NULL.

int access_ok(int type, unsigned long addr, unsigned long size)
Returns nonzero if the userspace block of memory is valid and zero otherwise.




All the macros map on to assembler functions, which all follow similar patterns of implementation. For illustration purposes, we'll just trace how copy_from_user() is implemented on the x86.

If the size of the copy is known at compile time, copy_from_user() calls __constant_copy_from_user(), or __generic_copy_from_user() is used. If the size is known, there are different assembler optimizations to copy data in 1, 2 or 4 byte strides. Otherwise, the distinction between the two copy functions is not important.

The generic copy function eventually calls the function __copy_user_zeroing() in <asm-i386/uaccess.h>, which has three important parts. The first part is the assembler for the actual copying of size number of bytes from userspace. If any page is not resident, a page fault will occur, and, if the address is valid, it will get swapped in as normal. The second part is fixup code, and the third part is the __ex_table mapping the instructions from the first part to the fixup code in the second part.

These pairings, as described in Section 4.5, copy the location of the copy instructions and the location of the fixup code to the kernel exception handle table by the linker. If an invalid address is read, the function do_page_fault() will fall through, call search_exception_table(), find the Enhanced Instruction Pointer (EIP) where the faulty read took place and jump to the fixup code, which copies zeros into the remaining kernel space, fixes up registers and returns. In this manner, the kernel can safely access userspace with no expensive checks and let the MMU hardware handle the exceptions.

All the other functions that access userspace follow a similar pattern.

理解上面这些,再去看书的“页框管理”“内核空间的页描述符数组”“进程地址空间地址管理”“slab分配器”“kmalloc的连续页内存分配,vmalloc的非连续页内存”分配等都比较拗容易理解了。

  评论这张
 
阅读(956)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017