284 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			284 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| =============================
 | |
| No-MMU memory mapping support
 | |
| =============================
 | |
| 
 | |
| The kernel has limited support for memory mapping under no-MMU conditions, such
 | |
| as are used in uClinux environments. From the userspace point of view, memory
 | |
| mapping is made use of in conjunction with the mmap() system call, the shmat()
 | |
| call and the execve() system call. From the kernel's point of view, execve()
 | |
| mapping is actually performed by the binfmt drivers, which call back into the
 | |
| mmap() routines to do the actual work.
 | |
| 
 | |
| Memory mapping behaviour also involves the way fork(), vfork(), clone() and
 | |
| ptrace() work. Under uClinux there is no fork(), and clone() must be supplied
 | |
| the CLONE_VM flag.
 | |
| 
 | |
| The behaviour is similar between the MMU and no-MMU cases, but not identical;
 | |
| and it's also much more restricted in the latter case:
 | |
| 
 | |
|  (#) Anonymous mapping, MAP_PRIVATE
 | |
| 
 | |
| 	In the MMU case: VM regions backed by arbitrary pages; copy-on-write
 | |
| 	across fork.
 | |
| 
 | |
| 	In the no-MMU case: VM regions backed by arbitrary contiguous runs of
 | |
| 	pages.
 | |
| 
 | |
|  (#) Anonymous mapping, MAP_SHARED
 | |
| 
 | |
| 	These behave very much like private mappings, except that they're
 | |
| 	shared across fork() or clone() without CLONE_VM in the MMU case. Since
 | |
| 	the no-MMU case doesn't support these, behaviour is identical to
 | |
| 	MAP_PRIVATE there.
 | |
| 
 | |
|  (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, !PROT_WRITE
 | |
| 
 | |
| 	In the MMU case: VM regions backed by pages read from file; changes to
 | |
| 	the underlying file are reflected in the mapping; copied across fork.
 | |
| 
 | |
| 	In the no-MMU case:
 | |
| 
 | |
|          - If one exists, the kernel will re-use an existing mapping to the
 | |
|            same segment of the same file if that has compatible permissions,
 | |
|            even if this was created by another process.
 | |
| 
 | |
|          - If possible, the file mapping will be directly on the backing device
 | |
|            if the backing device has the NOMMU_MAP_DIRECT capability and
 | |
|            appropriate mapping protection capabilities. Ramfs, romfs, cramfs
 | |
|            and mtd might all permit this.
 | |
| 
 | |
| 	 - If the backing device can't or won't permit direct sharing,
 | |
|            but does have the NOMMU_MAP_COPY capability, then a copy of the
 | |
|            appropriate bit of the file will be read into a contiguous bit of
 | |
|            memory and any extraneous space beyond the EOF will be cleared
 | |
| 
 | |
| 	 - Writes to the file do not affect the mapping; writes to the mapping
 | |
| 	   are visible in other processes (no MMU protection), but should not
 | |
| 	   happen.
 | |
| 
 | |
|  (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, PROT_WRITE
 | |
| 
 | |
| 	In the MMU case: like the non-PROT_WRITE case, except that the pages in
 | |
| 	question get copied before the write actually happens. From that point
 | |
| 	on writes to the file underneath that page no longer get reflected into
 | |
| 	the mapping's backing pages. The page is then backed by swap instead.
 | |
| 
 | |
| 	In the no-MMU case: works much like the non-PROT_WRITE case, except
 | |
| 	that a copy is always taken and never shared.
 | |
| 
 | |
|  (#) Regular file / blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
 | |
| 
 | |
| 	In the MMU case: VM regions backed by pages read from file; changes to
 | |
| 	pages written back to file; writes to file reflected into pages backing
 | |
| 	mapping; shared across fork.
 | |
| 
 | |
| 	In the no-MMU case: not supported.
 | |
| 
 | |
|  (#) Memory backed regular file, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
 | |
| 
 | |
| 	In the MMU case: As for ordinary regular files.
 | |
| 
 | |
| 	In the no-MMU case: The filesystem providing the memory-backed file
 | |
| 	(such as ramfs or tmpfs) may choose to honour an open, truncate, mmap
 | |
| 	sequence by providing a contiguous sequence of pages to map. In that
 | |
| 	case, a shared-writable memory mapping will be possible. It will work
 | |
| 	as for the MMU case. If the filesystem does not provide any such
 | |
| 	support, then the mapping request will be denied.
 | |
| 
 | |
|  (#) Memory backed blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
 | |
| 
 | |
| 	In the MMU case: As for ordinary regular files.
 | |
| 
 | |
| 	In the no-MMU case: As for memory backed regular files, but the
 | |
| 	blockdev must be able to provide a contiguous run of pages without
 | |
| 	truncate being called. The ramdisk driver could do this if it allocated
 | |
| 	all its memory as a contiguous array upfront.
 | |
| 
 | |
|  (#) Memory backed chardev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
 | |
| 
 | |
| 	In the MMU case: As for ordinary regular files.
 | |
| 
 | |
| 	In the no-MMU case: The character device driver may choose to honour
 | |
| 	the mmap() by providing direct access to the underlying device if it
 | |
| 	provides memory or quasi-memory that can be accessed directly. Examples
 | |
| 	of such are frame buffers and flash devices. If the driver does not
 | |
| 	provide any such support, then the mapping request will be denied.
 | |
| 
 | |
| 
 | |
| Further notes on no-MMU MMAP
 | |
| ============================
 | |
| 
 | |
|  (#) A request for a private mapping of a file may return a buffer that is not
 | |
|      page-aligned.  This is because XIP may take place, and the data may not be
 | |
|      paged aligned in the backing store.
 | |
| 
 | |
|  (#) A request for an anonymous mapping will always be page aligned.  If
 | |
|      possible the size of the request should be a power of two otherwise some
 | |
|      of the space may be wasted as the kernel must allocate a power-of-2
 | |
|      granule but will only discard the excess if appropriately configured as
 | |
|      this has an effect on fragmentation.
 | |
| 
 | |
|  (#) The memory allocated by a request for an anonymous mapping will normally
 | |
|      be cleared by the kernel before being returned in accordance with the
 | |
|      Linux man pages (ver 2.22 or later).
 | |
| 
 | |
|      In the MMU case this can be achieved with reasonable performance as
 | |
|      regions are backed by virtual pages, with the contents only being mapped
 | |
|      to cleared physical pages when a write happens on that specific page
 | |
|      (prior to which, the pages are effectively mapped to the global zero page
 | |
|      from which reads can take place).  This spreads out the time it takes to
 | |
|      initialize the contents of a page - depending on the write-usage of the
 | |
|      mapping.
 | |
| 
 | |
|      In the no-MMU case, however, anonymous mappings are backed by physical
 | |
|      pages, and the entire map is cleared at allocation time.  This can cause
 | |
|      significant delays during a userspace malloc() as the C library does an
 | |
|      anonymous mapping and the kernel then does a memset for the entire map.
 | |
| 
 | |
|      However, for memory that isn't required to be precleared - such as that
 | |
|      returned by malloc() - mmap() can take a MAP_UNINITIALIZED flag to
 | |
|      indicate to the kernel that it shouldn't bother clearing the memory before
 | |
|      returning it.  Note that CONFIG_MMAP_ALLOW_UNINITIALIZED must be enabled
 | |
|      to permit this, otherwise the flag will be ignored.
 | |
| 
 | |
|      uClibc uses this to speed up malloc(), and the ELF-FDPIC binfmt uses this
 | |
|      to allocate the brk and stack region.
 | |
| 
 | |
|  (#) A list of all the private copy and anonymous mappings on the system is
 | |
|      visible through /proc/maps in no-MMU mode.
 | |
| 
 | |
|  (#) A list of all the mappings in use by a process is visible through
 | |
|      /proc/<pid>/maps in no-MMU mode.
 | |
| 
 | |
|  (#) Supplying MAP_FIXED or a requesting a particular mapping address will
 | |
|      result in an error.
 | |
| 
 | |
|  (#) Files mapped privately usually have to have a read method provided by the
 | |
|      driver or filesystem so that the contents can be read into the memory
 | |
|      allocated if mmap() chooses not to map the backing device directly. An
 | |
|      error will result if they don't. This is most likely to be encountered
 | |
|      with character device files, pipes, fifos and sockets.
 | |
| 
 | |
| 
 | |
| Interprocess shared memory
 | |
| ==========================
 | |
| 
 | |
| Both SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU
 | |
| mode.  The former through the usual mechanism, the latter through files created
 | |
| on ramfs or tmpfs mounts.
 | |
| 
 | |
| 
 | |
| Futexes
 | |
| =======
 | |
| 
 | |
| Futexes are supported in NOMMU mode if the arch supports them.  An error will
 | |
| be given if an address passed to the futex system call lies outside the
 | |
| mappings made by a process or if the mapping in which the address lies does not
 | |
| support futexes (such as an I/O chardev mapping).
 | |
| 
 | |
| 
 | |
| No-MMU mremap
 | |
| =============
 | |
| 
 | |
| The mremap() function is partially supported.  It may change the size of a
 | |
| mapping, and may move it [#]_ if MREMAP_MAYMOVE is specified and if the new size
 | |
| of the mapping exceeds the size of the slab object currently occupied by the
 | |
| memory to which the mapping refers, or if a smaller slab object could be used.
 | |
| 
 | |
| MREMAP_FIXED is not supported, though it is ignored if there's no change of
 | |
| address and the object does not need to be moved.
 | |
| 
 | |
| Shared mappings may not be moved.  Shareable mappings may not be moved either,
 | |
| even if they are not currently shared.
 | |
| 
 | |
| The mremap() function must be given an exact match for base address and size of
 | |
| a previously mapped object.  It may not be used to create holes in existing
 | |
| mappings, move parts of existing mappings or resize parts of mappings.  It must
 | |
| act on a complete mapping.
 | |
| 
 | |
| .. [#] Not currently supported.
 | |
| 
 | |
| 
 | |
| Providing shareable character device support
 | |
| ============================================
 | |
| 
 | |
| To provide shareable character device support, a driver must provide a
 | |
| file->f_op->get_unmapped_area() operation. The mmap() routines will call this
 | |
| to get a proposed address for the mapping. This may return an error if it
 | |
| doesn't wish to honour the mapping because it's too long, at a weird offset,
 | |
| under some unsupported combination of flags or whatever.
 | |
| 
 | |
| The driver should also provide backing device information with capabilities set
 | |
| to indicate the permitted types of mapping on such devices. The default is
 | |
| assumed to be readable and writable, not executable, and only shareable
 | |
| directly (can't be copied).
 | |
| 
 | |
| The file->f_op->mmap() operation will be called to actually inaugurate the
 | |
| mapping. It can be rejected at that point. Returning the ENOSYS error will
 | |
| cause the mapping to be copied instead if NOMMU_MAP_COPY is specified.
 | |
| 
 | |
| The vm_ops->close() routine will be invoked when the last mapping on a chardev
 | |
| is removed. An existing mapping will be shared, partially or not, if possible
 | |
| without notifying the driver.
 | |
| 
 | |
| It is permitted also for the file->f_op->get_unmapped_area() operation to
 | |
| return -ENOSYS. This will be taken to mean that this operation just doesn't
 | |
| want to handle it, despite the fact it's got an operation. For instance, it
 | |
| might try directing the call to a secondary driver which turns out not to
 | |
| implement it. Such is the case for the framebuffer driver which attempts to
 | |
| direct the call to the device-specific driver. Under such circumstances, the
 | |
| mapping request will be rejected if NOMMU_MAP_COPY is not specified, and a
 | |
| copy mapped otherwise.
 | |
| 
 | |
| .. important::
 | |
| 
 | |
| 	Some types of device may present a different appearance to anyone
 | |
| 	looking at them in certain modes. Flash chips can be like this; for
 | |
| 	instance if they're in programming or erase mode, you might see the
 | |
| 	status reflected in the mapping, instead of the data.
 | |
| 
 | |
| 	In such a case, care must be taken lest userspace see a shared or a
 | |
| 	private mapping showing such information when the driver is busy
 | |
| 	controlling the device. Remember especially: private executable
 | |
| 	mappings may still be mapped directly off the device under some
 | |
| 	circumstances!
 | |
| 
 | |
| 
 | |
| Providing shareable memory-backed file support
 | |
| ==============================================
 | |
| 
 | |
| Provision of shared mappings on memory backed files is similar to the provision
 | |
| of support for shared mapped character devices. The main difference is that the
 | |
| filesystem providing the service will probably allocate a contiguous collection
 | |
| of pages and permit mappings to be made on that.
 | |
| 
 | |
| It is recommended that a truncate operation applied to such a file that
 | |
| increases the file size, if that file is empty, be taken as a request to gather
 | |
| enough pages to honour a mapping. This is required to support POSIX shared
 | |
| memory.
 | |
| 
 | |
| Memory backed devices are indicated by the mapping's backing device info having
 | |
| the memory_backed flag set.
 | |
| 
 | |
| 
 | |
| Providing shareable block device support
 | |
| ========================================
 | |
| 
 | |
| Provision of shared mappings on block device files is exactly the same as for
 | |
| character devices. If there isn't a real device underneath, then the driver
 | |
| should allocate sufficient contiguous memory to honour any supported mapping.
 | |
| 
 | |
| 
 | |
| Adjusting page trimming behaviour
 | |
| =================================
 | |
| 
 | |
| NOMMU mmap automatically rounds up to the nearest power-of-2 number of pages
 | |
| when performing an allocation.  This can have adverse effects on memory
 | |
| fragmentation, and as such, is left configurable.  The default behaviour is to
 | |
| aggressively trim allocations and discard any excess pages back in to the page
 | |
| allocator.  In order to retain finer-grained control over fragmentation, this
 | |
| behaviour can either be disabled completely, or bumped up to a higher page
 | |
| watermark where trimming begins.
 | |
| 
 | |
| Page trimming behaviour is configurable via the sysctl ``vm.nr_trim_pages``.
 |