標題: linux 2.6.25 changelog [打印本頁] 作者: greek_zjb 時間: 2008-05-03 18:42 標題: linux 2.6.25 changelog
Linux 2 6 25
Linux kernel version 2.6.25 Released 17 April 2008 ( full SCM git log
) Summary: 2.6.25 includes support of a new architecture (MN10300/AM33) and the widely used Orion
SoCs
,
a new interface for more accurate measurement of process memory usage,
a 'memory resource controller' for controlling the memory usage of
groups of processes, realtime group scheduling, a tool for measuring
high latencies called latencytop
,
ACPI thermal regulation, timer event notifications through file
descriptors, an alternative MAC security framework called SMACK, an
ext4 update, BRK and PIE-executable address space randomization, RCU
preemption support, FIFO spinlocks in x86, EFI support in x86-64, a new
network protocol called CAN
,
initial ATI r500 DRI/DRM support, the beginning of the end for tasks
stuck in D state, improved device support and many other small
improvements.
目錄
1. Important features (AKA: the cool stuff)
1.1. Memory Resource Controller
Recommended LWN article (somewhat outdated, but still interesting): "Controlling memory use in containers"
The memory resource controller is a cgroups-based feature. Cgroups, aka "Control Groups", is a feature that was merged in 2.6.24
,
and its purpose is to be a generic framework where several "resource
controllers" can plug in and manage different resources of the system
such as process scheduling or memory allocation. It also offers a
unified user interface, based on a virtual filesystem where
administrators can assign arbitrary resource constraints to a group of
chosen tasks. For example, in 2.6.24
they merged two resource controllers: Cpusets and Group Scheduling. The
first allows to bind CPU and Memory nodes to the arbitrarily chosen
group of tasks, aka cgroup, and the second allows to bind a CPU
bandwidth policy to the cgroup.
The
memory resource controller isolates the memory behavior of a group of
tasks -cgroup- from the rest of the system. It can be used to:
Isolate
an application or a group of applications. Memory hungry applications
can be isolated and limited to a smaller amount of memory.
Create a cgroup with limited amount of memory, this can be used as a good alternative to booting with mem=XXXX.
Virtualization solutions can control the amount of memory they want to assign to a virtual machine instance.
A
CD/DVD burner could control the amount of memory used by the rest of
the system to ensure that burning does not fail due to lack of
available memory.
The
configuration interface, like all the cgroups, is done by mounting the
cgroup filesystem with the "-o memory" option, creating a
randomly-named directory (the cgroup), adding tasks to the cgroup by
catting its PID to the 'task' file inside the cgroup directory, and
writing values to the following files: 'memory.limit_in_bytes',
'memory.usage_in_bytes' (memory statistic for the cgroup),
'memory.stats' (more statistics: RSS, caches, inactive/active pages),
'memory.failcnt' (number of times that the cgroup exceeded the limit),
and 'mem_control_type'. OOM conditions are also handled in a per-cgroup
manner: when the tasks in the cgroup surpass the limits, OOM will be
called to kill a task between all the tasks involved in that specific
cgroup.
Code: (commit 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
, 10
, 11
, 12
)
1.2. Real Time Group scheduling
Group scheduling is a feature introduced in 2.6.24
.
It allows to assign different process scheduling priorities other than
nice levels. For example, given two users on a system, you may want to
to assign 50% of CPU time to each one, regardless of how many processes
is running each one (traditionally, if one user is running f.e. 10
cpu-bound processes and the other user only 1, this last user would get
starved its CPU time), this is the "group tasks by user id"
configuration option of Group Scheduling does. You may also want to
create arbitrary groups of tasks and give them CPU time privileges,
this is what the "group tasks by Control Groups" option does, basing
its configuration interface in cgroups (feature introduced in 2.6.24
and described in the "Memory resource controller" section).
Those
are the two working modes of Control Groups. Additionally there're
several types of tasks. What 2.6.25 adds to Group Scheduling is the
ability to also handle real time (aka SCHED_RT) processes. This makes
much easier to handle RT tasks and give them scheduling guarantees.
Documentation: sched-rt-group.txt
Code: (commit 1
, 2
, 3
, 4
)
There's
serious interest in running RT tasks on enterprise-class hardware, so a
large number of enhancements to the RT scheduling class and
load-balancer have been merged to provide optimum behaviour for RT
tasks.
Code: (commit 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
)
1.3. RCU Preemption support
Recommended LWN article: "The design of preemptible read-copy-update"
RCU
is a very powerful locking scheme used in Linux to scale to very large
number of CPUs on a single system. However, it wasn't well suited for
low latency,RT-ish workloads, and some parts could cause high latency.
In 2.6.25, RCU can be preempted, eliminating that source of latencies
and making Linux a bit more RT-ish.
Code: (commit 1
, 2
)
1.4. FIFO ticket spinlocks in x86
Recommended LWN article: "Ticket spinlocks"
In
certain workloads, spinlocks can be unfair, ie: a process spinning on a
spinlock can be starved up to 1,000,000 times. Usually starvation in
spinlocks is not a problem, and it was thought that it was not too
important because such spinlock would become a performance problem
before any starvation is noticed, but testing has showed the contrary.
And it's always possible to find an obscure corner case that will
generate a lot of contention on some lock, and the processor that will
grab the lock does it randomly.
With
the new spinlocks, the processes grab the spinlock in FIFO order,
ensuring fairness (and more importantly, guaranteeing to some point the
Spinlocks
configured to run on machines with more than 255 CPUs will use a 32-bit
value, and 16 bits when the number of CPUs is smaller (as a bonus, the
maximum theoretical limit of CPUs that spinlocks can support is raised
up to 65536 processors)
Code: (commit 1
, 2
)
1.5. Better process memory usage measurement
Recommended LWN article: "How much memory are applications really using?"
Measuring
how much memory processes are using is more difficult than it looks,
specially when processes are sharing the memory used. Features like
/proc/$PID/smaps (added in 2.6.14
)
help, but it has not been enough. 2.6.25 adds new statistics to make
this task easier. A new /proc/$PID/pagemaps file is added for each
process. In this file the kernel exports (in binary format) the
physical page localization for each page used by the process. Comparing
this file with the files of other processes allows to know what pages
they are sharing. Another file, /proc/kpagemaps, exposes another kind
of statistics about the pages of the system. The author of the patch,
Matt Mackall, proposes two new statistic metrics: "proportional set
size" (PSS) - divide each shared page by the number of processes
sharing it; and "unique set size" (USS) (counting of pages not shared).
The first statistic, PSS, has also been added to each file in
/proc/$PID/smaps. In this HG repository
you can find some sample command line and graphic tools that exploits all those statistics.
Code: (commit 1
, 2
, 3
, 4
)
1.6. timerfd() syscall
timerfd()
is a feature that got merged in 2.6.22 but was disabled due to late
complaints about the syscall interface. Its purpose is to extend the
timer event notifications to something else than signals, because doing
such things with signals is hard. poll()/epoll() only covers file
descriptors, so the options were a BSDish kevent-like subsystem or
delivering time notifications via a file descriptor, so that poll/epoll
could handle them.
There
were implementations for both approaches, but the cleaner and more
"unixy" design of the file descriptor approach won. In 2.6.25, a
revised API has been finally introduced. The API can be found in this LWN article
The most used MAC solution in Linux is SELinux, a very powerful security framework. SMACK
is an alternative MAC framework, not so powerful as SELinux but simpler
to use and configure. Linux is all about flexibility, and in the same
way it has several filesystems, this alternative security framework
doesn't pretends to reemplaze SELinux, it's just an alternative for
those who find it more suited to its needs.
From the LWN article: Like
SELinux, Smack implements Mandatory Access Control (MAC), but it
purposely leaves out the role based access control and type enforcement
that are major parts of SELinux. Smack is geared towards solving
smaller security problems than SELinux, requiring much less
configuration and very little application support.
Code: (commit)
Slow
servers, Skipping audio, Jerky video - everyone knows the symptoms of
latency. But to know what's really going on in the system, what's
causing the latency, and how
to fix it... those are difficult questions without good answers right
now. LatencyTOP is a Linux tool for software developers (both kernel
and userspace), aimed at identifying where system latency occurs, and
what kind of operation/action is causing the latency to happen. By
identifying this, developers can then change the code to avoid the
worst latency hiccups.
There
are many types and causes of latency, and LatencyTOP focus on type that
causes audio skipping and desktop stutters. Specifically, LatencyTOP focuses
on the cases where the applications want to run and execute useful
code, but there's some resource that's not currently available (and the
kernel then blocks the
process). This is done both on a system level and on a per process
level, so that you can see what's happening to the system, and which process is suffering and/or causing the delays.
You can find the latencytop userspace tool, including screenshots, at latencytop.org
.
Code: (commit)
1.9. BRK and PIE executable randomization Exec-shield
is a Red Hat that was started in 2003 by Red Hat to implement several
security protections and is mainly used in Red Hat and Fedora. Many
features have already been merged lot of time ago, but not all of them.
In 2.6.25 two of them are being merged: brk() randomization and PIE
executable randomization. Those two features should make the address
space randomization on i386 and x86_64 complete.
Code (commit 1
, 2
, 3
)
1.10. Controller area network (CAN) protocol support
Recommended LWN article: "PF_CAN"
From the "Controller Area Network" Wikipedia article
: Controller
Area Network (CAN or CAN-bus) is a computer network protocol and bus
standard designed to allow microcontrollers and devices to communicate
with each other and without a host computer.. This implementation has been contributed by Volkswagen.
Code: (commit 1
, 2
, 3
, 4
, 5
, 6
)
1.11. ACPI thermal regulation/WMI
In 2.6.25 ACPI adds thermal regulation support (commit 1
, 2
, 3
, 4
) and a WMI ( Windows Management Interface
, a proprietary extension to ACPI) mapper (commit 1
, 2
, 3
)
1.12. EXT4 update
Recommended article: "A better ext4"
EXT4
mainline snapshot gets an update with a bunch of features: Multi-block
allocation, large blocksize up to PAGE_SIZE, journal checksumming,
large file support, large filesystem support, inode versioning, and
allow in-inode extended attributes on the root inode. These features
should be the last ones that require on-disk format changes. Other
features that don't affect the disk format, like delayed allocation,
have still to be merged.
Code: (commit 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
, 10
, 11
)
1.13. MN10300/AM33 architecture support
The
MN10300/AM33 architecture is now supported under the "mn10300"
subdirectory. 2.6.25 adds support MN10300/AM33 CPUs produced by MEI. It
also adds board support for the ASB2303 with the ASB2308 daughter
board, and the ASB2305. The only processor supported is the MN103E010,
which is an AM33v2 core plus on-chip devices.
Code: (commit)
1.14. TASK_KILLABLE
Most
Unix systems have two states when sleeping -- interruptible and
uninterruptible. 2.6.25 adds a third state: killable. While
interruptible sleeps can be interrupted by any signal, killable sleeps
can only be interrupted by fatal signals. The practical implications of
this feature is that NFS has been converted to use it, and as a result
you can now kill -9 a task that is waiting for an NFS server that isn't
contactable.
Further
uses include allowing the OOM killer to make better decisions (it can't
kill a task that's sleeping uninterruptibly) and changing more parts of
the kernel to use the killable state. If you have a task stuck in
uninterruptible sleep with the 2.6.25 kernel, please contact MatthewWilcox
with the output from
$ ps -eo pid,stat,wchan:40,comm |grep D
Code:
Commits 1-11 are prep-work. Patches 15 and 21 accomplish the major
user-visible features, but depend on all the commits which have gone
before them.
(commit 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
, 10
, 11
, 12
, 13
, 14
, 15
, 16
, 17
, 18
, 19
, 20
, 21
, 22
)
2. Subsystems
2.1. Various
Block/VFS
IO
context sharing. Syslets (or other threads/processes that want io
context sharing) can set the CLONE_IO clone() flag to enforce sharing
of io context (commit 1
, 2
, 3
, 4
, 5)
get rid of NR_OPEN and introduce a sysctl_nr_open (commit)
md: allow devices to be shared between md arrays (commit)
, allow a maximum extent to be set for resyncing (commit)
, support 'external' metadata for md arrays (commit)
Better rate control algorithm selection. (commit)
, add PID controller based rate control algorithm (commit)
, make PID rate control algorithm the default (commit)
Introduce key handling (commit)
, support adding/removing keys via cfg80211 (commit)
, support getting key sequence counters via cfg80211 (commit)
4xx: PLB to PCI-X support (commit)
, PLB to PCI 2.x support (commit)
, PLB to PCI Express support (commit)
, PCI support for Ebony board (commit)
, add early udbg support for 40x processors (commit)
, EP405 boards support for arch/powerpc (commit)
, add PCI to Walnut platform. (commit)
, base support for 440GX Taishan eval board. (commit)
, base support for 440SPe "Katmai" eval board. (commit)
, 440GRx Rainier board support. (commit)
, (commit)
, PIKA Warp base platform (commit)
mpc5200: Add generic support for simple MPC5200 based boards. (commit)
QE: Add ability to upload QE firmware. (commit)
, add support for Freescale QUICCEngine UART. (commit)
, add support for Freescale QUICCEngine UART. (commit)
8xx: Analogue & Micro Adder875 board support (commit)
PS3: Add logical performance monitor device support (commit)
, add logical performance monitor driver support (commit)
85xx: Port STX GP3 board over from arch/ppc (commit)
, port TQM85xx boards over from arch/ppc (commit)
, add support for Wind River SBC8560 in arch/powerpc (commit)
, add v1 device tree source for Wind River SBC8560 board (commit)
, add basic support for Wind River SBC8548 board (commit)
,
83xx: Add support for Wind River SBC834x boards (commit)
, add device tree source for Wind River SBC834x board. (commit)
, add MPC837x RDB platform support (commit)
pcm027: add support for phyCORE-PXA270 CPU module. (commit)
, add network support for phyCORE-PXA270. (commit)
, add support for pcm990 baseboard for phyCORE-PXA270. (commit)
Base support for pxa-based Toshiba e-series PDAs. (commit)
Add basic support for HTC Magician PDA phones. (commit)
Adds drivers for IXP4xx QMgr and NPE features (commit)
pxa: add basic support for Littleton (PXA3xx Form Factor Platform). (commit)
, add preliminary suspend/resume code for pxa3xx (commit)
, add cpufreq support. (commit)
Realview: clocksource support for the Realview platforms (commit)
, clockevents support for the RealView
platforms (commit)
, add broadcasting clockevents support for ARM11MPCore (commit)
, add clockevents suport for the local timers (commit)
, add core-tile detection (commit)
OMAP: Add DMA support for chaining and 3430 (commit)
HDA: Add Asus VX1 support (commit)
, add support for RV610/RV630 HDMI audio. (commit)
, STAC92HD71 codec mixer. (commit)
, add support of HP Thin Client T5735. (commit)
, add support for RV6xx HDMI audio. (commit)
, initial support of the Mitac 8252D (based on ALC883). (commit)
, add ALC889/ALC267/ALC269 support. (commit)
, add support for VIA VT1708B HD audio codec. (commit)
, added more 92HD71 codecs. (commit)
, added STAC92HD73 support. (commit)
, add IEC958 digital out support for Lenovo Thinkpads T61/X61. (commit)
, device ID for Macbook sound card. (commit)
, 92HD71BXX Mono Mute Support. (commit)
, 92HD7XXX power management support. (commit)
, add the support of Dell OEM laptops with ALC268. (commit)
, new model for conexant 5045 codec to support benq r55e. (commit)
, add model for Acer Aspire 5315. (commit)
, add Conexant 5051 codec support. (commit)
, add model for Acer Aspire 5310. (commit)
, add model for HP DV9553EG laptop. (commit)
, ALSA HD Audio patch for Intel ICH10 DeviceID's. (commit)
, add Dell T3400 support. (commit)
, add support for Intel SCH. (commit)
, add missing model for HD-audio Cx5045 codec (commit)
, add support for Samsung Q1 Ultra Vista edition. (commit)
ice1724: Add support of Onkyo SE-90PCI and SE-200PCI. (commit)
Add support for Motorola ROKR Z6 cellphone in mass storage mode (commit)
3.11. FireWire
init_ohci1394_dma: new standalone driver for remote kernel debugging via FireWire in early kernel initialization phase (commit)
A
whole boatload of bug fixes for firewire-core, firewire-ohci,
firewire-sbp2. The sum of them brings huge improvements of stability
and functionality of these drivers over linux 2.6.24. See the linux1394-user changelog
for a list of fixes.
3.12. RDMA
RDMA/nes: Add a driver for Neteffect RNICs (commit)