CFS 調度器
--wxc200
大家好哈,
兄弟最近在學習調度器,有點兒心得,更多得是迷惑,寫出心得來與大家分享,貼出迷惑來請大家解答。呵呵
linux自2.6.23后引入了一種新的調度器,叫'Completely Fair Scheduler'(wiki).是由Ingo Molnar在很短的時間內寫的。他與的cfs代碼是對之前另一種調度器 "O(1) scheduler" 的替換.
先扯一下o(1)調度.
先看wiki的解釋:
"An O(1) scheduler is a kernel scheduling design that can schedule processes within a constant amount of time, regardless of how many processes are running on the operating system (OS)."
Understanding Linux Kernel(3th) 這本書chapter 7講的進程調度,7.2節(jié)提到"Scheduling Algorithm",說2.6 對進程的選擇是constant time,兼顧了批處理和交互式進程.進程分類,實時進程(又分為SCHED_FIFL AND SCHED_RR)和普通進程.
最主要的問題是這些進程怎么組織的,簡單看下結構:
沒辦法傳圖片,請參照附件"數(shù)組".
hrtimer那部分東東等咱們聊完了cfs再說,那個主要是在原有的時間管理layer上新添加了“時間事件”,把時間中斷以事件的方式注冊。精確度由之前的hz提升到了ns(需要硬件支持)。。。
cfs
Documentation/scheduler/sched-design-CFS.txt 介紹了cfs相關東西,也可以在線看.
我按照我的理解來“添油加醋”地翻譯下。
1 概括
“80% of CFS's design can be summed up in a single sentence: CFS basically models
an "ideal, precise multi-tasking CPU" on real hardware.”
""Ideal multi-tasking CPU" is a (non-existent ) CPU that has 100% physical
power and which can run each task at precise equal speed, in parallel, each at
1/nr_running speed. For example: if there are 2 tasks running, then it runs
each at 50% physical power --- i.e., actually in parallel.
"
模 擬了個理想的,多任務cpu.假如有n個進程,那么每個進程 的執(zhí)行速度是1/n,,即所有的任務都是并行執(zhí)行的。我認為就算是有并行執(zhí)行的cpu,速度也不應該完全平均,按照優(yōu)先級再分比較劃算。比如2個進 程,a,b,a 的優(yōu)先級是b的兩 倍,那么就照著速度比a:b = 2:1唄~
“On real hardware, we can run only a single task at once, so we have to
introduce the concept of "virtual runtime." The virtual runtime of a task
specifies when its next timeslice would start execution on the ideal
multi-tasking CPU described above. In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks. ”
當然沒有這種cpu,故引進了個新概念"virtual runtime",這個概念真是折磨人好久。我曾經(jīng)在clf上發(fā)帖子問過,有個兄弟回復還是不錯的。后面看過代碼后,我的理解是:1) 紅黑樹結點排列的值 2) 每次task run time轉換的一個值 .
先看上文顏色部分,我真是很難翻譯它~ 照字面理解,是實際運行時間與任務數(shù)量的一個比值? 我舉個例子來說明它是怎么計算 的吧。。
在 __update_curr()這個函數(shù)里,會更新當前任務的運行時間信息。假如一個任務a自上次調度到這次調度時間間隔為delta,那么它的 vrumtime的增量是按照這個公式:delta * NICE_0_LOAD / a->se.load。假如把NICE_0_LOAD / a->se.load作為一個比值p的話,我們可以這么描述p的變化:優(yōu)先級越高,p越小。這個load是與nice值 相關的,有一個靜態(tài)的數(shù)組,我后面在代碼里會介紹。
所以,一堆進行都運行了一段時間delta,那么它們的vrumtime就遵循上面的公式。很明顯,優(yōu)先級最大的進程vrumtime增量最小。。。
2 操作細節(jié)
cfs就是通過追蹤這個vruntime來進行任務調度的。它總是選 vruntime最小的進程來運行。
3 紅黑樹。
紅黑樹這個數(shù)據(jù)結構在內核里用得還不少嘛,不過俺不太懂。哪位兄弟給掃掃盲? hrtimer里也用到了red-black-tree。這里把進程的vruntime用rb-tree來存儲。
4 一些feature
cfs has no time-slice concept.o(1)有的,cfs沒有“明顯”得用,它偷偷摸摸地用。呵呵 翻譯完這幾段咱再說這個。文檔里面那個接口估計是用來調整“最小粒度”的。用于“桌面”和“服務器“兩 種不同的調度。后者可能不太希望頻繁的調度,浪費時間。前者則希望面面俱到--”不患寡婦而患不均也“
5 調度policy
SCHED_FIFO/_RR are implemented in sched_rt.c and are as specified by POSIX. 實時進程調度
實時調度和普通進程調度。
6 調度的類。
調度哭可以模塊化注冊,所以你可以選擇實時調度或者是cfs調度。不錯!
sched_fair.c文件實現(xiàn)了cfs調度的所以classes
7 組調度。
不光是普通進程,組調度也實現(xiàn)了。。這個是后來一個patch加的。
×××××××××××××××××××××××××××××××××××
上面 是對著內核的一篇文檔,簡要地說了幾部分 。。。在正式進行我們的hack之前,先嘮叨幾句,算是個總結和感性的認識吧~吼吼
咱們先從新創(chuàng)建的一個進程說起吧。
1) kernel/fork.c里,fork routine,do_fork() will create a new process,if its state is running,it will
call wake_up_new_task() to wake up the newly created process FOR THE FIRST TIME.
do_fork(){
...
if (unlikely(clone_flags & CLONE_STOPPED)) {
...
} else {
wake_up_new_task(p, clone_flags);
}
...
}
2) 這個函數(shù)做什么呢?
void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
{
unsigned long flags;
struct rq *rq;
if (!p->sched_class->task_new || !current->se.on_rq) {如果條件不滿足,直接入列,但請注意active_task()的最后參數(shù)0,不喚醒
activate_task(rq, p, 0);
} else {
/*
* Let the scheduling class do new task startup
* management (if any):
*/
p->sched_class->task_new(rq, p);調用完全公平類里面的task_new做些初始化的操作。
inc_nr_running(rq);增加運行隊列的運行進程數(shù)目
}
trace_sched_wakeup_new(rq, p);
check_preempt_curr(rq, p, 0); 可不可以搶占當前進程?
#ifdef CONFIG_SMP
if (p->sched_class->task_wake_up)
p->sched_class->task_wake_up(rq, p);
#endif
task_rq_unlock(rq, &flags);
}
3) 先看task_new()吧,它會掉到fair_sched_class類里的task_new_fair.
static void task_new_fair(struct rq *rq, struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
struct sched_entity *se = &p->se, *curr = cfs_rq->curr;
int this_cpu = smp_processor_id();
sched_info_queued(p);
update_curr(cfs_rq); 更新cfs的一些信息
place_entity(cfs_rq, se, 1);初始化se在cfs里的信息,包括vruntime
/* 'curr' will be NULL if the child belongs to a different group */
if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
curr && curr->vruntime < se->vruntime) {
/*
* Upon rescheduling, sched_class::put_prev_task() will place
* 'current' within the tree based on its new key value.
*/
swap(curr->vruntime, se->vruntime);
resched_task(rq->curr);
}
/*
* Get the amount of time the current task was running
* since the last time we changed load (this cannot
* overflow on 32 bits):
*/
delta_exec = (unsigned long)(now - curr->exec_start);計算自上次調度此進程到這次又調度,經(jīng)過的時間
這個時間比較鬼異,是多久呢?假如在運行隊列里的n個進程之一的a,再一次被調度到時,這個delta_exec是多大呢? 如果中途有進程退出或睡眠,
那么運行隊列會動態(tài)更新的,所以這個delta_exec的變化曲線是什么樣子的難說。
if (entity_is_task(curr)) { 這個條件比較有趣,我喜歡。跟過去發(fā)現(xiàn)是這樣定義的:
"/* An entity is a task if it doesn't "own" a runqueue */
#define entity_is_task(se) (!se->my_q) " 如果是組調度,調度實體可能代表一個組,不單單是一個task。就是看看有沒有my_q這個指針了,
即有沒有它control的cfs_rq
如果是task,條件滿足,
/*
* The 'current' period is already promised to the current tasks,
* however the extra weight of the new task will slow them down a
* little, place the new task so that it fits in the slot that
* stays open at the end.
*/
if (initial && sched_feat(START_DEBIT))
vruntime += sched_vslice(cfs_rq, se); 稍微加一些,呵呵
if (!initial) { 如果是睡了,喚醒的,應該有些補償?shù)?#160; 具體怎么補,多了怎么辦,少了怎么辦?
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS)) {
unsigned long thresh = sysctl_sched_latency;
/*
* convert the sleeper threshold into virtual time
*/
if (sched_feat(NORMALIZED_SLEEPER))
thresh = calc_delta_fair(thresh, se);
vruntime -= thresh;
}
/* ensure we never gain time by being placed backwards. */
vruntime = max_vruntime(se->vruntime, vruntime);
}
3 關于tick中斷
為了保證調度,在每個tick都會對進程時間信息等進行更新。首先找一個從hrtimer上層到進程調度之間的點,暫時往進程調度的方向說,后面談到hrtimer時,再往hrtimer追根溯源吧。
這個點就是,update_process_times,它是在timer interrupt 中被調 的,它會調一個跟process直接相關的函數(shù),scheduler_tick().
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
* It also gets called by the fork code, when changing the parent's
* timeslices.
*/
void scheduler_tick(void)
{
int cpu = smp_processor_id();
...
update_cpu_load(rq);
curr->sched_class->task_tick(rq, curr, 0); task_tick即公平調度類里的
spin_unlock(&rq->lock);
static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq); 更新當前進程current的信息,這個昨天已經(jīng)看過了
#ifdef CONFIG_SCHED_HRTICK
/*
* queued ticks are scheduled to match the slice, so don't bother
* validating it and just reschedule.
*/
if (queued) {
resched_task(rq_of(cfs_rq)->curr);
return;
}
/*
* don't let the period tick interfere with the hrtick preemption
*/
if (!sched_feat(DOUBLE_TICK) &&
hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
return;
#endif
if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
check_preempt_tick(cfs_rq, curr); 這個要與check_preempt_curr對比著看,我當初就在這個地方暈了。。。注釋是這樣的:“ Preempt the current task with a newly woken task if needed:”
第二個函數(shù)折磨了我好久~ 既然是完全公平調度,每個進程的運行時間應該完全公平呀? 實際上ideal_runtime 是與進程的優(yōu)先級有關系的。
首先看ideal_runtime是怎么計算出來的。
3)
/*
* We calculate the wall-time slice from the period by taking a part
* proportional to the weight.
*
* s = p*P[w/rw]這是通過一個比例算出的
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
unsigned long nr_running = cfs_rq->nr_running;運行的進程數(shù)目
4)看看cale_delata_weight第一個參數(shù)是怎么得到的。
/*
* The idea is to set a period in which each task runs once. 設置一個period,讓每個進程都跑一遍。
*
* When there are too many tasks (sysctl_sched_nr_latency) we have to stretch
* this period because otherwise the slices get too small. 當進程數(shù)目大于默認值(5)時,要拉伸一下這個period
*
* p = (nr <= nl) ? l : l*nr/nl
*/
static u64 __sched_period(unsigned long nr_running)
{
u64 period = sysctl_sched_latency;
unsigned long nr_latency = sched_nr_latency;
if (unlikely(nr_running > nr_latency)) {
period = sysctl_sched_min_granularity; 最小粒度 4ms
period *= nr_running; 4ms * 進程數(shù)目
}
return period;
}
這個比較好理解,小于5個進程時,這個period周期是20ms,大于5個running number 時,就讓4*nr;后面就是線性的,
這是對一個即將被調度出去的進程的處理。
2) 它會掉pub_prev_entity 這個例程看得有點兒迷糊。 如果進程還在runqueue上,就把它重新放回紅黑樹,只是位置變了,并把當前的curr指針清空。
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
/*
* If still on the runqueue then deactivate_task()
* was not called and update_curr() has to be done:
*/
if (prev->on_rq) 這個條件還得研究研究,什么情況下用與不用
update_curr(cfs_rq);
check_spread(cfs_rq, prev);
if (prev->on_rq) {
update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev); 怎么又放回去呢?
}
cfs_rq->curr = NULL;
}
這兒有個小猜想,希望得到驗證:
1) a task call schedule(),it's se->on_rq = 1,so it need denqueue .call deactivate_task()
2) a task picked by schedule() need to get the cpu to run,it's se->on_rq = 0,so
it need enqueue.
3) a task's exec time is over,need change to rb-tree's right it's se->on_rq = 1,just change the tree
1 按照文檔的解釋,完全公平調度并不僅僅針對單一進程,也應該對組調度。例如有兩 個用戶wxc1,wxc2,以用戶userid來分,調度器分配給兩 個組的時間是公平的,wxc1 and wxc2各有50%的CPU,,組內也實現(xiàn)公平調度。
幾個宏:
CONFIG_GROUP_SCHED :打開組調度
CONFIG_RT_GROUP_SCHED 實時進程組調度
CONFIG_FAIR_GROUP_SCHED 普通進程。
組的劃分:
1)Based on user id (CONFIG_USER_SCHED)
2)Based on "cgroup" pseudo filesystem (CONFIG_CGROUP_SCHED)
后面這一種還不太熟悉,參照eric xiao 網(wǎng)友的blog,在對cgroup文件系統(tǒng)進程操作時,會去調用一個接口,里面的函數(shù)完成對組調度的操作。
cgroup這部分后面再學習,暫時還不了解。
3 一些與組調度相關的數(shù)組結構
1) 結構體組:
/* task group related information */
struct task_group {
#ifdef CONFIG_CGROUP_SCHED
struct cgroup_subsys_state css;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each cpu */
struct sched_entity **se;每個cpu上都有一個調度實體,代表這個組,不是單個task的調度實體
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;每個cpu上都有一個cfs_rq,屬于這個組的
unsigned long shares;
#endif
...
#ifdef CONFIG_FAIR_GROUP_SCHED
//Note here: cfs has rq pointer when group schedule
struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */ 指向一個run queue
/*
* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
* a hierarchy). Non-leaf lrqs hold other higher schedulable entities
* (like users, containers etc.)
*
* leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
* list is used during load balance.
*/
struct list_head leaf_cfs_rq_list; 鏈表,把一個runqueue上的所有cfs_rq鏈接在一起
struct task_group *tg; /* group that "owns" this runqueue */ 哪個group來的?
...
}
4) struct sched_entity {
。。。
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity *parent; 也有父親了~以前是野孩子 其父親一般是其所在group的父親的在同樣cpu上的調度實體
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq; 這兩 個cfs_rq挺繞的:cfs_rq這個指針指向的是parent->my_q
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;自己的cfs_rq
#endif
};
4 組調度結構圖
請參照附件,手工畫得,因為不會用做圖軟件 汗~
右上角部分:兩 個組之前是父子關系 每個組旁邊的三角形是代表組的調度實體
5 具體的操作函數(shù)
先看兩 個組:
1) /* Default task group.
* Every task in system belong to this group at bootup.
*/
struct task_group init_task_group;
2) /*
* Root task group.
* Every UID task group (including init_task_group aka UID-0) will
* be a child to this group.
*/
init_task_group是系統(tǒng)初始化時進程的所在組。
root_task_group 是所有group 的祖宗。
在sched_init()里面,有對它的初始化過程。
sched_init(){
.....
#ifdef CONFIG_FAIR_GROUP_SCHED
init_task_group.se = (struct sched_entity **)ptr;
ptr += nr_cpu_ids * sizeof(void **);請留意這兒, ptr加一個偏移,這段空間內存放著在所有cpu上的代表組的調度實體的指針。
來看一個組怎么被創(chuàng)建的吧。
/* allocate runqueue etc for a new task group */
struct task_group *sched_create_group(struct task_group *parent) parent這個參數(shù)一般在創(chuàng)建的時候會指定,比如是上一個進程組,沒有的話可能是前面提到的那兩 個組
{
struct task_group *tg;
unsigned long flags;
int i;
tg = kzalloc(sizeof(*tg), GFP_KERNEL); 申請組結構體的內存空間
if (!tg)
return ERR_PTR(-ENOMEM);
if (!alloc_fair_sched_group(tg, parent)) 恩,這個比較好玩兒,申請那個指針數(shù)組,一會兒看,
goto err;
/* change task's runqueue when it moves between groups.
* The caller of this function should have put the task in its new group
* by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
* reflect its new group.
*/
void sched_move_task(struct task_struct *tsk)
{
int on_rq, running;
unsigned long flags;
struct rq *rq;