亚洲av成人无遮挡网站在线观看,少妇性bbb搡bbb爽爽爽,亚洲av日韩精品久久久久久,兔费看少妇性l交大片免费,无码少妇一区二区三区

Chinaunix

標題: 完全公平調度（CFS） [打印本頁]

作者: wxc200 時間: 2009-03-02 14:52
標題: 完全公平調度（CFS）
總結了篇關于cfs調度器的文檔

喜歡pdf的可以查看檔案~

CFS 調度器
--wxc200
大家好哈，
兄弟最近在學習調度器，有點兒心得，更多得是迷惑，寫出心得來與大家分享，貼出迷惑來請大家解答。呵呵
linux自2.6.23后引入了一種新的調度器，叫'Completely Fair Scheduler'（wiki）.是由Ingo Molnar在很短的時間內寫的。他與的cfs代碼是對之前另一種調度器 "O(1) scheduler" 的替換.
先扯一下o(1)調度.
先看wiki的解釋:
"An O(1) scheduler is a kernel scheduling design that can schedule processes within a constant amount of time, regardless of how many processes are running on the operating system (OS)."
Understanding Linux Kernel(3th) 這本書chapter 7講的進程調度,7.2節(jié)提到"Scheduling Algorithm",說2.6 對進程的選擇是constant time,兼顧了批處理和交互式進程.進程分類,實時進程(又分為SCHED_FIFL AND SCHED_RR)和普通進程.
最主要的問題是這些進程怎么組織的,簡單看下結構:
沒辦法傳圖片,請參照附件"數(shù)組".

它有兩個數(shù) 組,active里面是has time-slice 的,expired數(shù)組是out-of-slice。簡單的說，一個普通的進程被調度運行后，放在active里，當其時間片用光后，可能就要移到 expired數(shù)組里了。說“可能”是因為有的進程就不移走。比如交互式進程。
說白了,就是把所有的process按照優(yōu)先級掛在鏈表上,從里面按照優(yōu)先級高低選擇第一個不為空的進程運行.
普通進程的優(yōu)先級是100~139,實時進程要更低,這符合調度的算法.
我有點等不及了,咱們直接奔cfs吧~
另外有點就是,引入hrtimer之后,進程調度還是在tick中斷完成的.每個tick都會檢查進程是否應該調度,當然,主動讓cpu(即調用scheduler().)的就不算了吧...

hrtimer那部分東東等咱們聊完了cfs再說，那個主要是在原有的時間管理layer上新添加了“時間事件”，把時間中斷以事件的方式注冊。精確度由之前的hz提升到了ns（需要硬件支持）。。。
cfs
Documentation/scheduler/sched-design-CFS.txt 介紹了cfs相關東西，也可以在線看.
我按照我的理解來“添油加醋”地翻譯下。
1 概括
“80% of CFS's design can be summed up in a single sentence: CFS basically models
an "ideal, precise multi-tasking CPU" on real hardware.”
""Ideal multi-tasking CPU" is a (non-existent

) CPU that has 100% physical
power and which can run each task at precise equal speed, in parallel, each at
1/nr_running speed. For example: if there are 2 tasks running, then it runs
each at 50% physical power --- i.e., actually in parallel.
"
模擬了個理想的，多任務cpu.假如有n個進程，那么每個進程的執(zhí)行速度是1/n,,即所有的任務都是并行執(zhí)行的。我認為就算是有并行執(zhí)行的cpu，速度也不應該完全平均，按照優(yōu)先級再分比較劃算。比如2個進程，a,b，a 的優(yōu)先級是b的兩倍，那么就照著速度比a:b = 2:1唄~
“On real hardware, we can run only a single task at once, so we have to
introduce the concept of "virtual runtime." The virtual runtime of a task
specifies when its next timeslice would start execution on the ideal
multi-tasking CPU described above. In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks. ”
當然沒有這種cpu,故引進了個新概念"virtual runtime",這個概念真是折磨人好久。我曾經(jīng)在clf上發(fā)帖子問過，有個兄弟回復還是不錯的。后面看過代碼后，我的理解是：1）紅黑樹結點排列的值 2）每次task run time轉換的一個值 .
先看上文顏色部分，我真是很難翻譯它~ 照字面理解，是實際運行時間與任務數(shù)量的一個比值？我舉個例子來說明它是怎么計算的吧。。
在 __update_curr(）這個函數(shù)里，會更新當前任務的運行時間信息。假如一個任務a自上次調度到這次調度時間間隔為delta,那么它的 vrumtime的增量是按照這個公式：delta * NICE_0_LOAD / a->se.load。假如把NICE_0_LOAD / a->se.load作為一個比值p的話，我們可以這么描述p的變化：優(yōu)先級越高，p越小。這個load是與nice值相關的，有一個靜態(tài)的數(shù)組，我后面在代碼里會介紹。
所以，一堆進行都運行了一段時間delta，那么它們的vrumtime就遵循上面的公式。很明顯，優(yōu)先級最大的進程vrumtime增量最小。。。
2 操作細節(jié)
cfs就是通過追蹤這個vruntime來進行任務調度的。它總是選 vruntime最小的進程來運行。
3 紅黑樹。
紅黑樹這個數(shù)據(jù)結構在內核里用得還不少嘛，不過俺不太懂。哪位兄弟給掃掃盲？ hrtimer里也用到了red-black-tree。這里把進程的vruntime用rb-tree來存儲。
4 一些feature
cfs has no time-slice concept.o(1)有的，cfs沒有“明顯”得用，它偷偷摸摸地用。呵呵翻譯完這幾段咱再說這個。文檔里面那個接口估計是用來調整“最小粒度”的。用于“桌面”和“服務器“兩種不同的調度。后者可能不太希望頻繁的調度，浪費時間。前者則希望面面俱到--”不患寡婦而患不均也“
5 調度policy
SCHED_FIFO/_RR are implemented in sched_rt.c and are as specified by POSIX. 實時進程調度
實時調度和普通進程調度。
6 調度的類。
調度哭可以模塊化注冊，所以你可以選擇實時調度或者是cfs調度。不錯！
sched_fair.c文件實現(xiàn)了cfs調度的所以classes
7 組調度。
不光是普通進程，組調度也實現(xiàn)了。。這個是后來一個patch加的。
×××××××××××××××××××××××××××××××××××
上面是對著內核的一篇文檔，簡要地說了幾部分。。。在正式進行我們的hack之前，先嘮叨幾句，算是個總結和感性的認識吧~吼吼

關于實時進程的調度，這一次先不分析，它和o1差不多，還保持著優(yōu)先級數(shù)組的做法，那是”上流社會“玩兒的游戲，后面再折騰它�！逼胀ù蟊姟皞兺鎯旱木褪莄fs了。
cfs調度，我寫兩部分，一部分是普通進程的調度，沒有組的概念。一部分是組的調度。我個人覺得組得調度比較難理解~ 這幾天一直在思考。。。今天下午好像豁然開朗了。。。畫了個草圖，到后面我做出這張圖來，大家看看對不對

咱們先說普通進程的調度
關于普通進程的組織，應該有這么一種印象。
有一個隊列，叫cfs_rq,里面維護了個紅黑樹。每一個task_struct has a se,稱為調度實體。它就代表了一個被調度的進程，里面有此進程的一些信息，比如前面提到的vruntime.
一個進程創(chuàng)建的時候，就會被放置在這個紅黑樹里。它自己的位置是由它的vruntime值來決定的。在每個tick中斷的時候，會更新進程的信息，并檢查這個紅黑樹，判斷某些進程是不是要被替換掉。
現(xiàn)在我們來想象下面一個例程。
進程a被創(chuàng)建了，sched_fork()會做一些新創(chuàng)建進程調度方面的初始化。然后應該嘗試把此進程放到隊列里啊，讓它被調度。 set_task_cpu()做了這部分工作。然后這個進程如果不是實時進程，就讓它跟cfs的類綁定在一塊兒，這樣下面用到調度的一些方法，就會到 cfs相關的類去找嘍~ 這時候如果搶占打開了，會去調度一次，以讓新創(chuàng)建的進程有機會被調度一次�；蛘咴谙乱粋€tick的時候，會檢查已經(jīng)紅黑樹，以確認這個進程是不是調度。（注：上面紅色這句有點胡扯的嫌疑，請明白人指點）
比如在每個tick中斷的時候，會去紅黑樹里面找vruntime最小的那個（紅黑樹最左邊的葉子）去調度。那么這個調度過程，所有用到的方法，就是上面提到的cfs的類給實現(xiàn)的。
最后，大家再對rb-tree里面的任務結點有個感性的認識吧：
優(yōu)先級高的進程，總是靠左，以有最大調度機會。比如說有n個進程，那么在一段時間內，應該把n個進程都運行一遍。這兒有兩三個問題，一段時間是多久？每個進程有多少時間去運行呢？每個進程分到的運行時間跟vruntime有什么關系呢？
一段時間，與運行著的進程數(shù)目有關。進程越多，時間越長。每個進程并不是平均分配這些時間的。按照我的理解，分到這個時間越多的進程，后面vruntime增長得越慢。
上面這幾句話，我可是暈了近一個星期才明白的。不知道跟大家的理解一致么？
本來我錯誤得理解是這樣的，完全公平調度么，當然所有的進程有同樣的運行時間。如果是這樣，那么rb-tree就沒用了。因為每個結點都一樣的值。所以，請大家有這樣一種概念：兩個進程，a優(yōu)先級特別低，b特高。假如現(xiàn)在有n個正被調度的進程，它們都被調度一遍的時間是delta，那么b會占很高的時間，a 很低的時間。運行完成后，b還是在很靠左的地方呆著，而a一定往最靠右的地方。。。。哎，表述不清啊，，最好畫個圖。。。這樣就為了讓b得到更常的運行時間。。。而其vruntime仍然很�。ǹ梢詤⒄丈戏忄]件里面那個計算vruntime的公式）
好了
我們開始真正的代碼旅行吧。

1 幾個重要的數(shù)據(jù)結構
1）task_struct : 大家都知道這個。

struct task_struct {
...

int prio, static_prio, normal_prio; 進程的優(yōu)先級
unsigned int rt_priority;實時進程的優(yōu)先級
const struct sched_class *sched_class; 調度類，一堆函數(shù)指針，實現(xiàn)調度
struct sched_entity se;調度實體一個進程對應一個調度實體，，
struct sched_rt_entity rt;
....
}
2) sched_class 調度相關的函數(shù)指針。
struct sched_class {
...
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); 入列
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);出列
void (*yield_task) (struct rq *rq);

void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int sync);檢查當前進程可否被新進程搶占

struct task_struct * (*pick_next_task) (struct rq *rq); 選擇下一個進程運行
void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
int (*select_task_rq)(struct task_struct *p, int sync);
....
#ifdef CONFIG_FAIR_GROUP_SCHED 請格外留意此宏。。。。跟組調度相關的，，暫時不管它，后面跟組相關的調度再回來看它。
void (*moved_group) (struct task_struct *p);
#endif
};

3）調度實體
struct sched_entity {
struct load_weight load; /* for load-balancing */ nice對應的load值
struct rb_node run_node; 紅黑樹結點
struct list_head group_node;
unsigned int on_rq; 這個是什么呢？不知道

u64 exec_start; 上次開始調度時的運行時間
u64 sum_exec_runtime; 總運行時間
u64 vruntime;呵呵都知道了
u64 prev_sum_exec_runtime; 上次調度總運行時間
...
#ifdef CONFIG_FAIR_GROUP_SCHED 如果是組調度的話，就多了些部分。
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif

}

4)cfs 運行隊列
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load;
unsigned long nr_running;

u64 exec_clock;
u64 min_vruntime;

5)大boss，運行隊列
struct rq {

,...
unsigned long nr_running;
#define CPU_LOAD_IDX_MAX 5
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
...

struct cfs_rq cfs;
..
struct task_struct *curr, *idle;
}
6）調度相關類
/*
* All the scheduling class methods:
*/
static const struct sched_class fair_sched_class = {
    .next            = &idle_sched_class,
    .enqueue_task        = enqueue_task_fair,
    .dequeue_task        = dequeue_task_fair,
    .yield_task        = yield_task_fair,

    .check_preempt_curr    = check_preempt_wakeup,

    .pick_next_task        = pick_next_task_fair,
    .put_prev_task        = put_prev_task_fair,

#ifdef CONFIG_SMP
    .select_task_rq        = select_task_rq_fair,

    .load_balance        = load_balance_fair,
    .move_one_task        = move_one_task_fair,
#endif

    .set_curr_task          = set_curr_task_fair,
    .task_tick        = task_tick_fair,
    .task_new        = task_new_fair,

    .prio_changed        = prio_changed_fair,
    .switched_to        = switched_to_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED
    .moved_group        = moved_group_fair,
#endif
};
2 吃代碼
上面幾個結構體的關系還是很好理解的，它們的關系我本來想畫個圖表示下的，覺得有點麻煩，何況也不難，我就不畫了。。直接進代碼了哈~
在進行之前呢，先介紹兩篇文章，是一個網(wǎng)友寫的。之前和他聊cfs，后來我迷糊了，他沒迷糊，就寫了這篇文章。。。還不錯。
Linux進程管理之CFS調度器分析
Linux進程管理之CFS組調度分析

大家的理解差不多，他寫得很條理，為了防止雷同，雷著大家了，我不會引用它里面的文字的。我就在邊讀代碼的過程中邊把自己的體驗和疑惑貼出來，大家一同討論吧~   錯誤，一定要指出我的錯誤啊~呵呵

咱們先從新創(chuàng)建的一個進程說起吧。
1) kernel/fork.c里,fork routine,do_fork() will create a new process,if its state is running,it will
call wake_up_new_task() to wake up the newly created process FOR THE FIRST TIME.
do_fork(){
...
    if (unlikely(clone_flags & CLONE_STOPPED)) {
            ...
        } else {
            wake_up_new_task(p, clone_flags);
        }
...
}
2) 這個函數(shù)做什么呢？
void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
{
    unsigned long flags;
    struct rq *rq;

    rq = task_rq_lock(p, &flags);順序操作運行隊列
    BUG_ON(p->state != TASK_RUNNING);
    update_rq_clock(rq);

    p->prio = effective_prio(p);   計算priority,,普通進程的priority就是static priority

    if (!p->sched_class->task_new || !current->se.on_rq) {如果條件不滿足，直接入列，但請注意active_task()的最后參數(shù)0，不喚醒
        activate_task(rq, p, 0);
    } else {
        /*
        * Let the scheduling class do new task startup
        * management (if any):
        */
        p->sched_class->task_new(rq, p);調用完全公平類里面的task_new做些初始化的操作。
        inc_nr_running(rq);增加運行隊列的運行進程數(shù)目
    }
    trace_sched_wakeup_new(rq, p);
    check_preempt_curr(rq, p, 0); 可不可以搶占當前進程？
#ifdef CONFIG_SMP
    if (p->sched_class->task_wake_up)
        p->sched_class->task_wake_up(rq, p);
#endif
    task_rq_unlock(rq, &flags);
}
3) 先看task_new()吧，它會掉到fair_sched_class類里的task_new_fair.
static void task_new_fair(struct rq *rq, struct task_struct *p)
{
    struct cfs_rq *cfs_rq = task_cfs_rq(p);
    struct sched_entity *se = &p->se, *curr = cfs_rq->curr;
    int this_cpu = smp_processor_id();

    sched_info_queued(p);

    update_curr(cfs_rq); 更新cfs的一些信息
    place_entity(cfs_rq, se, 1);初始化se在cfs里的信息，包括vruntime

    /* 'curr' will be NULL if the child belongs to a different group */
    if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
            curr && curr->vruntime < se->vruntime) {
        /*
        * Upon rescheduling, sched_class::put_prev_task() will place
        * 'current' within the tree based on its new key value.
        */
        swap(curr->vruntime, se->vruntime);
        resched_task(rq->curr);
    }

    enqueue_task_fair(rq, p, 0); 放進隊列里面
}
4) 我們先看update_curr()吧
static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_of(cfs_rq)->clock;
    unsigned long delta_exec;

    if (unlikely(!curr))
        return;

    /*
    * Get the amount of time the current task was running
    * since the last time we changed load (this cannot
    * overflow on 32 bits):
    */
    delta_exec = (unsigned long)(now - curr->exec_start);計算自上次調度此進程到這次又調度，經(jīng)過的時間
這個時間比較鬼異，是多久呢？假如在運行隊列里的n個進程之一的a,再一次被調度到時，這個delta_exec是多大呢？如果中途有進程退出或睡眠，
那么運行隊列會動態(tài)更新的，所以這個delta_exec的變化曲線是什么樣子的難說。

    __update_curr(cfs_rq, curr, delta_exec);   把剛計算出來的運行時間作為參數(shù)傳進去，它做什么事情呢？
    curr->exec_start = now;呵呵，更新完了立刻打個時間戳。

    if (entity_is_task(curr)) { 這個條件比較有趣，我喜歡。跟過去發(fā)現(xiàn)是這樣定義的：
"/* An entity is a task if it doesn't "own" a runqueue */
#define entity_is_task(se)    (!se->my_q)   " 如果是組調度，調度實體可能代表一個組，不單單是一個task。就是看看有沒有my_q這個指針了，
即有沒有它control的cfs_rq
如果是task,條件滿足，

        struct task_struct *curtask = task_of(curr);

        cpuacct_charge(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec); 這兩個不管了，看不懂
    }
}

cfs.pdf

712.74 KB, 下載次數(shù): 1227

作者: wxc200 時間: 2009-03-02 14:54
標題: 回復 #1 wxc200 的帖子
6）靜悄悄地過來，看看這個經(jīng)常調用的函數(shù)，到底做了啥捏？
static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
          unsigned long delta_exec) 傳進來的執(zhí)行時間
{
    unsigned long delta_exec_weighted;

    schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

    curr->sum_exec_runtime += delta_exec; 直接加到總運行時間變量里去，
    schedstat_add(cfs_rq, exec_clock, delta_exec);
    delta_exec_weighted = calc_delta_fair(delta_exec, curr); 這個很重要。。。這個函數(shù)名字叫”公平計算delta"，咋公平捏？第七步分析會看到
    curr->vruntime += delta_exec_weighted; 把上步計算出來的值加到了vruntime里
    update_min_vruntime(cfs_rq);
}

7) 看看怎么計算這個delta_exec_weighted的？
/*
* delta /= w
*/
static inline unsigned long
calc_delta_fair(unsigned long delta, struct sched_entity *se)
{
    if (unlikely(se->load.weight != NICE_0_LOAD)) 如果進程實體有默認的load值，直接返回delta
        delta = calc_delta_mine(delta, NICE_0_LOAD, &se->load);計算應該修正的這個值

    return delta;
}
在往下走之前，先弄明白nice 和 se->load之間的關系吧。
nice 都知道，se->load是指調度實體的負載。同理一個cfs_rq也有負載。。一般是所有task的load之和。這個load是通過nice與一個靜態(tài)數(shù)組轉換來的。一般普通進程的nice值在-20~19之間，其對應load.weight值是遞減的，具體可參見那個靜態(tài)數(shù)組。而 NICE_0_LOAD就是nice=0的對應的load值。
好了，這相當于是做一個修正嘍。假如現(xiàn)在傳進來的運行時間是10,那么，這個調度實體應該返回的delta是多少呢？看這個函數(shù)的comment就明白了，delta * NICE_0_LOAD/se->load，是一個比重。。。之前這個地方說過了，呵呵。

那么，調度實體的優(yōu)先級越高，其得出來的值越小。
反回到6)中，可以知道vruntime是怎么update的了。那么，它是如何初始化的呢？后面有，再說,是通過vslice()計算的。
8）我們再返回到task_new_fair,看update_curr()后，接下來做的事情   place_entity.

這個函數(shù)也比較好玩，做得是一個”獎勵“工作：

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;

    /*
     * The 'current' period is already promised to the current tasks,
     * however the extra weight of the new task will slow them down a
     * little, place the new task so that it fits in the slot that
     * stays open at the end.
     */
    if (initial && sched_feat(START_DEBIT))
        vruntime += sched_vslice(cfs_rq, se); 稍微加一些，呵呵

    if (!initial) {    如果是睡了，喚醒的，應該有些補償?shù)?#160; 具體怎么補，多了怎么辦，少了怎么辦？
        /* sleeps upto a single latency don't count. */
        if (sched_feat(NEW_FAIR_SLEEPERS)) {
            unsigned long thresh = sysctl_sched_latency;

            /*
             * convert the sleeper threshold into virtual time
             */
            if (sched_feat(NORMALIZED_SLEEPER))
                thresh = calc_delta_fair(thresh, se);

            vruntime -= thresh;
        }

        /* ensure we never gain time by being placed backwards. */
        vruntime = max_vruntime(se->vruntime, vruntime);
    }

    se->vruntime = vruntime;
}

9）最后，task_new_fair做的事情就是讓task入列。enqueue_task_fair()
9) static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;

    for_each_sched_entity(se) {遍歷所有的se及它的父親。。。這個在非組調度時，就是se,組調度時，會將其dad一同入列
        if (se->on_rq)如果已經(jīng)在運行隊列，就停止
            break;
        cfs_rq = cfs_rq_of(se);
        enqueue_entity(cfs_rq, se, wakeup);真正的入隊函數(shù)，，，把se插入到rb-tree.
        wakeup = 1;
    }

    hrtick_update(rq);   hr部分的，后面再分析。
}

10)enqueue_entity() 會更新時間，最終調────enqueue_entity()，把schedule_entity往紅黑樹中存放。每個結點的值是通過 entity_key()來實現(xiàn)的，比較奇怪，是se->vruntime-cfs->min_vruntime....不知道這么做。

入列操作有一些跟喚醒判斷有關的，后面再查資料看看
11）再回到wake_up_new_task()來。
會調用check_preempt_curr()，判斷當前進程是否被新進程搶占。這個函數(shù)調用sched_fair_class里的 check_preempt_wakeup,這個函數(shù)也很好玩兒，它會調其中一個函數(shù)，wakeup_preempt_entity()，來判斷當前的進程和新進程之前滿足什么樣的關系才可以搶占。

這時新加了個區(qū)間：新進程的vruntime比current小，但要滿足一定條件，才能完成搶占。
三種情況返回值為-1/0/1,代表不同的意思。

最后我們又返回到了wake_up_new_task.一個新進程創(chuàng)建后被調度的過程大體就是上面的流程。

另一個比較有意思的就是tick了，明天看看tick中斷做的事情吧。

一些有意思的關于cfs調度的資料：
http://video.linuxfoundation.org/video/1029   視頻 cfs作者的演講
Some kernel hackers...
Thomas Gleixner
http://www.google.com/search?hl= ... ;btnG=Google+Search
http://kerneltrap.org/Thomas_Gleixner
http://de.wikipedia.org/wiki/Thomas_Gleixner

Schedulers: the plot thickens
http://lwn.net/Articles/230574/

[patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
http://lwn.net/Articles/230501/

http://www.ibm.com/developerworks/cn/linux/l-cfs/index.html 完全公平調度程序

鼠眼看Linux調度器
http://blog.chinaunix.net/u1/42957/showart.php?id=337604

http://www.ibm.com/developerwork ... cheduler/index.html
Linux 調度器發(fā)展簡述

至今不敢寫一篇關于cfs的文章收藏
http://blog.csdn.net/dog250/archive/2009/01/15/3793000.aspx

http://www.ibm.com/developerworks/cn/linux/l-cfs/index.html
使用完全公平調度程序（CFS）進行多任務處理

3 關于tick中斷
為了保證調度，在每個tick都會對進程時間信息等進行更新。首先找一個從hrtimer上層到進程調度之間的點，暫時往進程調度的方向說，后面談到hrtimer時，再往hrtimer追根溯源吧。
這個點就是，update_process_times,它是在timer interrupt 中被調的，它會調一個跟process直接相關的函數(shù),scheduler_tick().
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
* It also gets called by the fork code, when changing the parent's
* timeslices.
*/
void scheduler_tick(void)
{
    int cpu = smp_processor_id();
    ...
    update_cpu_load(rq);
    curr->sched_class->task_tick(rq, curr, 0);   task_tick即公平調度類里的
    spin_unlock(&rq->lock);

#ifdef CONFIG_SMP
    rq->idle_at_tick = idle_cpu(cpu);
    trigger_load_balance(rq, cpu);
#endif
}

看看task_tick：
/*
* scheduler tick hitting a task of our scheduling class:
*/
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;

    for_each_sched_entity(se) {   老朋友了，組調度的時候會遍歷父親們，否則就是它自己
        cfs_rq = cfs_rq_of(se);
        entity_tick(cfs_rq, se, queued); 傳進來的queued=0
    }
}

繼續(xù)看entity_tick;

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    /*
    * Update run-time statistics of the 'current'.
    */
    update_curr(cfs_rq);   更新當前進程current的信息，這個昨天已經(jīng)看過了

#ifdef CONFIG_SCHED_HRTICK
    /*
    * queued ticks are scheduled to match the slice, so don't bother
    * validating it and just reschedule.
    */
    if (queued) {
        resched_task(rq_of(cfs_rq)->curr);
        return;
    }
    /*
    * don't let the period tick interfere with the hrtick preemption
    */
    if (!sched_feat(DOUBLE_TICK) &&
            hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
        return;
#endif

    if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
        check_preempt_tick(cfs_rq, curr); 這個要與check_preempt_curr對比著看，我當初就在這個地方暈了。。。注釋是這樣的：“ Preempt the current task with a newly woken task if needed:”

}

把兩個函數(shù)都貼出來：

1） static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
{
    struct task_struct *curr = rq->curr;
    struct sched_entity *se = &curr->se, *pse = &p->se;
。。。
    if (wakeup_preempt_entity(se, pse) == 1) { 這個是根據(jù)當前進程curr和要調度的p之前的vruntime比較，有一個區(qū)間限度。
。。。
}
2）static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;

    ideal_runtime = sched_slice(cfs_rq, curr);算一下當前進程理想的運行時間是多少？
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; 算算它已經(jīng)運行了多久？
    if (delta_exec > ideal_runtime)
        resched_task(rq_of(cfs_rq)->curr);
}

有趣的是，上面兩個函數(shù)的comment是一樣的：-）

第二個函數(shù)折磨了我好久~ 既然是完全公平調度，每個進程的運行時間應該完全公平呀？實際上ideal_runtime 是與進程的優(yōu)先級有關系的。
首先看ideal_runtime是怎么計算出來的。
3)
/*
* We calculate the wall-time slice from the period by taking a part
* proportional to the weight.
*
* s = p*P[w/rw]這是通過一個比例算出的
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    unsigned long nr_running = cfs_rq->nr_running;運行的進程數(shù)目

    if (unlikely(!se->on_rq))
        nr_running++;

    return calc_delta_weight(__sched_period(nr_running), se);   這個weight是跟進程數(shù)目有關系的
}

4)看看cale_delata_weight第一個參數(shù)是怎么得到的。
/*
* The idea is to set a period in which each task runs once.   設置一個period,讓每個進程都跑一遍。
*
* When there are too many tasks (sysctl_sched_nr_latency) we have to stretch
* this period because otherwise the slices get too small. 當進程數(shù)目大于默認值（5）時，要拉伸一下這個period
*
* p = (nr <= nl) ? l : l*nr/nl
*/
static u64 __sched_period(unsigned long nr_running)
{
    u64 period = sysctl_sched_latency;
    unsigned long nr_latency = sched_nr_latency;

    if (unlikely(nr_running > nr_latency)) {
        period = sysctl_sched_min_granularity; 最小粒度   4ms
        period *= nr_running; 4ms * 進程數(shù)目
    }

    return period;
}

這個比較好理解，小于5個進程時，這個period周期是20ms,大于5個running number 時，就讓4*nr;后面就是線性的，

回到3,計算出來的這個period作為calc_delata_weight（）的第一個參數(shù)往下傳，繼而算出當前se的比重來。
5）
/*
* delta *= P[w / rw]
*/
static inline unsigned long
calc_delta_weight(unsigned long delta, struct sched_entity *se)
{
    for_each_sched_entity(se) {
        delta = calc_delta_mine(delta,
                se->load.weight, &cfs_rq_of(se)->load); 這個函數(shù)不陌生，在計算vruntime的時候也有它，不過那個函數(shù)是calc_delta_fair,重點是中間的參數(shù)，一個是 nice_0_load,一個是當前進程的load
    }

    return delta;
}
這個很簡單：   假如n 個進程的period是4n,其中進程i的 load.weight是m,所有進程的load是M，那么進程i應該在這個period里面分到的蛋糕是： 4n * m/M

再返回check_preempt_tick，通過比較實際計算出來的值與理想值比較，確定當前進程是否被搶占。。

這個地方有個疑問： check_preempt_tick是在tick中斷里調的，check_preempt_curr()昨晚分析過在進程創(chuàng)建的時候有過調用，后都傳進去一個需要調度的進程作為參數(shù)，與當前進程比較，滿足一定的條件時，才會把當前進程切換掉。前者是在tick中斷里掉，不知道下一個調的進程是誰，這個應該由調度器后面去選一個。沒關系，它的工作就是看看當前進程把自己的時間片用完了沒？用完了的話，我就回去告訴調度器，把你調度走，，，具體怎么調度呢？下面我們接著看吧。。。先猜想一下：1 ）把當前進程出列，調整其在紅黑樹中的位置。2）從紅黑樹中選擇下最左結點運行

4 調度函數(shù)
resched_task（）里設置某個進程需要調度。其實就是設定一個flag而已。
static inline void set_tsk_need_resched(struct task_struct *tsk)
{
    set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}

來看看大boss,
/*
* schedule() is the main scheduler function.
*/
asmlinkage void __sched schedule(void)
{
...

    prev->sched_class->put_prev_task(rq, prev);
    next = pick_next_task(rq, prev);

....
}

中間這兩個函數(shù)，是我們關心的。

1)
/*
* Account for a descheduled task:
*/
static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
    struct sched_entity *se = &prev->se;
    struct cfs_rq *cfs_rq;

    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        put_prev_entity(cfs_rq, se);
    }
}

這是對一個即將被調度出去的進程的處理。
2)    它會掉pub_prev_entity 這個例程看得有點兒迷糊。如果進程還在runqueue上，就把它重新放回紅黑樹，只是位置變了,并把當前的curr指針清空。
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
    /*
    * If still on the runqueue then deactivate_task()
    * was not called and update_curr() has to be done:
    */
    if (prev->on_rq) 這個條件還得研究研究，什么情況下用與不用
        update_curr(cfs_rq);

    check_spread(cfs_rq, prev);
    if (prev->on_rq) {
        update_stats_wait_start(cfs_rq, prev);
        /* Put 'current' back into the tree. */
        __enqueue_entity(cfs_rq, prev);   怎么又放回去呢？
    }
    cfs_rq->curr = NULL;
}

3) 回到scheduler函數(shù)，
static struct task_struct *pick_next_task_fair(struct rq *rq)
{
    struct task_struct *p;
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;

    if (unlikely(!cfs_rq->nr_running))
        return NULL;

    do {
        se = pick_next_entity(cfs_rq); 挑出下一個可以運行的
        set_next_entity(cfs_rq, se); 設置它為cfs_rq的curr
        cfs_rq = group_cfs_rq(se);是group的 se么？
    } while (cfs_rq);   這個循環(huán)其實很有意思，它為group調度提供支持。如果cfs_rq這層里面的進程被選擇完畢，它會接著選擇其父task_group里的去運行。

    p = task_of(se);
    hrtick_start_fair(rq, p);

    return p;
}

4) pick_next_entity會調用__pick_next_entity(cfs_rq)去選擇紅黑樹最左側的結點，然后有兩個條件判斷，作為返回se的最終人選。
"
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1)
        return cfs_rq->next;

    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1)
        return cfs_rq->last;
"
wakeup_preempt_entity（）這個我在前面分析過了，這是最終決定兩個進程能否進行切換的一個判斷。。。next 和 last估計是cfs里定時更新的親戚，好事當然先輪著自己嘍。。所以這個地方調度會優(yōu)先選擇下，，，具體再討論吧。

5) 選出來了以后，會通過 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)設置一些信息，把它賊給cfs_rq->curr,更新總運行時間等。
6）這兒還有一個奇怪的函數(shù)：
hrtick_start_fair(rq, p);
我認為選出了可以運行的進程，下一步應該真正讓它運行吧，，這步操作在哪呢？
上面這個函數(shù)只是做了些判斷，是否當前進程真得需要運行？然后更新了些與hrtick相關的rq信息。
也有可能是scheduler(）函數(shù)下面做的吧，具體我還沒有打到。

這兒有個小猜想，希望得到驗證：
1) a task call schedule(),it's se->on_rq = 1,so it need denqueue .call deactivate_task()

2) a task picked by schedule() need to get the cpu to run,it's se->on_rq = 0,so
it need enqueue.
3) a task's exec time is over,need change to rb-tree's right it's se->on_rq = 1,just change the tree

這個再想想。。。。

作者: wxc200 時間: 2009-03-02 14:55
標題: 回復 #2 wxc200 的帖子
組調度

先說一下提綱

1 關于group schedule的文檔
2 關于組調度的認識
3 一些與組調度相關的數(shù)組結構
4 組調度結構圖
5 具體的操作函數(shù)
6 后記

1   按照文檔的解釋，完全公平調度并不僅僅針對單一進程，也應該對組調度。例如有兩個用戶wxc1,wxc2，以用戶userid來分，調度器分配給兩個組的時間是公平的,wxc1 and wxc2各有50%的CPU，，組內也實現(xiàn)公平調度。

幾個宏：
CONFIG_GROUP_SCHED ：打開組調度

CONFIG_RT_GROUP_SCHED 實時進程組調度

CONFIG_FAIR_GROUP_SCHED 普通進程。
組的劃分：
1）Based on user id (CONFIG_USER_SCHED)
2）Based on "cgroup" pseudo filesystem (CONFIG_CGROUP_SCHED)
后面這一種還不太熟悉，參照eric xiao 網(wǎng)友的blog,在對cgroup文件系統(tǒng)進程操作時，會去調用一個接口，里面的函數(shù)完成對組調度的操作。
cgroup這部分后面再學習，暫時還不了解。

2 關于組調度的認識
1）組調度是分層次的。
2）一個組里面有數(shù)個進程，這些進程的調度作為一層。
3）所有的組都是連在一起的，他們之前有父子兄弟關系。為首的是root-task-group
4）另一層是組調度，即把一個組作為一個調度實體，仍然用schedule_entity這個結構體來表示。
5）一個task_group在每個cpu上都關聯(lián)著一個cfs_rq,一個schedule_entity，注意，這兩個東東是與之前單個進程調度里提到的cfs_rq & schedule_entity有分別的。。它們代表得是組。跟之前單個進程不一樣。
6）每個cpu的run queue里有個鏈表，把所有在這個cpu上屬于某些組的cfs_rq鏈在了一起。
7）組與組之間的父子關系，同時代表著組在某個cpu上的調度實體之間的父子關系。假如說wxc1是wxc2組的parent,那個wxc1 在cpu0上的調度實體schedu_entity 是wxc2在cpu0上調度實體的父親，這個一會在結構圖中可以看到。
8）每個調度實體都會指向一個它它所屬的cfs_rq,同時還有一個它所在的cfs_rq.對于代表組的調度實體來說，它的cfs_rq是其parent所在組的cfs_rq,而它自己的cfs_rq,用my_q指向，是它所在的組的cfs_rq

3 一些與組調度相關的數(shù)組結構
1) 結構體組：
/* task group related information */
struct task_group {
#ifdef CONFIG_CGROUP_SCHED
    struct cgroup_subsys_state css;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    /* schedulable entities of this group on each cpu */
    struct sched_entity **se;每個cpu上都有一個調度實體，代表這個組，不是單個task的調度實體
    /* runqueue "owned" by this group on each cpu */
    struct cfs_rq **cfs_rq;每個cpu上都有一個cfs_rq,屬于這個組的
    unsigned long shares;
#endif

#ifdef CONFIG_RT_GROUP_SCHED
    struct sched_rt_entity **rt_se;
    struct rt_rq **rt_rq;

    struct rt_bandwidth rt_bandwidth;
#endif

    struct rcu_head rcu;
    struct list_head list;

    struct task_group *parent; 這三個代表著組與組之間的親戚
    struct list_head siblings;
    struct list_head children;
};

2） struct rq{
...
struct list_head leaf_cfs_rq_list; 用來鏈接在此cpu上所有組的cfs_rq
...
}
3)
struct cfs_rq {

...
#ifdef CONFIG_FAIR_GROUP_SCHED
    //Note here: cfs has rq pointer when group schedule
    struct rq *rq;    /* cpu runqueue to which this cfs_rq is attached */   指向一個run queue

    /*
    * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
    * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
    * (like users, containers etc.)
    *
    * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
    * list is used during load balance.
    */
    struct list_head leaf_cfs_rq_list;             鏈表，把一個runqueue上的所有cfs_rq鏈接在一起
    struct task_group *tg;    /* group that "owns" this runqueue */    哪個group來的？
...
}
4） struct sched_entity {
。。。
#ifdef CONFIG_FAIR_GROUP_SCHED
    struct sched_entity    *parent;    也有父親了~以前是野孩子   其父親一般是其所在group的父親的在同樣cpu上的調度實體
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq        *cfs_rq; 這兩個cfs_rq挺繞的：cfs_rq這個指針指向的是parent->my_q
    /* rq "owned" by this entity/group: */
    struct cfs_rq        *my_q;自己的cfs_rq
#endif
};

4 組調度結構圖
請參照附件，手工畫得，因為不會用做圖軟件   汗~

右上角部分：兩個組之前是父子關系  每個組旁邊的三角形是代表組的調度實體

5 具體的操作函數(shù)

先看兩個組：
1） /* Default task group.
*    Every task in system belong to this group at bootup.
*/
struct task_group init_task_group;
2) /*
* Root task group.
*     Every UID task group (including init_task_group aka UID-0) will
*     be a child to this group.
*/
init_task_group是系統(tǒng)初始化時進程的所在組。
root_task_group 是所有group 的祖宗。
在sched_init()里面，有對它的初始化過程。
sched_init(){
.....
#ifdef CONFIG_FAIR_GROUP_SCHED
        init_task_group.se = (struct sched_entity **)ptr;
        ptr += nr_cpu_ids * sizeof(void **);請留意這兒， ptr加一個偏移，這段空間內存放著在所有cpu上的代表組的調度實體的指針。

        init_task_group.cfs_rq = (struct cfs_rq **)ptr;
        ptr += nr_cpu_ids * sizeof(void **);

#ifdef CONFIG_USER_SCHED
        root_task_group.se = (struct sched_entity **)ptr;
        ptr += nr_cpu_ids * sizeof(void **);

        root_task_group.cfs_rq = (struct cfs_rq **)ptr;
        ptr += nr_cpu_ids * sizeof(void **);
#endif /* CONFIG_USER_SCHED */
...
}
由于page_alloc還沒有建立，所以用alloc_bootmem()，上面的過程建立了一個task-group里面的**se,**cfs_rq部分。。。即兩個指針數(shù)組。
留意這句：
    init_task_group.parent = &root_task_group;
可見init_task_group的父親也是root_task_group.

接著往下，有一個對每個cpu的注冊過程，這個過程和一個組創(chuàng)建并注冊的過程類似，簡單分析下。
這個過程只簡我感興趣的分析：
for_each_possible_cpu(i) {
        struct rq *rq;

        rq = cpu_rq(i);
        spin_lock_init(&rq->lock);
        rq->nr_running = 0;
        init_cfs_rq(&rq->cfs, rq);初始化運行列隊自己的cfs_rq，不是組的哦~
        init_rt_rq(&rq->rt, rq);
#ifdef CONFIG_FAIR_GROUP_SCHED
        init_task_group.shares = init_task_group_load;
        INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
#ifdef CONFIG_CGROUP_SCHED
。。。

        init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);   這個函數(shù)很重要。。。把initi_task_group,和運行列隊的cfs_rq綁定。
#elif defined CONFIG_USER_SCHED
        root_task_group.shares = NICE_0_LOAD;
        init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, 0, NULL);同理。
。。。

    init_tg_cfs_entry(&init_task_group,
                &per_cpu(init_cfs_rq, i),
                &per_cpu(init_sched_entity, i), i, 1,
                root_task_group.se);
。。

}

init_tg_cfs_entry()這個函數(shù)做了許多好事，提出表揚。
急不可奈地先看看這廝到底做了啥捏：

#ifdef CONFIG_FAIR_GROUP_SCHED
static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
                struct sched_entity *se, int cpu, int add,
                struct sched_entity *parent)   這幾個參數(shù)分別是：要初始化的組 + 要初始化的cfs_rq + 要初始化的se + CPU + 是否把cfs掛在rq上 + 代表組的父親的調度實體
{
    struct rq *rq = cpu_rq(cpu);
    tg->cfs_rq[cpu] = cfs_rq;先把指針真滿
    init_cfs_rq(cfs_rq, rq);把它和運行隊列關鏈一下。下面我們會分析這個初始化做什么事情
    cfs_rq->tg = tg;從這兩句我感覺到，指針真是個好東西�。弘S意抽像出來一個玩意兒，就可以做為一個接口，把兩個東西綁在一起，讓他們有關系。。。這里它就綁定了組和rq
    if (add)
        list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
    tg->se[cpu] = se;也把se放進指針數(shù)組
    /* se could be NULL for init_task_group */
    if (!se)
        return;

    if (!parent)   這兒有意思：沒有父親的話，se指向的cfs_rq就是當前運行隊列的cfs_rq,這個條件在初始化“根組”和初”始化組“時成立。
        se->cfs_rq = &rq->cfs;
    else
        se->cfs_rq = parent->my_q; 否則呢，就指向父親所擁有的cfs_rq

    se->my_q = cfs_rq;然后把自己所在組的cfs_rq作為自己的隊列
    se->load.weight = tg->shares;   共享組的load值。
    se->load.inv_weight = 0;
    se->parent = parent; 親吻下自己的父母
}
#endif

看看上面那個如何把cfs_rq和運行隊列關聯(lián)？
static void init_cfs_rq(struct cfs_rq *cfs_rq, struct rq *rq)
{
    cfs_rq->tasks_timeline = RB_ROOT; 初始化下自己的紅黑結點
    INIT_LIST_HEAD(&cfs_rq->tasks);
#ifdef CONFIG_FAIR_GROUP_SCHED
    cfs_rq->rq = rq; 我能找到你，你也能找到我，這樣才是穩(wěn)固的關系。。。。
#endif
    cfs_rq->min_vruntime = (u64)(-(1LL << 20)); 默認一個cfs_rq的最左結點的值
}

上面這些分析大體代表了“根組”和“初始化組”在系統(tǒng)初始化的過程中相關流程。

來看一個組怎么被創(chuàng)建的吧。
/* allocate runqueue etc for a new task group */
struct task_group *sched_create_group(struct task_group *parent) parent這個參數(shù)一般在創(chuàng)建的時候會指定，比如是上一個進程組，沒有的話可能是前面提到的那兩個組
{
    struct task_group *tg;
    unsigned long flags;
    int i;

    tg = kzalloc(sizeof(*tg), GFP_KERNEL);   申請組結構體的內存空間
    if (!tg)
        return ERR_PTR(-ENOMEM);

    if (!alloc_fair_sched_group(tg, parent))   恩，這個比較好玩兒，申請那個指針數(shù)組，一會兒看，
        goto err;

    if (!alloc_rt_sched_group(tg, parent))
        goto err;

    spin_lock_irqsave(&task_group_lock, flags);
    for_each_possible_cpu(i) { 把申請到的組向每一個cpu注冊
        register_fair_sched_group(tg, i);
        register_rt_sched_group(tg, i);
    }
    list_add_rcu(&tg->list, &task_groups);

    WARN_ON(!parent); /* root should already exist */

    tg->parent = parent; 再親吻一下自己的父母
    INIT_LIST_HEAD(&tg->children);
    list_add_rcu(&tg->siblings, &parent->children);
    spin_unlock_irqrestore(&task_group_lock, flags);

    return tg;

err:
    free_sched_group(tg);
    return ERR_PTR(-ENOMEM);
}

這個流程不審很清晰的。。。。申請空間-->申請指針數(shù)組-->向cpu的運行隊列注冊

看看這兩個函數(shù)做什么。
static
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se, *parent_se;
    struct rq *rq;
    int i;

    tg->cfs_rq = kzalloc(sizeof(cfs_rq) * nr_cpu_ids, GFP_KERNEL);
    if (!tg->cfs_rq)
        goto err;
    tg->se = kzalloc(sizeof(se) * nr_cpu_ids, GFP_KERNEL);
    if (!tg->se)
        goto err;
上面兩步不難，申請指向每個cpu的cfs_rq 和 se數(shù)組

    tg->shares = NICE_0_LOAD;

    for_each_possible_cpu(i) { 向每一個cpu注冊下申請的cfs_rq 和 se結構體。
        rq = cpu_rq(i);

        cfs_rq = kmalloc_node(sizeof(struct cfs_rq),
                GFP_KERNEL|__GFP_ZERO, cpu_to_node(i));
        if (!cfs_rq)
            goto err;

        se = kmalloc_node(sizeof(struct sched_entity),
                GFP_KERNEL|__GFP_ZERO, cpu_to_node(i));
        if (!se)
            goto err;

        parent_se = parent ? parent->se : NULL;   判斷當前組有父親么？有的話其父親在此cpu上的調度實體即為它的調度實體的父親。。真拗口
        init_tg_cfs_entry(tg, cfs_rq, se, i, 0, parent_se);這個是之前我們分析過的，注意其參數(shù)
    }

    return 1;

err:
    return 0;
}
上面這個申請函數(shù)過程還是蠻順利的。
下面看看這個注冊更簡單，它就是把申請到的cfs_rq掛到運行隊列的鏈表上。。。

這樣一個組的創(chuàng)建過程分析完了。

最后再分析一個進程在不同組之間移動的情況：

/* change task's runqueue when it moves between groups.
*    The caller of this function should have put the task in its new group
*    by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
*    reflect its new group.
*/
void sched_move_task(struct task_struct *tsk)
{
    int on_rq, running;
    unsigned long flags;
    struct rq *rq;

    rq = task_rq_lock(tsk, &flags);   如果是單進程調度，返回的是運行隊列的cfs_rq結構體，否則返回調度實體指向的cfs_rq

    update_rq_clock(rq);

    running = task_current(rq, tsk);
    on_rq = tsk->se.on_rq;

    if (on_rq)
        dequeue_task(rq, tsk, 0); 如果還在列上，出列吧，要移走了嘛
    if (unlikely(running))不希望它還在運行，如果是的話，就讓它停止？或者重新入列？
        tsk->sched_class->put_prev_task(rq, ts）

    set_task_rq(tsk, task_cpu(tsk)); 設置新的運行隊列下面有分析

#ifdef CONFIG_FAIR_GROUP_SCHED
    if (tsk->sched_class->moved_group)
        tsk->sched_class->moved_group(tsk); 下面有分析這個函數(shù)
#endif

    if (unlikely(running))
        tsk->sched_class->set_curr_task(rq); 試著讓它運行。。。這個函數(shù)的邏輯也沒看懂
    if (on_rq)
        enqueue_task(rq, tsk, 0); 正式入列    下面分析它

    task_rq_unlock(rq, &flags);
}
#endif

這個流程的邏輯我是比較暈的，我搞不清楚為什么會這么做？
當一個進程從一個組移到另一個組的時候，這些操作具體的邏輯我不太懂，，需要再看下組調度理論性的東西。

set_task_rq將重新設置一個task的運行列隊：
/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
static inline void set_task_rq(struct task_struct *p, unsigned int cpu)   這兒設置的可是一個進程哦，一個進程的調度實體。。。與代表組的調度實體要分清楚
{
#ifdef CONFIG_FAIR_GROUP_SCHED
    p->se.cfs_rq = task_group(p)->cfs_rq[cpu];
    p->se.parent = task_group(p)->se[cpu];
#endif

#ifdef CONFIG_RT_GROUP_SCHED
    p->rt.rt_rq = task_group(p)->rt_rq[cpu];
    p->rt.parent = task_group(p)->rt_se[cpu];
#endif
}

這樣我們大體有個概念了。。。在組高度中，一個組內的進程的父親都是代表這個組的調度實體。它們指向的cfs_rq就是當前組所在這個cpu上擁有的cfs_rq

下面這個函數(shù)守成了進程移到新的組的珠一些信息的更新
#ifdef CONFIG_FAIR_GROUP_SCHED
static void moved_group_fair(struct task_struct *p)
{
    struct cfs_rq *cfs_rq = task_cfs_rq(p);

    update_curr(cfs_rq);
    place_entity(cfs_rq, &p->se, 1);
}
#endif

最后我們看這個入列的函數(shù)吧：
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;

    for_each_sched_entity(se) {這個for還是很有趣
        if (se->on_rq)
            break;
        cfs_rq = cfs_rq_of(se); 取得其cfs_rq 非組調度時，返回當前運行隊列的cfs_rq,組調度時，返回的是se將要調度到的cfs_rq
        enqueue_entity(cfs_rq, se, wakeup); 入列了可能要初始化時間啊之類的
        wakeup = 1;
    }

    hrtick_update(rq);
}
看看for循環(huán)：
#define for_each_sched_entity(se) \
        for (; se; se = se->parent)
這個在入列出列時經(jīng)常遇到。如果打開了組調度，就是上面這樣。
舉例，組A里的一個進程a要加入到組B在運行列隊cpu0上的cfs_rq。這時進程的se->cfs_rq應該指向的時在cpu0上的組B的cfs_rq。
首先執(zhí)行一些出隊入隊操作。當這個環(huán)節(jié)完了后，還在對組A這個調度實體進行更新，這樣往上遍歷，把所有組A的父親們遍歷一次，都執(zhí)行了入列操作。

為什么這么做呢？考慮中~

后記
上面這些是我看組調度中想到的一部分。。。。后面看完cgroup調度，應該會更有意思的。
                                 (end)

[ 本帖最后由 wxc200 于 2009-3-3 14:53 編輯 ]

作者: scutan 時間: 2009-03-02 15:05
謝謝樓主,辛苦了. 下下來慢慢看!

作者: wxc200 時間: 2009-03-02 15:42
標題: 回復 #4 scutan 的帖子
scutan,
不客氣歡迎指教

作者: emmoblin 時間: 2009-03-02 23:27
太長了，先收藏了。

作者: dreamice 時間: 2009-03-03 09:12
總結的好，學習了

作者: ynchnluiti 時間: 2009-03-03 09:51
謝謝樓主

作者: Godbach 時間: 2009-03-03 13:32
建議LZ把第2樓的內容，斜體部分編輯一下哈

作者: wxc200 時間: 2009-03-03 14:54
標題: 回復 #9 Godbach 的帖子
okay 修改好啦

作者: snow888 時間: 2009-03-03 15:05
標題: 回復 #10 wxc200 的帖子
能不能把那些小腦袋都去掉啊,

看這帖子還得會猜啊 ...

作者: scutan 時間: 2009-03-03 16:25

原帖由 snow888 于 2009-3-3 15:05 發(fā)表
能不能把那些小腦袋都去掉啊,

看這帖子還得會猜啊 ...

一樓提供了PDF下載的

作者: nqdgj2007 時間: 2009-03-03 21:50
http://blog.chinaunix.net/u1/51562/showart_1840437.html
這個可以對樓主有些地方進行補充

作者: wxc200 時間: 2009-03-04 09:52
標題: 回復 #13 nqdgj2007 的帖子
恩，你說得很對
我在帖子里面也提到了ericxiao的這兩篇blog,寫得很不錯
我之前和他討論過cfs的調度，后來也看過他的文章，呵呵
我已經(jīng)在pdf里面給出他blog的鏈接了，感興趣得可以對比著學習

這篇帖子本來是發(fā)給組內同事作為一個討論帖和資料收集帖的

作者: liumhare 時間: 2009-03-06 17:25
樓主有時間補充一下hrtick對進程調度的影響吧，

作者: gawk 時間: 2009-03-06 21:21
不錯，有時間仔細看看

作者: tuibo 時間: 2011-01-23 23:31
老帖，大家繼續(xù)補充

作者: wangjl_sdu 時間: 2011-01-24 08:41
收藏了，好東西{:3_180:}

歡迎光臨 Chinaunix (http://72891.cn/)