k8s调度器及调度队列源码分析

简介

在 Kubernetes 中,调度是指将 Pod 放置到合适的 Node 上,然后对应 Node 上的 kubelet 才能够运行这些 Pod。K8s scheduler 就是用来调度 pod 的一个组件。

本文主要是通过源码了解调度器的部分工作流程。

调度器流程图

源码分析

Based on Kubernetes v1.19.11.

K8s scheduler 主要的数据结构是:

  1. Scheduler。
  2. SchedulingQueue。

相关的代码流程主要分为两个部分:

  1. cmd/kube-scheduler,这里是我们调度器的起始处,主要是读取配置,初始化并启动调度器。
  2. pkg/scheduler,这里是调度器的核心代码。

数据结构

Scheduler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// pkg/scheduler/scheduler.go

// Scheduler watches for new unscheduled pods. It attempts to find
// nodes that they fit on and writes bindings back to the api server.
type Scheduler struct {
// It is expected that changes made via SchedulerCache will be observed
// by NodeLister and Algorithm.
SchedulerCache internalcache.Cache
Algorithm core.ScheduleAlgorithm

// NextPod should be a function that blocks until the next pod
// is available. We don't use a channel for this, because scheduling
// a pod may take some amount of time and we don't want pods to get
// stale while they sit in a channel.
NextPod func() *framework.QueuedPodInfo

// Error is called if there is an error. It is passed the pod in
// question, and the error
Error func(*framework.QueuedPodInfo, error)

// Close this to shut down the scheduler.
StopEverything <-chan struct{}

// SchedulingQueue holds pods to be scheduled
SchedulingQueue internalqueue.SchedulingQueue

// Profiles are the scheduling profiles.
Profiles profile.Map

scheduledPodsHasSynced func() bool
client clientset.Interface
}
  1. SchedulerCache ,保存了调度所需的 podStates 和 nodeInfos。
  2. Algorithm ,会使用该对象的 Schedule 方法来运行调度逻辑。
  3. SchedulingQueue ,调度队列。
  4. Profiles ,调度器配置。

SchedulingQueue

Interface

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// pkg/scheduler/internal/queue/scheduling_queue.go

// SchedulingQueue is an interface for a queue to store pods waiting to be scheduled.
// The interface follows a pattern similar to cache.FIFO and cache.Heap and
// makes it easy to use those data structures as a SchedulingQueue.
type SchedulingQueue interface {
framework.PodNominator
Add(pod *v1.Pod) error
// AddUnschedulableIfNotPresent adds an unschedulable pod back to scheduling queue.
// The podSchedulingCycle represents the current scheduling cycle number which can be
// returned by calling SchedulingCycle().
AddUnschedulableIfNotPresent(pod *framework.QueuedPodInfo, podSchedulingCycle int64) error
// SchedulingCycle returns the current number of scheduling cycle which is
// cached by scheduling queue. Normally, incrementing this number whenever
// a pod is popped (e.g. called Pop()) is enough.
SchedulingCycle() int64
// Pop removes the head of the queue and returns it. It blocks if the
// queue is empty and waits until a new item is added to the queue.
Pop() (*framework.QueuedPodInfo, error)
Update(oldPod, newPod *v1.Pod) error
Delete(pod *v1.Pod) error
MoveAllToActiveOrBackoffQueue(event string)
AssignedPodAdded(pod *v1.Pod)
AssignedPodUpdated(pod *v1.Pod)
PendingPods() []*v1.Pod
// Close closes the SchedulingQueue so that the goroutine which is
// waiting to pop items can exit gracefully.
Close()
// NumUnschedulablePods returns the number of unschedulable pods exist in the SchedulingQueue.
NumUnschedulablePods() int
// Run starts the goroutines managing the queue.
Run()
}

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// PriorityQueue implements a scheduling queue.
// The head of PriorityQueue is the highest priority pending pod. This structure
// has three sub queues. One sub-queue holds pods that are being considered for
// scheduling. This is called activeQ and is a Heap. Another queue holds
// pods that are already tried and are determined to be unschedulable. The latter
// is called unschedulableQ. The third queue holds pods that are moved from
// unschedulable queues and will be moved to active queue when backoff are completed.
type PriorityQueue struct {
// PodNominator abstracts the operations to maintain nominated Pods.
framework.PodNominator
stop chan struct{}
clock util.Clock
// pod initial backoff duration.
podInitialBackoffDuration time.Duration
// pod maximum backoff duration.
podMaxBackoffDuration time.Duration

lock sync.RWMutex
cond sync.Cond
// activeQ is heap structure that scheduler actively looks at to find pods to
// schedule. Head of heap is the highest priority pod.
activeQ *heap.Heap

// podBackoffQ is a heap ordered by backoff expiry. Pods which have completed backoff
// are popped from this heap before the scheduler looks at activeQ
podBackoffQ *heap.Heap

// unschedulableQ holds pods that have been tried and determined unschedulable.
unschedulableQ *UnschedulablePodsMap

// schedulingCycle represents sequence number of scheduling cycle and is incremented
// when a pod is popped.
schedulingCycle int64
// moveRequestCycle caches the sequence number of scheduling cycle when we
// received a move request. Unscheduable pods in and before this scheduling
// cycle will be put back to activeQueue if we were trying to schedule them
// when we received move request.
moveRequestCycle int64

// closed indicates that the queue is closed.
// It is mainly used to let Pop() exit its control loop while waiting for an item.
closed bool
}
  1. PodNominator:调度算法调度的结果,保存了 Pod 和 Node 的关系。
  2. cond:用来控制调度队列的 Pop 操作。
  3. activeQ:用堆维护的优先队列,保存着待调度的 pod,其中优先级默认是根据 Pod 的优先级和创建时间来排序。
  4. podBackoffQ:同样是用堆维护的优先队列,保存着运行失败的 Pod,优先级是根据 backOffTime 来排序,backOffTimepodInitialBackoffDuration 以及 podMaxBackoffDuration 两个参数影响。
  5. unschedulableQ:是一个 Map 结构,保存着暂时无法调度(可能是资源不满足等情况)的 Pod。

cmd/kube-scheduler

调度器的入口 main

最开始,scheduler 在 cmd/kube-scheduler/scheduler.go 使用 NewSchedulerCommand() 初始化命令并执行命令。

1
2
3
4
5
6
7
8
9
10
// cmd/kube-scheduler/scheduler.go

func main() {
...
command := app.NewSchedulerCommand()
...
if err := command.Execute(); err != nil {
os.Exit(1)
}
}

初始化调度器命令 NewSchedulerCommand

NewSchedulerCommand() 会读取配置文件和参数,初始化调度命令,其中最主要的函数是 runCommand()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func NewSchedulerCommand(registryOptions ...Option) *cobra.Command {
...
cmd := &cobra.Command{
Use: "kube-scheduler",
...
Run: func(cmd *cobra.Command, args []string) {
if err := runCommand(cmd, opts, registryOptions...); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
},
...
}

...
return cmd
}

执行调度器命令 runCommand

runCommand 主要分为两个重要步骤:

  1. Setup :读取配置文件以及参数,初始化调度器。这里的配置文件包括 Profiles 配置等。
  2. Run:运行调度器所需的组件,例如健康检查服务,Informer 等。然后使用 Setup 得到的调度器运行调度的主流程。
1
2
3
4
5
6
7
8
9
10
func runCommand(cmd *cobra.Command, opts *options.Options, registryOptions ...Option) error {
...

cc, sched, err := Setup(ctx, opts, registryOptions...)
if err != nil {
return err
}

return Run(ctx, cc, sched)
}

创建调度器 Setup

Setup 会根据配置文件和参数创建 scheduler。这里个人觉得最主要的是 Profiles,里面定义了调度器的名字,以及 scheduling framework 的插件配置。还有一些可以用来调优的参数,例如 PercentageOfNodesToScore, PodInitialBackoffSeconds , PodMaxBackoffSeconds 等。

并且 scheduler.New() 中会有一个 addAllEventHandlers(sched, informerFactory, podInformer) 函数,启动所有资源对象的事件监听,来根据情况调用对应的回调函数,这些回调函数同时也会影响调度队列的运行过程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
func Setup(ctx context.Context, opts *options.Options, outOfTreeRegistryOptions ...Option) (*schedulerserverconfig.CompletedConfig, *scheduler.Scheduler, error) {
...
// Create the scheduler.
sched, err := scheduler.New(cc.Client,
cc.InformerFactory,
cc.PodInformer,
recorderFactory,
ctx.Done(),
scheduler.WithProfiles(cc.ComponentConfig.Profiles...),
scheduler.WithAlgorithmSource(cc.ComponentConfig.AlgorithmSource),
scheduler.WithPercentageOfNodesToScore(cc.ComponentConfig.PercentageOfNodesToScore),
scheduler.WithFrameworkOutOfTreeRegistry(outOfTreeRegistry),
scheduler.WithPodMaxBackoffSeconds(cc.ComponentConfig.PodMaxBackoffSeconds),
scheduler.WithPodInitialBackoffSeconds(cc.ComponentConfig.PodInitialBackoffSeconds),
scheduler.WithExtenders(cc.ComponentConfig.Extenders...),
)
if err != nil {
return nil, nil, err
}

return &cc, sched, nil
}

运行调度器 Run

Run 主要是启动一些组件,然后调用 sched.Run(ctx) 进行调度的主流程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
func Run(ctx context.Context, cc *schedulerserverconfig.CompletedConfig, sched *scheduler.Scheduler) error {
...
// Prepare the event broadcaster.
cc.EventBroadcaster.StartRecordingToSink(ctx.Done())
// Setup healthz checks.
...
// Start up the healthz server.
...
// Start all informers.
go cc.PodInformer.Informer().Run(ctx.Done())
cc.InformerFactory.Start(ctx.Done())
// Wait for all caches to sync before scheduling.
cc.InformerFactory.WaitForCacheSync(ctx.Done())

// If leader election is enabled, runCommand via LeaderElector until done and exit.
// Leader election
...
// Leader election is disabled, so runCommand inline until done.
sched.Run(ctx)
return fmt.Errorf("finished without leader elect")
}

pkg/scheduler

运行调度器主流程

Run 会启动 scheduling queue,并不断调用 sched.scheduleOne() 进行调度。

1
2
3
4
5
6
7
8
9
// Run begins watching and scheduling. It waits for cache to be synced, then starts scheduling and blocked until the context is done.
func (sched *Scheduler) Run(ctx context.Context) {
if !cache.WaitForCacheSync(ctx.Done(), sched.scheduledPodsHasSynced) {
return
}
sched.SchedulingQueue.Run()
wait.UntilWithContext(ctx, sched.scheduleOne, 0)
sched.SchedulingQueue.Close()
}

运行调度队列

1
2
3
4
5
// Run starts the goroutine to pump from podBackoffQ to activeQ
func (p *PriorityQueue) Run() {
go wait.Until(p.flushBackoffQCompleted, 1.0*time.Second, p.stop)
go wait.Until(p.flushUnschedulableQLeftover, 30*time.Second, p.stop)
}

调度队列的运行逻辑:

  1. 每隔 1s 检查 podBackoffQ 是否有 pod 可以放入 activeQ 中。检查的逻辑是判断 backOffTime 是否已经到期。
  2. 每隔 30s 检查 unschedulableQ 是否有 pod 可以放入 activeQ 中。

单个 Pod 的调度 scheduleOne

在介绍 scheduleOne 之前,看这张 pod 调度流程图能有助于我们理清整个过程。同时这也是 k8s v1.15 开始支持的 Scheduling Framework 的 Plugin 扩展点。

pod scheduling context

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// scheduleOne does the entire scheduling workflow for a single pod.  It is serialized on the scheduling algorithm's host fitting.
func (sched *Scheduler) scheduleOne(ctx context.Context) {
podInfo := sched.NextPod()
...
pod := podInfo.Pod
prof, err := sched.profileForPod(pod)
...
// Synchronously attempt to find a fit for the pod.
start := time.Now()
state := framework.NewCycleState()
...
scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, prof, state, pod)
...

// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
// This allows us to keep scheduling without waiting on binding to occur.
assumedPodInfo := podInfo.DeepCopy()
assumedPod := assumedPodInfo.Pod
// assume modifies `assumedPod` by setting NodeName=scheduleResult.SuggestedHost
err = sched.assume(assumedPod, scheduleResult.SuggestedHost)
...

// Run the Reserve method of reserve plugins.
if sts := prof.RunReservePluginsReserve(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost); !sts.IsSuccess() {
...
}

// Run "permit" plugins.
runPermitStatus := prof.RunPermitPlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
...

// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
go func() {
bindingCycleCtx, cancel := context.WithCancel(ctx)
waitOnPermitStatus := prof.WaitOnPermit(bindingCycleCtx, assumedPod)
if !waitOnPermitStatus.IsSuccess() {
...
return
}

// Run "prebind" plugins.
preBindStatus := prof.RunPreBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
if !preBindStatus.IsSuccess() {
...
return
}

err := sched.bind(bindingCycleCtx, prof, assumedPod, scheduleResult.SuggestedHost, state)
if err != nil {
...
} else {
// Run "postbind" plugins.
prof.RunPostBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
}
}()
}

ScheduleOne 是调度器的主流程,主要包括以下几步:

  1. 调用 sched.NextPod() 拿到下一个需要调度的 pod。后面会对这个过程进行更详细的介绍。
  2. 调用 sched.profileForPod(pod) ,根据 pod 中的 schedulerName 拿到针对该 pod 调度的 Profiles。这些 Profiles 就包括了调度插件的配置等。
  3. 进行上图中的 Scheduling Cycle 部分,这部分是单线程运行的。
    • 调用 sched.Algorithm.Schedule()。此处包括好几个步骤,其中 PreFilter, Filter 被称为 Predicate,是对节点进行过滤,这里面考虑了节点资源,Pod Affinity,以及 Node Volumn 等情况。而 PreScore , Score , Nomalize Score 又被称为 Priorities,是对节点进行优选打分,这里会得到一个适合当前 Pod 分配上去的 Node。
    • 进行 Reserve 操作,将调度结果缓存。当后面的调度流程执行失败,会进行 Unreserve 进行数据回滚。
    • 进行 Permit 操作,这里是用户自定义的插件,可以使 Pod 进行 allow(允许 Pod 通过 Permit 阶段)、reject(Pod 调度失败)和 wait(可设置超时时间)这三种操作。对于 Gang Scheduling (一批 pod 同时创建成功或同时创建失败),可以在 Permit 对 Pod 进行控制。
  4. 进行图中的 Binding Cycle 部分,这部分是起了一个 Goroutine 去完成工作的,不会阻塞调度主流程。
    • 最开始会进行 WaitOnPermit 操作,这里会阻塞判断 Pod 是否 Permit,直到 Pod Permit 状态为 allow 或者 reject 再往下继续运行。
    • 进行 PreBind , Bind , PostBind 操作。这里会调用 k8s apiserver 提供的接口 b.handle.ClientSet().CoreV1().Pods(binding.Namespace).Bind(ctx, binding, metav1.CreateOptions{}),将待调度的 Pod 与选中的节点进行绑定,但是可能会绑定失败,此时会做 Unreserve 操作,将节点上面 Pod 的资源解除预留,然后重新放置到失败队列中。

当 Pod 与 Node 绑定成功后,Node 上面的 kubelet 会 watch 到对应的 event,然后会在节点上创建 Pod,包括创建容器 storage、network 等。等所有的资源都准备完成,kubelet 会把 Pod 状态更新为Running。

SchedulingQueue 细节

获取下一个运行的 Pod

调度的时候,需要获取一个调度的 pod,即 sched.NextPod() ,其中调用了 SchedulingQueue 的 Pop() 方法。

activeQ 中没有元素,会通过 p.cond.Wait() 阻塞,直到 podBackoffQ 或者 unschedulableQ 将元素加入 activeQ 并通过 cond.Broadcast() 来唤醒。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Pop removes the head of the active queue and returns it. It blocks if the
// activeQ is empty and waits until a new item is added to the queue. It
// increments scheduling cycle when a pod is popped.
func (p *PriorityQueue) Pop() (*framework.QueuedPodInfo, error) {
p.lock.Lock()
defer p.lock.Unlock()
for p.activeQ.Len() == 0 {
// When the queue is empty, invocation of Pop() is blocked until new item is enqueued.
// When Close() is called, the p.closed is set and the condition is broadcast,
// which causes this loop to continue and return from the Pop().
if p.closed {
return nil, fmt.Errorf(queueClosed)
}
p.cond.Wait()
}
obj, err := p.activeQ.Pop()
if err != nil {
return nil, err
}
pInfo := obj.(*framework.QueuedPodInfo)
pInfo.Attempts++
p.schedulingCycle++
return pInfo, err
}

将 Pod 加入 activeQ

当 pod 加入 activeQ 后,还会从 unschedulableQ 以及 podBackoffQ 中删除对应 pod 的信息,并使用 cond.Broadcast() 来唤醒阻塞的 Pop。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Add adds a pod to the active queue. It should be called only when a new pod
// is added so there is no chance the pod is already in active/unschedulable/backoff queues
func (p *PriorityQueue) Add(pod *v1.Pod) error {
p.lock.Lock()
defer p.lock.Unlock()
pInfo := p.newQueuedPodInfo(pod)
if err := p.activeQ.Add(pInfo); err != nil {
klog.Errorf("Error adding pod %v to the scheduling queue: %v", nsNameForPod(pod), err)
return err
}
if p.unschedulableQ.get(pod) != nil {
klog.Errorf("Error: pod %v is already in the unschedulable queue.", nsNameForPod(pod))
p.unschedulableQ.delete(pod)
}
// Delete pod from backoffQ if it is backing off
if err := p.podBackoffQ.Delete(pInfo); err == nil {
klog.Errorf("Error: pod %v is already in the podBackoff queue.", nsNameForPod(pod))
}
metrics.SchedulerQueueIncomingPods.WithLabelValues("active", PodAdd).Inc()
p.PodNominator.AddNominatedPod(pod, "")
p.cond.Broadcast()

return nil
}

当 Pod 调度失败时进入失败队列

当 pod 调度失败时,会调用 sched.Error() ,其中调用了 p.AddUnschedulableIfNotPresent() .

决定 pod 调度失败时进入 podBackoffQ 还是 unschedulableQ :如果 moveRequestCycle 大于 podSchedulingCycle ,则进入 podBackoffQ ,否则进入 unschedulableQ .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// AddUnschedulableIfNotPresent inserts a pod that cannot be scheduled into
// the queue, unless it is already in the queue. Normally, PriorityQueue puts
// unschedulable pods in `unschedulableQ`. But if there has been a recent move
// request, then the pod is put in `podBackoffQ`.
func (p *PriorityQueue) AddUnschedulableIfNotPresent(pInfo *framework.QueuedPodInfo, podSchedulingCycle int64) error {
...
// If a move request has been received, move it to the BackoffQ, otherwise move
// it to unschedulableQ.
if p.moveRequestCycle >= podSchedulingCycle {
if err := p.podBackoffQ.Add(pInfo); err != nil {
return fmt.Errorf("error adding pod %v to the backoff queue: %v", pod.Name, err)
}
} else {
p.unschedulableQ.addOrUpdate(pInfo)
}
...
}

何时 moveRequestCycle >= podSchedulingCycle

  1. 我们在集群资源变更的时候(例如添加 Node 或者删除 Pod),会有回调函数尝试将 unschedulableQ 中之前因为资源不满足需求的 pod 放入 activeQ 或者 podBackoffQ ,及时进行调度。
  2. 调度队列会每隔 30s 定时运行 flushUnschedulableQLeftover ,尝试调度 unschedulableQ 中的 pod。

这两者都会调用 movePodsToActiveOrBackoffQueue 函数,并将 moveRequestCycle 设为 p.schedulingCycle.

1
2
3
4
5
func (p *PriorityQueue) movePodsToActiveOrBackoffQueue(podInfoList []*framework.QueuedPodInfo, event string) {
...
p.moveRequestCycle = p.schedulingCycle
p.cond.Broadcast()
}

podBackoffQ 中 pod 的生命周期

加入 podBackoffQ

有两种情况会让 pod 加入 podBackoffQ:

  1. 调度失败。如果调度失败,并且集群资源发生变更,即 moveRequestCycle >= podSchedulingCycle ,pod 就会加入到 podBackoffQ 中。
  2. 从 unschedulableQ 中转移。当集群资源发生变化的时候,最终会调用 movePodsToActiveOrBackoffQueue 将 unschedulableQ 的 pod 转移到 podBackoffQ 或者 activeQ 中。转移到 podBackoffQ 的条件是 p.isPodBackingoff(pInfo) ,即 pod 仍然处于 backoff 状态。

退出 podBackoffQ

调度器会定时让 pod 从 podBackoffQ 转移到 activeQ 中。

sched.SchedulingQueue.Run 中运行的 flushBackoffQCompleted cronjob 会每隔 1s 按照优先级(优先级是按照 backoffTime 排序)依次将满足 backoffTime 条件的 pod 从 podBackoffQ 转移到 activeQ 中,直到遇到一个不满足 backoffTime 条件的 pod。

unschedulableQ 中 pod 的生命周期

加入 unschedulableQ

只有一种情况会让 pod 加入 unschedulableQ,那就是调度失败。如果调度失败,并且集群资源没有发生变更,即 moveRequestCycle < podSchedulingCycle ,那么 pod 就会加入到 unschedulableQ 中。

退出 unschedulableQ

调度器会同样定时让 pod 从 unschedulableQ 转移到 podBackoffQ 或者 activeQ 中。

sched.SchedulingQueue.Run 中运行的 flushUnschedulableQLeftover 最终会调用 movePodsToActiveOrBackoffQueue 将 pod 分别加入到 podBackoffQ 或者 activeQ 中。

总结

Kubernetes scheduler 是 kubernetes 中相当重要的组件,基本上各个云平台都会根据自己的业务模型和需求自定义调度器,例如 华为的 Volcano 计算框架。

通过这方面的学习,能在自定义调度器的开发中更加得心应手。

Reference

k8s source code

图解kubernetes调度器SchedulingQueue核心源码实现

深入理解k8s调度器与调度框架核心源码

Kubernetes资源调度——scheduler