Kubernetes Scheduler

The Kubernetes Scheduler is a control plane component responsible for assigning newly created pods to nodes in the cluster. It determines the best node for running a pod based on resource requirements, constraints, and other policies.

With Kubeadm, kube-scheduler, is deployed as a POD inside the cluster. It can be checked inside kube-system namespace.

Core Responsibilities

The Scheduler’s primary function is to:

  • Watch for newly created pods with no assigned node
  • Select the most suitable node for each pod
  • Inform the API server of the scheduling decisions
  • Balance workload distribution across the cluster

How Scheduling Works

The scheduling process follows a two-step operation:

1. Filtering (Predicates)

First, the scheduler filters out nodes that cannot run the pod:

  • Nodes without sufficient resources (CPU, memory, GPU)
  • Nodes that don’t match node selectors or affinity rules
  • Nodes with taints that aren’t tolerated by the pod
  • Nodes that don’t have required volumes available

2. Scoring (Priorities)

Next, the scheduler ranks the remaining nodes:

  • Resource allocation: Balancing resource utilization
  • Spreading: Distributing pods across nodes, zones, or domains
  • Affinity/anti-affinity: Preferring or avoiding co-location with other pods
  • Taints/tolerations: Soft preferences for node selection
  • Custom policies: User-defined scoring functions

Finally, the pod is scheduled to the node with the highest score.

Scheduling Policies

Resource Requests and Limits

The scheduler uses pod resource requests (not limits) for making scheduling decisions:

  • CPU requests: Minimum guaranteed CPU
  • Memory requests: Minimum guaranteed memory
  • Extended resources: GPUs, FPGAs, or other specialized hardware

Node Selection Constraints

Pods can specify node selection requirements:

  • nodeSelector: Simple key-value pair matching for node labels
  • Node affinity: More expressive requirements for node selection
    • requiredDuringSchedulingIgnoredDuringExecution: Hard requirements
    • preferredDuringSchedulingIgnoredDuringExecution: Soft preferences

Pod Affinity and Anti-Affinity

Pods can specify relationships with other pods:

  • Pod affinity: Attract pods to each other (co-location)
  • Pod anti-affinity: Repel pods from each other (separation)
    • Useful for high availability by spreading replicas
    • Important for performance isolation

Taints and Tolerations

  • Taints: Properties on nodes that repel pods
  • Tolerations: Properties on pods that allow (but don’t require) scheduling on tainted nodes

Common use cases:

  • Dedicated nodes for specific workloads
  • Nodes reserved for specific users
  • Preventing scheduling on problematic nodes

Advanced Scheduling

Multiple Schedulers

Kubernetes supports running multiple schedulers simultaneously:

  • Pods can specify which scheduler should handle them
  • Custom schedulers can implement domain-specific logic
  • Default scheduler continues to handle regular pods

Scheduler Extenders

External processes that the scheduler calls out to for:

  • Additional filtering
  • Custom prioritization
  • Binding decisions

Scheduler Profiles

Configure multiple scheduling profiles with different plugin sets:

  • Pods can select a profile via the .spec.schedulerName field
  • Allows different workloads to have different scheduling behaviors
  • Introduced in Kubernetes 1.18

Performance and Capacity

  • The scheduler is designed to handle large clusters
  • Scheduling decisions are typically made in milliseconds
  • Scheduler performance can impact pod startup latency
  • In very large clusters, scheduler performance becomes critical

Scheduler Plugins

The scheduler uses a plugin architecture with extension points:

  • PreFilter: Run before filtering
  • Filter: Node filtration
  • PreScore: Run before scoring
  • Score: Node scoring
  • Reserve: Reserve node resources
  • Permit: Allow or reject scheduling
  • PreBind: Run before binding
  • Bind: Bind pod to node
  • PostBind: Run after binding

Common Scheduling Scenarios

High Availability

  • Spreading pods across failure domains (nodes, racks, zones)
  • Using pod anti-affinity to prevent co-location
  • Ensuring critical services don’t share fate

Resource Efficiency

  • Bin-packing pods onto nodes to maximize utilization
  • Co-locating complementary workloads
  • Balancing the cluster to avoid hotspots

Specialized Hardware

  • Directing pods to nodes with GPUs, FPGAs, or other hardware
  • Managing access to limited specialized resources
  • Isolating high-performance workloads

Troubleshooting Scheduling

Common issues include:

  • Insufficient resources: Cluster doesn’t have enough CPU, memory, or other resources
  • Affinity/anti-affinity constraints: Too restrictive conditions
  • Node selectors: No nodes match the required labels
  • Taints without tolerations: Pods can’t tolerate node taints
  • PersistentVolume availability: Required volumes not available on nodes

Diagnostic commands:

  • kubectl describe pod <pod-name>: Check Events section
  • kubectl get events: View cluster-wide scheduling events
  • Scheduler logs: Examine detailed scheduling decisions