Kubernetes Scheduler
The Kubernetes Scheduler is a control plane component responsible for assigning newly created pods to nodes in the cluster. It determines the best node for running a pod based on resource requirements, constraints, and other policies.
With Kubeadm, kube-scheduler, is deployed as a POD inside the cluster. It can be checked inside kube-system
namespace.
Core Responsibilities
The Scheduler’s primary function is to:
- Watch for newly created pods with no assigned node
- Select the most suitable node for each pod
- Inform the API server of the scheduling decisions
- Balance workload distribution across the cluster
How Scheduling Works
The scheduling process follows a two-step operation:
1. Filtering (Predicates)
First, the scheduler filters out nodes that cannot run the pod:
- Nodes without sufficient resources (CPU, memory, GPU)
- Nodes that don’t match node selectors or affinity rules
- Nodes with taints that aren’t tolerated by the pod
- Nodes that don’t have required volumes available
2. Scoring (Priorities)
Next, the scheduler ranks the remaining nodes:
- Resource allocation: Balancing resource utilization
- Spreading: Distributing pods across nodes, zones, or domains
- Affinity/anti-affinity: Preferring or avoiding co-location with other pods
- Taints/tolerations: Soft preferences for node selection
- Custom policies: User-defined scoring functions
Finally, the pod is scheduled to the node with the highest score.
Scheduling Policies
Resource Requests and Limits
The scheduler uses pod resource requests (not limits) for making scheduling decisions:
- CPU requests: Minimum guaranteed CPU
- Memory requests: Minimum guaranteed memory
- Extended resources: GPUs, FPGAs, or other specialized hardware
Node Selection Constraints
Pods can specify node selection requirements:
- nodeSelector: Simple key-value pair matching for node labels
- Node affinity: More expressive requirements for node selection
- requiredDuringSchedulingIgnoredDuringExecution: Hard requirements
- preferredDuringSchedulingIgnoredDuringExecution: Soft preferences
Pod Affinity and Anti-Affinity
Pods can specify relationships with other pods:
- Pod affinity: Attract pods to each other (co-location)
- Pod anti-affinity: Repel pods from each other (separation)
- Useful for high availability by spreading replicas
- Important for performance isolation
Taints and Tolerations
- Taints: Properties on nodes that repel pods
- Tolerations: Properties on pods that allow (but don’t require) scheduling on tainted nodes
Common use cases:
- Dedicated nodes for specific workloads
- Nodes reserved for specific users
- Preventing scheduling on problematic nodes
Advanced Scheduling
Multiple Schedulers
Kubernetes supports running multiple schedulers simultaneously:
- Pods can specify which scheduler should handle them
- Custom schedulers can implement domain-specific logic
- Default scheduler continues to handle regular pods
Scheduler Extenders
External processes that the scheduler calls out to for:
- Additional filtering
- Custom prioritization
- Binding decisions
Scheduler Profiles
Configure multiple scheduling profiles with different plugin sets:
- Pods can select a profile via the
.spec.schedulerName
field - Allows different workloads to have different scheduling behaviors
- Introduced in Kubernetes 1.18
Performance and Capacity
- The scheduler is designed to handle large clusters
- Scheduling decisions are typically made in milliseconds
- Scheduler performance can impact pod startup latency
- In very large clusters, scheduler performance becomes critical
Scheduler Plugins
The scheduler uses a plugin architecture with extension points:
- PreFilter: Run before filtering
- Filter: Node filtration
- PreScore: Run before scoring
- Score: Node scoring
- Reserve: Reserve node resources
- Permit: Allow or reject scheduling
- PreBind: Run before binding
- Bind: Bind pod to node
- PostBind: Run after binding
Common Scheduling Scenarios
High Availability
- Spreading pods across failure domains (nodes, racks, zones)
- Using pod anti-affinity to prevent co-location
- Ensuring critical services don’t share fate
Resource Efficiency
- Bin-packing pods onto nodes to maximize utilization
- Co-locating complementary workloads
- Balancing the cluster to avoid hotspots
Specialized Hardware
- Directing pods to nodes with GPUs, FPGAs, or other hardware
- Managing access to limited specialized resources
- Isolating high-performance workloads
Troubleshooting Scheduling
Common issues include:
- Insufficient resources: Cluster doesn’t have enough CPU, memory, or other resources
- Affinity/anti-affinity constraints: Too restrictive conditions
- Node selectors: No nodes match the required labels
- Taints without tolerations: Pods can’t tolerate node taints
- PersistentVolume availability: Required volumes not available on nodes
Diagnostic commands:
kubectl describe pod <pod-name>
: Check Events sectionkubectl get events
: View cluster-wide scheduling events- Scheduler logs: Examine detailed scheduling decisions