Kubernetes 1.35 “Timbernetes”: 5 features that actually move the needle
One of the most AI / ML-focused Kubernetes releases in the project’s history. 60+ enhancements, 5 features that will genuinely reshape how you run production workloads — distributed training, batch processing, Zero Trust architectures.
01. Native Gang Scheduling Alpha
If you’ve ever launched distributed ML training on Kubernetes, you know the scenario: a job requesting 8 pods × 8 GPUs. The scheduler places 5… then runs out of resources for the last 3. Result: 5 pods burning GPU hours doing nothing while waiting on 3 that will never come. That’s a scheduling deadlock, and so far the workaround relied on external tools like Volcano or Kueue.
1.35 introduces a native PodGroup concept: define the group, set the minimum number of pods required, and Kubernetes guarantees “all-or-nothing” placement. Either every pod is scheduled simultaneously, or none of them are. No partial deployments, no wasted resources.
Status: alpha in 1.35 — feature gate required. Worth testing in non-prod if you run AI workloads at scale.
02. In-Place Pod Resource Updates GA
Six years in the making, and finally stable and general availability. Concretely: you can change a pod’s CPU and memory requests and limits without restarting it.
Typical case: an inference pod runs with 512 MB of memory and starts hitting pressure. Old way: patch the Deployment, trigger a rollout, restart everything. New way:
kubectl patch pod my-pod --subresource resize --patch '
spec:
containers:
- name: app
resources:
requests:
memory: "1Gi"
limits:
memory: "1Gi"
'
The container keeps running, the kernel adjusts cgroups live. No restart, no dropped connections, no cold cache. Game changer for any workload where restarts are expensive: ML services, batch, stateful loads. Beta since 1.33, GA now — safe for production.
03. Per-container restart rules Beta · ON
Until now, restartPolicy applies to the whole pod: a crashing GPU sidecar restarts the entire pod, including the main training job that has been running for 4 hours.
1.35 introduces per-container policies driven by exit code. Pod with three containers — training job, GPU driver sidecar, logging container. If the GPU driver gets OOMKilled (exit code 137), only that container restarts — training continues. For exit code 1 (likely an application bug), you can choose not to restart and preserve the logs.
Status: beta, enabled by default in 1.35. Usable today.
04. Native Pod Certificates Beta
Today, deploying Zero Trust on Kubernetes typically means cert-manager or SPIRE for issuance, CRDs to orchestrate requests, Secrets for storage, and sidecars / init containers for rotation. It works, but it’s a lot of moving parts.
1.35 brings the mechanism native:
- the kubelet generates keys locally;
- it creates a
PodCertificateRequest; - the API server issues the certificate directly;
- the kubelet writes the credential bundle into the pod’s filesystem;
- automatic rotation, no sidecar required.
Security bonus: the API server enforces node restrictions at admission time, eliminating one of the classic pitfalls of third-party signers. Pure mTLS flow, no bearer token in the issuance path.
cert-manager and SPIRE aren’t going anywhere — they cover advanced use cases this feature doesn’t address. But for basic workload identity and service-to-service mTLS, native is now a credible option.
05. Mutable Job Resources (suspended) Alpha
The scenario: you launch a batch Job, six hours later it dies from OOMKill because the memory limit was too low. Before 1.35, you had to delete and recreate the Job — losing status, history, completion tracking.
With 1.35, you can now suspend a Job, adjust its resources, then resume it:
# 1. Suspend
kubectl patch job my-job --type=merge -p '{"spec":{"suspend":true}}'
# 2. Update resources
kubectl patch job my-job --type=merge -p '{"spec":{"template":{"spec":{"containers":[{"name":"worker","resources":{"limits":{"memory":"4Gi"}}}]}}}}'
# 3. Resume
kubectl patch job my-job --type=merge -p '{"spec":{"suspend":false}}'
Same Job object, execution resumed, completion tracking intact. Huge for batch processing and ML training, where iterating on resource requirements is the norm. Status: alpha — feature gate required.
Deprecations & upgrade notes
Before scheduling a 1.35 upgrade, audit your node configuration:
- cgroups v1 support is removed;
- IPVS mode in
kube-proxyis deprecated; - containerd 1.x support ends with 1.35.
These can silently block a migration — review your kernels, runtimes, and proxy configs before planning the switch.
Key takeaways
1.35 has a clear theme: Kubernetes is getting serious about AI and ML. Native gang scheduling, live resizing, fine-grained restart rules — pain points ML engineers have been hacking around for years are now handled at the platform core.
On the production side, the two features I’d start integrating right away are in-place resize (GA, low risk) and per-container restart rules (beta, on by default). The other three deserve a staging-cluster PoC before any commitment.