Kubernetes 1.35 “Timbernetes”: 5 major updates for AI and production

v1.35 — Timbernetes

Kubernetes 1.35 “Timbernetes”: 5 features that actually move the needle

One of the most AI / ML-focused Kubernetes releases in the project’s history. 60+ enhancements, 5 features that will genuinely reshape how you run production workloads — distributed training, batch processing, Zero Trust architectures.

7 min read K8s admins / DevOps / SREs May 2026

01. Native Gang Scheduling Alpha

If you’ve ever launched distributed ML training on Kubernetes, you know the scenario: a job requesting 8 pods × 8 GPUs. The scheduler places 5… then runs out of resources for the last 3. Result: 5 pods burning GPU hours doing nothing while waiting on 3 that will never come. That’s a scheduling deadlock, and so far the workaround relied on external tools like Volcano or Kueue.

1.35 introduces a native PodGroup concept: define the group, set the minimum number of pods required, and Kubernetes guarantees “all-or-nothing” placement. Either every pod is scheduled simultaneously, or none of them are. No partial deployments, no wasted resources.

Status: alpha in 1.35 — feature gate required. Worth testing in non-prod if you run AI workloads at scale.

02. In-Place Pod Resource Updates GA

Six years in the making, and finally stable and general availability. Concretely: you can change a pod’s CPU and memory requests and limits without restarting it.

Typical case: an inference pod runs with 512 MB of memory and starts hitting pressure. Old way: patch the Deployment, trigger a rollout, restart everything. New way:

kubectl patch pod my-pod --subresource resize --patch '
spec:
  containers:
  - name: app
    resources:
      requests:
        memory: "1Gi"
      limits:
        memory: "1Gi"
'

The container keeps running, the kernel adjusts cgroups live. No restart, no dropped connections, no cold cache. Game changer for any workload where restarts are expensive: ML services, batch, stateful loads. Beta since 1.33, GA now — safe for production.

03. Per-container restart rules Beta · ON

Until now, restartPolicy applies to the whole pod: a crashing GPU sidecar restarts the entire pod, including the main training job that has been running for 4 hours.

1.35 introduces per-container policies driven by exit code. Pod with three containers — training job, GPU driver sidecar, logging container. If the GPU driver gets OOMKilled (exit code 137), only that container restarts — training continues. For exit code 1 (likely an application bug), you can choose not to restart and preserve the logs.

Status: beta, enabled by default in 1.35. Usable today.

04. Native Pod Certificates Beta

Today, deploying Zero Trust on Kubernetes typically means cert-manager or SPIRE for issuance, CRDs to orchestrate requests, Secrets for storage, and sidecars / init containers for rotation. It works, but it’s a lot of moving parts.

1.35 brings the mechanism native:

  • the kubelet generates keys locally;
  • it creates a PodCertificateRequest;
  • the API server issues the certificate directly;
  • the kubelet writes the credential bundle into the pod’s filesystem;
  • automatic rotation, no sidecar required.

Security bonus: the API server enforces node restrictions at admission time, eliminating one of the classic pitfalls of third-party signers. Pure mTLS flow, no bearer token in the issuance path.

cert-manager and SPIRE aren’t going anywhere — they cover advanced use cases this feature doesn’t address. But for basic workload identity and service-to-service mTLS, native is now a credible option.

05. Mutable Job Resources (suspended) Alpha

The scenario: you launch a batch Job, six hours later it dies from OOMKill because the memory limit was too low. Before 1.35, you had to delete and recreate the Job — losing status, history, completion tracking.

With 1.35, you can now suspend a Job, adjust its resources, then resume it:

# 1. Suspend
kubectl patch job my-job --type=merge -p '{"spec":{"suspend":true}}'

# 2. Update resources
kubectl patch job my-job --type=merge -p '{"spec":{"template":{"spec":{"containers":[{"name":"worker","resources":{"limits":{"memory":"4Gi"}}}]}}}}'

# 3. Resume
kubectl patch job my-job --type=merge -p '{"spec":{"suspend":false}}'

Same Job object, execution resumed, completion tracking intact. Huge for batch processing and ML training, where iterating on resource requirements is the norm. Status: alpha — feature gate required.

Deprecations & upgrade notes

Before scheduling a 1.35 upgrade, audit your node configuration:

  • cgroups v1 support is removed;
  • IPVS mode in kube-proxy is deprecated;
  • containerd 1.x support ends with 1.35.

These can silently block a migration — review your kernels, runtimes, and proxy configs before planning the switch.

Key takeaways

1.35 has a clear theme: Kubernetes is getting serious about AI and ML. Native gang scheduling, live resizing, fine-grained restart rules — pain points ML engineers have been hacking around for years are now handled at the platform core.

Gang schedulingno more deadlocks for distributed jobs
In-place resize GAvertical scaling without restarts
Per-container restartsurgical recovery
Pod certificatessimplified Zero Trust
Mutable jobsfix resources without recreation

On the production side, the two features I’d start integrating right away are in-place resize (GA, low risk) and per-container restart rules (beta, on by default). The other three deserve a staging-cluster PoC before any commitment.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top