OpenShift AI: Complete Enterprise AI/ML Platform Guide [2026]
![OpenShift AI: Complete Enterprise AI/ML Platform Guide [2026]](/images/blog/openshift/openshift-ai-hero.webp)
OpenShift AI: Complete Enterprise AI/ML Platform Guide
AI is hot, but running AI in enterprises is completely different from playing with AI in Jupyter Notebooks.
Data security, model governance, GPU scheduling, version control, CI/CD... every one is a pitfall. OpenShift AI tries to fill these pitfalls, providing an enterprise-grade AI/ML platform.
This article will provide a complete introduction to OpenShift AI, from platform features to practical applications, helping you evaluate whether it's suitable for your AI workloads. If you're not familiar with OpenShift yet, we recommend first reading the OpenShift Complete Guide.
Introduction to OpenShift AI
What is OpenShift AI?
OpenShift AI is Red Hat's enterprise-grade AI/ML platform, formerly known as Red Hat OpenShift Data Science (RHODS).
On top of the OpenShift container platform, it provides complete machine learning lifecycle support:
- Data preparation and exploration
- Model development and training
- Model deployment and serving
- Model monitoring and governance
Product Positioning
OpenShift AI isn't trying to compete with AWS SageMaker or GCP Vertex AI in the "fully managed" market. Its positioning is:
"Build an enterprise-grade AI/ML platform on your own infrastructure"
Suitable for:
- Organizations with data sovereignty requirements
- Enterprises wanting to run AI on private or hybrid cloud
- Teams already using OpenShift
Core Features Overview
| Feature | Description |
|---|---|
| Data Science Project | Team collaboration workspace |
| Workbenches | Jupyter Notebook development environment |
| Model Serving | Model deployment and inference services |
| Pipelines | ML Pipeline orchestration |
| Model Registry | Model version management (preview) |
| Lightspeed | AI-assisted operations |
OpenShift AI Architecture
Platform Architecture
OpenShift AI is built on top of OpenShift:
┌─────────────────────────────────────────────────┐
│ OpenShift AI │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Workbench│ │ Serving │ │Pipeline │ │
│ └─────────┘ └─────────┘ └─────────┘ │
├─────────────────────────────────────────────────┤
│ OpenShift │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ GPU │ │ Storage │ │ Network │ │
│ │ Support │ │ (ODF) │ │ (SDN) │ │
│ └─────────┘ └─────────┘ └─────────┘ │
├─────────────────────────────────────────────────┤
│ Infrastructure (Cloud/Bare Metal) │
└─────────────────────────────────────────────────┘
Core Components
1. Dashboard
Web UI entry point, providing:
- Data Science Project management
- Workbench creation and access
- Model Server management
- Pipeline execution monitoring
2. Notebook Controller
Manages Jupyter Notebook environments:
- Multiple preset images (PyTorch, TensorFlow, Standard DS)
- Custom image support
- GPU allocation
3. Model Mesh / KServe
Model inference services:
- Supports multiple model formats
- Auto-scaling
- A/B testing
4. Data Science Pipelines
Based on Kubeflow Pipelines:
- Visual Pipeline editing
- Scheduled execution
- Experiment tracking
Integration with OpenShift
OpenShift AI deeply integrates with OpenShift features:
| OpenShift Feature | OpenShift AI Usage |
|---|---|
| RBAC | Control who can access which projects |
| Network Policy | Isolate ML workloads |
| PVC/ODF | Dataset and model storage |
| GPU Operator | GPU resource management |
| Monitoring | Model service monitoring |
AI/ML Workflow
Complete Workflow
OpenShift AI supports end-to-end ML workflow:
Data Prep → Feature Eng → Model Training → Model Eval → Model Deploy → Monitoring
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
Workbench Workbench Training Registry Serving Monitoring
+ Pipeline Job (KServe)
Data Preparation
Perform data exploration and preparation in Workbench:
# Connect to data sources
import boto3
from sqlalchemy import create_engine
# S3 data
s3 = boto3.client('s3',
endpoint_url=os.environ['S3_ENDPOINT'],
aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY']
)
# Database
engine = create_engine(os.environ['DATABASE_URL'])
df = pd.read_sql("SELECT * FROM training_data", engine)
Data can be stored in:
- OpenShift Data Foundation (ODF)
- S3-compatible storage
- External databases
Model Training
Single Machine Training:
Train directly in Workbench (suitable for small models):
import torch
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
Distributed Training:
Large models use Kubernetes native distributed training:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: distributed-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:latest
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 4
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:latest
resources:
limits:
nvidia.com/gpu: 1
Model Serving
Trained models can be deployed via KServe:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-model
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "s3://models/my-model"
Supported model formats:
- TensorFlow
- PyTorch
- ONNX
- scikit-learn
- XGBoost
- LightGBM
OpenShift Lightspeed
Lightspeed is a highlight feature of OpenShift AI, allowing administrators to operate clusters using natural language.
Feature Introduction
Lightspeed is an AI assistant integrated into the OpenShift Console:
- Query cluster status using natural language
- Explain error messages
- Suggest solutions
- Generate YAML configurations
Use Cases
Query Cluster Status:
You: Show me the top 5 Pods with highest CPU usage in the past hour
Lightspeed: Based on Prometheus metrics, the Pods with highest CPU usage in the past hour are:
1. ml-training-job-xyz (namespace: ai-project) - 3.2 cores
2. data-pipeline-abc (namespace: data-eng) - 2.8 cores
...
Troubleshooting:
You: Why does my-deployment's Pod keep CrashLoopBackOff?
Lightspeed: I checked the Pod logs and events, found the following issues:
1. Container can't find environment variable DATABASE_URL at startup
2. Recommend checking if ConfigMap or Secret is correctly configured
...
Generate Configuration:
You: Help me create an HPA that scales my-deployment when CPU exceeds 70%, maximum 10 replicas
Lightspeed: Here's the recommended HorizontalPodAutoscaler configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-deployment-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-deployment
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Security Considerations
Lightspeed's design considers enterprise security needs:
- Can use Red Hat hosted LLM
- Can also connect to self-hosted LLM
- Sensitive data won't be sent externally (configurable)
- Has audit logs
Want to use OpenShift Lightspeed to improve operations efficiency? Book an AI adoption consultation and let us evaluate your use cases.
GPU Support
AI workloads can't do without GPUs. OpenShift AI provides complete support through NVIDIA GPU Operator.
NVIDIA GPU Operator
GPU Operator automatically handles:
- Driver installation
- CUDA Toolkit
- Device Plugin
- GPU Monitoring
Install GPU Operator:
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: gpu-operator
namespace: nvidia-gpu-operator
spec:
channel: stable
name: gpu-operator-certified
source: certified-operators
sourceNamespace: openshift-marketplace
GPU Resource Scheduling
Request GPUs in Workbench or Pod:
resources:
limits:
nvidia.com/gpu: 1
OpenShift automatically schedules to nodes with GPUs.
Multi-GPU Training
Distributed training can use multiple GPUs:
resources:
limits:
nvidia.com/gpu: 4 # Multiple GPUs on single node
Or cross-node:
# PyTorchJob cross-node distributed
spec:
pytorchReplicaSpecs:
Worker:
replicas: 8 # 8 Workers, each with 1 GPU
GPU Monitoring
GPU Operator automatically integrates monitoring:
- GPU utilization
- GPU memory
- GPU temperature
- Power consumption
Can be viewed in OpenShift Monitoring's Grafana with related metrics.
Development Environment
Jupyter Notebook Integration
OpenShift AI's Workbench is based on Jupyter:
Preset Images:
- Standard Data Science (general purpose)
- PyTorch
- TensorFlow
- CUDA (GPU environment)
Custom Images:
Can create your own Notebook images:
FROM quay.io/opendatahub/notebooks:jupyter-pytorch-2024.1
# Install additional packages
RUN pip install transformers datasets accelerate
# Copy custom configuration
COPY jupyter_notebook_config.py /opt/app-root/etc/
VS Code Server
Besides Jupyter, also supports VS Code Server:
- Full IDE experience
- Extension support
- Terminal access
Environment Variables and Secrets
Securely manage API Keys and credentials:
# Create Secret
apiVersion: v1
kind: Secret
metadata:
name: ml-credentials
stringData:
HUGGINGFACE_TOKEN: "hf_xxx"
S3_ACCESS_KEY: "xxx"
Automatically injected into Workbench.
MLOps Practices
Model Version Control
Use Data Science Pipelines to track model versions:
from kfp import dsl
@dsl.component
def train_model(data_path: str, model_output: str):
# Training logic
model.save(model_output)
@dsl.component
def evaluate_model(model_path: str) -> float:
# Evaluation logic
return accuracy
@dsl.pipeline
def ml_pipeline():
train = train_model(data_path="s3://data", model_output="s3://models/v1")
evaluate = evaluate_model(model_path=train.outputs['model_output'])
CI/CD for ML
Integrate with OpenShift Pipelines (Tekton):
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: ml-cicd
spec:
tasks:
- name: fetch-code
taskRef:
name: git-clone
- name: run-tests
taskRef:
name: pytest
runAfter: [fetch-code]
- name: train-model
taskRef:
name: ml-training
runAfter: [run-tests]
- name: deploy-model
taskRef:
name: kserve-deploy
runAfter: [train-model]
A/B Testing
KServe supports Canary deployment:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-model
spec:
predictor:
canaryTrafficPercent: 10
model:
modelFormat:
name: sklearn
storageUri: "s3://models/v2" # New version
10% traffic goes to new model, full switch after validation.
Security and Compliance
Data Security
Data Isolation:
- Each Data Science Project is a separate Namespace
- Can use Network Policy to limit network access
- Data stored in PVC, can be encrypted
Access Control:
- RBAC controls who can access which projects
- Can integrate with enterprise identity systems (LDAP/AD)
Model Security
Model Access Control:
- Model Server can set authentication
- Limit who can call inference API
Model Auditing:
- Pipeline execution records
- Model version tracking
- Inference logs
Compliance Considerations
OpenShift AI helps meet compliance requirements:
| Requirement | Solution |
|---|---|
| Data Residency | Deploy on your own infrastructure |
| Access Auditing | OpenShift audit logs |
| Model Governance | Model Registry + Pipeline |
| Explainability | Integrate AI Explainability tools |
Deployment and Configuration
Install OpenShift AI
Install from OperatorHub:
- Search for Red Hat OpenShift AI
- Select install to redhat-ods-operator namespace
- Wait for Operator to be ready
Create Data Science Cluster
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
dashboard:
managementState: Managed
workbenches:
managementState: Managed
datasciencepipelines:
managementState: Managed
modelmeshserving:
managementState: Managed
kserve:
managementState: Managed
Resource Configuration
Recommended resource allocation:
| Component | CPU | Memory | Description |
|---|---|---|---|
| Dashboard | 1 | 2Gi | Low load |
| Workbench (small) | 2 | 8Gi | Light development |
| Workbench (large) | 8 | 32Gi | Model training |
| Model Server | Depends on model | Depends on model | Needs evaluation |
FAQ
Q1: How is OpenShift AI different from AWS SageMaker?
The main difference is deployment location. SageMaker is AWS's fully managed service, data and models are on AWS. OpenShift AI can be deployed anywhere—public cloud, private cloud, own data center. Suitable for enterprises with data sovereignty requirements or already using OpenShift.
Q2: How many GPUs do I need to run OpenShift AI?
You don't necessarily need GPUs. Data exploration and small model training can use CPU. But for training deep learning models or real-time inference, GPUs are much faster. Recommendation: 1-2 GPUs for dev/test environments, plan production based on workload.
Q3: Will OpenShift Lightspeed send my data externally?
It can be controlled. Lightspeed supports multiple LLM backends: (1) Red Hat hosted LLM (data goes through Red Hat); (2) Self-hosted LLM (data never leaves). Enterprises can choose based on security requirements.
Q4: Can existing Jupyter Notebooks be used directly?
Mostly yes. OpenShift AI's Workbench is based on standard Jupyter, your notebook files should run directly. But if you have special package requirements, you may need custom images.
Q5: How is OpenShift AI licensing calculated?
OpenShift AI has separate subscription licensing, not included in OpenShift Container Platform. Specific costs require contacting Red Hat or partners. Usually priced by resources used (Cores).
Want to Run AI Workloads on OpenShift?
From GPU configuration to MLOps processes, many choices but also many pitfalls.
Book an AI adoption consultation and let experienced people help you avoid pitfalls.
Reference Resources
Need Professional Cloud Advice?
Whether you're evaluating cloud platforms, optimizing existing architecture, or looking for cost-saving solutions, we can help
Book Free ConsultationRelated Articles
OpenShift Advanced Features: ACM, ACS, LDAP, Authentication Configuration Complete Guide [2026]
In-depth introduction to OpenShift advanced feature configuration, covering ACM multi-cluster management, ACS advanced security, LDAP/AD authentication, RBAC permission design, Auto Scaling, and Service Mesh.
OpenShiftOpenShift Architecture Deep Dive: Control Plane, Operator and Network Design [2026]
In-depth analysis of OpenShift architecture design, covering Control Plane components, Worker Nodes, Operator mechanism, OVN-Kubernetes networking, storage architecture, security design and high availability configuration.
OpenShiftWhat is OpenShift? Red Hat Container Platform Complete Guide [2026]
In-depth analysis of OpenShift container platform, covering core architecture, differences from Kubernetes, Virtualization, AI features, version lifecycle, installation deployment to pricing and licensing, helping enterprises evaluate OpenShift adoption.