OpenShift AI: Complete Enterprise AI/ML Platform Guide [2026]

2/4/202611 min min read

#OpenShift#AI#ML#Lightspeed#MLOps

OpenShift AI: Complete Enterprise AI/ML Platform Guide

AI is hot, but running AI in enterprises is completely different from playing with AI in Jupyter Notebooks.

Data security, model governance, GPU scheduling, version control, CI/CD... every one is a pitfall. OpenShift AI tries to fill these pitfalls, providing an enterprise-grade AI/ML platform.

This article will provide a complete introduction to OpenShift AI, from platform features to practical applications, helping you evaluate whether it's suitable for your AI workloads. If you're not familiar with OpenShift yet, we recommend first reading the OpenShift Complete Guide.

Introduction to OpenShift AI

What is OpenShift AI?

OpenShift AI is Red Hat's enterprise-grade AI/ML platform, formerly known as Red Hat OpenShift Data Science (RHODS).

On top of the OpenShift container platform, it provides complete machine learning lifecycle support:

Data preparation and exploration
Model development and training
Model deployment and serving
Model monitoring and governance

Product Positioning

OpenShift AI isn't trying to compete with AWS SageMaker or GCP Vertex AI in the "fully managed" market. Its positioning is:

"Build an enterprise-grade AI/ML platform on your own infrastructure"

Suitable for:

Organizations with data sovereignty requirements
Enterprises wanting to run AI on private or hybrid cloud
Teams already using OpenShift

Core Features Overview

Feature	Description
Data Science Project	Team collaboration workspace
Workbenches	Jupyter Notebook development environment
Model Serving	Model deployment and inference services
Pipelines	ML Pipeline orchestration
Model Registry	Model version management (preview)
Lightspeed	AI-assisted operations

OpenShift AI Architecture

Platform Architecture

OpenShift AI is built on top of OpenShift:

┌─────────────────────────────────────────────────┐
│                 OpenShift AI                     │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│  │Workbench│ │ Serving │ │Pipeline │           │
│  └─────────┘ └─────────┘ └─────────┘           │
├─────────────────────────────────────────────────┤
│                  OpenShift                       │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│  │   GPU   │ │ Storage │ │ Network │           │
│  │ Support │ │  (ODF)  │ │  (SDN)  │           │
│  └─────────┘ └─────────┘ └─────────┘           │
├─────────────────────────────────────────────────┤
│              Infrastructure (Cloud/Bare Metal)   │
└─────────────────────────────────────────────────┘

Core Components

1. Dashboard

Web UI entry point, providing:

Data Science Project management
Workbench creation and access
Model Server management
Pipeline execution monitoring

2. Notebook Controller

Manages Jupyter Notebook environments:

Multiple preset images (PyTorch, TensorFlow, Standard DS)
Custom image support
GPU allocation

3. Model Mesh / KServe

Model inference services:

Supports multiple model formats
Auto-scaling
A/B testing

4. Data Science Pipelines

Based on Kubeflow Pipelines:

Visual Pipeline editing
Scheduled execution
Experiment tracking

Integration with OpenShift

OpenShift AI deeply integrates with OpenShift features:

OpenShift Feature	OpenShift AI Usage
RBAC	Control who can access which projects
Network Policy	Isolate ML workloads
PVC/ODF	Dataset and model storage
GPU Operator	GPU resource management
Monitoring	Model service monitoring

AI/ML Workflow

Complete Workflow

OpenShift AI supports end-to-end ML workflow:

Data Prep → Feature Eng → Model Training → Model Eval → Model Deploy → Monitoring
   │          │          │          │          │         │
   ▼          ▼          ▼          ▼          ▼         ▼
Workbench  Workbench  Training   Registry   Serving  Monitoring
           + Pipeline   Job                 (KServe)

Data Preparation

Perform data exploration and preparation in Workbench:

# Connect to data sources
import boto3
from sqlalchemy import create_engine

# S3 data
s3 = boto3.client('s3',
    endpoint_url=os.environ['S3_ENDPOINT'],
    aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
    aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY']
)

# Database
engine = create_engine(os.environ['DATABASE_URL'])
df = pd.read_sql("SELECT * FROM training_data", engine)

Data can be stored in:

OpenShift Data Foundation (ODF)
S3-compatible storage
External databases

Model Training

Single Machine Training:

Train directly in Workbench (suitable for small models):

import torch
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

Distributed Training:

Large models use Kubernetes native distributed training:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: distributed-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:latest
            resources:
              limits:
                nvidia.com/gpu: 1
    Worker:
      replicas: 4
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:latest
            resources:
              limits:
                nvidia.com/gpu: 1

Model Serving

Trained models can be deployed via KServe:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://models/my-model"

Supported model formats:

TensorFlow
PyTorch
ONNX
scikit-learn
XGBoost
LightGBM

OpenShift Lightspeed

Lightspeed is a highlight feature of OpenShift AI, allowing administrators to operate clusters using natural language.

Feature Introduction

Lightspeed is an AI assistant integrated into the OpenShift Console:

Query cluster status using natural language
Explain error messages
Suggest solutions
Generate YAML configurations

Use Cases

Query Cluster Status:

You: Show me the top 5 Pods with highest CPU usage in the past hour

Lightspeed: Based on Prometheus metrics, the Pods with highest CPU usage in the past hour are:
1. ml-training-job-xyz (namespace: ai-project) - 3.2 cores
2. data-pipeline-abc (namespace: data-eng) - 2.8 cores
...

Troubleshooting:

You: Why does my-deployment's Pod keep CrashLoopBackOff?

Lightspeed: I checked the Pod logs and events, found the following issues:
1. Container can't find environment variable DATABASE_URL at startup
2. Recommend checking if ConfigMap or Secret is correctly configured
...

Generate Configuration:

You: Help me create an HPA that scales my-deployment when CPU exceeds 70%, maximum 10 replicas

Lightspeed: Here's the recommended HorizontalPodAutoscaler configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-deployment-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Security Considerations

Lightspeed's design considers enterprise security needs:

Can use Red Hat hosted LLM
Can also connect to self-hosted LLM
Sensitive data won't be sent externally (configurable)
Has audit logs

Want to use OpenShift Lightspeed to improve operations efficiency? Book an AI adoption consultation and let us evaluate your use cases.

GPU Support

AI workloads can't do without GPUs. OpenShift AI provides complete support through NVIDIA GPU Operator.

NVIDIA GPU Operator

GPU Operator automatically handles:

Driver installation
CUDA Toolkit
Device Plugin
GPU Monitoring

Install GPU Operator:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator
  namespace: nvidia-gpu-operator
spec:
  channel: stable
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace

GPU Resource Scheduling

Request GPUs in Workbench or Pod:

resources:
  limits:
    nvidia.com/gpu: 1

OpenShift automatically schedules to nodes with GPUs.

Multi-GPU Training

Distributed training can use multiple GPUs:

resources:
  limits:
    nvidia.com/gpu: 4  # Multiple GPUs on single node

Or cross-node:

# PyTorchJob cross-node distributed
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 8  # 8 Workers, each with 1 GPU

GPU Monitoring

GPU Operator automatically integrates monitoring:

GPU utilization
GPU memory
GPU temperature
Power consumption

Can be viewed in OpenShift Monitoring's Grafana with related metrics.

Development Environment

Jupyter Notebook Integration

OpenShift AI's Workbench is based on Jupyter:

Preset Images:

Standard Data Science (general purpose)
PyTorch
TensorFlow
CUDA (GPU environment)

Custom Images:

Can create your own Notebook images:

FROM quay.io/opendatahub/notebooks:jupyter-pytorch-2024.1

# Install additional packages
RUN pip install transformers datasets accelerate

# Copy custom configuration
COPY jupyter_notebook_config.py /opt/app-root/etc/

VS Code Server

Besides Jupyter, also supports VS Code Server:

Full IDE experience
Extension support
Terminal access

Environment Variables and Secrets

Securely manage API Keys and credentials:

# Create Secret
apiVersion: v1
kind: Secret
metadata:
  name: ml-credentials
stringData:
  HUGGINGFACE_TOKEN: "hf_xxx"
  S3_ACCESS_KEY: "xxx"

Automatically injected into Workbench.

MLOps Practices

Model Version Control

Use Data Science Pipelines to track model versions:

from kfp import dsl

@dsl.component
def train_model(data_path: str, model_output: str):
    # Training logic
    model.save(model_output)

@dsl.component
def evaluate_model(model_path: str) -> float:
    # Evaluation logic
    return accuracy

@dsl.pipeline
def ml_pipeline():
    train = train_model(data_path="s3://data", model_output="s3://models/v1")
    evaluate = evaluate_model(model_path=train.outputs['model_output'])

CI/CD for ML

Integrate with OpenShift Pipelines (Tekton):

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: ml-cicd
spec:
  tasks:
  - name: fetch-code
    taskRef:
      name: git-clone
  - name: run-tests
    taskRef:
      name: pytest
    runAfter: [fetch-code]
  - name: train-model
    taskRef:
      name: ml-training
    runAfter: [run-tests]
  - name: deploy-model
    taskRef:
      name: kserve-deploy
    runAfter: [train-model]

A/B Testing

KServe supports Canary deployment:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://models/v2"  # New version

10% traffic goes to new model, full switch after validation.

Security and Compliance

Data Security

Data Isolation:

Each Data Science Project is a separate Namespace
Can use Network Policy to limit network access
Data stored in PVC, can be encrypted

Access Control:

RBAC controls who can access which projects
Can integrate with enterprise identity systems (LDAP/AD)

Model Security

Model Access Control:

Model Server can set authentication
Limit who can call inference API

Model Auditing:

Pipeline execution records
Model version tracking
Inference logs

Compliance Considerations

OpenShift AI helps meet compliance requirements:

Requirement	Solution
Data Residency	Deploy on your own infrastructure
Access Auditing	OpenShift audit logs
Model Governance	Model Registry + Pipeline
Explainability	Integrate AI Explainability tools

Deployment and Configuration

Install OpenShift AI

Install from OperatorHub:

Search for Red Hat OpenShift AI
Select install to redhat-ods-operator namespace
Wait for Operator to be ready

Create Data Science Cluster

apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    dashboard:
      managementState: Managed
    workbenches:
      managementState: Managed
    datasciencepipelines:
      managementState: Managed
    modelmeshserving:
      managementState: Managed
    kserve:
      managementState: Managed

Resource Configuration

Recommended resource allocation:

Component	CPU	Memory	Description
Dashboard	1	2Gi	Low load
Workbench (small)	2	8Gi	Light development
Workbench (large)	8	32Gi	Model training
Model Server	Depends on model	Depends on model	Needs evaluation

FAQ

Q1: How is OpenShift AI different from AWS SageMaker?

The main difference is deployment location. SageMaker is AWS's fully managed service, data and models are on AWS. OpenShift AI can be deployed anywhere—public cloud, private cloud, own data center. Suitable for enterprises with data sovereignty requirements or already using OpenShift.

Q2: How many GPUs do I need to run OpenShift AI?

You don't necessarily need GPUs. Data exploration and small model training can use CPU. But for training deep learning models or real-time inference, GPUs are much faster. Recommendation: 1-2 GPUs for dev/test environments, plan production based on workload.

Q3: Will OpenShift Lightspeed send my data externally?

It can be controlled. Lightspeed supports multiple LLM backends: (1) Red Hat hosted LLM (data goes through Red Hat); (2) Self-hosted LLM (data never leaves). Enterprises can choose based on security requirements.

Q4: Can existing Jupyter Notebooks be used directly?

Mostly yes. OpenShift AI's Workbench is based on standard Jupyter, your notebook files should run directly. But if you have special package requirements, you may need custom images.

Q5: How is OpenShift AI licensing calculated?

OpenShift AI has separate subscription licensing, not included in OpenShift Container Platform. Specific costs require contacting Red Hat or partners. Usually priced by resources used (Cores).

Want to Run AI Workloads on OpenShift?

From GPU configuration to MLOps processes, many choices but also many pitfalls.

Book an AI adoption consultation and let experienced people help you avoid pitfalls.

Reference Resources

Need Professional Cloud Advice?

Whether you're evaluating cloud platforms, optimizing existing architecture, or looking for cost-saving solutions, we can help

Book Free Consultation

OpenShift

OpenShift Advanced Features: ACM, ACS, LDAP, Authentication Configuration Complete Guide [2026]

In-depth introduction to OpenShift advanced feature configuration, covering ACM multi-cluster management, ACS advanced security, LDAP/AD authentication, RBAC permission design, Auto Scaling, and Service Mesh.

OpenShift

OpenShift Architecture Deep Dive: Control Plane, Operator and Network Design [2026]

In-depth analysis of OpenShift architecture design, covering Control Plane components, Worker Nodes, Operator mechanism, OVN-Kubernetes networking, storage architecture, security design and high availability configuration.

OpenShift

What is OpenShift? Red Hat Container Platform Complete Guide [2026]

In-depth analysis of OpenShift container platform, covering core architecture, differences from Kubernetes, Virtualization, AI features, version lifecycle, installation deployment to pricing and licensing, helping enterprises evaluate OpenShift adoption.

OpenShift AI: Complete Enterprise AI/ML Platform Guide

Introduction to OpenShift AI

What is OpenShift AI?

Product Positioning

Core Features Overview

OpenShift AI Architecture

Platform Architecture

Core Components

Integration with OpenShift

AI/ML Workflow

Complete Workflow

Data Preparation

Model Training

Model Serving

OpenShift Lightspeed

Feature Introduction

Use Cases

Security Considerations

GPU Support

NVIDIA GPU Operator

GPU Resource Scheduling

Multi-GPU Training

GPU Monitoring

Development Environment

Jupyter Notebook Integration

VS Code Server

Environment Variables and Secrets

MLOps Practices

Model Version Control

CI/CD for ML

A/B Testing

Security and Compliance

Data Security

Model Security

Compliance Considerations

Deployment and Configuration

Install OpenShift AI

Create Data Science Cluster

Resource Configuration

FAQ

Q1: How is OpenShift AI different from AWS SageMaker?

Q2: How many GPUs do I need to run OpenShift AI?

Q3: Will OpenShift Lightspeed send my data externally?

Q4: Can existing Jupyter Notebooks be used directly?

Q5: How is OpenShift AI licensing calculated?

Want to Run AI Workloads on OpenShift?

Reference Resources

Need Professional Cloud Advice?

Related Articles

OpenShift Advanced Features: ACM, ACS, LDAP, Authentication Configuration Complete Guide [2026]

OpenShift Architecture Deep Dive: Control Plane, Operator and Network Design [2026]

What is OpenShift? Red Hat Container Platform Complete Guide [2026]