Back to HomeOpenShift

OpenShift AI: Complete Enterprise AI/ML Platform Guide [2026]

10 min min read
#OpenShift#AI#ML#Lightspeed#MLOps

OpenShift AI: Complete Enterprise AI/ML Platform Guide [2026]

OpenShift AI: Complete Enterprise AI/ML Platform Guide

AI is hot, but running AI in enterprises is completely different from playing with AI in Jupyter Notebooks.

Data security, model governance, GPU scheduling, version control, CI/CD... every one is a pitfall. OpenShift AI tries to fill these pitfalls, providing an enterprise-grade AI/ML platform.

This article will provide a complete introduction to OpenShift AI, from platform features to practical applications, helping you evaluate whether it's suitable for your AI workloads. If you're not familiar with OpenShift yet, we recommend first reading the OpenShift Complete Guide.


Introduction to OpenShift AI

What is OpenShift AI?

OpenShift AI is Red Hat's enterprise-grade AI/ML platform, formerly known as Red Hat OpenShift Data Science (RHODS).

On top of the OpenShift container platform, it provides complete machine learning lifecycle support:

  • Data preparation and exploration
  • Model development and training
  • Model deployment and serving
  • Model monitoring and governance

Product Positioning

OpenShift AI isn't trying to compete with AWS SageMaker or GCP Vertex AI in the "fully managed" market. Its positioning is:

"Build an enterprise-grade AI/ML platform on your own infrastructure"

Suitable for:

  • Organizations with data sovereignty requirements
  • Enterprises wanting to run AI on private or hybrid cloud
  • Teams already using OpenShift

Core Features Overview

FeatureDescription
Data Science ProjectTeam collaboration workspace
WorkbenchesJupyter Notebook development environment
Model ServingModel deployment and inference services
PipelinesML Pipeline orchestration
Model RegistryModel version management (preview)
LightspeedAI-assisted operations

OpenShift AI Architecture

Platform Architecture

OpenShift AI is built on top of OpenShift:

┌─────────────────────────────────────────────────┐
│                 OpenShift AI                     │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│  │Workbench│ │ Serving │ │Pipeline │           │
│  └─────────┘ └─────────┘ └─────────┘           │
├─────────────────────────────────────────────────┤
│                  OpenShift                       │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│  │   GPU   │ │ Storage │ │ Network │           │
│  │ Support │ │  (ODF)  │ │  (SDN)  │           │
│  └─────────┘ └─────────┘ └─────────┘           │
├─────────────────────────────────────────────────┤
│              Infrastructure (Cloud/Bare Metal)   │
└─────────────────────────────────────────────────┘

Core Components

1. Dashboard

Web UI entry point, providing:

  • Data Science Project management
  • Workbench creation and access
  • Model Server management
  • Pipeline execution monitoring

2. Notebook Controller

Manages Jupyter Notebook environments:

  • Multiple preset images (PyTorch, TensorFlow, Standard DS)
  • Custom image support
  • GPU allocation

3. Model Mesh / KServe

Model inference services:

  • Supports multiple model formats
  • Auto-scaling
  • A/B testing

4. Data Science Pipelines

Based on Kubeflow Pipelines:

  • Visual Pipeline editing
  • Scheduled execution
  • Experiment tracking

Integration with OpenShift

OpenShift AI deeply integrates with OpenShift features:

OpenShift FeatureOpenShift AI Usage
RBACControl who can access which projects
Network PolicyIsolate ML workloads
PVC/ODFDataset and model storage
GPU OperatorGPU resource management
MonitoringModel service monitoring

AI/ML Workflow

Complete Workflow

OpenShift AI supports end-to-end ML workflow:

Data Prep → Feature Eng → Model Training → Model Eval → Model Deploy → Monitoring
   │          │          │          │          │         │
   ▼          ▼          ▼          ▼          ▼         ▼
Workbench  Workbench  Training   Registry   Serving  Monitoring
           + Pipeline   Job                 (KServe)

Data Preparation

Perform data exploration and preparation in Workbench:

# Connect to data sources
import boto3
from sqlalchemy import create_engine

# S3 data
s3 = boto3.client('s3',
    endpoint_url=os.environ['S3_ENDPOINT'],
    aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
    aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY']
)

# Database
engine = create_engine(os.environ['DATABASE_URL'])
df = pd.read_sql("SELECT * FROM training_data", engine)

Data can be stored in:

  • OpenShift Data Foundation (ODF)
  • S3-compatible storage
  • External databases

Model Training

Single Machine Training:

Train directly in Workbench (suitable for small models):

import torch
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

Distributed Training:

Large models use Kubernetes native distributed training:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: distributed-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:latest
            resources:
              limits:
                nvidia.com/gpu: 1
    Worker:
      replicas: 4
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:latest
            resources:
              limits:
                nvidia.com/gpu: 1

Model Serving

Trained models can be deployed via KServe:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://models/my-model"

Supported model formats:

  • TensorFlow
  • PyTorch
  • ONNX
  • scikit-learn
  • XGBoost
  • LightGBM

OpenShift Lightspeed

Lightspeed is a highlight feature of OpenShift AI, allowing administrators to operate clusters using natural language.

Feature Introduction

Lightspeed is an AI assistant integrated into the OpenShift Console:

  • Query cluster status using natural language
  • Explain error messages
  • Suggest solutions
  • Generate YAML configurations

Use Cases

Query Cluster Status:

You: Show me the top 5 Pods with highest CPU usage in the past hour

Lightspeed: Based on Prometheus metrics, the Pods with highest CPU usage in the past hour are:
1. ml-training-job-xyz (namespace: ai-project) - 3.2 cores
2. data-pipeline-abc (namespace: data-eng) - 2.8 cores
...

Troubleshooting:

You: Why does my-deployment's Pod keep CrashLoopBackOff?

Lightspeed: I checked the Pod logs and events, found the following issues:
1. Container can't find environment variable DATABASE_URL at startup
2. Recommend checking if ConfigMap or Secret is correctly configured
...

Generate Configuration:

You: Help me create an HPA that scales my-deployment when CPU exceeds 70%, maximum 10 replicas

Lightspeed: Here's the recommended HorizontalPodAutoscaler configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-deployment-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Security Considerations

Lightspeed's design considers enterprise security needs:

  • Can use Red Hat hosted LLM
  • Can also connect to self-hosted LLM
  • Sensitive data won't be sent externally (configurable)
  • Has audit logs

Want to use OpenShift Lightspeed to improve operations efficiency? Book an AI adoption consultation and let us evaluate your use cases.


GPU Support

AI workloads can't do without GPUs. OpenShift AI provides complete support through NVIDIA GPU Operator.

NVIDIA GPU Operator

GPU Operator automatically handles:

  • Driver installation
  • CUDA Toolkit
  • Device Plugin
  • GPU Monitoring

Install GPU Operator:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator
  namespace: nvidia-gpu-operator
spec:
  channel: stable
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace

GPU Resource Scheduling

Request GPUs in Workbench or Pod:

resources:
  limits:
    nvidia.com/gpu: 1

OpenShift automatically schedules to nodes with GPUs.

Multi-GPU Training

Distributed training can use multiple GPUs:

resources:
  limits:
    nvidia.com/gpu: 4  # Multiple GPUs on single node

Or cross-node:

# PyTorchJob cross-node distributed
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 8  # 8 Workers, each with 1 GPU

GPU Monitoring

GPU Operator automatically integrates monitoring:

  • GPU utilization
  • GPU memory
  • GPU temperature
  • Power consumption

Can be viewed in OpenShift Monitoring's Grafana with related metrics.


Development Environment

Jupyter Notebook Integration

OpenShift AI's Workbench is based on Jupyter:

Preset Images:

  • Standard Data Science (general purpose)
  • PyTorch
  • TensorFlow
  • CUDA (GPU environment)

Custom Images:

Can create your own Notebook images:

FROM quay.io/opendatahub/notebooks:jupyter-pytorch-2024.1

# Install additional packages
RUN pip install transformers datasets accelerate

# Copy custom configuration
COPY jupyter_notebook_config.py /opt/app-root/etc/

VS Code Server

Besides Jupyter, also supports VS Code Server:

  • Full IDE experience
  • Extension support
  • Terminal access

Environment Variables and Secrets

Securely manage API Keys and credentials:

# Create Secret
apiVersion: v1
kind: Secret
metadata:
  name: ml-credentials
stringData:
  HUGGINGFACE_TOKEN: "hf_xxx"
  S3_ACCESS_KEY: "xxx"

Automatically injected into Workbench.


MLOps Practices

Model Version Control

Use Data Science Pipelines to track model versions:

from kfp import dsl

@dsl.component
def train_model(data_path: str, model_output: str):
    # Training logic
    model.save(model_output)

@dsl.component
def evaluate_model(model_path: str) -> float:
    # Evaluation logic
    return accuracy

@dsl.pipeline
def ml_pipeline():
    train = train_model(data_path="s3://data", model_output="s3://models/v1")
    evaluate = evaluate_model(model_path=train.outputs['model_output'])

CI/CD for ML

Integrate with OpenShift Pipelines (Tekton):

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: ml-cicd
spec:
  tasks:
  - name: fetch-code
    taskRef:
      name: git-clone
  - name: run-tests
    taskRef:
      name: pytest
    runAfter: [fetch-code]
  - name: train-model
    taskRef:
      name: ml-training
    runAfter: [run-tests]
  - name: deploy-model
    taskRef:
      name: kserve-deploy
    runAfter: [train-model]

A/B Testing

KServe supports Canary deployment:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://models/v2"  # New version

10% traffic goes to new model, full switch after validation.


Security and Compliance

Data Security

Data Isolation:

  • Each Data Science Project is a separate Namespace
  • Can use Network Policy to limit network access
  • Data stored in PVC, can be encrypted

Access Control:

  • RBAC controls who can access which projects
  • Can integrate with enterprise identity systems (LDAP/AD)

Model Security

Model Access Control:

  • Model Server can set authentication
  • Limit who can call inference API

Model Auditing:

  • Pipeline execution records
  • Model version tracking
  • Inference logs

Compliance Considerations

OpenShift AI helps meet compliance requirements:

RequirementSolution
Data ResidencyDeploy on your own infrastructure
Access AuditingOpenShift audit logs
Model GovernanceModel Registry + Pipeline
ExplainabilityIntegrate AI Explainability tools

Deployment and Configuration

Install OpenShift AI

Install from OperatorHub:

  1. Search for Red Hat OpenShift AI
  2. Select install to redhat-ods-operator namespace
  3. Wait for Operator to be ready

Create Data Science Cluster

apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    dashboard:
      managementState: Managed
    workbenches:
      managementState: Managed
    datasciencepipelines:
      managementState: Managed
    modelmeshserving:
      managementState: Managed
    kserve:
      managementState: Managed

Resource Configuration

Recommended resource allocation:

ComponentCPUMemoryDescription
Dashboard12GiLow load
Workbench (small)28GiLight development
Workbench (large)832GiModel training
Model ServerDepends on modelDepends on modelNeeds evaluation

FAQ

Q1: How is OpenShift AI different from AWS SageMaker?

The main difference is deployment location. SageMaker is AWS's fully managed service, data and models are on AWS. OpenShift AI can be deployed anywhere—public cloud, private cloud, own data center. Suitable for enterprises with data sovereignty requirements or already using OpenShift.

Q2: How many GPUs do I need to run OpenShift AI?

You don't necessarily need GPUs. Data exploration and small model training can use CPU. But for training deep learning models or real-time inference, GPUs are much faster. Recommendation: 1-2 GPUs for dev/test environments, plan production based on workload.

Q3: Will OpenShift Lightspeed send my data externally?

It can be controlled. Lightspeed supports multiple LLM backends: (1) Red Hat hosted LLM (data goes through Red Hat); (2) Self-hosted LLM (data never leaves). Enterprises can choose based on security requirements.

Q4: Can existing Jupyter Notebooks be used directly?

Mostly yes. OpenShift AI's Workbench is based on standard Jupyter, your notebook files should run directly. But if you have special package requirements, you may need custom images.

Q5: How is OpenShift AI licensing calculated?

OpenShift AI has separate subscription licensing, not included in OpenShift Container Platform. Specific costs require contacting Red Hat or partners. Usually priced by resources used (Cores).


Want to Run AI Workloads on OpenShift?

From GPU configuration to MLOps processes, many choices but also many pitfalls.

Book an AI adoption consultation and let experienced people help you avoid pitfalls.


Reference Resources

Need Professional Cloud Advice?

Whether you're evaluating cloud platforms, optimizing existing architecture, or looking for cost-saving solutions, we can help

Book Free Consultation

Related Articles