LLMOps: Scalable Generative AI at Sopra Steria

by Shashank Chamoli - AI Technical Lead, Sopra Steria
| minute read

LLMOps: Driving Operational Excellence in Large Language Models

At Sopra Steria, we recognise that LLMOps (MLOps for Large Language Models) is essential for modern AI initiatives. While it is a sub-category of traditional MLOps, LLMOps require a specialised focus on the infrastructure needed to fine-tune foundational models and deploy them as reliable products. A critical difference lies in the cost structure: while standard MLOps focuse on data collection and training costs, LLMOps generate significant costs around inference and the massive calculations required for fine-tuning large-scale datasets.

The Strategic Benefits of LLMOps

Implementing a robust LLMOps framework offers three primary advantages for our clients:

  • Efficiency: Our teams develop models faster and improve quality through a streamlined management approach that promotes better communication between development and deployment.
  • Scalability: We manage and monitor multiple models simultaneously using Continuous Integration and Continuous Delivery (CI/CD), providing a more responsive user experience through improved data communication.
  • Risk Reduction: We prioritise transparency and establish better compliance with organisation and industry policies, enhancing security to protect sensitive information.

High-Performance Infrastructure with Ray and vLLM

To ensure our models scale both horizontally and vertically, we utilise vLLM and Ray. Our architecture centers on a Ray Cluster, which consists of a Head Node for cluster management and multiple Worker Nodes dedicated to running user code in distributed tasks.

A key feature of this setup is Autoscaling. When resource demands exceed capacity, the autoscaler increases worker nodes; conversely, it removes idle nodes to optimise costs.

We also utilise Tensor Parallelism to distribute operations across multiple processors, which is essential for handling models with billions of parameters.

Building RAG-Based Applications

Sopra Steria employs Retrieval-Augmented Generation (RAG) to fetch relevant data from outside a foundation model, providing richer context to improve output.

 

RAG is a cornerstone of our work because it reduces hallucinations by grounding the model in retrieved facts. It is significantly more cost-efficient than continuous pre-training and allows for easier updates or removal of sensitive data by simply updating the retrieval index.

Deployment and Lifecycle Automation

Our LLMOps principles span the entire project lifecycle, from exploratory data analysis to continuous monitoring with human feedback. To move from development to production, we follow a structured deployment process.At Sopra Steria, we excel in automating these processes. Using Airflow on the Innerdata platform (Kubernetes), we can deploy models and knowledgebase pipelines with minimal parameter input.

 

 

Performance Benchmarking and Caching

Our solutions are backed by rigorous performance data. Benchmarking on specialised 16GB GPU machines demonstrates that our framework maintains stable response times across various context lengths and user loads.

To further reduce latency and cost, we implement Intelligent Caching. By storing previously computed data, future requests for the same or semantically similar data can be served faster, reducing the number of expensive LLM requests. We explore caching based on Item IDs, Pairs of Item IDs, and Constrained Inputs (like specific genres or lead actors) to ensure high reliability and consistent responses.

Search

ai-and-technology

artificial-intelligence

data

sopra-steria-in

Related content

Complexity of Database Modernisation

A financial services firm modernised its Java-Oracle stack with AI-driven migration to PostgreSQL, cutting costs, reducing risk, and enabling a flexible open-source future.

Application Modernisation & Database Migration

We modernised a global telecom and IT leader by migrating to PostgreSQL and adopting Kubernetes, CI/CD, and ELK, enabling a cloud-native, microservices-based architecture.

AI Data Centres: Mapping the Global Disparity

AI infrastructure is the new economic engine. Here’s why bridging the global data centre gap is critical for inclusive digital growth.