Run containerized AI models locally with RamaLama
The open source AI ecosystem has matured quickly, and many developers start by using tools such as Ollama or LM Studio to run large language models (LLMs) on their laptops. This works well for quickly testing out a model and prototyping, but things become complicated when you need to manage dependencies, support different accelerators, or move workloads to Kubernetes.
Thankfully, just as containers solved development problems like portability and environment isolation for applications, the same applies to AI models too! RamaLama is an open source project that makes running AI models in containers simple, or in the project's own words, "boring and predictable." Let's take a look at how it works, and get started with local AI inference, model serving, and retrieval augmented generation (RAG).
Why run AI models locally?
There are several reasons developers and organizations want local or self-hosted AI:
- Control for developers: You can run models directly on your own hardware instead of relying on a remote LLM API. This avoids vendor lock-in and gives you full control over how models are executed and integrated.
- Data privacy for organizations: Enterprises often cannot send sensitive data to external services. Running AI workloads on-premises or in a controlled environment keeps data inside your own infrastructure.
- Cost management at scale: When you are generating thousands or millions of tokens per day, usage-based cloud APIs can become expensive very quickly. Hosting your own models offers more predictable cost profiles.
With RamaLama, you can download, run, and manage your own AI models, just as you would with any other type of workload like a database, backend, etc.
What is RamaLama?
RamaLama is a command-line interface (CLI) for running AI models in containers on your machine. Instead of manually managing model runtimes and dependencies, you plug into an existing container engine such as Podman or Docker.
Under the hood, RamaLama fetches models from sources such as Ollama's model registry, Hugging Face, and any Open Container Initiative (OCI)-compliant registry (Docker Hub, Quay.io, etc.).
One command, and you're good to start chatting with a local model or perhaps serve it as a OpenAI-compatible API endpoint for your existing applications to use. Now, let's check out how to install and use RamaLama.
Installing RamaLama and inspecting your environment
Start by visiting the RamaLama website at ramalama.ai and installing the CLI for your platform. Packages are available for Linux, macOS, and Windows. After installation, verify that RamaLama can see your environment:
ramalama info
This command prints details about your container engine and any detected GPU accelerators.
How RamaLama selects the right image
When you run a model for the first time, RamaLama uses the information from ramalama info to pull a pre-built image that matches your hardware:
- CUDA images for NVIDIA GPUs
- ROCm images for supported AMD GPUs
- Vulkan-based images where appropriate
- CPU-only images when no accelerator is available
These images are compiled from the upstream llama.cpp project, which also powers Ollama. Once the image is pulled and the model is downloaded, RamaLama reuses them for subsequent runs.
Running your first model with RamaLama
To run a model locally, you can start with a simple command such as:
ramalama run gpt-oss
After this command completes the initial image pull and model download, you have a local, isolated, GPU-optimized LLM running in a container!
Serving an OpenAI-compatible API with RamaLama
Interactive command-line chat is useful, but many applications require a network-accessible API. Typical use cases include:
- RAG services that answer questions over your own documents (ex. LangChain, Dify, LlamaIndex)
- Agents that call tools or microservices (ex. Model Context Protocol)
- Existing applications that already use the OpenAI API
RamaLama makes it straightforward to expose a local model through a REST endpoint:
ramalama serve gpt-oss --port 8000
This command serves an OpenAI-compatible HTTP API on port 8000, allows any tool that can talk to the ChatGPT API to use your local endpoint instead, and starts a lightweight UI front end you can use to interactively test the model in your browser.
Adding external data with RAG using RamaLama
Many real applications need LLMs to answer questions about your own documents. This pattern is known as retrieval-augmented generation (RAG). RamaLama uses the Docling project to simplify data preparation and provides a built-in RAG workflow.
ramalama rag data.pdf quay.io/cclyburn/my-data
This command uses Docling to convert data.pdf into structured JSON, builds a vector database from that JSON for similarity search, and packages the result into an OCI image. Once that image is built, you can launch a model with RAG enabled:
ramalama run --rag gpt-oss
This starts 2 containers: the RAG image containing your processed documents via the vector database, and your selected model running as an inference server.
From local workflows to edge and Kubernetes
Because RamaLama packages models (and a RAG pipeline) as container images, you can move them through the same pipelines you already use for other workloads. From a single local setup, you can generate Quadlet files for deployment to edge devices and generate Kubernetes manifests for clusters.
Wrapping up
RamaLama brings together containers, open source runtimes such as llama.cpp and an OpenAI-compatible API to make local AI workloads easier to run and manage. In addition, it's designed with a robust security footprint, running AI as an isolated container, mounting the model as read-only, and not providing network access. If you're looking for a standardized way to run LLMs locally on your own infrastructure, be sure to check out RamaLama on GitHub and start making working with AI "boring and predictable!"