local-llm-server icon

local-llm-server

ProductionData & Intelligence

A robust, production-ready API for managing and serving local language models with comprehensive performance monitoring. It provides an OpenAI-compatible API layer over local inference engines (llama.cpp, etc.), enabling secure, air-gapped AI capabilities for the enterprise.

Key Features

  • OpenAI-compatible API Interface
  • Real-time GPU/TPS Performance Monitoring
  • Model Management & Switching UI
  • Efficiency Mode (No-Log Inference)
  • RBAC Integration for Model Access Controls
  • Support for ROCm/CUDA and CPU Inference

API Endpoints

MethodPathDescription
GET`/v1/models`List currently loaded and available models
POST`/v1/chat/completions`OpenAI-compatible chat completion endpoint
POST`/api/orchestrate/load`Load a specific model into VRAM
POST`/api/orchestrate/unload`Unload current model to free VRAM
GET`/api/performance/metrics`Get real-time token generation and GPU stats

Usage Example

python
import requests
# Example interaction
response = requests.get(
    url="https://api.arcore.internal/v1/models",
    headers={"Authorization": "Bearer <token>"}
)
print(response.json())

Tech Stack

PythonFastAPISQLiteDockerROCm/CUDAllama.cpp

Authentication

  • **Header:** `Authorization: Bearer <token>`
  • **Scopes:** RBAC is enforced at the object level via `ArcoreCodex` policies.

Compliance & Security

Compliance

  • Network: Air-gap capable
  • Access: API Key auth
  • Data Privacy: No external data egress

Security

  • Access: API Key auth

Related Services