KubeLLM: Collaborative LLM Agents for Distributed Edge Computing Systems
In this project, we propose KubeLLM, a framework that integrates Large Language Models (LLMs) with Kubernetes-based orchestration to enhance the management of distributed edge computing systems. KubeLLM embeds lightweight LLM agents within edge nodes to enable automated monitoring, diagnosis, and resolution of configuration issues. This approach aims to proactively address silent failures, minimize unexpected disruptions, and mitigate security vulnerabilities, thereby ensuring robust and reliable edge computing environments. The primary research challenge lies in achieving accurate and consistent responses from lightweight LLMs to implement quick and effective fixes for complex configuration issues. To address this, we propose a novel approach that leverages collaboration of multiple LLM agents, combined with Retrieval Augmented Generation (RAG) and model fine-tuning. In our preliminary work, we developed an early prototype of KubeLLM using llama3.1, an open-source LLM as the base model. Initial results on a testbed of ARM-based edge devices, including Raspberry Pi and Nvidia Jetson Nano, demonstrated promising outcomes showed promising outcomes in terms of response latency and accuracy in identifying and resolving a subset of configuration issues, even without model fine-tuning. To systematically evaluate the framework, the project aims to develop a comprehensive benchmarking suite featuring a diverse range of configuration issues within Kubernetes clusters deployed on edge devices. This suite will enable automated assessment of collaborative LLM agents, incorporating various models and enhancement strategies, such as fine-tuning, to rigorously evaluate their performance in resolving configuration challenges effectively.
- Faculty: Palden Lama
- Department: Computer Science
- Open Positions: 3
- Mode: Hybrid
- Hours per week: 19
Requirements and Responsibilities:
Needed Skills:
- CS 3423 Systems Programming (required)
- Knowledge of docker and Kubernetes (preferred)
- Experience with Raspberry Pi (preferred)
Student Responsibilities:
Data Analysis: Collect and analyze performance metrics (e.g., latency, accuracy, resource usage, energy usage) to identify improvements.
Model Optimization: Support fine-tuning LLMs and experimenting with Retrieval-Augmented Generation (RAG) for better issue resolution.
Documentation: Document experiments, configurations, and results for reproducibility and team use.
Research Dissemination: Assist in preparing research papers, presentations, and educational materials for conferences and publications.