Emerging Architectures for LLM Applications | Andreessen Horowitz Large language models are a powerful new primitive for building software. But since they are so new—and behave so differently from normal computing resources—it’s not always obvious how to use them. In this post, we’re sharing a reference architecture for the emerging LLM app stack. It shows the most common systems, tools, and design patterns we’ve seen used by AI startups and sophisticated tech companies. This stack is still very early and may change substantially as the underlying technology advances, but we hope it will be a useful reference for developers working with LLMs now. in Artificial Intelligence AI > LLM/FM Large Language / Foundation Modelswith aiarchitecturefoundationmodelsgenerativeailargelanguagemodelsllm
LLM Inference Sizing and Performance Guidance When planning to deploy a chatbot or simple Retrieval-Augmentation-Generation (RAG) pipeline, you may have questions about sizing (capacity) and performance based on your existing GPU resources or potential future GPU acquisitions. For instance: What is the maximum number of concurrent requests that can be supported for a specific Large Language Model (LLM) on a specific GPU? What is the maximum sequence length (or prompt size) that a user can send to the chat app without experiencing a noticeably slow response time? What is the estimated response time (latency) for generating output tokens, and how does it vary with different input sizes and LLM sizes? Conversely, if you have specific capacity or latency requ in Computers > Hårdvara > AIwith concurrencygpullmmemoryrequirementsscalingsizing