Sunday, June 8, 2025

Choosing the Right LLM Setup: Local, EC2, Bedrock, OpenAI, or SageMaker?

 

After experimenting with LLMs for chatbots, data analytics, and content summarization across environments, Here’s what I gathered for an appropriate LLM setup:

1) Local LLMs (Experimented with Python using GPT4All with Orca mini on mac M3)

Use For:

  • Experimentation, learning and POCs

  • Working offline or in secure environments

  • Handling sensitive data (e.g., finance, healthcare)

  • A no-cost, self-contained setup on M1/M2/M3//M4

  • Full control over where data lives for HIPAA, GDPR

Limitations:

  • Slower inference on local

  • Manual model updates, no built-in scaling

  • Obviously not designed for multiple concurrent users or serving over the web without additional orchestration

Tips:

  • Use 4-bit quantized models (GGUF) for faster inference on CPU even for larger models like Llama 3

  • Automate model updates with scripts to avoid manual downloads

  • Add validation layers with schema checking/fact checking/rule filters/secondary model validation/RAG (Retrieval-Augmented Generation) 

     to reduce hallucinations and error scenarios.

  • While CPUs can run LLMs, dedicated GPUs (or Apple Silicon's unified memory) provide a substantial speedup.

2) EC2 + LLMs (Experimented with a Dockerized Python app using Ollama with Mistral 7B, Llama 3 running on a spot instance)

Use for:

  • More compute (GPU/RAM) than laptop

  • Running larger models privately (e.g., LLaMA 3 13B)

  • Building custom tools or secure internal pipelines

  • Full control over where data lives for HIPAA, GDPR

Limitations:

  • Infrastructure setup, scaling, and cost management

Tips:

  • Use Spot Instances to cut costs by up to 90%

  • Use Reserved Instances for predictable, long-running workloads, and Savings Plans for flexible cost savings.

  • Package with Docker or vLLM for easier setup & reproducibility

  • Store models in EFS/S3 to share across instances, avoid redownloads

  • Can scale horizontally with auto-scaling groups and load balancers for larger internal teams.

3) Amazon Bedrock (Experimented with Python running on Lambda and using AWS bedrock models -

amazon.titan-embed-text-v2:0 - for generating embeddings, amazon.titan-text-express-v1 - for text generation)

Use for:

  • Developing apps with managed model agnostic APIs (Claude, Mistral, Titan)

  • Enterprise-grade security, IAM, and uptime

  • Scalable for customer-facing or API-heavy workloads -  no need to manage concurrency limits or infrastructure.

Limitations:

  • Usage-based costs, limited access to model internals means less control over quantization, direct access to weights, etc. 

  • Fine-tuning is limited but possible for Titan models.

Tips:

  • Use VPC endpoints to restrict data to the network

  • Apply RAG to reduce hallucinations, add context

  • Use Guardrails to enforce safe outputs and compliance

  • Monitor via CloudWatch to track usage and avoid cost spikes

  • Supports regional compliance when paired with VPC and IAM boundaries.

  • Provides Data encryption at rest (KMS) and in transit, and ensures least-privilege access for Bedrock-related roles.

  • Handles rate limiting (e.g., exponential backoff, request queuing) for production applications

  • SageMaker supports fine-tuning and hosting of open-source LLMs. Can use it when full control over training loops, 

    advanced MLOps, or cost-efficient multi-model endpoints is needed. Ideal for custom enterprise LLM pipelines beyond Bedrock’s prebuilt APIs.

4) OpenAI APIs (GPT-4o, GPT-3.5) (Experimented with Java and Python apps calling OpenAI APIs for text and image generation)

Use for:

  • Best-in-class reasoning and multimodal support for complex problem solving, code generation, logical deduction and strategic thinking

  • Zero infrastructure preference

  • Rapid prototyping and app development

  • Building stateful, conversational applications with persistent threads, function calling, and knowledge retrieval,

     the Assistants API simplifies much of the orchestration.

Limitations:

  • Per-token pricing, privacy, and rate limits

  • Fine-tuning is only available for GPT-3.5 and smaller models - not GPT-4

Tips:

  • Use caching (e.g., Redis) to avoid repeated prompts and save cost

  • Prefer GPT-3.5 for simple tasks for cheaper, good enough for many use cases

  • Add RAG to supply facts instead of relying on memory

  • Opt out of data logging or sign a DPA to enhance privacy

  • OpenAI has region-specific data controls, but full residency control is limited.

  • Can use function calling to interact with external tools and APIs for building intelligent agents and automations.

  • Handles rate limiting (e.g., exponential backoff, request queuing) for production applications

Summary

No internet or high privacy => Local LLM (e.g., Ollama, GPT4All) 

Private high-performance inference => EC2 + LLMs

Scalable managed APIs with security => Amazon Bedrock

Best reasoning, no infra needed => OpenAI APIs

Custom training & hosting => SageMaker

No comments: