Using Localized, Offline LLMs with LM Studio

LM Studio is a desktop application designed for developing and experimenting with Large Language Models (LLMs) on your local machine. It provides a more familiar chat interface, one click install seamless integration with models from Hugging Face, and the capability to run a local server that mimics OpenAI endpoints.

Compatible Large Language Models (LLMs) from Hugging Face can be run in GGUF (llama.cpp) and MLX formats (Mac only). While GGUF text embedding models are available, some may not work on your machine or could be too large.

Key Functionalities

Local LLM Execution: Run various LLMs directly on your computer. They're offline, so you don't need internet access. No data is sent to the cloud.
Chat Interface: Chat using a user-friendly standard chat interface that has great syntax highlighting and formatting and lots of customization options.
Model Download & Search: Download and search for models via Hugging Face directly from the app.
Local Server: Serve models through endpoints similar to OpenAI’s API.
Configuration Management: Manage local models and customize the settings to your liking.

Installation Guide

System Requirements

I'm using a Apple Silicon particularly for this. LM Studio supports:

macOS: Apple Silicon (M1/M2/M3/M4) with macOS 13.4 or newer; 16GB+ RAM recommended.
Windows: x64/ARM systems with at least 16GB RAM and AVX2 instruction set support.
Linux: Ubuntu 20.04 or newer, x64 only.

Note: Intel-based Macs are currently unsupported.

Getting LM Studio Installed

Typically I use the Homebrew package manager for Mac, but you can also download the installer from the LM Studio Downloads page.

Download the Installer: Visit LM Studio Downloads to download the installer for your operating system.
Run the Installer: Launch the downloaded file and follow the on-screen instructions.
Install LM Runtimes: Press ⌘ Shift R (Mac) or Ctrl Shift R (Windows/Linux) to install necessary runtimes like llama.cpp (GGUF) or MLX.

Or for Homebrew users:

brew install lm-studio

The Homebrew tap is available on Homebrew Formulae.

Using LM Studio

Running an LLM

Download some Models:

Navigate to the Discover tab.
Choose a model from the curated list or search for one using keywords (e.g., "Llama"). Lately I've been using the following models:
- cogito-vl-preview-qwen-14B-GGUF/cogito-v1-preview-qwen-14B-Q4_K_M.gguf The model supports a maximum context length of 128k tokens with ROPE settings. It is a hybrid reasoning model trained through Iterated Distillation and Amplification, optimized for coding, STEM, instruction following, and general helpfulness. It boasts superior multilingual, coding, and tool-calling capabilities compared to similarly sized competitors.
- gemma-3-27b-it-GGUF/gemma-3-27b-it-Q4_K_M.gguf The model is optimized with Quantization Aware Training for enhanced 4-bit performance and supports a context length of 128k tokens, with a maximum output of 8192 tokens. It is multimodal, handling images normalized to 896 x 896 resolution. Gemma 3 models excel in various text generation and image understanding tasks, including question answering, summarization, and reasoning.
- NTQAI_-_Nxcode-CQ-7B-orpo-gguf/Nxcode-CQ-7B-orpo.Q4_K_S.gguf Does Python programming and debugging real well.
- Qwen2.5-Coder-32B-Instruct-GGUF/qwen2.5-coder-32b-instruct-q2_k.gguf Qwen2.5-Coder is a series of code-specific large language models with sizes from 0.5 to 32 billion parameters. It improves code generation, reasoning, and fixing over CodeQwen1.5, trained on 5.5 trillion tokens. The 32B model rivals GPT-4o, supports 128K tokens, and is suited for real-world applications like Code Agents. This is probably my favorite / most used model.
- qwen2.5-coder-32b-instruct-mlx Qwen for Silicon Macs.
- Qwen2.5-Coder-14B-Instruct-GGUF/qwen2.5-coder-14b-instruct-q4_k_m.gguf Qwen2.5-Coder at 14B context length. Supports 128k tokens
- DeepSeek-R1-Distill-Llama-8B-GGUF/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf Supports context length of 128k. Distilled from DeepSeek's R1 reasoning model. Tuned for reasoning and chain-of-thought.

Load & Chat with your new Model:

Switch to the Chat tab.
Use ⌘ L (Mac) or Ctrl L (Windows/Linux) to open the model loader.
Select a downloaded or sideloaded model and load it with desired configuration parameters.

Managing Chats

Create Conversations: Use ⌘ + N (Mac) or Ctrl N (Windows/Linux).
Organize Conversations: Create folders using ⌘ Shift N (Mac) or Ctrl Shift N (Windows/Linux).
Duplicate Conversations: Right-click on a chat and select "Duplicate".

Chatting with Documents

Attach document files (.docx, .pdf, .txt) to your chats.
LM Studio uses RAG (Retrieval-Augmented Generation) for long documents, extracting relevant parts to enhance context.

Finding Models in LM Studio

Searching & Downloading Models

Discover Tab: Accessible via ⌘ 2 (Mac) or Ctrl 2 (Windows/Linux).
Search Options: Use keywords or specific user/model strings; insert full Hugging Face URLs.

Managing Model Directory

There's an internal downloader/directory browser for Hugging Face models. Def browse around, there's suggested/popular models or you can search for something specific. You can also download models directly from the browser.
You can search for models using keywords (e.g., llama, gemma, lmstudio) or by entering a specific user/model string, including full Hugging Face URLs.
The terms like Q3_K_S and Q_8 refer to different versions of the same model, varying in fidelity. The "Q" stands for "Quantization", a technique that compresses model file sizes at the cost of some quality.

Advanced Features

Configuring Presets

Save commonly used system prompts and inference parameters as named presets for different use cases (reasoning, creative writing, etc.). This field (often found in AI prompting interfaces) is designed to give the AI context and guidelines before you provide your main request. It's where you tell the model how to behave or what rules to follow when generating its response.

Think of it as: Setting the stage, giving instructions to an assistant, or defining the "personality" of the AI for a specific task.

Per-model Defaults

Set default load settings for each model directly from the My Models tab.

Prompt Template Customization

Override the default prompt template in the My Models tab using Jinja or manual specifications.

Speculative Decoding

So in the sidebar of the Chat tab, you'll see a section called "Speculative Decoding". Speculative decoding is a method that enhances the generation speed of large language models (LLMs) while maintaining response quality.

Speculative decoding involves two models: a larger "main" model and a smaller, faster "draft" model. The draft model quickly suggests potential tokens, which the main model verifies against its own generation. The main model only accepts tokens that align with its output and always generates one additional token after the last accepted draft token. Both models must share the same vocabulary for the draft model to be effective.

API and Server Usage

LM Studio as a Local LLM API Server

Serve models through Developer tab using OpenAI compatibility mode, enhanced REST API, or lmstudio-js SDK.
Run LM Studio as a service without GUI for server deployment or background processing.

Structured Output

When Structured Output are enabled, model outputs conform to the tool definition provided. Enforce JSON schema-based structured output from LLMs via the /v1/chat/completions endpoint.

This example below just shows how to make a structured output request using the curl utility.

curl http://{{hostname}}:{{port}}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "{{model}}",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful jokester."
      },
      {
        "role": "user",
        "content": "Tell me a joke."
      }
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "joke_response",
        "strict": "true",
        "schema": {
          "type": "object",
          "properties": {
            "joke": {
              "type": "string"
            }
          },
        "required": ["joke"]
        }
      }
    },
    "temperature": 0.7,
    "max_tokens": 50,
    "stream": false
  }'

The API allows structured JSON outputs via the /v1/chat/completions endpoint when a JSON schema is provided. This enables the LLM to respond with valid JSON that adheres to the specified schema, similar to OpenAI's Structured Output API, and is compatible with OpenAI client SDKs.

Note: Not all models support structured output, especially those with fewer than 7 billion parameters.

LM Studio is an powerful tool for local development and experimentation with Large Language Models. Its intuitive interface, robust feature set, and flexible configuration options make it suitable for both beginners and advanced users. Whether you're looking to experiment with existing models or develop your own applications, LM Studio provides a solid foundation for local LLM projects.

Keep your data and queries safe and offline. Hope this helps you in your LLM / dev journies.