AI Model Library - Browse Our Collection of AI Models

AI

MiniMax: MiniMax M2

ID:minimax/minimax-m2

MiniMax-M2 is a compact, high-efficiency large language model optimized for end-to-end coding and agentic workflows. With 10 billion activated parameters (230 billion total), it delivers near-frontier intelligence across general reasoning, tool use, and multi-step task execution while maintaining low latency and deployment efficiency. The model excels in code generation, multi-file editing, compile-run-fix loops, and test-validated repair, showing strong results on SWE-Bench Verified, Multi-SWE-Bench, and Terminal-Bench. It also performs competitively in agentic evaluations such as BrowseComp and GAIA, effectively handling long-horizon planning, retrieval, and recovery from execution errors. Benchmarked by [Artificial Analysis](https://artificialanalysis.ai/models/minimax-m2), MiniMax-M2 ranks among the top open-source models for composite intelligence, spanning mathematics, science, and instruction-following. Its small activation footprint enables fast inference, high concurrency, and improved unit economics, making it well-suited for large-scale agents, developer assistants, and reasoning-driven applications that require responsiveness and cost efficiency. To avoid degrading this model's performance, MiniMax highly recommends preserving reasoning between turns. Learn more about using reasoning_details to pass back reasoning in our [docs](https://openrouter.ai/docs/use-cases/reasoning-tokens#preserving-reasoning-blocks).

Unknown Provider

205K

$0.15/M

$0.45/M

Free

AI

MiniMax: MiniMax M2 (free)

ID:minimax/minimax-m2:free

MiniMax-M2 is a compact, high-efficiency large language model optimized for end-to-end coding and agentic workflows. With 10 billion activated parameters (230 billion total), it delivers near-frontier intelligence across general reasoning, tool use, and multi-step task execution while maintaining low latency and deployment efficiency. The model excels in code generation, multi-file editing, compile-run-fix loops, and test-validated repair, showing strong results on SWE-Bench Verified, Multi-SWE-Bench, and Terminal-Bench. It also performs competitively in agentic evaluations such as BrowseComp and GAIA, effectively handling long-horizon planning, retrieval, and recovery from execution errors. Benchmarked by [Artificial Analysis](https://artificialanalysis.ai/models/minimax-m2), MiniMax-M2 ranks among the top open-source models for composite intelligence, spanning mathematics, science, and instruction-following. Its small activation footprint enables fast inference, high concurrency, and improved unit economics, making it well-suited for large-scale agents, developer assistants, and reasoning-driven applications that require responsiveness and cost efficiency. To avoid degrading this model's performance, MiniMax highly recommends preserving reasoning between turns. Learn more about using reasoning_details to pass back reasoning in our [docs](https://openrouter.ai/docs/use-cases/reasoning-tokens#preserving-reasoning-blocks).

Unknown Provider

205K

Qw

Qwen: Qwen3 VL 30B A3B Thinking

ID:qwen/qwen3-vl-30b-a3b-thinking

Qwen3-VL-30B-A3B-Thinking is a multimodal model that unifies strong text generation with visual understanding for images and videos. Its Thinking variant enhances reasoning in STEM, math, and complex tasks. It excels in perception of real-world/synthetic categories, 2D/3D spatial grounding, and long-form visual comprehension, achieving competitive multimodal benchmark results. For agentic use, it handles multi-image multi-turn instructions, video timeline alignments, GUI automation, and visual coding from sketches to debugged UI. Text performance matches flagship Qwen3 models, suiting document AI, OCR, UI assistance, spatial tasks, and agent research.

Qwen

262K

$0.3/M

$1/M

Go

Google: Gemini 2.5 Flash Lite

ID:google/gemini-2.5-flash-lite

Gemini 2.5 Flash-Lite is a lightweight reasoning model in the Gemini 2.5 family, optimized for ultra-low latency and cost efficiency. It offers improved throughput, faster token generation, and better performance across common benchmarks compared to earlier Flash models. By default, "thinking" (i.e. multi-pass reasoning) is disabled to prioritize speed, but developers can enable it via the [Reasoning API parameter](https://openrouter.ai/docs/use-cases/reasoning-tokens) to selectively trade off cost for intelligence.

Google

1M

$0.1/M

$0.4/M

Op

OpenAI: GPT-5

ID:openai/gpt-5

GPT-5 is OpenAI’s most advanced model, offering major improvements in reasoning, code quality, and user experience. It is optimized for complex tasks that require step-by-step reasoning, instruction following, and accuracy in high-stakes use cases. It supports test-time routing features and advanced prompt understanding, including user-specified intent like "think hard about this." Improvements include reductions in hallucination, sycophancy, and better performance in coding, writing, and health-related tasks.

OpenAI

400K

$1.25/M

$10/M

An

Anthropic: Claude Sonnet 4.5

ID:anthropic/claude-sonnet-4.5

Claude Sonnet 4.5 is Anthropic’s most advanced Sonnet model to date, optimized for real-world agents and coding workflows. It delivers state-of-the-art performance on coding benchmarks such as SWE-bench Verified, with improvements across system design, code security, and specification adherence. The model is designed for extended autonomous operation, maintaining task continuity across sessions and providing fact-based progress tracking. Sonnet 4.5 also introduces stronger agentic capabilities, including improved tool orchestration, speculative parallel execution, and more efficient context and memory management. With enhanced context tracking and awareness of token usage across tool calls, it is particularly well-suited for multi-context and long-running workflows. Use cases span software engineering, cybersecurity, financial analysis, research agents, and other domains requiring sustained reasoning and tool use.

Anthropic

1M

$3/M

$15/M

De

DeepSeek: DeepSeek V3.2 Exp

ID:deepseek/deepseek-v3.2-exp

DeepSeek-V3.2-Exp is an experimental large language model released by DeepSeek as an intermediate step between V3.1 and future architectures. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism designed to improve training and inference efficiency in long-context scenarios while maintaining output quality. Users can control the reasoning behaviour with the `reasoning` `enabled` boolean. [Learn more in our docs](https://openrouter.ai/docs/use-cases/reasoning-tokens#enable-reasoning-with-default-config) The model was trained under conditions aligned with V3.1-Terminus to enable direct comparison. Benchmarking shows performance roughly on par with V3.1 across reasoning, coding, and agentic tool-use tasks, with minor tradeoffs and gains depending on the domain. This release focuses on validating architectural optimizations for extended context lengths rather than advancing raw task accuracy, making it primarily a research-oriented model for exploring efficient transformer designs.

DeepSeek

164K

$0.27/M

$0.4/M

Free

De

DeepSeek: DeepSeek V3.1 (free)

ID:deepseek/deepseek-chat-v3.1:free

DeepSeek-V3.1 is a large hybrid reasoning model (671B parameters, 37B active) that supports both thinking and non-thinking modes via prompt templates. It extends the DeepSeek-V3 base with a two-phase long-context training process, reaching up to 128K tokens, and uses FP8 microscaling for efficient inference. Users can control the reasoning behaviour with the `reasoning` `enabled` boolean. [Learn more in our docs](https://openrouter.ai/docs/use-cases/reasoning-tokens#enable-reasoning-with-default-config) The model improves tool use, code generation, and reasoning efficiency, achieving performance comparable to DeepSeek-R1 on difficult benchmarks while responding more quickly. It supports structured tool calling, code agents, and search agents, making it suitable for research, coding, and agentic workflows. It succeeds the [DeepSeek V3-0324](/deepseek/deepseek-chat-v3-0324) model and performs well on a variety of tasks.

DeepSeek

164K

Go

Google: Gemini 2.5 Flash Image Preview (Nano Banana)

ID:google/gemini-2.5-flash-image-preview

Gemini 2.5 Flash Image Preview, a.k.a. "Nano Banana," is a state of the art image generation model with contextual understanding. It is capable of image generation, edits, and multi-turn conversations.

Google

33K

$0.3/M

$2.5/M

Qw

Qwen: Qwen3 Coder 480B A35B

ID:qwen/qwen3-coder

Qwen3-Coder-480B-A35B-Instruct is a Mixture-of-Experts (MoE) code generation model developed by the Qwen team. It is optimized for agentic coding tasks such as function calling, tool use, and long-context reasoning over repositories. The model features 480 billion total parameters, with 35 billion active per forward pass (8 out of 160 experts). Pricing for the Alibaba endpoints varies by context length. Once a request is greater than 128k input tokens, the higher pricing is used.

Qwen

262K

$0.22/M

$0.95/M

Qw

Qwen: Qwen3 Max

ID:qwen/qwen3-max

Qwen3-Max is an updated release built on the Qwen3 series, offering major improvements in reasoning, instruction following, multilingual support, and long-tail knowledge coverage compared to the January 2025 version. It delivers higher accuracy in math, coding, logic, and science tasks, follows complex instructions in Chinese and English more reliably, reduces hallucinations, and produces higher-quality responses for open-ended Q&A, writing, and conversation. The model supports over 100 languages with stronger translation and commonsense reasoning, and is optimized for retrieval-augmented generation (RAG) and tool calling, though it does not include a dedicated “thinking” mode.

Qwen

256K

$1.2/M

$6/M

Op

OpenAI: GPT-5 Nano

ID:openai/gpt-5-nano

GPT-5-Nano is the smallest and fastest variant in the GPT-5 system, optimized for developer tools, rapid interactions, and ultra-low latency environments. While limited in reasoning depth compared to its larger counterparts, it retains key instruction-following and safety features. It is the successor to GPT-4.1-nano and offers a lightweight option for cost-sensitive or real-time applications.

OpenAI

400K

$0.05/M

$0.4/M

Op

OpenAI: GPT-5 Mini

ID:openai/gpt-5-mini

GPT-5 Mini is a compact version of GPT-5, designed to handle lighter-weight reasoning tasks. It provides the same instruction-following and safety-tuning benefits as GPT-5, but with reduced latency and cost. GPT-5 Mini is the successor to OpenAI's o4-mini model.

OpenAI

400K

$0.25/M

$2/M

AI

Meta: Llama 4 Scout

ID:meta-llama/llama-4-scout

Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by Meta, activating 17 billion parameters out of a total of 109B. It supports native multimodal input (text and image) and multilingual output (text and code) across 12 supported languages. Designed for assistant-style interaction and visual reasoning, Scout uses 16 experts per forward pass and features a context length of 10 million tokens, with a training corpus of ~40 trillion tokens. Built for high efficiency and local or commercial deployment, Llama 4 Scout incorporates early fusion for seamless modality integration. It is instruction-tuned for use in multilingual chat, captioning, and image understanding tasks. Released under the Llama 4 Community License, it was last trained on data up to August 2024 and launched publicly on April 5, 2025.

Unknown Provider

328K

$0.08/M

$0.3/M

AI

Meta: Llama 4 Maverick

ID:meta-llama/llama-4-maverick

Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forward pass (400B total). It supports multilingual text and image input, and produces multilingual text and code output across 12 supported languages. Optimized for vision-language tasks, Maverick is instruction-tuned for assistant-like behavior, image reasoning, and general-purpose multimodal interaction. Maverick features early fusion for native multimodality and a 1 million token context window. It was trained on a curated mixture of public, licensed, and Meta-platform data, covering ~22 trillion tokens, with a knowledge cutoff in August 2024. Released on April 5, 2025 under the Llama 4 Community License, Maverick is suited for research and commercial applications requiring advanced multimodal understanding and high model throughput.

Unknown Provider

1M

$0.17/M

$0.85/M

NV

NVIDIA: Llama 3.3 Nemotron Super 49B V1.5

ID:nvidia/llama-3.3-nemotron-super-49b-v1.5

Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and multi-turn chat, followed by multiple RL stages; Reward-aware Preference Optimization (RPO) for alignment, RL with Verifiable Rewards (RLVR) for step-wise reasoning, and iterative DPO to refine tool-use behavior. A distillation-driven Neural Architecture Search (“Puzzle”) replaces some attention blocks and varies FFN widths to shrink memory footprint and improve throughput, enabling single-GPU (H100/H200) deployment while preserving instruction following and CoT quality. In internal evaluations (NeMo-Skills, up to 16 runs, temp = 0.6, top_p = 0.95), the model reports strong reasoning/coding results, e.g., MATH500 pass@1 = 97.4, AIME-2024 = 87.5, AIME-2025 = 82.71, GPQA = 71.97, LiveCodeBench (24.10–25.02) = 73.58, and MMLU-Pro (CoT) = 79.53. The model targets practical inference efficiency (high tokens/s, reduced VRAM) with Transformers/vLLM support and explicit “reasoning on/off” modes (chat-first defaults, greedy recommended when disabled). Suitable for building agents, assistants, and long-context retrieval systems where balanced accuracy-to-cost and reliable tool use matter.

NVIDIA

131K

$0.1/M

$0.4/M

AI

Meta: Llama Guard 4 12B

ID:meta-llama/llama-guard-4-12b

Llama Guard 4 is a Llama 4 Scout-derived multimodal pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM—generating text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated. Llama Guard 4 was aligned to safeguard against the standardized MLCommons hazards taxonomy and designed to support multimodal Llama 4 capabilities. Specifically, it combines features from previous Llama Guard models, providing content moderation for English and multiple supported languages, along with enhanced capabilities to handle mixed text-and-image prompts, including multiple images. Additionally, Llama Guard 4 is integrated into the Llama Moderations API, extending robust safety classification to text and images.

Unknown Provider

164K

$0.05/M

Free

AI

Meta: Llama 3.3 8B Instruct (free)

ID:meta-llama/llama-3.3-8b-instruct:free

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

Unknown Provider

128K

X-

xAI: Grok Code Fast 1

ID:x-ai/grok-code-fast-1

Grok Code Fast 1 is a speedy and economical reasoning model that excels at agentic coding. With reasoning traces visible in the response, developers can steer Grok Code for high-quality work flows.

X-AI

256K

$0.2/M

$1.5/M

X-

xAI: Grok 4 Fast

ID:x-ai/grok-4-fast

Grok 4 Fast is xAI's latest multimodal model with SOTA cost-efficiency and a 2M token context window. It comes in two flavors: non-reasoning and reasoning. Read more about the model on xAI's [news post](http://x.ai/news/grok-4-fast). Reasoning can be enabled using the `reasoning` `enabled` parameter in the API. [Learn more in our docs](https://openrouter.ai/docs/use-cases/reasoning-tokens#controlling-reasoning-tokens) Prompts and completions on Grok 4 Fast Free may be used by xAI or OpenRouter to improve future models.

X-AI

2M

$0.2/M

$0.5/M

X-

xAI: Grok 4

ID:x-ai/grok-4

Grok 4 is xAI's latest reasoning model with a 256k context window. It supports parallel tool calling, structured outputs, and both image and text inputs. Note that reasoning is not exposed, reasoning cannot be disabled, and the reasoning effort cannot be specified. Pricing increases once the total tokens in a given request is greater than 128k tokens. See more details on the [xAI docs](https://docs.x.ai/docs/models/grok-4-0709)

X-AI

256K

$3/M

$15/M

Ri

MoonshotAI: Kimi K2

ID:moonshotai/kimi-k2

Kimi K2 Instruct is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. Kimi K2 excels across a broad range of benchmarks, particularly in coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) tasks. It supports long-context inference up to 128K tokens and is designed with a novel training stack that includes the MuonClip optimizer for stable large-scale MoE training.

Rifx.Online

66K

$0.14/M

$2.49/M

De

DeepSeek: DeepSeek V3 0324

ID:deepseek-chat-v3-0324

DeepSeek V3, a 685B-parameter, mixture-of-experts model, is the latest iteration of the flagship chat model family from the DeepSeek team. It succeeds the [DeepSeek V3](/deepseek/deepseek-chat-v3) model and performs really well on a variety of tasks.

DeepSeek

64K

$0.27/M

$1.1/M

Free

Ri

MoonshotAI: Kimi K2 (free)

ID:moonshotai/kimi-k2:free

Kimi K2 Instruct is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. Kimi K2 excels across a broad range of benchmarks, particularly in coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) tasks. It supports long-context inference up to 128K tokens and is designed with a novel training stack that includes the MuonClip optimizer for stable large-scale MoE training.

Rifx.Online

66K

Free

De

DeepSeek: R1 0528 (free)

ID:deepseek/deepseek-r1-0528:free

# DeepSeek-R1 ## 1. Introduction We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. **NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the [Usage Recommendation](#usage-recommendations) section.** <p align="center"> <img width="80%" src="https://github.com/deepseek-ai/DeepSeek-R1/raw/main/figures/benchmark.jpg"> </p> ## 2. Model Summary --- **Post-Training: Large-Scale Reinforcement Learning on the Base Model** - We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. --- **Distillation: Smaller Models Can Be Powerful Too** - We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. - Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. ## 3. Evaluation Results ### DeepSeek-R1-Evaluation For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1. <div align="center"> | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 | |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------| | | Architecture | - | - | MoE | - | - | MoE | | | # Activated Params | - | - | 37B | - | - | 37B | | | # Total Params | - | - | 671B | - | - | 671B | | English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | **91.8** | 90.8 | | | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | **92.9** | | | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | **84.0** | | | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | **92.2** | | | IF-Eval (Prompt Strict) | **86.5** | 84.3 | 86.1 | 84.8 | - | 83.3 | | | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | **75.7** | 71.5 | | | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | **47.0** | 30.1 | | | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | **82.5** | | | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | **87.6** | | | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | **92.3** | | Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | **65.9** | | | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | **96.6** | 96.3 | | | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | **2061** | 2029 | | | SWE Verified (Resolved) | **50.8** | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | | | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | **61.7** | 53.3 | | Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | **79.8** | | | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | **97.3** | | | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | **78.8** | | Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | **92.8** | | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** | | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 | </div> ### Distilled Model Evaluation <div align="center"> | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------| | GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | | Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | | o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | **1820** | | QwQ-32B-Preview | 44.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 | | DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 | | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 | | DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.9 | 59.1 | 53.1 | 1481 | | DeepSeek-R1-Distill-Qwen-32B | **72.6** | 83.3 | 94.3 | 62.1 | 57.2 | 1691 | | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 | | DeepSeek-R1-Distill-Llama-70B | 70.0 | **86.7** | **94.5** | **65.2** | **57.5** | 1633 | </div> ## 4. Chat Website & API Platform You can chat with DeepSeek-R1 on DeepSeek's official website: [chat.deepseek.com](https://chat.deepseek.com), and switch on the button "DeepThink" We also provide OpenAI-Compatible API at DeepSeek Platform: [platform.deepseek.com](https://platform.deepseek.com/) ## 5. How to Run Locally ### DeepSeek-R1 Models Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running DeepSeek-R1 locally. **NOTE: Hugging Face's Transformers has not been directly supported yet.** ### DeepSeek-R1-Distill Models DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models. For instance, you can easily start a service using [vLLM](https://github.com/vllm-project/vllm): ```shell vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager ``` You can also easily start a service using [SGLang](https://github.com/sgl-project/sglang) ```bash python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code --tp 2 ``` ### Usage Recommendations **We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:** 1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs. 2. **Avoid adding a system prompt; all instructions should be contained within the user prompt.** 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}." 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results. Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\<think\>\n\n\</think\>") when responding to certain queries, which can adversely affect the model's performance. **To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\<think\>\n" at the beginning of every output.**

DeepSeek

164K

Ri

tts-1-1106

ID:tts-1

No description available

Rifx.Online

$0.0001/request

Ri

FunAudioLLM/CosyVoice2-0.5B

ID:funaudiollm/cosyvoice2-0.5b

No description available

Rifx.Online

$0.01/request

Free

Ri

FunAudioLLM/SenseVoiceSmall

ID:funaudiollm/sensevoicesmall

FunAudioLLM/SenseVoiceSmall

Rifx.Online

Op

gpt-4.1-mini

ID:gpt-4.1-mini

GPT-4.1 Mini is a mid-sized model delivering performance competitive with GPT-4o at substantially lower latency and cost. It retains a 1 million token context window and scores 45.1% on hard instruction evals, 35.8% on MultiChallenge, and 84.1% on IFEval. Mini also shows strong coding ability (e.g., 31.6% on Aider’s polyglot diff benchmark) and vision understanding, making it suitable for interactive applications with tight performance constraints.

OpenAI

1M

$0.4/M

$1.6/M

Op

gpt-4.1-mini

ID:rifx/gpt-4.1-mini

GPT-4.1 Mini is a mid-sized model delivering performance competitive with GPT-4o at substantially lower latency and cost. It retains a 1 million token context window and scores 45.1% on hard instruction evals, 35.8% on MultiChallenge, and 84.1% on IFEval. Mini also shows strong coding ability (e.g., 31.6% on Aider’s polyglot diff benchmark) and vision understanding, making it suitable for interactive applications with tight performance constraints.

OpenAI

1M

$0.4/M

$1.6/M

Op

free/gpt-4.1-nano

ID:rifx/gpt-4.1-nano

For tasks that demand low latency, GPT‑4.1 nano is the fastest and cheapest model in the GPT-4.1 series. It delivers exceptional performance at a small size with its 1 million token context window, and scores 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding – even higher than GPT‑4o mini. It’s ideal for tasks like classification or autocompletion.

OpenAI

1M

$0.1/M

$0.4/M

Op

gpt-4.1

ID:rifx/gpt-4.1

GPT-4.1 is a flagship large language model optimized for advanced instruction following, real-world software engineering, and long-context reasoning. It supports a 1 million token context window and outperforms GPT-4o and GPT-4.5 across coding (54.6% SWE-bench Verified), instruction compliance (87.4% IFEval), and multimodal understanding benchmarks. It is tuned for precise code diffs, agent reliability, and high recall in large document contexts, making it ideal for agents, IDE tooling, and enterprise knowledge retrieval.

OpenAI

1M

$2/M

$8/M

Op

OpenAI: GPT-4.1

ID:openai/gpt-4.1

GPT-4.1 is a flagship large language model optimized for advanced instruction following, real-world software engineering, and long-context reasoning. It supports a 1 million token context window and outperforms GPT-4o and GPT-4.5 across coding (54.6% SWE-bench Verified), instruction compliance (87.4% IFEval), and multimodal understanding benchmarks. It is tuned for precise code diffs, agent reliability, and high recall in large document contexts, making it ideal for agents, IDE tooling, and enterprise knowledge retrieval.

OpenAI

1M

$2/M

$8/M

Op

OpenAI: GPT-4.1 Nano

ID:openai/gpt-4.1-nano

For tasks that demand low latency, GPT‑4.1 nano is the fastest and cheapest model in the GPT-4.1 series. It delivers exceptional performance at a small size with its 1 million token context window, and scores 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding – even higher than GPT‑4o mini. It’s ideal for tasks like classification or autocompletion.

OpenAI

1M

$0.1/M

$0.4/M

Op

OpenAI: GPT-4.1 Mini

ID:openai/gpt-4.1-mini

GPT-4.1 Mini is a mid-sized model delivering performance competitive with GPT-4o at substantially lower latency and cost. It retains a 1 million token context window and scores 45.1% on hard instruction evals, 35.8% on MultiChallenge, and 84.1% on IFEval. Mini also shows strong coding ability (e.g., 31.6% on Aider’s polyglot diff benchmark) and vision understanding, making it suitable for interactive applications with tight performance constraints.

OpenAI

1M

$0.4/M

$1.6/M

Free

De

DeepSeek: DeepSeek V3 0324 (free)

ID:deepseek/deepseek-chat-v3-0324:free

DeepSeek V3, a 685B-parameter, mixture-of-experts model, is the latest iteration of the flagship chat model family from the DeepSeek team. It succeeds the [DeepSeek V3](/deepseek/deepseek-chat-v3) model and performs really well on a variety of tasks.

DeepSeek

64K

Free

Go

Google: Gemma 3 27B (free)

ID:google/gemma-3-27b-it:free

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to [Gemma 2](google/gemma-2-27b-it)

Google

128K

Go

Google: Gemma 3 27B

ID:google/gemma-3-27b-it

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to [Gemma 2](google/gemma-2-27b-it)

Google

128K

$0.3/M

$0.5/M

Op

dall-e-3

ID:dall-e-3

dall-e-3

OpenAI

$0.001/request

To

black-forest-labs/FLUX.1-redux

ID:black-forest-labs/flux.1-redux

FLUX.1 Redux [dev] is an adapter for all FLUX.1 base models for image variation generation. Given an input image, FLUX.1 Redux can reproduce the image with slight variation, allowing to refine a given image. It naturally integrates into more complex workflows unlocking image restyling.

Together

$0.025/request

Op

tts-1-hd

ID:tts-1-hd

No description available

OpenAI

$300/M

Free

To

black-forest-labs/FLUX.1-schnell-Free

ID:black-forest-labs/flux.1-schnell-free

`FLUX.1 [schnell]` is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our [blog post](https://blackforestlabs.ai/announcing-black-forest-labs/). # Key Features 1. Cutting-edge output quality and competitive prompt following, matching the performance of closed source alternatives. 2. Trained using latent adversarial diffusion distillation, `FLUX.1 [schnell]` can generate high-quality images in only 1 to 4 steps. 3. Released under the `apache-2.0` licence, the model can be used for personal, scientific, and commercial purposes. # Usage We provide a reference implementation of `FLUX.1 [schnell]`, as well as sampling code, in a dedicated [github repository](https://github.com/black-forest-labs/flux). Developers and creatives looking to build on top of `FLUX.1 [schnell]` are encouraged to use this as a starting point. ## API Endpoints The FLUX.1 models are also available via API from the following sources - [bfl.ml](https://docs.bfl.ml/) (currently `FLUX.1 [pro]`) - [replicate.com](https://replicate.com/collections/flux) - [fal.ai](https://fal.ai/models/fal-ai/flux/schnell) - [mystic.ai](https://www.mystic.ai/black-forest-labs/flux1-schnell) ## ComfyUI `FLUX.1 [schnell]` is also available in [Comfy UI](https://github.com/comfyanonymous/ComfyUI) for local inference with a node-based workflow. ## Diffusers To use `FLUX.1 [schnell]` with the 🧨 diffusers python library, first install or upgrade diffusers ```shell pip install -U diffusers ``` Then you can use `FluxPipeline` to run the model ```python import torch from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power prompt = "A cat holding a sign that says hello world" image = pipe( prompt, guidance_scale=0.0, num_inference_steps=4, max_sequence_length=256, generator=torch.Generator("cpu").manual_seed(0) ).images[0] image.save("flux-schnell.png") ``` To learn more check out the [diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux) documentation --- # Limitations - This model is not intended or able to provide factual information. - As a statistical model this checkpoint might amplify existing societal biases. - The model may fail to generate output that matches the prompts. - Prompt following is heavily influenced by the prompting-style. # Out-of-Scope Use The model and its derivatives may not be used - In any way that violates any applicable national, federal, state, local or international law or regulation. - For the purpose of exploiting, harming or attempting to exploit or harm minors in any way; including but not limited to the solicitation, creation, acquisition, or dissemination of child exploitative content. - To generate or disseminate verifiably false information and/or content with the purpose of harming others. - To generate or disseminate personal identifiable information that can be used to harm an individual. - To harass, abuse, threaten, stalk, or bully individuals or groups of individuals. - To create non-consensual nudity or illegal pornographic content. - For fully automated decision making that adversely impacts an individual's legal rights or otherwise creates or modifies a binding, enforceable obligation. - Generating or facilitating large-scale disinformation campaigns.

Together

Go

Google: Gemini 2.0 Flash Lite

ID:google/gemini-2.0-flash-lite-001

Gemini 2.0 Flash Lite offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5), all at extremely economical token prices.

Google

1M

$0.075/M

$0.3/M

Qw

Qwen: Qwen2.5 VL 72B Instruct

ID:qwen/qwen2.5-vl-72b-instruct

Qwen2.5-VL is proficient in recognizing common objects such as flowers, birds, fish, and insects. It is also highly capable of analyzing texts, charts, icons, graphics, and layouts within images.

Qwen

131K

$0.7/M

Me

Meta: Llama 3 70B (Base)

ID:meta-llama/llama-3-70b

Meta's latest class of model (Llama 3) launched with a variety of sizes & flavors. This is the base 70B pre-trained version. It has demonstrated strong performance compared to leading closed-source models in human evaluations. To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).

Meta Llama

8K

$0.59/M

$0.79/M

Me

Meta: Llama 3 8B (Base)

ID:meta-llama/llama-3-8b

Meta's latest class of model (Llama 3) launched with a variety of sizes & flavors. This is the base 8B pre-trained version. It has demonstrated strong performance compared to leading closed-source models in human evaluations. To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).

Meta Llama

8K

$0.05/M

$0.08/M

An

Anthropic: Claude 3.7 Sonnet

ID:anthropic/claude-3.7-sonnet

Claude 3.7 Sonnet is an advanced large language model with improved reasoning, coding, and problem-solving capabilities. It introduces a hybrid reasoning approach, allowing users to choose between rapid responses and extended, step-by-step processing for complex tasks. The model demonstrates notable improvements in coding, particularly in front-end development and full-stack updates, and excels in agentic workflows, where it can autonomously navigate multi-step processes. Claude 3.7 Sonnet maintains performance parity with its predecessor in standard mode while offering an extended reasoning mode for enhanced accuracy in math, coding, and instruction-following tasks. Read more at the [blog post here](https://www.anthropic.com/news/claude-3-7-sonnet)

Anthropic

200K

$3/M

$15/M

Pe

Perplexity: R1 1776

ID:perplexity/r1-1776

Note: As this model does not return <think> tags, thoughts will be streamed by default directly to the `content` field. R1 1776 is a version of DeepSeek-R1 that has been post-trained to remove censorship constraints related to topics restricted by the Chinese government. The model retains its original reasoning capabilities while providing direct responses to a wider range of queries. R1 1776 is an offline chat model that does not use the perplexity search subsystem. The model was tested on a multilingual dataset of over 1,000 examples covering sensitive topics to measure its likelihood of refusal or overly filtered responses. [Evaluation Results](https://cdn-uploads.huggingface.co/production/uploads/675c8332d01f593dc90817f5/GiN2VqC5hawUgAGJ6oHla.png) Its performance on math and reasoning benchmarks remains similar to the base R1 model. [Reasoning Performance](https://cdn-uploads.huggingface.co/production/uploads/675c8332d01f593dc90817f5/n4Z9Byqp2S7sKUvCvI40R.png) Read more on the [Blog Post](https://perplexity.ai/hub/blog/open-sourcing-r1-1776)

Perplexity

128K

$2/M

$8/M

20%

Op

OpenAI: o3 Mini High

ID:openai/o3-mini-high

OpenAI o3-mini-high is the same model as [o3-mini](/openai/o3-mini) with reasoning_effort set to high. o3-mini is a cost-efficient language model optimized for STEM reasoning tasks, particularly excelling in science, mathematics, and coding. The model features three adjustable reasoning effort levels and supports key developer capabilities including function calling, structured outputs, and streaming, though it does not include vision processing capabilities. The model demonstrates significant improvements over its predecessor, with expert testers preferring its responses 56% of the time and noting a 39% reduction in major errors on complex questions. With medium reasoning effort settings, o3-mini matches the performance of the larger o1 model on challenging reasoning evaluations like AIME and GPQA, while maintaining lower latency and cost.

OpenAI

200K

$1.1/M

$4.4/M

20%

De

DeepSeek: R1

ID:deepseek/deepseek-r1

# DeepSeek-R1 ## 1. Introduction We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. **NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the [Usage Recommendation](#usage-recommendations) section.** <p align="center"> <img width="80%" src="https://github.com/deepseek-ai/DeepSeek-R1/raw/main/figures/benchmark.jpg"> </p> ## 2. Model Summary --- **Post-Training: Large-Scale Reinforcement Learning on the Base Model** - We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. --- **Distillation: Smaller Models Can Be Powerful Too** - We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. - Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. ## 3. Evaluation Results ### DeepSeek-R1-Evaluation For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1. <div align="center"> | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 | |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------| | | Architecture | - | - | MoE | - | - | MoE | | | # Activated Params | - | - | 37B | - | - | 37B | | | # Total Params | - | - | 671B | - | - | 671B | | English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | **91.8** | 90.8 | | | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | **92.9** | | | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | **84.0** | | | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | **92.2** | | | IF-Eval (Prompt Strict) | **86.5** | 84.3 | 86.1 | 84.8 | - | 83.3 | | | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | **75.7** | 71.5 | | | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | **47.0** | 30.1 | | | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | **82.5** | | | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | **87.6** | | | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | **92.3** | | Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | **65.9** | | | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | **96.6** | 96.3 | | | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | **2061** | 2029 | | | SWE Verified (Resolved) | **50.8** | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | | | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | **61.7** | 53.3 | | Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | **79.8** | | | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | **97.3** | | | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | **78.8** | | Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | **92.8** | | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** | | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 | </div> ### Distilled Model Evaluation <div align="center"> | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------| | GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | | Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | | o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | **1820** | | QwQ-32B-Preview | 44.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 | | DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 | | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 | | DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.9 | 59.1 | 53.1 | 1481 | | DeepSeek-R1-Distill-Qwen-32B | **72.6** | 83.3 | 94.3 | 62.1 | 57.2 | 1691 | | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 | | DeepSeek-R1-Distill-Llama-70B | 70.0 | **86.7** | **94.5** | **65.2** | **57.5** | 1633 | </div> ## 4. Chat Website & API Platform You can chat with DeepSeek-R1 on DeepSeek's official website: [chat.deepseek.com](https://chat.deepseek.com), and switch on the button "DeepThink" We also provide OpenAI-Compatible API at DeepSeek Platform: [platform.deepseek.com](https://platform.deepseek.com/) ## 5. How to Run Locally ### DeepSeek-R1 Models Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running DeepSeek-R1 locally. **NOTE: Hugging Face's Transformers has not been directly supported yet.** ### DeepSeek-R1-Distill Models DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models. For instance, you can easily start a service using [vLLM](https://github.com/vllm-project/vllm): ```shell vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager ``` You can also easily start a service using [SGLang](https://github.com/sgl-project/sglang) ```bash python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code --tp 2 ``` ### Usage Recommendations **We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:** 1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs. 2. **Avoid adding a system prompt; all instructions should be contained within the user prompt.** 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}." 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results. Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\<think\>\n\n\</think\>") when responding to certain queries, which can adversely affect the model's performance. **To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\<think\>\n" at the beginning of every output.**

DeepSeek

164K

$3/M

$8/M

Go

Google: Gemini Flash 2.0

ID:google/gemini-2.0-flash-001

Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5). It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.

Google

1M

$0.1/M

$0.4/M

De

DeepSeek: DeepSeek R1 Distill Llama 70B

ID:deepseek/deepseek-r1-distill-llama-70b

DeepSeek R1 Distill Llama 70B is a distilled large language model based on [Llama-3.3-70B-Instruct](/meta-llama/llama-3.3-70b-instruct), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). The model combines advanced distillation techniques to achieve high performance across multiple benchmarks, including: - AIME 2024 pass@1: 70.0 - MATH-500 pass@1: 94.5 - CodeForces Rating: 1633 The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models.

DeepSeek

131K

$0.23/M

$0.69/M

Ri

Sao10K: Llama 3 8B Lunaris

ID:sao10k/l3-lunaris-8b

Lunaris 8B is a versatile generalist and roleplaying model based on Llama 3. It's a strategic merge of multiple models, designed to balance creativity with improved logic and general knowledge. Created by [Sao10k](https://huggingface.co/Sao10k), this model aims to offer an improved experience over Stheno v3.2, with enhanced creativity and logical reasoning. For best results, use with Llama 3 Instruct context template, temperature 1.4, and min_p 0.1.

Rifx.Online

8K

$0.03/M

$0.06/M

Ri

Inflatebot: Mag Mell R1 12B

ID:inflatebot/mn-mag-mell-r1

Mag Mell is a merge of pre-trained language models created using mergekit, based on [Mistral Nemo](/mistralai/mistral-nemo). It is a great roleplay and storytelling model which combines the best parts of many other models to be a general purpose solution for many usecases. Intended to be a general purpose "Best of Nemo" model for any fictional, creative use case. Mag Mell is composed of 3 intermediate parts: - Hero (RP, trope coverage) - Monk (Intelligence, groundedness) - Deity (Prose, flair)

Rifx.Online

16K

$0.9/M

Me

Meta: Llama 3.3 70B Instruct

ID:meta-llama/llama-3.3-70b-instruct

The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. [Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md)

Meta Llama

131K

$0.13/M

$0.4/M

Op

text-embedding-3-small

ID:text-embedding-3-small

text-embedding-3-small is OpenAI's cost-effective text embedding model, serving as the lightweight version in the text-embedding-3 series. This model maintains good performance while offering a more economical pricing option. ## Key Features - **Cost-Effective**: Approximately 1/6 the price of text-embedding-3-large - **Multilingual Support**: Supports embeddings for 100+ languages - **Context Length**: Handles up to 8192 tokens of input - **Embedding Dimension**: Fixed 1536-dimensional embedding vectors ## Performance - Outperforms the older text-embedding-ada-002 on most tasks - Slightly lower performance than text-embedding-3-large, but sufficient for most applications ## Use Cases 1. Text Similarity Matching 2. Information Retrieval 3. Text Classification 4. Small-scale RAG Applications 5. Personal Projects or Budget-conscious Commercial Applications ## Pricing - Input: $0.00002 / 1K tokens - Output: Free ## Usage Recommendations - Ideal for projects with limited budgets requiring quality embeddings - Recommended for testing and development - Consider upgrading to text-embedding-3-large if highest embedding quality is required text-embedding-3-small offers an excellent balance between performance and cost, making it particularly suitable for startups and budget-sensitive applications.

OpenAI

$0.02/M

Am

Amazon: Nova Lite 1.0

ID:amazon/nova-lite-v1

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time customer interactions, document analysis, and visual question-answering tasks with high accuracy. With an input context of 300K tokens, it can analyze multiple images or up to 30 minutes of video in a single input.

Amazon

300K

$0.06/M

$0.24/M

Un

Toppy M 7B

ID:undi95/toppy-m-7b

A wild 7B parameter model that merges several models using the new task_arithmetic merge method from mergekit. List of merged models: - NousResearch/Nous-Capybara-7B-V1.9 - [HuggingFaceH4/zephyr-7b-beta](/huggingfaceh4/zephyr-7b-beta) - lemonilia/AshhLimaRP-Mistral-7B - Vulkane/120-Days-of-Sodom-LoRA-Mistral-7b - Undi95/Mistral-pippa-sharegpt-7b-qlora #merge #uncensored

Undi95

4K

$0.07/M

Un

ReMM SLERP 13B

ID:undi95/remm-slerp-l2-13b

A recreation trial of the original MythoMax-L2-B13 but with updated models. #merge

Undi95

4K

$1.125/M

Mi

Mistral: Pixtral 12B

ID:mistralai/pixtral-12b

The first image to text model from Mistral AI. Its weight was launched via torrent per their tradition: https://x.com/mistralai/status/1833758285167722836

MistralAI

4K

$0.1/M

Mi

Phi-3.5 Mini 128K Instruct

ID:microsoft/phi-3.5-mini-128k-instruct

Phi-3.5 models are lightweight, state-of-the-art open models. These models were trained with Phi-3 datasets that include both synthetic data and the filtered, publicly available websites data, with a focus on high quality and reasoning-dense properties. Phi-3.5 Mini uses 3.8B parameters, and is a dense decoder-only transformer model using the same tokenizer as [Phi-3 Mini](/microsoft/phi-3-mini-128k-instruct). The models underwent a rigorous enhancement process, incorporating both supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures. When assessed against benchmarks that test common sense, language understanding, math, code, long context and logical reasoning, Phi-3.5 models showcased robust and state-of-the-art performance among models with less than 13 billion parameters.

Microsoft Azure

128K

$0.1/M

Op

OpenAI: ChatGPT-4o

ID:openai/chatgpt-4o-latest

Dynamic model continuously updated to the current version of [GPT-4o](/openai/gpt-4o) in ChatGPT. Intended for research and evaluation. Note: This model is currently experimental and not suitable for production use-cases, and may be heavily rate-limited.

OpenAI

128K

$5/M

$15/M

An

Anthropic: Claude 3.5 Sonnet (2024-06-20)

ID:anthropic/claude-3.5-sonnet-20240620

Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at: - Coding: Autonomously writes, edits, and runs code with reasoning and troubleshooting - Data science: Augments human data science expertise; navigates unstructured data while using multiple tools for insights - Visual processing: excelling at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone - Agentic tasks: exceptional tool use, making it great at agentic tasks (i.e. complex, multi-step problem solving tasks that require engaging with other systems) #multimodal

Anthropic

200K

$3/M

$15/M

Me

Meta: Llama 3.2 1B Instruct

ID:meta-llama/llama-3.2-1b-instruct

Llama 3.2 1B is a 1-billion-parameter language model focused on efficiently performing natural language tasks, such as summarization, dialogue, and multilingual text analysis. Its smaller size allows it to operate efficiently in low-resource environments while maintaining strong task performance. Supporting eight core languages and fine-tunable for more, Llama 1.3B is ideal for businesses or developers seeking lightweight yet powerful AI solutions that can operate in diverse multilingual settings without the high computational demand of larger models. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Meta Llama

131K

$0.01/M

$0.02/M

Qw

Qwen: QwQ 32B Preview

ID:qwen/qwq-32b-preview

## Introduction **QwQ-32B-Preview** is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities. As a preview release, it demonstrates promising analytical abilities while having several important limitations: 1. **Language Mixing and Code-Switching**: The model may mix languages or switch between them unexpectedly, affecting response clarity. 2. **Recursive Reasoning Loops**: The model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer. 3. **Safety and Ethical Considerations**: The model requires enhanced safety measures to ensure reliable and secure performance, and users should exercise caution when deploying it. 4. **Performance and Benchmark Limitations**: The model excels in math and coding but has room for improvement in other areas, such as common sense reasoning and nuanced language understanding. **Specification**: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 32.5B - Number of Paramaters (Non-Embedding): 31.0B - Number of Layers: 64 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 32,768 tokens For more details, please refer to our [blog](https://qwenlm.github.io/blog/qwq-32b-preview/). You can also check Qwen2.5 [GitHub](https://github.com/QwenLM/Qwen2.5), and [Documentation](https://qwen.readthedocs.io/en/latest/). ## Requirements The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: ``` KeyError: 'qwen2' ``` ## Quickstart Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/QwQ-32B-Preview" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "How many r in strawberry." messages = [ {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` ## Citation If you find our work helpful, feel free to give us a cite. ``` @misc{qwq-32b-preview, title = {QwQ: Reflect Deeply on the Boundaries of the Unknown}, url = {https://qwenlm.github.io/blog/qwq-32b-preview/}, author = {Qwen Team}, month = {November}, year = {2024} } @article{qwen2, title={Qwen2 Technical Report}, author={An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan}, journal={arXiv preprint arXiv:2407.10671}, year={2024} } ```

Qwen

33K

$0.15/M

$0.6/M

Go

Google: Gemini 2.0 Flash Experimental

ID:google/gemini-2.0-flash-exp

Gemini 2.0 Flash offers a significantly faster time to first token (TTFT) compared to [Gemini 1.5 Flash](google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini 1.5 Pro](google/gemini-pro-1.5). It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.

Google

1M

$0.2/M

$0.6/M

Free

Go

Google: Gemini 2.0 Flash Experimental (free)

ID:google/gemini-2.0-flash-exp:free

Gemini 2.0 Flash offers a significantly faster time to first token (TTFT) compared to [Gemini 1.5 Flash](google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini 1.5 Pro](google/gemini-pro-1.5). It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.

Google

1M

Free

De

DeepSeek: R1 (free)

ID:deepseek/deepseek-r1:free

# DeepSeek-R1 ## 1. Introduction We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. **NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the [Usage Recommendation](#usage-recommendations) section.** <p align="center"> <img width="80%" src="https://github.com/deepseek-ai/DeepSeek-R1/raw/main/figures/benchmark.jpg"> </p> ## 2. Model Summary --- **Post-Training: Large-Scale Reinforcement Learning on the Base Model** - We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. --- **Distillation: Smaller Models Can Be Powerful Too** - We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. - Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. ## 3. Evaluation Results ### DeepSeek-R1-Evaluation For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1. <div align="center"> | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 | |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------| | | Architecture | - | - | MoE | - | - | MoE | | | # Activated Params | - | - | 37B | - | - | 37B | | | # Total Params | - | - | 671B | - | - | 671B | | English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | **91.8** | 90.8 | | | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | **92.9** | | | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | **84.0** | | | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | **92.2** | | | IF-Eval (Prompt Strict) | **86.5** | 84.3 | 86.1 | 84.8 | - | 83.3 | | | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | **75.7** | 71.5 | | | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | **47.0** | 30.1 | | | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | **82.5** | | | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | **87.6** | | | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | **92.3** | | Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | **65.9** | | | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | **96.6** | 96.3 | | | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | **2061** | 2029 | | | SWE Verified (Resolved) | **50.8** | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | | | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | **61.7** | 53.3 | | Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | **79.8** | | | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | **97.3** | | | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | **78.8** | | Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | **92.8** | | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** | | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 | </div> ### Distilled Model Evaluation <div align="center"> | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------| | GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | | Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | | o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | **1820** | | QwQ-32B-Preview | 44.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 | | DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 | | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 | | DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.9 | 59.1 | 53.1 | 1481 | | DeepSeek-R1-Distill-Qwen-32B | **72.6** | 83.3 | 94.3 | 62.1 | 57.2 | 1691 | | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 | | DeepSeek-R1-Distill-Llama-70B | 70.0 | **86.7** | **94.5** | **65.2** | **57.5** | 1633 | </div> ## 4. Chat Website & API Platform You can chat with DeepSeek-R1 on DeepSeek's official website: [chat.deepseek.com](https://chat.deepseek.com), and switch on the button "DeepThink" We also provide OpenAI-Compatible API at DeepSeek Platform: [platform.deepseek.com](https://platform.deepseek.com/) ## 5. How to Run Locally ### DeepSeek-R1 Models Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running DeepSeek-R1 locally. **NOTE: Hugging Face's Transformers has not been directly supported yet.** ### DeepSeek-R1-Distill Models DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models. For instance, you can easily start a service using [vLLM](https://github.com/vllm-project/vllm): ```shell vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager ``` You can also easily start a service using [SGLang](https://github.com/sgl-project/sglang) ```bash python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code --tp 2 ``` ### Usage Recommendations **We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:** 1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs. 2. **Avoid adding a system prompt; all instructions should be contained within the user prompt.** 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}." 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results. Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\<think\>\n\n\</think\>") when responding to certain queries, which can adversely affect the model's performance. **To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\<think\>\n" at the beginning of every output.**

DeepSeek

164K

Free

Me

Meta: Llama 3.3 70B Instruct (free)

ID:meta-llama/llama-3.3-70b-instruct:free

The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. [Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md)

Meta Llama

131K

Free

NV

NVIDIA: Llama 3.1 Nemotron 70B Instruct (free)

ID:nvidia/llama-3.1-nemotron-70b-instruct:free

NVIDIA's Llama 3.1 Nemotron 70B is a language model designed for generating precise and useful responses. Leveraging [Llama 3.1 70B](/models/meta-llama/llama-3.1-70b-instruct) architecture and Reinforcement Learning from Human Feedback (RLHF), it excels in automatic alignment benchmarks. This model is tailored for applications requiring high accuracy in helpfulness and response generation, suitable for diverse user queries across multiple domains. Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

NVIDIA

131K

Free

Qw

Qwen: Qwen2.5 VL 72B Instruct (free)

ID:qwen/qwen2.5-vl-72b-instruct:free

Qwen2.5-VL is proficient in recognizing common objects such as flowers, birds, fish, and insects. It is also highly capable of analyzing texts, charts, icons, graphics, and layouts within images.

Qwen

131K

Free

Go

Google: Gemini 2.0 Flash Thinking Experimental (free)

ID:google/gemini-2.0-flash-thinking-exp-1219:free

Gemini 2.0 Flash Thinking Mode is an experimental model that's trained to generate the "thinking process" the model goes through as part of its response. As a result, Thinking Mode is capable of stronger reasoning capabilities in its responses than the [base Gemini 2.0 Flash model](/google/gemini-2.0-flash-exp).

Google

40K

Free

So

Rogue Rose 103B v0.2 (free)

ID:sophosympatheia/rogue-rose-103b-v0.2:free

Rogue Rose demonstrates strong capabilities in roleplaying and storytelling applications, potentially surpassing other models in the 103-120B parameter range. While it occasionally exhibits inconsistencies with scene logic, the overall interaction quality represents an advancement in natural language processing for creative applications. It is a 120-layer frankenmerge model combining two custom 70B architectures from November 2023, derived from the [xwin-stellarbright-erp-70b-v2](https://huggingface.co/sophosympatheia/xwin-stellarbright-erp-70b-v2) base.

Sophosympatheia

4K

Free

Go

Google: Gemini Pro 2.0 Experimental (free)

ID:google/gemini-2.0-pro-exp-02-05:free

Gemini 2.0 Pro Experimental is a bleeding-edge version of the Gemini 2.0 Pro model. Because it's currently experimental, it will be **heavily rate-limited** by Google. Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms). #multimodal

Google

2M

Free

Go

Google: Gemini Flash Lite 2.0 Preview (free)

ID:google/gemini-2.0-flash-lite-preview-02-05:free

Gemini Flash Lite 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](google/gemini-pro-1.5). Because it's currently in preview, it will be **heavily rate-limited** by Google. This model will move from free to paid pending a general rollout on February 24th, at $0.075 / $0.30 per million input / ouput tokens respectively.

Google

1M

Free

De

DeepSeek: R1 Distill Llama 70B (free)

ID:deepseek/deepseek-r1-distill-llama-70b:free

DeepSeek R1 Distill Llama 70B is a distilled large language model based on [Llama-3.3-70B-Instruct](/meta-llama/llama-3.3-70b-instruct), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). The model combines advanced distillation techniques to achieve high performance across multiple benchmarks, including: - AIME 2024 pass@1: 70.0 - MATH-500 pass@1: 94.5 - CodeForces Rating: 1633 The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models.

DeepSeek

131K

Free

Qw

Qwen: Qwen VL Plus (free)

ID:qwen/qwen-vl-plus:free

Qwen's Enhanced Large Visual Language Model. Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and extreme aspect ratios for image input. It delivers significant performance across a broad range of visual tasks.

Qwen

8K

De

DeepSeek: R1 Distill Qwen 1.5B

ID:deepseek/deepseek-r1-distill-qwen-1.5b

DeepSeek R1 Distill Qwen 1.5B is a distilled large language model based on [Qwen 2.5 Math 1.5B](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). It's a very small and efficient model which outperforms [GPT 4o 0513](/openai/gpt-4o-2024-05-13) on Math Benchmarks. Other benchmark results include: - AIME 2024 pass@1: 28.9 - AIME 2024 cons@64: 52.7 - MATH-500 pass@1: 83.9 The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models.

DeepSeek

131K

$0.18/M

De

DeepSeek: R1 Distill Llama 8B

ID:deepseek/deepseek-r1-distill-llama-8b

DeepSeek R1 Distill Llama 8B is a distilled large language model based on [Llama-3.1-8B-Instruct](/meta-llama/llama-3.1-8b-instruct), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). The model combines advanced distillation techniques to achieve high performance across multiple benchmarks, including: - AIME 2024 pass@1: 50.4 - MATH-500 pass@1: 89.1 - CodeForces Rating: 1205 The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models. Hugging Face: - [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) - [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) |

DeepSeek

32K

$0.04/M

De

DeepSeek: R1 Distill Qwen 14B

ID:deepseek/deepseek-r1-distill-qwen-14b

DeepSeek R1 Distill Qwen 14B is a distilled large language model based on [Qwen 2.5 14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). It outperforms OpenAI's o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. Other benchmark results include: - AIME 2024 pass@1: 69.7 - MATH-500 pass@1: 93.9 - CodeForces Rating: 1481 The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models.

DeepSeek

64K

$0.15/M

De

DeepSeek: R1 Distill Qwen 32B

ID:deepseek/deepseek-r1-distill-qwen-32b

DeepSeek R1 Distill Qwen 32B is a distilled large language model based on [Qwen 2.5 32B](https://huggingface.co/Qwen/Qwen2.5-32B), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). It outperforms OpenAI's o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. Other benchmark results include: - AIME 2024 pass@1: 72.6 - MATH-500 pass@1: 94.3 - CodeForces Rating: 1691 The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models.

DeepSeek

131K

$0.12/M

$0.18/M

De

DeepSeek: R1 (nitro)

ID:deepseek/deepseek-r1:nitro

# DeepSeek-R1 ## 1. Introduction We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. **NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the [Usage Recommendation](#usage-recommendations) section.** <p align="center"> <img width="80%" src="https://github.com/deepseek-ai/DeepSeek-R1/raw/main/figures/benchmark.jpg"> </p> ## 2. Model Summary --- **Post-Training: Large-Scale Reinforcement Learning on the Base Model** - We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. --- **Distillation: Smaller Models Can Be Powerful Too** - We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. - Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. ## 3. Evaluation Results ### DeepSeek-R1-Evaluation For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1. <div align="center"> | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 | |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------| | | Architecture | - | - | MoE | - | - | MoE | | | # Activated Params | - | - | 37B | - | - | 37B | | | # Total Params | - | - | 671B | - | - | 671B | | English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | **91.8** | 90.8 | | | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | **92.9** | | | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | **84.0** | | | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | **92.2** | | | IF-Eval (Prompt Strict) | **86.5** | 84.3 | 86.1 | 84.8 | - | 83.3 | | | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | **75.7** | 71.5 | | | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | **47.0** | 30.1 | | | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | **82.5** | | | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | **87.6** | | | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | **92.3** | | Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | **65.9** | | | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | **96.6** | 96.3 | | | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | **2061** | 2029 | | | SWE Verified (Resolved) | **50.8** | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | | | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | **61.7** | 53.3 | | Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | **79.8** | | | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | **97.3** | | | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | **78.8** | | Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | **92.8** | | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** | | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 | </div> ### Distilled Model Evaluation <div align="center"> | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------| | GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | | Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | | o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | **1820** | | QwQ-32B-Preview | 44.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 | | DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 | | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 | | DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.9 | 59.1 | 53.1 | 1481 | | DeepSeek-R1-Distill-Qwen-32B | **72.6** | 83.3 | 94.3 | 62.1 | 57.2 | 1691 | | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 | | DeepSeek-R1-Distill-Llama-70B | 70.0 | **86.7** | **94.5** | **65.2** | **57.5** | 1633 | </div> ## 4. Chat Website & API Platform You can chat with DeepSeek-R1 on DeepSeek's official website: [chat.deepseek.com](https://chat.deepseek.com), and switch on the button "DeepThink" We also provide OpenAI-Compatible API at DeepSeek Platform: [platform.deepseek.com](https://platform.deepseek.com/) ## 5. How to Run Locally ### DeepSeek-R1 Models Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running DeepSeek-R1 locally. **NOTE: Hugging Face's Transformers has not been directly supported yet.** ### DeepSeek-R1-Distill Models DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models. For instance, you can easily start a service using [vLLM](https://github.com/vllm-project/vllm): ```shell vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager ``` You can also easily start a service using [SGLang](https://github.com/sgl-project/sglang) ```bash python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code --tp 2 ``` ### Usage Recommendations **We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:** 1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs. 2. **Avoid adding a system prompt; all instructions should be contained within the user prompt.** 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}." 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results. Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\<think\>\n\n\</think\>") when responding to certain queries, which can adversely affect the model's performance. **To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\<think\>\n" at the beginning of every output.**

DeepSeek

164K

$3/M

$8/M

Ri

MiniMax: MiniMax-01

ID:minimax/minimax-01

MiniMax-01 is a combines MiniMax-Text-01 for text generation and MiniMax-VL-01 for image understanding. It has 456 billion parameters, with 45.9 billion parameters activated per inference, and can handle a context of up to 4 million tokens. The text model adopts a hybrid architecture that combines Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE). The image model adopts the “ViT-MLP-LLM” framework and is trained on top of the text model. To read more about the release, see: https://www.minimaxi.com/en/news/minimax-01-series-2

Rifx.Online

1M

$0.2/M

$1.1/M

Mi

Microsoft: Phi 4

ID:microsoft/phi-4

[Microsoft Research](/microsoft) Phi-4 is designed to perform well in complex reasoning tasks and can operate efficiently in situations with limited memory or where quick responses are needed. At 14 billion parameters, it was trained on a mix of high-quality synthetic datasets, data from curated websites, and academic materials. It has undergone careful improvement to follow instructions accurately and maintain strong safety standards. It works best with English language inputs. For more information, please see [Phi-4 Technical Report](https://arxiv.org/pdf/2412.08905)

Microsoft Azure

16K

$0.07/M

$0.14/M

30%

Op

OpenAI: o1-preview

ID:rifx/o1-preview

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1). Note: This model is currently experimental and not suitable for production use-cases, and may be heavily rate-limited.

OpenAI

128K

$15/M

$60/M

40%

Op

OpenAI: o1-mini

ID:rifx/o1-mini

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1). Note: This model is currently experimental and not suitable for production use-cases, and may be heavily rate-limited.

OpenAI

128K

$3/M

$12/M

De

DeepSeek V3

ID:deepseek/deepseek-chat-v3

## 1. Introduction We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. <p align="center"> <img width="80%" src="https://wsrv.nl/?url=https://huggingface.co/deepseek-ai/DeepSeek-V3/resolve/main/figures/benchmark.png"> </p> ## 2. Model Summary --- **Architecture: Innovative Load Balancing Strategy and Training Objective** - On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing. - We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration. --- **Pre-Training: Towards Ultimate Training Efficiency** - We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model. - Through co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, nearly achieving full computation-communication overlap. This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead. - At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours. --- **Post-Training: Knowledge Distillation from DeepSeek-R1** - We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain a control over the output style and length of DeepSeek-V3. --- ## 3. Model Downloads <div align="center"> | **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download** | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-V3-Base | 671B | 37B | 128K | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base) | | DeepSeek-V3 | 671B | 37B | 128K | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-V3) | </div> **NOTE: The total size of DeepSeek-V3 models on HuggingFace is 685B, which includes 671B of the Main Model weights and 14B of the Multi-Token Prediction (MTP) Module weights.** To ensure optimal performance and flexibility, we have partnered with open-source communities and hardware vendors to provide multiple ways to run the model locally. For step-by-step guidance, check out Section 6: [How_to Run_Locally](#6-how-to-run-locally). For developers looking to dive deeper, we recommend exploring [README_WEIGHTS.md](./README_WEIGHTS.md) for details on the Main Model weights and the Multi-Token Prediction (MTP) Modules. Please note that MTP support is currently under active development within the community, and we welcome your contributions and feedback. ## 4. Evaluation Results ### Base Model #### Standard Benchmarks <div align="center"> | | Benchmark (Metric) | # Shots | DeepSeek-V2 | Qwen2.5 72B | LLaMA3.1 405B | DeepSeek-V3 | |---|-------------------|----------|--------|-------------|---------------|---------| | | Architecture | - | MoE | Dense | Dense | MoE | | | # Activated Params | - | 21B | 72B | 405B | 37B | | | # Total Params | - | 236B | 72B | 405B | 671B | | English | Pile-test (BPB) | - | 0.606 | 0.638 | **0.542** | 0.548 | | | BBH (EM) | 3-shot | 78.8 | 79.8 | 82.9 | **87.5** | | | MMLU (Acc.) | 5-shot | 78.4 | 85.0 | 84.4 | **87.1** | | | MMLU-Redux (Acc.) | 5-shot | 75.6 | 83.2 | 81.3 | **86.2** | | | MMLU-Pro (Acc.) | 5-shot | 51.4 | 58.3 | 52.8 | **64.4** | | | DROP (F1) | 3-shot | 80.4 | 80.6 | 86.0 | **89.0** | | | ARC-Easy (Acc.) | 25-shot | 97.6 | 98.4 | 98.4 | **98.9** | | | ARC-Challenge (Acc.) | 25-shot | 92.2 | 94.5 | **95.3** | **95.3** | | | HellaSwag (Acc.) | 10-shot | 87.1 | 84.8 | **89.2** | 88.9 | | | PIQA (Acc.) | 0-shot | 83.9 | 82.6 | **85.9** | 84.7 | | | WinoGrande (Acc.) | 5-shot | **86.3** | 82.3 | 85.2 | 84.9 | | | RACE-Middle (Acc.) | 5-shot | 73.1 | 68.1 | **74.2** | 67.1 | | | RACE-High (Acc.) | 5-shot | 52.6 | 50.3 | **56.8** | 51.3 | | | TriviaQA (EM) | 5-shot | 80.0 | 71.9 | **82.7** | **82.9** | | | NaturalQuestions (EM) | 5-shot | 38.6 | 33.2 | **41.5** | 40.0 | | | AGIEval (Acc.) | 0-shot | 57.5 | 75.8 | 60.6 | **79.6** | | Code | HumanEval (Pass@1) | 0-shot | 43.3 | 53.0 | 54.9 | **65.2** | | | MBPP (Pass@1) | 3-shot | 65.0 | 72.6 | 68.4 | **75.4** | | | LiveCodeBench-Base (Pass@1) | 3-shot | 11.6 | 12.9 | 15.5 | **19.4** | | | CRUXEval-I (Acc.) | 2-shot | 52.5 | 59.1 | 58.5 | **67.3** | | | CRUXEval-O (Acc.) | 2-shot | 49.8 | 59.9 | 59.9 | **69.8** | | Math | GSM8K (EM) | 8-shot | 81.6 | 88.3 | 83.5 | **89.3** | | | MATH (EM) | 4-shot | 43.4 | 54.4 | 49.0 | **61.6** | | | MGSM (EM) | 8-shot | 63.6 | 76.2 | 69.9 | **79.8** | | | CMath (EM) | 3-shot | 78.7 | 84.5 | 77.3 | **90.7** | | Chinese | CLUEWSC (EM) | 5-shot | 82.0 | 82.5 | **83.0** | 82.7 | | | C-Eval (Acc.) | 5-shot | 81.4 | 89.2 | 72.5 | **90.1** | | | CMMLU (Acc.) | 5-shot | 84.0 | **89.5** | 73.7 | 88.8 | | | CMRC (EM) | 1-shot | **77.4** | 75.8 | 76.0 | 76.3 | | | C3 (Acc.) | 0-shot | 77.4 | 76.7 | **79.7** | 78.6 | | | CCPM (Acc.) | 0-shot | **93.0** | 88.5 | 78.6 | 92.0 | | Multilingual | MMMLU-non-English (Acc.) | 5-shot | 64.0 | 74.8 | 73.8 | **79.4** | </div> Note: Best results are shown in bold. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3 achieves the best performance on most benchmarks, especially on math and code tasks. For more evaluation details, please check our paper. #### Context Window <p align="center"> <img width="80%" src="https://wsrv.nl/?url=https://huggingface.co/deepseek-ai/DeepSeek-V3/resolve/main/figures/niah.png"> </p> Evaluation results on the ``Needle In A Haystack`` (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to **128K**. ### Chat Model #### Standard Benchmarks (Models larger than 67B) <div align="center"> | | **Benchmark (Metric)** | **DeepSeek V2-0506** | **DeepSeek V2.5-0905** | **Qwen2.5 72B-Inst.** | **Llama3.1 405B-Inst.** | **Claude-3.5-Sonnet-1022** | **GPT-4o 0513** | **DeepSeek V3** | |---|---------------------|---------------------|----------------------|---------------------|----------------------|---------------------------|----------------|----------------| | | Architecture | MoE | MoE | Dense | Dense | - | - | MoE | | | # Activated Params | 21B | 21B | 72B | 405B | - | - | 37B | | | # Total Params | 236B | 236B | 72B | 405B | - | - | 671B | | English | MMLU (EM) | 78.2 | 80.6 | 85.3 | **88.6** | **88.3** | 87.2 | **88.5** | | | MMLU-Redux (EM) | 77.9 | 80.3 | 85.6 | 86.2 | **88.9** | 88.0 | **89.1** | | | MMLU-Pro (EM) | 58.5 | 66.2 | 71.6 | 73.3 | **78.0** | 72.6 | 75.9 | | | DROP (3-shot F1) | 83.0 | 87.8 | 76.7 | 88.7 | 88.3 | 83.7 | **91.6** | | | IF-Eval (Prompt Strict) | 57.7 | 80.6 | 84.1 | 86.0 | **86.5** | 84.3 | 86.1 | | | GPQA-Diamond (Pass@1) | 35.3 | 41.3 | 49.0 | 51.1 | **65.0** | 49.9 | 59.1 | | | SimpleQA (Correct) | 9.0 | 10.2 | 9.1 | 17.1 | 28.4 | **38.2** | 24.9 | | | FRAMES (Acc.) | 66.9 | 65.4 | 69.8 | 70.0 | 72.5 | **80.5** | 73.3 | | | LongBench v2 (Acc.) | 31.6 | 35.4 | 39.4 | 36.1 | 41.0 | 48.1 | **48.7** | | Code | HumanEval-Mul (Pass@1) | 69.3 | 77.4 | 77.3 | 77.2 | 81.7 | 80.5 | **82.6** | | | LiveCodeBench (Pass@1-COT) | 18.8 | 29.2 | 31.1 | 28.4 | 36.3 | 33.4 | **40.5** | | | LiveCodeBench (Pass@1) | 20.3 | 28.4 | 28.7 | 30.1 | 32.8 | 34.2 | **37.6** | | | Codeforces (Percentile) | 17.5 | 35.6 | 24.8 | 25.3 | 20.3 | 23.6 | **51.6** | | | SWE Verified (Resolved) | - | 22.6 | 23.8 | 24.5 | **50.8** | 38.8 | 42.0 | | | Aider-Edit (Acc.) | 60.3 | 71.6 | 65.4 | 63.9 | **84.2** | 72.9 | 79.7 | | | Aider-Polyglot (Acc.) | - | 18.2 | 7.6 | 5.8 | 45.3 | 16.0 | **49.6** | | Math | AIME 2024 (Pass@1) | 4.6 | 16.7 | 23.3 | 23.3 | 16.0 | 9.3 | **39.2** | | | MATH-500 (EM) | 56.3 | 74.7 | 80.0 | 73.8 | 78.3 | 74.6 | **90.2** | | | CNMO 2024 (Pass@1) | 2.8 | 10.8 | 15.9 | 6.8 | 13.1 | 10.8 | **43.2** | | Chinese | CLUEWSC (EM) | 89.9 | 90.4 | **91.4** | 84.7 | 85.4 | 87.9 | 90.9 | | | C-Eval (EM) | 78.6 | 79.5 | 86.1 | 61.5 | 76.7 | 76.0 | **86.5** | | | C-SimpleQA (Correct) | 48.5 | 54.1 | 48.4 | 50.4 | 51.3 | 59.3 | **64.8** | Note: All models are evaluated in a configuration that limits the output length to 8K. Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models. </div> #### Open Ended Generation Evaluation <div align="center"> | Model | Arena-Hard | AlpacaEval 2.0 | |-------|------------|----------------| | DeepSeek-V2.5-0905 | 76.2 | 50.5 | | Qwen2.5-72B-Instruct | 81.2 | 49.1 | | LLaMA-3.1 405B | 69.3 | 40.5 | | GPT-4o-0513 | 80.4 | 51.1 | | Claude-Sonnet-3.5-1022 | 85.2 | 52.0 | | DeepSeek-V3 | **85.5** | **70.0** | Note: English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric. </div> ## 5. Chat Website & API Platform You can chat with DeepSeek-V3 on DeepSeek's official website: [chat.deepseek.com](https://chat.deepseek.com/sign_in) We also provide OpenAI-Compatible API at DeepSeek Platform: [platform.deepseek.com](https://platform.deepseek.com/) ## 6. How to Run Locally DeepSeek-V3 can be deployed locally using the following hardware and open-source community software: 1. **DeepSeek-Infer Demo**: We provide a simple and lightweight demo for FP8 and BF16 inference. 2. **SGLang**: Fully support the DeepSeek-V3 model in both BF16 and FP8 inference modes. 3. **LMDeploy**: Enables efficient FP8 and BF16 inference for local and cloud deployment. 4. **TensorRT-LLM**: Currently supports BF16 inference and INT4/8 quantization, with FP8 support coming soon. 5. **vLLM**: Support DeekSeek-V3 model with FP8 and BF16 modes for tensor parallelism and pipeline parallelism. 6. **AMD GPU**: Enables running the DeepSeek-V3 model on AMD GPUs via SGLang in both BF16 and FP8 modes. 7. **Huawei Ascend NPU**: Supports running DeepSeek-V3 on Huawei Ascend devices. Since FP8 training is natively adopted in our framework, we only provide FP8 weights. If you require BF16 weights for experimentation, you can use the provided conversion script to perform the transformation. Here is an example of converting FP8 weights to BF16: ```shell cd inference python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights ``` **NOTE: Huggingface's Transformers has not been directly supported yet.** ### 6.1 Inference with DeepSeek-Infer Demo (example only) #### Model Weights & Demo Code Preparation First, clone our DeepSeek-V3 GitHub repository: ```shell git clone https://github.com/deepseek-ai/DeepSeek-V3.git ``` Navigate to the `inference` folder and install dependencies listed in `requirements.txt`. ```shell cd DeepSeek-V3/inference pip install -r requirements.txt ``` Download the model weights from HuggingFace, and put them into `/path/to/DeepSeek-V3` folder. #### Model Weights Conversion Convert HuggingFace model weights to a specific format: ```shell python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16 ``` #### Run Then you can chat with DeepSeek-V3: ```shell torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200 ``` Or batch inference on a given file: ```shell torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --input-file $FILE ``` ### 6.2 Inference with SGLang (recommended) [SGLang](https://github.com/sgl-project/sglang) currently supports MLA optimizations, FP8 (W8A8), FP8 KV Cache, and Torch Compile, delivering state-of-the-art latency and throughput performance among open-source frameworks. Notably, [SGLang v0.4.1](https://github.com/sgl-project/sglang/releases/tag/v0.4.1) fully supports running DeepSeek-V3 on both **NVIDIA and AMD GPUs**, making it a highly versatile and robust solution. Here are the launch instructions from the SGLang team: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3 ### 6.3 Inference with LMDeploy (recommended) [LMDeploy](https://github.com/InternLM/lmdeploy), a flexible and high-performance inference and serving framework tailored for large language models, now supports DeepSeek-V3. It offers both offline pipeline processing and online deployment capabilities, seamlessly integrating with PyTorch-based workflows. For comprehensive step-by-step instructions on running DeepSeek-V3 with LMDeploy, please refer to here: https://github.com/InternLM/lmdeploy/issues/2960 ### 6.4 Inference with TRT-LLM (recommended) [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) now supports the DeepSeek-V3 model, offering precision options such as BF16 and INT4/INT8 weight-only. Support for FP8 is currently in progress and will be released soon. You can access the custom branch of TRTLLM specifically for DeepSeek-V3 support through the following link to experience the new features directly: https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3. ### 6.5 Inference with vLLM (recommended) [vLLM](https://github.com/vllm-project/vllm) v0.6.6 supports DeepSeek-V3 inference for FP8 and BF16 modes on both NVIDIA and AMD GPUs. Aside from standard techniques, vLLM offers _pipeline parallelism_ allowing you to run this model on multiple machines connected by networks. For detailed guidance, please refer to the [vLLM instructions](https://docs.vllm.ai/en/latest/serving/distributed_serving.html). Please feel free to follow [the enhancement plan](https://github.com/vllm-project/vllm/issues/11539) as well. ### 6.6 Recommended Inference Functionality with AMD GPUs In collaboration with the AMD team, we have achieved Day-One support for AMD GPUs using SGLang, with full compatibility for both FP8 and BF16 precision. For detailed guidance, please refer to the [SGLang instructions](#63-inference-with-lmdeploy-recommended). ### 6.7 Recommended Inference Functionality with Huawei Ascend NPUs The [MindIE](https://www.hiascend.com/en/software/mindie) framework from the Huawei Ascend community has successfully adapted the BF16 version of DeepSeek-V3. For step-by-step guidance on Ascend NPUs, please follow the [instructions here](https://modelers.cn/models/MindIE/deepseekv3). ## 7. License This code repository is licensed under [the MIT License](LICENSE-CODE). The use of DeepSeek-V3 Base/Chat models is subject to [the Model License](LICENSE-MODEL). DeepSeek-V3 series (including Base and Chat) supports commercial use. ## 8. Citation ``` ``` ## 9. Contact If you have any questions, please raise an issue or contact us at [[email protected]]([email protected]).

DeepSeek

64K

$0.14/M

$0.28/M

Op

OpenAI: o1-mini

ID:openai/o1-mini

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1). Note: This model is currently experimental and not suitable for production use-cases, and may be heavily rate-limited.

OpenAI

128K

$3/M

$12/M

Op

OpenAI: o1

ID:openai/o1

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1).

OpenAI

200K

$15/M

$60/M

Free

Go

Google: Gemini 2.0 Flash Thinking Experimental (free)

ID:google/gemini-2.0-flash-thinking-exp:free

Gemini 2.0 Flash Thinking Mode is an experimental model that's trained to generate the "thinking process" the model goes through as part of its response. As a result, Thinking Mode is capable of stronger reasoning capabilities in its responses than the [base Gemini 2.0 Flash model](/google/gemini-2.0-flash-exp).

Google

40K

50%

Ev

EVA Llama 3.33 70b

ID:eva-unit-01/eva-llama-3.33-70b

EVA Llama 3.33 70b is a roleplay and storywriting specialist model. It is a full-parameter finetune of [Llama-3.3-70B-Instruct](https://openrouter.ai/meta-llama/llama-3.3-70b-instruct) on mixture of synthetic and natural data. It uses Celeste 70B 0.1 data mixture, greatly expanding it to improve versatility, creativity and "flavor" of the resulting model This model was built with Llama by Meta.

Eva-unit-01

16K

$4/M

$6/M

X-

xAI: Grok 2 Vision 1212

ID:x-ai/grok-2-vision-1212

Grok 2 Vision 1212 advances image-based AI with stronger visual comprehension, refined instruction-following, and multilingual support. From object recognition to style analysis, it empowers developers to build more intuitive, visually aware applications. Its enhanced steerability and reasoning establish a robust foundation for next-generation image solutions. To read more about this model, check out [xAI's announcement](https://x.ai/blog/grok-1212).

X-AI

33K

$2/M

$10/M

Ri

Sao10K: Llama 3.3 Euryale 70B

ID:sao10k/l3.3-euryale-70b

Euryale L3.3 70B is a model focused on creative roleplay from [Sao10k](https://ko-fi.com/sao10k). It is the successor of [Euryale L3 70B v2.2](/models/sao10k/l3-euryale-70b).

Rifx.Online

8K

$1.5/M

Op

text-embedding-3-large

ID:text-embedding-3-large

text-embedding-3-large is OpenAI's latest text embedding model released in 2024. Compared to its predecessors, it offers several significant improvements: ## Key Features - **Enhanced Performance**: Outperforms the previous text-embedding-ada-002 model on most tasks - **Better Multilingual Support**: Supports embeddings for 100+ languages - **Longer Context**: Handles up to 8192 tokens of input - **Dimension Selection**: Allows embedding dimensions between 256 and 3072 ## Use Cases 1. Semantic Search 2. Text Classification 3. Recommendation Systems 4. Similarity Matching 5. RAG (Retrieval Augmented Generation) Applications ## Pricing - Input: $0.00013 / 1K tokens - Output: Free ## Usage Recommendations - Recommended for production environments requiring high-quality embeddings - Consider text-embedding-3-small for cost-sensitive applications - Choose embedding dimensions based on your specific needs to balance performance and efficiency text-embedding-3-large is one of the most powerful text embedding models available today, particularly suitable for applications requiring high-quality text representations.

OpenAI

$0.13/M

An

Magnum v4 72B

ID:anthracite-org/magnum-v4-72b

This is a series of models designed to replicate the prose quality of the Claude 3 models, specifically Sonnet(https://openrouter.ai/anthropic/claude-3.5-sonnet) and Opus(https://openrouter.ai/anthropic/claude-3-opus). The model is fine-tuned on top of [Qwen2.5 72B](https://openrouter.ai/qwen/qwen-2.5-72b-instruct).

Anthracite-org

33K

$1.875/M

$2.25/M

Ba

baichuan3-turbo

ID:baichuan/baichuan3-turbo

Baichuan3-Turbo is an advanced artificial intelligence language model designed to provide users with efficient and intelligent natural language processing solutions. Leveraging the latest deep learning technologies, this model boasts powerful text generation and comprehension capabilities, making it suitable for a wide range of applications including conversational systems, content creation, and information retrieval. ### Key Features: 1. **Efficiency**: Baichuan3-Turbo employs optimized algorithms that significantly enhance processing speed, allowing for quick responses to user queries. 2. **Diversity**: The model supports multiple languages and dialects, catering to the needs of users from various regions and increasing its versatility across different scenarios. 3. **Contextual Understanding**: By deeply learning contextual information, Baichuan3-Turbo excels at understanding user intent, generating more relevant and natural responses. 4. **Customizability**: Users can fine-tune the model according to specific requirements, ensuring alignment with industry standards or particular task demands. 5. **Safety and Compliance**: The design prioritizes data privacy and compliance with regulations, ensuring user information security while adhering to relevant laws. ### Application Scenarios: - **Customer Service**: Automating responses to customer inquiries to improve service efficiency. - **Content Generation**: Assisting creators in generating high-quality articles, reports, or social media content. - **Educational Support**: Providing personalized learning suggestions and answering student questions. - **Data Analysis**: Extracting key insights from large volumes of text for in-depth analysis. In summary, Baichuan3-Turbo delivers exceptional performance and flexible application capabilities, offering innovative solutions across various industries and serving as a vital tool in advancing the process of intelligent automation.

Baichuan

32K

$1.7/M

Ba

baichuan4

ID:baichuan/baichuan4

### Baichuan4 Model Introduction Baichuan4 is a state-of-the-art artificial intelligence language model designed to enhance natural language understanding and generation capabilities. Built on cutting-edge deep learning techniques, Baichuan4 is tailored for diverse applications, ranging from conversational AI and content creation to data analysis and customer support. #### Key Features: 1. **Enhanced Performance**: Baichuan4 incorporates advanced algorithms that optimize processing efficiency, delivering faster response times and improved interaction quality. 2. **Multilingual Support**: The model is capable of understanding and generating text in multiple languages, making it accessible to a global audience and suitable for various linguistic contexts. 3. **Deep Contextual Awareness**: With its ability to comprehend nuanced context, Baichuan4 provides more accurate interpretations of user intent, resulting in responses that are not only relevant but also contextually appropriate. 4. **Adaptability**: Users can easily customize the model for specific use cases or industries, allowing for fine-tuning that meets unique operational needs or standards. 5. **Ethical Design**: The development of Baichuan4 emphasizes user privacy and ethical considerations, ensuring compliance with data protection regulations while maintaining high standards of security. #### Application Scenarios: - **Customer Interaction**: Streamlining customer service operations by automating responses and providing instant support. - **Content Creation**: Assisting writers in producing high-quality articles, marketing materials, or social media posts with ease. - **Educational Tools**: Supporting educators and learners with personalized tutoring solutions and resource recommendations. - **Data Mining and Analysis**: Extracting valuable insights from large datasets or unstructured text to facilitate informed decision-making. In conclusion, Baichuan4 stands at the forefront of AI language models, offering powerful capabilities that drive innovation across various fields. Its combination of performance, versatility, and ethical design makes it an essential tool for organizations seeking to leverage AI in their operations.

Baichuan

32K

$14.3/M

Mo

moonshot-v1-8k

ID:moonshot/moonshot-v1-8k

### Moonshot-v1-8k Model Introduction Moonshot-v1-8k is a large-scale language model developed by Moonshot AI, known for its exceptional natural language processing capabilities. Utilizing advanced deep learning techniques, this model has been trained on a vast corpus of text data, enabling it to understand and generate human-like language, thereby providing users with an efficient and intelligent interactive experience. #### Key Features: 1. **Powerful Semantic Understanding**: Moonshot-v1-8k excels in semantic comprehension, accurately interpreting user inputs and generating appropriate responses. 2. **Instruction Following Ability**: The model effectively follows user instructions, handling everything from simple question answering to complex task execution with ease. 3. **Efficient Text Generation**: With support for up to 8192 tokens in context windows, Moonshot-v1-8k is particularly suited for real-time interactions involving short texts, quickly producing coherent and relevant output. 4. **Versatile Application Scenarios**: The model can be applied across various domains such as chatbots, customer support, content creation, and educational assistance, significantly enhancing its usability. 5. **Easy Integration and Use**: Moonshot-v1-8k offers API access that allows developers to seamlessly integrate the model into various applications, making the implementation of intelligent features more straightforward. #### Application Scenarios: - **Customer Support**: Automating customer inquiries to enhance response speed and service quality. - **Content Creation**: Assisting writers in generating articles, reports, or other forms of textual content. - **Educational Assistance**: Providing personalized learning suggestions and resolving student queries. - **Social Media Management**: Creating engaging social media posts to boost user interaction rates. In summary, Moonshot-v1-8k is a powerful language model with extensive potential applications across multiple fields. Its broad usability makes it an essential tool for advancing artificial intelligence developments. By leveraging this model, users can enhance productivity and achieve higher levels of human-computer interaction.

Moonshot

8K

$1.9/M

Am

Amazon: Nova Pro 1.0

ID:amazon/nova-pro-v1

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the-art performance on key benchmarks including visual question answering (TextVQA) and video understanding (VATEX). Amazon Nova Pro demonstrates strong capabilities in processing both visual and textual information and at analyzing financial documents. **NOTE**: Video input and tool calling is not supported at this time.

Amazon

300K

$0.8/M

$3.2/M

Am

Amazon: Nova Micro 1.0

ID:amazon/nova-micro-v1

Amazon Nova Micro 1.0 is a text-only model that delivers the lowest latency responses in the Amazon Nova family of models at a very low cost. With a context length of 128K tokens and optimized for speed and cost, Amazon Nova Micro excels at tasks such as text summarization, translation, content classification, interactive chat, and brainstorming. It has simple mathematical reasoning and coding abilities.

Amazon

128K

$0.035/M

$0.14/M

Op

GPT-4o mini

ID:gpt-4o-mini

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/openai/gpt-4o), supporting both text and image inputs with text outputs. As their most advanced small model, it is many multiples more affordable than other recent frontier models, and more than 60% cheaper than [GPT-3.5 Turbo](/openai/gpt-3.5-turbo). It maintains SOTA intelligence, while being significantly more cost-effective. GPT-4o mini achieves an 82% score on MMLU and presently ranks higher than GPT-4 on chat preferences [common leaderboards](https://arena.lmsys.org/). Check out the [launch announcement](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) to learn more.

OpenAI

128K

$0.15/M

$0.6/M

40%

Op

gpt-4o

ID:rifx/gpt-4o

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities. For benchmarking against other models, it was briefly called ["im-also-a-good-gpt2-chatbot"](https://twitter.com/LiamFedus/status/1790064963966370209)

OpenAI

128K

$2.5/M

$10/M

40%

Op

GPT-4o mini

ID:rifx/gpt-4o-mini

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/openai/gpt-4o), supporting both text and image inputs with text outputs. As their most advanced small model, it is many multiples more affordable than other recent frontier models, and more than 60% cheaper than [GPT-3.5 Turbo](/openai/gpt-3.5-turbo). It maintains SOTA intelligence, while being significantly more cost-effective. GPT-4o mini achieves an 82% score on MMLU and presently ranks higher than GPT-4 on chat preferences [common leaderboards](https://arena.lmsys.org/). Check out the [launch announcement](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) to learn more.

OpenAI

128K

$0.15/M

$0.6/M

Gr

MythoMax 13B (extended)

ID:gryphe/mythomax-l2-13b:extended

One of the highest performing and most popular fine-tunes of Llama 2 13B, with rich descriptions and roleplay. #merge _These are extended-context endpoints for [MythoMax 13B](/gryphe/mythomax-l2-13b). They may have higher prices._

Gryphe

8K

$1.125/M

Free

Gr

MythoMax 13B (free)

ID:gryphe/mythomax-l2-13b:free

One of the highest performing and most popular fine-tunes of Llama 2 13B, with rich descriptions and roleplay. #merge _These are extended-context endpoints for [MythoMax 13B](/gryphe/mythomax-l2-13b). They may have higher prices._

Gryphe

8K

Un

ReMM SLERP 13B (extended)

ID:undi95/remm-slerp-l2-13b:extended

A recreation trial of the original MythoMax-L2-B13 but with updated models. #merge

Undi95

4K

$1.125/M

Go

Google: PaLM 2 Code Chat 32k

ID:google/palm-2-codechat-bison-32k

PaLM 2 fine-tuned for chatbot conversations that help with code-related questions.

Google

33K

$1/M

$2/M

01

01.AI: Yi Large

ID:01-ai/yi-large

The Yi Large model was designed by 01.AI with the following usecases in mind: knowledge search, data classification, human-like chat bots, and customer service. It stands out for its multilingual proficiency, particularly in Spanish, Chinese, Japanese, German, and French. Check out the [launch announcement](https://01-ai.github.io/blog/01.ai-yi-large-llm-launch) to learn more.

01-ai

33K

$3/M

Mi

Mistral Large 2411

ID:mistralai/mistral-large-2411

This is Mistral AI's flagship model, Mistral Large 2 (version mistral-large-2407). It's a proprietary weights-available model and excels at reasoning, code, JSON, chat, and more. Read the launch announcement [here](https://mistral.ai/news/mistral-large-2407/). It supports dozens of languages including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean, along with 80+ coding languages including Python, Java, C, C++, JavaScript, and Bash. Its long context window allows precise information recall from large documents.

MistralAI

128K

$2/M

$6/M

Mi

Mistral Large 2407

ID:mistralai/mistral-large-2407

This is Mistral AI's flagship model, Mistral Large 2 (version mistral-large-2407). It's a proprietary weights-available model and excels at reasoning, code, JSON, chat, and more. Read the launch announcement [here](https://mistral.ai/news/mistral-large-2407/). It supports dozens of languages including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean, along with 80+ coding languages including Python, Java, C, C++, JavaScript, and Bash. Its long context window allows precise information recall from large documents.

MistralAI

128K

$2/M

$6/M

Mi

Mistral: Pixtral Large 2411

ID:mistralai/pixtral-large-2411

Pixtral Large is a 124B open-weights multimodal model built on top of [Mistral Large 2](/mistralai/mistral-large-2411). The model is able to understand documents, charts and natural images. The model is available under the Mistral Research License (MRL) for research and educational use; and the Mistral Commercial License for experimentation, testing, and production for commercial purposes.

MistralAI

128K

$2/M

$6/M

Pe

Perplexity: Llama 3.1 Sonar 70B

ID:perplexity/llama-3.1-sonar-large-128k-chat

Llama 3.1 Sonar is Perplexity's latest model family. It surpasses their earlier Sonar models in cost-efficiency, speed, and performance. This is a normal offline LLM, but the [online version](/perplexity/llama-3.1-sonar-large-128k-online) of this model has Internet access.

Perplexity

131K

$1/M

Pe

Perplexity: Llama 3.1 Sonar 8B

ID:perplexity/llama-3.1-sonar-small-128k-chat

Llama 3.1 Sonar is Perplexity's latest model family. It surpasses their earlier Sonar models in cost-efficiency, speed, and performance. This is a normal offline LLM, but the [online version](/perplexity/llama-3.1-sonar-small-128k-online) of this model has Internet access.

Perplexity

131K

$0.2/M

Op

OpenChat 3.5 7B

ID:openchat/openchat-7b

OpenChat 7B is a library of open-source language models, fine-tuned with "C-RLFT (Conditioned Reinforcement Learning Fine-Tuning)" - a strategy inspired by offline reinforcement learning. It has been trained on mixed-quality data without preference labels. - For OpenChat fine-tuned on Mistral 7B, check out [OpenChat 7B](/openchat/openchat-7b). - For OpenChat fine-tuned on Llama 8B, check out [OpenChat 8B](/openchat/openchat-8b). #open-source

Openchat

8K

$0.055/M

Op

OpenAI: GPT-3.5 Turbo 16k (older v1106)

ID:openai/gpt-3.5-turbo-1106

An older GPT-3.5 Turbo model with improved instruction following, JSON mode, reproducible outputs, parallel function calling, and more. Training data: up to Sep 2021.

OpenAI

16K

$1/M

$2/M

Free

Un

Toppy M 7B (free)

ID:undi95/toppy-m-7b:free

A wild 7B parameter model that merges several models using the new task_arithmetic merge method from mergekit. List of merged models: - NousResearch/Nous-Capybara-7B-V1.9 - [HuggingFaceH4/zephyr-7b-beta](/huggingfaceh4/zephyr-7b-beta) - lemonilia/AshhLimaRP-Mistral-7B - Vulkane/120-Days-of-Sodom-LoRA-Mistral-7b - Undi95/Mistral-pippa-sharegpt-7b-qlora #merge #uncensored

Undi95

4K

Me

Meta: LlamaGuard 2 8B

ID:meta-llama/llama-guard-2-8b

This safeguard model has 8B parameters and is based on the Llama 3 family. Just like is predecessor, [LlamaGuard 1](https://huggingface.co/meta-llama/LlamaGuard-7b), it can do both prompt and response classification. LlamaGuard 2 acts as a normal LLM would, generating text that indicates whether the given input/output is safe/unsafe. If deemed unsafe, it will also share the content categories violated. For best results, please use raw prompt input or the `/completions` endpoint, instead of the chat API. It has demonstrated strong performance compared to leading closed-source models in human evaluations. Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Meta Llama

8K

$0.18/M

Mi

Mixtral 8x7B (base)

ID:mistralai/mixtral-8x7b

A pretrained generative Sparse Mixture of Experts, by Mistral AI. Incorporates 8 experts (feed-forward networks) for a total of 47B parameters. Base model (not fine-tuned for instructions) - see [Mixtral 8x7B Instruct](/mistralai/mixtral-8x7b-instruct) for an instruct-tuned model. #moe

MistralAI

33K

$0.54/M

Mi

Mistral Small

ID:mistralai/mistral-small

Cost-efficient, fast, and reliable option for use cases such as translation, summarization, and sentiment analysis.

MistralAI

32K

$0.2/M

$0.6/M

Mi

Mistral Tiny

ID:mistralai/mistral-tiny

This model is currently powered by Mistral-7B-v0.2, and incorporates a "better" fine-tuning than [Mistral 7B](/mistralai/mistral-7b-instruct-v0.1), inspired by community work. It's best used for large batch processing tasks where cost is a significant factor but reasoning capabilities are not crucial.

MistralAI

32K

$0.25/M

Go

Google: Gemini Pro 1.0

ID:google/gemini-pro

Google's flagship text generation model. Designed to handle natural language tasks, multiturn text and code chat, and code generation. See the benchmarks and prompting guidelines from [Deepmind](https://deepmind.google/technologies/gemini/). Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms).

Google

33K

$0.5/M

$1.5/M

Me

Llama 3 Lumimaid 70B

ID:neversleep/llama-3-lumimaid-70b

The NeverSleep team is back, with a Llama 3 70B finetune trained on their curated roleplay data. Striking a balance between eRP and RP, Lumimaid was designed to be serious, yet uncensored when necessary. To enhance it's overall intelligence and chat capability, roughly 40% of the training data was not roleplay. This provides a breadth of knowledge to access, while still keeping roleplay as the primary strength. Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).

Meta Llama

8K

$3.375/M

$4.5/M

Al

Goliath 120B

ID:alpindale/goliath-120b

A large LLM created by combining two fine-tuned Llama 70B models into one 120B model. Combines Xwin and Euryale. Credits to - [@chargoddard](https://huggingface.co/chargoddard) for developing the framework used to merge the model - [mergekit](https://github.com/cg123/mergekit). - [@Undi95](https://huggingface.co/Undi95) for helping with the merge ratios. #merge

Alpindale

6K

$9.375/M

Go

Google: Gemini Pro Vision 1.0

ID:google/gemini-pro-vision

Google's flagship multimodal model, supporting image and video in text or chat prompts for a text or code response. See the benchmarks and prompting guidelines from [Deepmind](https://deepmind.google/technologies/gemini/). Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms). #multimodal

Google

16K

$0.5/M

$1.5/M

Mi

WizardLM-2 7B

ID:microsoft/wizardlm-2-7b

WizardLM-2 7B is the smaller variant of Microsoft AI's latest Wizard model. It is the fastest and achieves comparable performance with existing 10x larger opensource leading models It is a finetune of [Mistral 7B Instruct](/mistralai/mistral-7b-instruct), using the same technique as [WizardLM-2 8x22B](/microsoft/wizardlm-2-8x22b). To read more about the model release, [click here](https://wizardlm.github.io/WizardLM2/). #moe

Microsoft Azure

32K

$0.055/M

Go

Google: Gemini Pro 1.5

ID:google/gemini-pro-1.5

Google's latest multimodal model, supporting image and video in text or chat prompts. Optimized for language tasks including: - Code generation - Text generation - Text editing - Problem solving - Recommendations - Information extraction - Data extraction or generation - AI agents Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms). #multimodal

Google

2M

$1.25/M

$5/M

Co

Cohere: Command R+

ID:cohere/command-r-plus

command-r-plus-08-2024 is an update of the [Command R+](/cohere/command-r-plus) with roughly 50% higher throughput and 25% lower latencies as compared to the previous Command R+ version, while keeping the hardware footprint the same. Read the launch post [here](https://docs.cohere.com/changelog/command-gets-refreshed). Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).

Cohere

128K

$2.85/M

$14.25/M

Da

Databricks: DBRX 132B Instruct

ID:databricks/dbrx-instruct

DBRX is a new open source large language model developed by Databricks. At 132B, it outperforms existing open source LLMs like Llama 2 70B and [Mixtral-8x7b](/mistralai/mixtral-8x7b) on standard industry benchmarks for language understanding, programming, math, and logic. It uses a fine-grained mixture-of-experts (MoE) architecture. 36B parameters are active on any input. It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. See the launch announcement and benchmark results [here](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm). #moe

Databricks

33K

$1.08/M

Ai

AI21: Jamba Instruct

ID:ai21/jamba-instruct

The Jamba-Instruct model, introduced by AI21 Labs, is an instruction-tuned variant of their hybrid SSM-Transformer Jamba model, specifically optimized for enterprise applications. - 256K Context Window: It can process extensive information, equivalent to a 400-page novel, which is beneficial for tasks involving large documents such as financial reports or legal documents - Safety and Accuracy: Jamba-Instruct is designed with enhanced safety features to ensure secure deployment in enterprise environments, reducing the risk and cost of implementation Read their [announcement](https://www.ai21.com/blog/announcing-jamba) to learn more. Jamba has a knowledge cutoff of February 2024.

Ai21

256K

$0.5/M

$0.7/M

Ri

Llama 3 Euryale 70B v2.1

ID:sao10k/l3-euryale-70b

Euryale 70B v2.1 is a model focused on creative roleplay from [Sao10k](https://ko-fi.com/sao10k). - Better prompt adherence. - Better anatomy / spatial awareness. - Adapts much better to unique and custom formatting / reply formats. - Very creative, lots of unique swipes. - Is not restrictive during roleplays.

Rifx.Online

8K

$0.35/M

$0.4/M

Mi

Mistral: Mistral 7B Instruct

ID:mistralai/mistral-7b-instruct

A high-performing, industry-standard 7.3B parameter model, with optimizations for speed and context length. *Mistral 7B Instruct has multiple version variants, and this is intended to be the latest version.*

MistralAI

33K

$0.055/M

Mi

Phi-3 Mini 128K Instruct

ID:microsoft/phi-3-mini-128k-instruct

Phi-3 Mini is a powerful 3.8B parameter model designed for advanced language understanding, reasoning, and instruction following. Optimized through supervised fine-tuning and preference adjustments, it excels in tasks involving common sense, mathematics, logical reasoning, and code processing. At time of release, Phi-3 Medium demonstrated state-of-the-art performance among lightweight models. This model is static, trained on an offline dataset with an October 2023 cutoff date.

Microsoft Azure

128K

$0.1/M

Mi

Phi-3 Medium 128K Instruct

ID:microsoft/phi-3-medium-128k-instruct

Phi-3 128K Medium is a powerful 14-billion parameter model designed for advanced language understanding, reasoning, and instruction following. Optimized through supervised fine-tuning and preference adjustments, it excels in tasks involving common sense, mathematics, logical reasoning, and code processing. At time of release, Phi-3 Medium demonstrated state-of-the-art performance among lightweight models. In the MMLU-Pro eval, the model even comes close to a Llama3 70B level of performance. For 4k context length, try [Phi-3 Medium 4K](/microsoft/phi-3-medium-4k-instruct).

Microsoft Azure

128K

$1/M

Go

Google: Gemini Flash 1.5

ID:google/gemini-flash-1.5

Gemini 1.5 Flash is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots. Gemini 1.5 Flash is designed for high-volume, high-frequency tasks where cost and latency matter. On most common tasks, Flash achieves comparable quality to other Gemini Pro models at a significantly reduced cost. Flash is well-suited for applications like chat assistants and on-demand content generation where speed and scale matter. Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms). #multimodal

Google

1M

$0.075/M

$0.3/M

Co

Cohere: Command

ID:cohere/command

Command is an instruction-following conversational model that performs language tasks with high quality, more reliably and with a longer context than our base generative models. Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).

Cohere

4K

$0.95/M

$1.9/M

Co

Cohere: Command R

ID:cohere/command-r

Command-R is a 35B parameter model that performs conversational language tasks at a higher quality, more reliably, and with a longer context than previous models. It can be used for complex workflows like code generation, retrieval augmented generation (RAG), tool use, and agents. Read the launch post [here](https://txt.cohere.com/command-r/). Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).

Cohere

128K

$0.475/M

$1.425/M

Free

Qw

Qwen 2 7B Instruct (free)

ID:qwen/qwen-2-7b-instruct:free

Qwen2 7B is a transformer-based model that excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning. It features SwiGLU activation, attention QKV bias, and group query attention. It is pretrained on extensive data with supervised finetuning and direct preference optimization. For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2/) and [GitHub repo](https://github.com/QwenLM/Qwen2). Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

Qwen

33K

Go

Google: Gemma 2 27B

ID:google/gemma-2-27b-it

Gemma 2 27B by Google is an open model built from the same research and technology used to create the [Gemini models](/models?q=gemini). Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. See the [launch announcement](https://blog.google/technology/developers/google-gemma-2/) for more details. Usage of Gemma is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms).

Google

8K

$0.27/M

Al

Magnum 72B

ID:alpindale/magnum-72b

From the maker of [Goliath](https://openrouter.ai/alpindale/goliath-120b), Magnum 72B is the first in a new family of models designed to achieve the prose quality of the Claude 3 models, notably Opus & Sonnet. The model is based on [Qwen2 72B](https://openrouter.ai/qwen/qwen-2-72b-instruct) and trained with 55 million tokens of highly curated roleplay (RP) data.

Alpindale

16K

$3.75/M

$4.5/M

Free

Go

Google: Gemma 2 9B (free)

ID:google/gemma-2-9b-it:free

Gemma 2 9B by Google is an advanced, open-source language model that sets a new standard for efficiency and performance in its size class. Designed for a wide variety of tasks, it empowers developers and researchers to build innovative applications, while maintaining accessibility, safety, and cost-effectiveness. See the [launch announcement](https://blog.google/technology/developers/google-gemma-2/) for more details. Usage of Gemma is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms).

Google

8K

Go

Google: Gemma 2 9B

ID:google/gemma-2-9b-it

Gemma 2 9B by Google is an advanced, open-source language model that sets a new standard for efficiency and performance in its size class. Designed for a wide variety of tasks, it empowers developers and researchers to build innovative applications, while maintaining accessibility, safety, and cost-effectiveness. See the [launch announcement](https://blog.google/technology/developers/google-gemma-2/) for more details. Usage of Gemma is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms).

Google

8K

$0.06/M

Mi

Mistral: Codestral Mamba

ID:mistralai/codestral-mamba

A 7.3B parameter Mamba-based model designed for code and reasoning tasks. - Linear time inference, allowing for theoretically infinite sequence lengths - 256k token context window - Optimized for quick responses, especially beneficial for code productivity - Performs comparably to state-of-the-art transformer models in code and reasoning tasks - Available under the Apache 2.0 license for free use, modification, and distribution

MistralAI

256K

$0.25/M

Mi

Mistral: Mistral Nemo

ID:mistralai/mistral-nemo

A 12B parameter model with a 128k token context length built by Mistral in collaboration with NVIDIA. The model is multilingual, supporting English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. It supports function calling and is released under the Apache 2.0 license.

MistralAI

128K

$0.13/M

Qw

Qwen 2 7B Instruct

ID:qwen/qwen-2-7b-instruct

Qwen2 7B is a transformer-based model that excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning. It features SwiGLU activation, attention QKV bias, and group query attention. It is pretrained on extensive data with supervised finetuning and direct preference optimization. For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2/) and [GitHub repo](https://github.com/QwenLM/Qwen2). Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

Qwen

33K

$0.054/M

Me

Meta: Llama 3.1 405B (base)

ID:meta-llama/llama-3.1-405b

Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This is the base 405B pre-trained version. It has demonstrated strong performance compared to leading closed-source models in human evaluations. Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Meta Llama

131K

$2/M

Free

Go

Google: Gemini Pro 1.5 Experimental

ID:google/gemini-pro-1.5-exp

Google's latest multimodal model, supporting image and video in text or chat prompts. Optimized for language tasks including: - Code generation - Text generation - Text editing - Problem solving - Recommendations - Information extraction - Data extraction or generation - AI agents Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms). #multimodal

Google

2M

An

Anthropic: Claude 3.5 Haiku (2024-10-22)

ID:anthropic/claude-3.5-haiku-20241022

Claude 3.5 Haiku features enhancements across all skill sets including coding, tool use, and reasoning. As the fastest model in the Anthropic lineup, it offers rapid response times suitable for applications that require high interactivity and low latency, such as user-facing chatbots and on-the-fly code completions. It also excels in specialized tasks like data extraction and real-time content moderation, making it a versatile tool for a broad range of industries. It does not support image inputs. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/3-5-models-and-computer-use)

Anthropic

200K

$1/M

$5/M

An

Anthropic: Claude 3 Opus

ID:anthropic/claude-3-opus

Claude 3 Opus is Anthropic's most powerful model for highly complex tasks. It boasts top-level performance, intelligence, fluency, and understanding. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-family) #multimodal

Anthropic

200K

$15/M

$75/M

An

Anthropic: Claude 3 Sonnet

ID:anthropic/claude-3-sonnet

Claude 3 Sonnet is an ideal balance of intelligence and speed for enterprise workloads. Maximum utility at a lower price, dependable, balanced for scaled deployments. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-family) #multimodal

Anthropic

200K

$3/M

$15/M

An

Anthropic: Claude 3 Haiku

ID:anthropic/claude-3-haiku

Claude 3 Haiku is Anthropic's fastest and most compact model for near-instant responsiveness. Quick and accurate targeted performance. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-haiku) #multimodal

Anthropic

200K

$0.25/M

$1.25/M

An

Anthropic: Claude 3.5 Haiku

ID:anthropic/claude-3.5-haiku

Claude 3.5 Haiku features enhancements across all skill sets including coding, tool use, and reasoning. As the fastest model in the Anthropic lineup, it offers rapid response times suitable for applications that require high interactivity and low latency, such as user-facing chatbots and on-the-fly code completions. It also excels in specialized tasks like data extraction and real-time content moderation, making it a versatile tool for a broad range of industries. It does not support image inputs. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/3-5-models-and-computer-use)

Anthropic

200K

$1/M

$5/M

An

Anthropic: Claude 3.5 Sonnet

ID:anthropic/claude-3.5-sonnet

Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at: - Coding: Autonomously writes, edits, and runs code with reasoning and troubleshooting - Data science: Augments human data science expertise; navigates unstructured data while using multiple tools for insights - Visual processing: excelling at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone - Agentic tasks: exceptional tool use, making it great at agentic tasks (i.e. complex, multi-step problem solving tasks that require engaging with other systems) #multimodal

Anthropic

200K

$3/M

$15/M

Qw

Qwen2-VL 7B Instruct

ID:qwen/qwen-2-vl-7b-instruct

Qwen2 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements: - SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. - Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. - Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. - Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub repo](https://github.com/QwenLM/Qwen2-VL). Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

Qwen

33K

$0.1/M

Op

OpenAI: o1-preview

ID:openai/o1-preview

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1). Note: This model is currently experimental and not suitable for production use-cases, and may be heavily rate-limited.

OpenAI

128K

$15/M

$60/M

Ai

AI21: Jamba 1.5 Large

ID:ai21/jamba-1-5-large

Jamba 1.5 Large is part of AI21's new family of open models, offering superior speed, efficiency, and quality. It features a 256K effective context window, the longest among open models, enabling improved performance on tasks like document summarization and analysis. Built on a novel SSM-Transformer architecture, it outperforms larger models like Llama 3.1 70B on benchmarks while maintaining resource efficiency. Read their [announcement](https://www.ai21.com/blog/announcing-jamba-model-family) to learn more.

Ai21

256K

$2/M

$8/M

Ri

Llama 3.1 Euryale 70B v2.2

ID:sao10k/l3.1-euryale-70b

Euryale L3.1 70B v2.2 is a model focused on creative roleplay from [Sao10k](https://ko-fi.com/sao10k). It is the successor of [Euryale L3 70B v2.1](/sao10k/l3-euryale-70b).

Rifx.Online

8K

$0.35/M

$0.4/M

Ai

AI21: Jamba 1.5 Mini

ID:ai21/jamba-1-5-mini

Jamba 1.5 Mini is the world's first production-grade Mamba-based model, combining SSM and Transformer architectures for a 256K context window and high efficiency. It works with 9 languages and can handle various writing and analysis tasks as well as or better than similar small models. This model uses less computer memory and works faster with longer texts than previous designs. Read their [announcement](https://www.ai21.com/blog/announcing-jamba-model-family) to learn more.

Ai21

256K

$0.2/M

$0.4/M

No

Nous: Hermes 3 70B Instruct

ID:nousresearch/hermes-3-llama-3.1-70b

Hermes 3 is a generalist language model with many improvements over [Hermes 2](/nousresearch/nous-hermes-2-mistral-7b-dpo), including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board. Hermes 3 70B is a competitive, if not superior finetune of the [Llama-3.1 70B foundation model](/meta-llama/llama-3.1-70b-instruct), focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user. The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.

NousreSearch

131K

$0.4/M

No

Nous: Hermes 3 405B Instruct

ID:nousresearch/hermes-3-llama-3.1-405b

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board. Hermes 3 405B is a frontier-level, full-parameter finetune of the Llama-3.1 405B foundation model, focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user. The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills. Hermes 3 is competitive, if not superior, to Llama-3.1 Instruct models at general capabilities, with varying strengths and weaknesses attributable between the two.

NousreSearch

131K

$1.79/M

$2.49/M

Free

Me

Meta: Llama 3.2 11B Vision Instruct (free)

ID:meta-llama/llama-3.2-11b-vision-instruct:free

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Meta Llama

131K

Me

Meta: Llama 3.2 11B Vision Instruct

ID:meta-llama/llama-3.2-11b-vision-instruct

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Meta Llama

131K

$0.055/M

Me

Lumimaid v0.2 8B

ID:neversleep/llama-3.1-lumimaid-8b

Lumimaid v0.2 8B is a finetune of [Llama 3.1 8B](/meta-llama/llama-3.1-8b-instruct) with a "HUGE step up dataset wise" compared to Lumimaid v0.1. Sloppy chats output were purged. Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).

Meta Llama

131K

$0.1875/M

$1.125/M

Op

GPT-4o

ID:gpt-4o

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities. For benchmarking against other models, it was briefly called ["im-also-a-good-gpt2-chatbot"](https://twitter.com/LiamFedus/status/1790064963966370209)

OpenAI

128K

$2.5/M

$10/M

Op

OpenAI: GPT-4o

ID:openai/gpt-4o

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities. For benchmarking against other models, it was briefly called ["im-also-a-good-gpt2-chatbot"](https://twitter.com/LiamFedus/status/1790064963966370209)

OpenAI

128K

$2.5/M

$10/M

Op

OpenAI: GPT-4o-mini

ID:openai/gpt-4o-mini

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/openai/gpt-4o), supporting both text and image inputs with text outputs. As their most advanced small model, it is many multiples more affordable than other recent frontier models, and more than 60% cheaper than [GPT-3.5 Turbo](/openai/gpt-3.5-turbo). It maintains SOTA intelligence, while being significantly more cost-effective. GPT-4o mini achieves an 82% score on MMLU and presently ranks higher than GPT-4 on chat preferences [common leaderboards](https://arena.lmsys.org/). Check out the [launch announcement](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) to learn more.

OpenAI

128K

$0.15/M

$0.6/M

Go

Google: Gemini 1.5 Flash-8B

ID:google/gemini-flash-1.5-8b

Gemini 1.5 Flash-8B is optimized for speed and efficiency, offering enhanced performance in small prompt tasks like chat, transcription, and translation. With reduced latency, it is highly effective for real-time and large-scale operations. This model focuses on cost-effective solutions while maintaining high-quality results. [Click here to learn more about this model](https://developers.googleblog.com/en/gemini-15-flash-8b-is-now-generally-available-for-use/). Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms).

Google

1M

$0.0375/M

$0.15/M

In

Inflection: Inflection 3 Productivity

ID:inflection/inflection-3-productivity

Inflection 3 Productivity is optimized for following instructions. It is better for tasks requiring JSON output or precise adherence to provided guidelines For emotional intelligence similar to Pi, see [Inflect 3 Pi](/inflection/inflection-3-pi) See [Inflection's announcement](https://inflection.ai/blog/enterprise) for more details.

Inflection

8K

$2.5/M

$10/M

In

Inflection: Inflection 3 Pi

ID:inflection/inflection-3-pi

Inflection 3 Pi powers Inflection's [Pi](https://pi.ai) chatbot, including backstory, emotional intelligence, productivity, and safety. It excels in scenarios like customer support, roleplay, and emotional intelligence.

Inflection

8K

$2.5/M

$10/M

Qw

Qwen2.5 7B Instruct

ID:qwen/qwen-2.5-7b-instruct

Qwen2.5 7B is the latest series of Qwen large language models. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

Qwen

131K

$0.27/M

Th

Rocinante 12B

ID:thedrummer/rocinante-12b

Rocinante 12B is designed for engaging storytelling and rich prose. Early testers have reported: - Expanded vocabulary with unique and expressive word choices - Enhanced creativity for vivid narratives - Adventure-filled and captivating stories

Thedrummer

33K

$0.25/M

$0.5/M

Me

Meta: Llama 3.2 3B Instruct

ID:meta-llama/llama-3.2-3b-instruct

Llama 3.2 3B is a 3-billion-parameter multilingual large language model, optimized for advanced natural language processing tasks like dialogue generation, reasoning, and summarization. Designed with the latest transformer architecture, it supports eight languages, including English, Spanish, and Hindi, and is adaptable for additional languages. Trained on 9 trillion tokens, the Llama 3.2B model excels in instruction-following, complex reasoning, and tool use. Its balanced performance makes it ideal for applications needing accuracy and efficiency in text generation across multilingual settings. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Meta Llama

131K

$0.03/M

$0.05/M

Free

Me

Meta: Llama 3.2 3B Instruct (free)

ID:meta-llama/llama-3.2-3b-instruct:free

Llama 3.2 3B is a 3-billion-parameter multilingual large language model, optimized for advanced natural language processing tasks like dialogue generation, reasoning, and summarization. Designed with the latest transformer architecture, it supports eight languages, including English, Spanish, and Hindi, and is adaptable for additional languages. Trained on 9 trillion tokens, the Llama 3.2B model excels in instruction-following, complex reasoning, and tool use. Its balanced performance makes it ideal for applications needing accuracy and efficiency in text generation across multilingual settings. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Meta Llama

131K

Qw

Qwen2-VL 72B Instruct

ID:qwen/qwen-2-vl-72b-instruct

Qwen2 VL 72B is a multimodal LLM from the Qwen Team with the following key enhancements: - SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. - Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. - Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. - Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub repo](https://github.com/QwenLM/Qwen2-VL). Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

Qwen

33K

$0.4/M

Qw

Qwen2.5 72B Instruct

ID:qwen/qwen-2.5-72b-instruct

Qwen2.5 72B is the latest series of Qwen large language models. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

Qwen

131K

$0.35/M

$0.4/M

Free

Me

Meta: Llama 3.2 1B Instruct (free)

ID:meta-llama/llama-3.2-1b-instruct:free

Llama 3.2 1B is a 1-billion-parameter language model focused on efficiently performing natural language tasks, such as summarization, dialogue, and multilingual text analysis. Its smaller size allows it to operate efficiently in low-resource environments while maintaining strong task performance. Supporting eight core languages and fine-tunable for more, Llama 1.3B is ideal for businesses or developers seeking lightweight yet powerful AI solutions that can operate in diverse multilingual settings without the high computational demand of larger models. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Meta Llama

131K

Me

Meta: Llama 3.2 90B Vision Instruct

ID:meta-llama/llama-3.2-90b-vision-instruct

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks. This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Meta Llama

131K

$0.35/M

$0.4/M