R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
R1-searcher is a project aimed at enhancing the reasoning capabilities of large reasoning models (LRMs) through a two-stage outcome-supervision reinforcement learning approach. This innovative method allows models to learn how to invoke web search and effectively utilize search engines during reasoning processes, addressing the limitations of knowledge-intensive problems.
Key Features:
- Two-Stage Learning: The model first learns to invoke search and then to solve questions using the search results.
- No Instruction Fine-Tuning Required: Compatible with existing Base LLMs or Chat LLMs without the need for complex fine-tuning.
- Outcome-Supervised Reinforcement Learning: Focuses on the design of rewards and the reinforcement learning algorithm to enhance performance.
- Diverse Training Data: Utilizes datasets like HotpotQA and 2WikiMultiHopQA for robust training and evaluation.
- Integration of Online Search: Incorporates online search capabilities to improve results, especially for recent knowledge.
Benefits:
- Improved Reasoning Performance: Achieves significant improvements over existing methods, even surpassing some closed-source models.
- Generalization Capabilities: Demonstrates exceptional performance across in-domain and out-of-domain datasets.
- Open Source: Provides access to training code, inference code, model checkpoints, and a detailed technical report for community use and further research.
Highlights:
- Achieved better results with Qwen-2.5-7B-Base and LLaMA-3.1-8B-Instruct models.
- Utilizes a structured reward system to guide the learning process effectively.
- Open-source resources available for researchers and developers to build upon this work.