AutoDidact

Autonomously train research-agent LLMs on custom data using reinforcement learning and self-verification.

Introduction

Self-Bootstrapping with Llama-8B: Generates meaningful question-answer pairs and trains itself for effective searches.
Autonomous Self-Verification: The Llama-8B model evaluates its answers, fostering a self-improving loop.
GRPO Reinforcement Learning: Uses Group Relative Policy Optimization to enhance research and reasoning capabilities.
Fully Autonomous Pipeline: All processes, including question generation and reinforcement learning, run locally with open-source models.

Significant improvement in answering capabilities demonstrated, e.g., from 23% to 59% accuracy in a validation set.
Learn to issue well-formed queries and effectively refine searches through training.

Built on Unsloth's Efficient GRPO code with enhancements for function calling and agentic loops.
Ideal for deploying models in research scenarios, especially with historical data or customized datasets.