SLM Agent
Leaderboard

> Evaluating Small Language Models (<10B)
> Task completion within ~20 steps
> Constrained hardware execution

Rank # Model Identifier Size Success % Avg Steps
Status: Active* Lower steps = higher efficiency
Metric_01

Action Efficiency

Evaluates capability to solve tasks within a hard cap of 20 steps.

Metric_02

API Reliability

Tests adherence to tool interfaces vs hallucinating invalid parameters.

Metric_03

Planning

Analyzes error recovery and loop detection in short-term sequences.

Metric_04

Hardware Reality

Benchmarked on constrained, low-VRAM local consumer hardware.

System Configuration

EnvironmentTextWorld
Sample Size30 Episodes / Model
Constraint20 Step Budget
ValidationStrict Regex Matching

Contribute

Feel free, to ping us if you want to discuss more, or have questions

Open Issue ->