> Evaluating Small Language Models (<10B)
> Task completion within ~20 steps
> Constrained hardware execution
| Rank # | Model Identifier | Size | Success % | Avg Steps |
|---|
Evaluates capability to solve tasks within a hard cap of 20 steps.
Tests adherence to tool interfaces vs hallucinating invalid parameters.
Analyzes error recovery and loop detection in short-term sequences.
Benchmarked on constrained, low-VRAM local consumer hardware.
Feel free, to ping us if you want to discuss more, or have questions