| Agent | Size | Overall SR | Overall ESAR | Success Rate (SR) % | Essential State Achieved Rate (ESAR) % | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Medium | Hard | Operation | Inf. Query | Easy | Medium | Hard | Operation | Inf. Query | ||||
| T3A + Gemini-2.5-pro |
- | 53.0 | 66.4 | 57.1 | 55.0 | 44.0 | 55.6 | 46.6 | 68.2 | 67.9 | 62.5 | 72.9 | 50.5 |
| InfiGUI-R1 | 7B | 27.0 | 52.1 | 34.3 | 30.0 | 12.0 | 34.7 | 7.1 | 55.3 | 50.8 | 51.0 | 54.6 | 46.2 |
| UI-Venus | 7B | 20.0 | 32.0 | 28.5 | 15.0 | 16.0 | 20.8 | 17.8 | 41.2 | 29.7 | 27.1 | 33.9 | 27.5 |
| Qwen3-VL | 7B | 17.0 | 38.2 | 34.3 | 12.5 | 0.0 | 20.8 | 7.1 | 50.6 | 35.2 | 31.3 | 44.0 | 24.2 |
| Mobile-Use + Qwen2.5-VL |
7B | 16.0 | 39.5 | 34.3 | 10.0 | 0.0 | 22.2 | 0.0 | 48.2 | 39.9 | 31.3 | 43.1 | 30.8 |
| T3A + Qwen2.5-VL |
7B | 15.0 | 30.7 | 31.4 | 10.0 | 4.0 | 19.4 | 3.8 | 37.6 | 28.1 | 28.1 | 29.4 | 34.1 |
| GUI-OWL | 7B | 14.0 | 32.0 | 31.4 | 7.5 | 0.0 | 18.1 | 3.6 | 49.4 | 26.6 | 23.9 | 35.8 | 23.1 |
| UI-Genie | 7B | 13.0 | 32.1 | 25.8 | 10.0 | 0.0 | 16.7 | 3.6 | 41.2 | 25.8 | 32.3 | 34.9 | 25.3 |
| UI-TARS-1.5 | 7B | 12.0 | 28.2 | 20.0 | 10.0 | 4.0 | 13.9 | 7.1 | 35.3 | 21.8 | 23.9 | 31.2 | 20.9 |
| Qwen2.5-VL | 7B | 3.0 | 14.2 | 5.7 | 0.0 | 4.0 | 4.2 | 0.0 | 10.5 | 10.9 | 21.8 | 15.6 | 11.0 |
Benchmark results evaluated by our A3RM. We report Task Success Rate (SR) and Essential State Achieved Rate (ESAR) across task categories and difficulties. Sorted by Overall SR.