Thanks! Yes, given the costs or running agents in loops - I'd love to see what auto-researcher finds - this maybe also a way to have per-model tuning and self healing on model updates. I looked a bit into quality scoring using Opus 4.6 as judge - ultraphilosopher looks strongest there too. But real improvement would be making a longer context + real code tests (and blending the results with code output). I love less chatter and faster responses from the models so use it by default everywhere now ;-)
Love this experiment. I’d be interested in running something like auto-researcher on this, to fine-tune and test more scenarios (prompts) and models.
Thanks! Yes, given the costs or running agents in loops - I'd love to see what auto-researcher finds - this maybe also a way to have per-model tuning and self healing on model updates. I looked a bit into quality scoring using Opus 4.6 as judge - ultraphilosopher looks strongest there too. But real improvement would be making a longer context + real code tests (and blending the results with code output). I love less chatter and faster responses from the models so use it by default everywhere now ;-)