fishchar explores how GPT-5 fares against Opus 4.1 and Sonnet 4 in GitHub Copilot, analyzing technical evaluation details, token economics, and the significance of future Copilot model changes.

Comparing GPT-5 and Opus 4.1 in GitHub Copilot

Author: fishchar

This community discussion assesses how GPT-5 matches up to Opus 4.1 and Sonnet 4 in terms of coding abilities and cost effectiveness within GitHub Copilot Pro.

Key Points

Price and Token Efficiency: GPT-5 is noted for offering similar coding performance to Opus 4.1 but with dramatic reductions in both output token costs (7.5x less) and input token costs (10x less). For users, this means the potential for significant savings.
Potential for Unlimited Usage: There’s speculation about whether GitHub Copilot Pro might switch from using 4.1 to GPT-5 for unlimited requests. If so, and the coding quality holds up, it would be a substantial upgrade for developers.
Model Base vs. Premium Pricing: Debate in the post highlights confusion about whether GPT-5 is a base or premium model and how this affects usage costs in Copilot’s billing model.

Technical Evaluation

The post references findings from a technical paper stating:

“All SWE-bench evaluation runs use a fixed subset of n=477 verified tasks … Our primary metric is pass@1 … the model must implement its change without knowing the correct tests ahead of time.”
It’s noted that the reported score may come from cherry-picked tasks, with some tasks omitted, potentially painting an optimistic picture of model accuracy.

UI and Coding Performance

The community praises GPT-5 and Opus 4.1 for their UI intelligence and reduced hallucinations (incorrect code generation).
Comparisons to Sonnet 4 add further context to the performance landscape among current large language models.

Takeaways

If Copilot Pro adopts GPT-5 for unlimited requests and retains the model’s coding quality, it would mark a significant improvement for developers—both technically and economically.
Discussions on evaluation approaches, model base/premium distinctions, and real-world performance (including potential cherry-picking of test results) provide necessary skepticism and context for these claims.

This post appeared first on “Reddit Github Copilot”. Read the entire article here