TL;DR (triage order that saves time)
The fastest path is: confirm whether it is provider-side, then confirm whether you have a runaway loop, then reduce context size.
Do not change models/providers until you can reproduce the failure with the same inputs.
- Step 1: check provider quota and error codes (429, 5xx)
- Step 2: check for runaway tool loops (retries, browsing, parallelism)
- Step 3: cap scope (budgets and stop rules)
- Step 4: shrink inputs (move large context into files and reference them)
- Step 5: only then adjust providers/models