The Shift Around The Results In Leaderboard Are Not
The main issue isn’t the model; it’s how outcomes are reported.
Tasks skipped during aggregation hide a critical detail.
This creates a false dip, misleading stakeholders.
Gotcha: Skipped tasks aren’t just missing numbers - they’re biased defaults.
Context Matters
- Skipna=True in the code isn’t a bug - it’s a safeguard.
- Tasks skipped are often low-confidence or poorly indexed, not truly absent.
- Example: A model scoring 57% vs leaderboard 54% - that’s a significant discrepancy.
Why It Hides
- Invisible bias: Only top tasks count; low-ranked ones vanish.
- Version confusion: Old checks aren’t recognized in new runs.
- Scored outliers: Skipped tasks skew averages like a rogue outlier.
Here’s the catch
- Don't trust averages alone: The true metric is what's not counted.
- Reconstruct missing tasks: Fill gaps to get real performance.
- Audit benchmarks: Check task coverage before conclusions.
The Bottom Line
The accuracy gap isn’t in the model - it’s in how data is ingested. The benchmark values are only as clean as the inputs.
The results in leaderboard are not accurate because of skipna=True This matters: Business decisions and research integrity rely on honest numbers. Benchmark integrity defines reliability.
This echoes findings from Sheryl et al.: "Incomplete data creates false narratives." As such, transparency is key. But there is a catch: automated fixes don't fix human oversight.
The key takeaway: Account for what’s skipped, not just what’s counted. Always validate source data. But there is a catch - old datasets hide in plain sight.