Article2025-09-07

The Nutrition Prediction Benchmark: Testing LLMs on Google Cafeteria Menus

John Leonardo
43 views

The Nutrition Prediction Benchmark: Testing LLMs on Google Cafeteria Menus

Can AI language models accurately predict the nutritional content of food just from ingredient lists? I created a simple benchmark to find out, testing models on real Google cafeteria dishes with surprising results.

Why does this matter? This benchmark tests something practical and important: whether AI models truly understand the relationship between food ingredients and their nutritional properties. It's not just about memorizing nutrition facts - it requires understanding portion sizes, cooking methods, and how different ingredients contribute to overall nutritional content.

The Experiment

The setup was straightforward. Using the Nutrition5k dataset from Google Research, I selected 10 random dishes from their cafeteria menus and asked 4 different AI models to:

Predict the exact nutritional content (calories, protein, carbs, fat) based solely on ingredient lists.

I used Google's dataset because it provides ground truth nutritional values measured through comprehensive analysis. Each dish was filtered to have at least 3 ingredients and 100+ calories to ensure meaningful complexity. To ensure reliability, I used the same 10 dishes across all models for direct comparison.

Here's my evaluation approach:

python
prompt = f"""You are a nutrition expert API which will take a list of ingredients and output the following format: {{"calories": <answer>, "protein": <answer>, "carbs": <answer>, "fat": <answer>}} You will respond with no other text. Your response must be json parsable. Ingredients: {ingredients} """ # Calculate Mean Absolute Percentage Error (MAPE) for each nutrition field for field in ["calories", "protein", "carbs", "fat"]: error = abs(predicted[field] - actual[field]) mape = (error / actual[field] * 100) if actual[field] > 0 else 0 # Overall score: accuracy (60%) + correlation (40%) accuracy_score = 100 / (1 + avg_mape) correlation_score = avg_correlation * 100 overall_score = accuracy_score * 0.6 + correlation_score * 0.4

The entire experiment cost $2.65 using OpenRouter for model calls and parallel execution to speed up testing across 50 different models.

The Results

The results revealed significant differences in nutritional understanding across models:

The Full Leaderboard (tested on 10 dishes, 50 models, higher overall score = better)

RankModelOverall ScoreAvg MAPECorrelationCost
1deepseek/deepseek-r1-052842.312.3%0.944$0.1341
2openai/gpt-oss-120b42.112.8%0.944$0.0155
3openai/gpt-5-mini41.713.9%0.943$0.0537
4google/gemini-2.5-pro41.215.8%0.941$0.3682
5openai/o4-mini40.717.8%0.938$0.1175
6openai/o340.618.6%0.940$0.2075
7openai/gpt-540.021.8%0.934$0.4361
8qwen/qwen3-235b-a22b-thinking-250739.820.4%0.925$0.0214
9openai/gpt-5-nano39.620.0%0.918$0.0186
10openai/gpt-oss-20b39.519.8%0.916$0.0039
11openai/gpt-4o-2024-11-2039.330.4%0.935$0.0071
12x-ai/grok-3-mini39.223.5%0.919$0.1159
13openai/gpt-4.1-mini38.227.4%0.903$0.0011
14openai/gpt-5-chat38.025.2%0.893$0.0049
15google/gemini-2.0-flash-00137.825.1%0.887$0.0003
16openai/gpt-4.137.632.0%0.894$0.0054
17x-ai/grok-437.522.3%0.874$0.7182
18openai/gpt-4o37.325.6%0.875$0.0071
19qwen/qwen3-235b-a22b-250736.534.6%0.870$0.0015
20qwen/qwen3-30b-a3b36.423.0%0.848$0.0167
21z-ai/glm-4.535.926.9%0.843$0.0674
22anthropic/claude-opus-4.135.635.5%0.850$0.0477
23mistralai/mistral-medium-3.135.553.8%0.860$0.0016
24openai/chatgpt-4o-latest35.330.2%0.835$0.0137
25mistralai/mistral-large-241134.842.7%0.835$0.0067
26anthropic/claude-opus-434.736.9%0.829$0.0481
27anthropic/claude-3.5-sonnet33.251.1%0.801$0.0100
28minimax/minimax-m132.830.9%0.772$0.1335
29anthropic/claude-3.5-haiku32.849.5%0.789$0.0026
30moonshotai/kimi-k232.735.6%0.776$0.0028
31tencent/hunyuan-a13b-instruct32.039.5%0.764$0.0014
32deepseek/deepseek-chat-v3-032431.860.9%0.772$0.0018
33x-ai/grok-331.737.5%0.753$0.0134
34anthropic/claude-sonnet-431.450.0%0.756$0.0095
35meta-llama/llama-3.3-70b-instruct31.470.3%0.764$0.0018
36anthropic/claude-3.7-sonnet31.053.4%0.747$0.0094
37microsoft/mai-ds-r130.431.0%0.713$0.0128
38google/gemini-2.5-flash30.266.7%0.733$0.0012
39nousresearch/hermes-3-llama-3.1-70b30.250.9%0.725$0.0003
40mistralai/mistral-small-3.2-24b-instruct29.337.3%0.693$0.0009
41meta-llama/llama-4-maverick28.750.6%0.689$0.0005
42microsoft/phi-4-reasoning-plus27.5114.7%0.675$0.0016
43amazon/nova-pro-v127.2117.9%0.667$0.0025
44mistralai/mistral-nemo26.871.3%0.648$0.0001
45amazon/nova-micro-v125.197.7%0.613$0.0001
46openai/gpt-3.5-turbo24.7106.7%0.604$0.0013
47inception/mercury24.195.6%0.586$0.0005
48meta-llama/llama-4-scout23.1104.3%0.562$0.0003
49liquid/lfm-7b22.1131.6%0.541$0.0000
50amazon/nova-lite-v120.4250.4%0.504$0.0002

Detailed Performance Breakdown

The clear winner is deepseek/deepseek-r1-0528, narrowly edging out the competition with an overall score of 42.3 and exceptional accuracy, particularly for calories:

  • Calories: 5.9% MAPE (exceptional)
  • Protein: 13.1% MAPE (very good)
  • Carbs: 14.1% MAPE (good)
  • Fat: 16.1% MAPE (good)
  • Correlation: 0.944 (outstanding)

Close second place goes to openai/gpt-oss-120b with a score of 42.1, showing balanced performance across all nutrition fields:

  • Calories: 12.8% MAPE (excellent)
  • Protein: 15.8% MAPE (very good)
  • Carbs: 11.9% MAPE (excellent)
  • Fat: 10.8% MAPE (excellent)
  • Correlation: 0.944 (outstanding)

Strong third place is openai/gpt-5-mini with a score of 41.7, demonstrating impressive capability for a "mini" model:

  • Calories: 7.8% MAPE (excellent)
  • Protein: 12.5% MAPE (very good)
  • Carbs: 20.1% MAPE (acceptable)
  • Fat: 15.3% MAPE (good)
  • Correlation: 0.943 (outstanding)

Google's flagship google/gemini-2.5-pro takes fourth place with a score of 41.2, showing solid performance but struggling with carbs:

  • Calories: 7.7% MAPE (excellent)
  • Protein: 13.1% MAPE (very good)
  • Carbs: 25.0% MAPE (concerning)
  • Fat: 17.3% MAPE (good)
  • Correlation: 0.941 (outstanding)

What This Reveals About AI Models

1. Specialized Knowledge vs. General Intelligence

The results reveal fascinating disconnects from general performance rankings. deepseek/deepseek-r1-0528 dominated despite being less prominent than flagship models like openai/gpt-5 or anthropic/claude-opus-4.1 (which ranked 22nd). Meanwhile, the compact openai/gpt-5-mini outperformed the full openai/gpt-5, and several Chinese models showed exceptional nutritional reasoning. This suggests that nutritional prediction requires specific knowledge about food science that doesn't correlate with general language capabilities.

2. The Carbohydrate Challenge

A striking pattern emerged across the 50 models: carbohydrate prediction was consistently the most challenging nutrition field. Even top performers like google/gemini-2.5-pro showed 25.0% MAPE for carbs while achieving 7.7% for calories. Several Claude models exhibited catastrophic carb prediction failures, with anthropic/claude-3.5-haiku reaching 114.4% MAPE for carbs while maintaining reasonable performance on other nutrients.

This carbohydrate challenge is particularly interesting because:

  • Carbs often come from multiple sources in complex dishes
  • Cooking methods significantly affect carbohydrate content (starch gelatinization, etc.)
  • Portion estimation is critical for accuracy
  • Complex carbohydrates vs. simple sugars require nuanced understanding

The fact that even sophisticated models consistently struggle with carbs suggests this requires deeper understanding of food chemistry and preparation methods than other macronutrients.

3. Cost-Performance Efficiency

The cost-performance analysis reveals striking insights across 50 models. google/gemini-2.0-flash-001 achieved 15th place at just $0.0003 - incredible value. Meanwhile, the winner deepseek/deepseek-r1-0528 cost $0.1341, while second-place openai/gpt-oss-120b cost only $0.0155 for nearly identical performance. The most expensive model, x-ai/grok-4 at $0.7182, ranked a disappointing 17th, highlighting that higher cost doesn't guarantee better nutritional reasoning.

4. Consistency in Nutritional Reasoning

Models that performed well showed consistent accuracy across all nutrition fields, while struggling models had erratic performance. This suggests that effective nutritional prediction requires systematic understanding rather than field-specific memorization.

The Methodology's Strengths and Limitations

Strengths:

  1. Real-world applicability: Uses actual Google cafeteria dishes with measured nutritional values
  2. Cost-effective: Entire benchmark runs for under 3 dollars
  3. Comprehensive evaluation: Tests multiple nutrition dimensions and includes correlation analysis
  4. Reproducible: Fixed seed ensures consistent dish selection across models

Limitations:

  1. Limited sample size: Only 10 dishes, though carefully selected for complexity
  2. Ingredient-only input: Doesn't include portion sizes, cooking methods, or visual information
  3. English-only: Tests only English ingredient descriptions
  4. Single domain: Focused on cafeteria-style prepared foods

Future Research Directions

This benchmark could be expanded in several compelling ways:

  • Multi-modal input: Include food images alongside ingredient lists to test visual nutritional reasoning
  • Portion awareness: Test models' ability to adjust predictions based on serving size information
  • Cooking method sensitivity: Evaluate how well models account for preparation techniques (grilled vs. fried)
  • Cultural cuisine diversity: Extend beyond American cafeteria food to international dishes
  • Macro vs. micro nutrients: Test prediction of vitamins, minerals, and other micronutrients
  • Temporal consistency: Run the same benchmark over time to track model improvements

The Practical Implications

Unlike more abstract benchmarks, nutrition prediction has immediate real-world applications:

  • Health apps: Accurate nutrition tracking from photos or ingredient lists
  • Food service: Automated nutritional labeling for restaurants and cafeterias
  • Medical applications: Dietary planning for patients with specific nutritional needs
  • Food development: Optimizing recipes for target nutritional profiles

The fact that deepseek-r1-0528 achieved 12.3% average MAPE (87.7% accuracy) suggests we're approaching the threshold where AI nutrition prediction could be practically useful for many applications.

A Note on Dataset Quality

Using Google's Nutrition5k dataset provides high-quality ground truth, but it's worth noting that:

  1. Measurement precision: The dataset uses sophisticated analysis techniques, making it more reliable than self-reported or estimated nutritional data
  2. Domain specificity: Google cafeteria dishes may not represent broader food categories
  3. Temporal stability: Nutritional content can vary based on ingredient sources, seasonality, and preparation variations

The Broader Point

This experiment reinforces a key insight: specialized benchmarks can reveal capabilities that general performance metrics miss.

While anthropic/claude-opus-4.1 might excel at reasoning tasks or creative writing, it ranks 22nd here, struggling with the specific domain knowledge required for nutritional prediction. Conversely, deepseek/deepseek-r1-0528's dominance suggests it has internalized food science relationships that don't necessarily translate to other domains.

The $2.65 cost of testing 50 models makes this benchmark feasible to run regularly, tracking how models improve on practical, domain-specific reasoning over time. As AI systems become more capable, we need more benchmarks like this that test real-world applications rather than abstract reasoning abilities.

The fact that we can achieve 87.7% accuracy (100% - 12.3% MAPE) on nutritional prediction from ingredient lists alone suggests we're closer to practical AI nutrition applications than many might expect.

#ai#llm#benchmark#nutrition#google#experiment