News

The dataset—called HealthBench—is OpenAI's first major independent health care project. It includes 5,000 "realistic health conversations," each with detailed grading tools to evaluate AI ...
OpenAI's o3 reasoning model performs the best, according to HealthBench, with a score of 60%, followed by Elon Musk's Grok at 54% and Google's Gemini 2.5 Pro at 52%.
OpenAI has announced the launch of HealthBench, a benchmark to evaluate AI models in healthcare using real-world applicability and physician judgment. "The 5,000 conversations in HealthBench simulate ...
Experts say it improves AI evaluation but warn that more review is needed TUESDAY, May 13, 2025 (HealthDay News) — OpenAI has unveiled a large dataset to help test how well artificial ...
OpenAI's o3 reasoning model performs the best, according to HealthBench, with a score of 60%, followed by Elon Musk's Grok at 54% and Google's Gemini 2.5 Pro at 52%.. In an example on OpenAI's ...
Experts say it improves AI evaluation but warn that more review is needed TUESDAY, May 13, 2025 (HealthDay News) — OpenAI has unveiled a large dataset to help test how well artificial intelligence (AI ...