The Underestimated "Intellect" of Large Models

Advertisements

The recent Christmas surprise from OpenAI extended an extra day beyond expectations, culminating in a significant announcement from CEO Sam Altman on the thirteenth dayDuring the holiday season, OpenAI decided to provide unlimited access to Sora for all Plus users, a privilege previously reserved for Pro subscribers paying $200 per monthThis revelation marked a noteworthy shift, aiming to democratize access to advanced features of their AI systems.

In a flurry of activity over twelve days, OpenAI released a series of groundbreaking products, including the comprehensive version of their model named o1, an enhanced Sora, ChatGPT Search capabilities, improved phone functionality, and the introduction of their new model, dubbed o3. This consistent rollout of innovations highlights OpenAI's commitment to evolving AI capabilities at an unprecedented pace.

Among these new advancements, the o3 model has garnered particular attention and excitement within the AI community

Widely regarded as OpenAI’s “secret weapon,” o3 represents a significant evolution in their reasoning models, essentially the second generation following o1. To avoid potential intellectual property disputes with the renowned UK telecom operator o2, OpenAI opted to directly move to o3, thus sidestepping the naming conflict.

Despite the anticipation surrounding o3, the development of ChatGPT-5, another flagship project for OpenAI, remains shrouded in uncertaintyInsiders revealed that after more than 18 months and substantial investment, the project has yet to deliver anticipated resultsIt seems the previously reliable Scaling Law principles are encountering limitations, as merely increasing parameters and data volumes fails to translate into groundbreaking enhancements for AI models.

As the approach of expanding the training scales of large AI models shows diminishing returns, OpenAI is placing emphasis on the reasoning capabilities encapsulated in the o series

This shift raises the question: could refining the reasoning time and depth of computation offer a viable solution for tackling complex problems?

There is a growing recognition of the underestimated potential associated with enhancing reasoning capabilitiesWhen OpenAI announced the o series, some experts speculated that the integration of 'thinking' abilities within large models could eliminate barriers to achieving Artificial General Intelligence (AGI). With o3 demonstrating considerable advancements over its predecessor, it further validates the notion that extending reasoning time is an effective pathway for progress.

Notably, Noam Brown, a prominent researcher at OpenAI and a key scientist involved with the o1 development team, recently expressed optimism about the advent of "reasoning time calculations" technologiesThis innovative approach involves augmenting the computational effort during the reasoning process, allowing larger models to engage in deeper analysis and solve more intricate issues.

Noam Brown articulates that while enhancing model performance by increasing pre-training scale has historically been effective, the accompanying costs are excessively high and ultimately lead to developmental bottlenecks

The emergence of reasoning time calculations presents a novel strategy that could accelerate the pathway to AGI.

The o1 and o3 models epitomize this evolution as they incorporate reasoning time calculations, empowering them to autonomously learn strategies, dissect tasks, identify, and rectify errors, thus facilitating more profound reasoning and problem-solving capabilities.

Reports suggest that OpenAI recognizes the limitations of solely relying on expanded pre-training scales and is actively pursuing reasoning time calculations as a promising alternativeBrown elucidates that the significance of this technology has been underappreciated and is still in its early developmental phase, leaving substantial room for future enhancementsAs large models “ponder” for extended durations, they begin to exhibit abilities previously thought to necessitate human intervention—such as attempting various strategies, breaking down complex problems into manageable components, and autonomously recognizing and correcting errors

alefox

This rationale underpins researchers’ belief that reasoning time calculations could be fundamental to advancing toward AGI.

The advancements that o3 brings to the table are impressive, with performance metrics indicating a substantial leap over existing modelsIn a real-world software task evaluation, it reached an accuracy rate of 71.7% in programming capabilities, surpassing o1 by over 20%. In competitive coding scenarios, it achieved a score of 2727, eclipsing the previous high of OpenAI’s chief scientist, which stood at 2665, while o1 lagged behind at 1891.

Mathematical reasoning abilities have also witnessed remarkable improvements as o3 demonstrated just a single mistake on the AIME, registering an accuracy rate of 96.7%. In a more challenging context, the GPQA Diamond test distinguished o3 with an accuracy of 87.7%, compared to an average human expert’s performance hovering around 70%. On cutting-edge mathematical benchmarks like EpochAI Frontier Math, o3 outperformed o1 tenfold.

The industry was particularly astonished by o3's performance on the Arc AGI test, established by prominent AI developer François Chollet in 2019, which is now viewed as a prestigious measure of an AI system's mathematical and logical reasoning prowess

In this assessment, o3 registered an accuracy rate of 75.7% under low computational configurations and reached 87.5% with higher computational power, surpassing the human average of 85%.

These figures, specifically the impressive showing on the Arc AGI test, strongly suggest that AI has eclipsed human capabilities in terms of rapid learning of new rules and reasoning processes.

Nonetheless, there remains a critical, sobering perspective among some researchers and scientistsIn the analysis of o3’s tests, two facets have raised cautionary flagsFirst, the cost of operation is staggeringDevelopers associated with the Arc AGI revealed that completing tasks with o3 in high-performance settings can incur expenses as high as $3400 per task, positioning this as a significant barrier in the commercial landscape.

Large model training is considerately expensive; for instance, training GPT-4 cost over $100 million, while ongoing development for GPT-5 is anticipated to reach as high as $500 million within just six months due to computational expenses

Secondly, o3 is not impervious to simple reasoning errorsChollet suggests that while o3 adapts to tasks it has not previously encountered, nearness to human levels does not equate to true AGI; it remains prone to failures in some relatively straightforward tasks, illustrating a fundamental divergence from human intelligence.

Chollet's commentary reflects sentiments echoed by numerous scientists in the fieldFor example, Ma Yi, a renowned scholar in AI machine vision and the dean of the Computing and Data Science School at the University of Hong Kong, argues that contemporary large models possess knowledge but lack true intelligenceHe proclaims, “knowledge is the integral of intelligent activities, whereas intelligence represents the differential of knowledge.” While GPT-4 may encompass a wealth of knowledge, it falls short on intelligence; conversely, a newborn may not possess extensive knowledge, yet the potential exists for that individual to evolve into the next generation's Einstein.

One AI engineer working in Silicon Valley notes that despite o3's impressive performance metrics, it is still merely the second iteration of OpenAI's reasoning models and cautions against over-glorification

He emphasizes the limitations in the sample size of the tests, arguing that it's premature to conclude that o3’s capabilities have reached or surpassed the level of human experts.

As the field of artificial intelligence continues its rapid evolution, especially with the swift iterations of large models, new, crucial questions await explorationFor example, following Google's release of the quantum chip Willow, Alibaba Cloud's founder and academician Wang Jian posed a profound inquiry: does “computation” in quantum settings carry the same definition as in traditional computing? Similarly, should the intelligence exhibited by OpenAI’s o3 in programming and mathematical reasoning be equated with human intelligence in definitional terms?

Moreover, the ongoing problem of large model hallucinations represents a significant challenge within the industryHallucination refers to instances where a model produces outputs that appear plausible yet are fundamentally contradictory or entirely erroneous, akin to deceit in human behavior

In critical sectors like finance, healthcare, and public safety, inaccurate model outputs can result in dire consequences if not verified by human oversight.

Recently, OpenAI's transcription tool, Whisper, has revealed alarming hallucination rates, with researchers from the University of Michigan finding that eight out of ten audio transcriptions generated by Whisper exhibited hallucinationsSome machine learning engineers have reported hallucinations in around half of the transcribed files, amounting to over 100 hours of analysisIn October of this year, media outlets highlighted that numerous physicians and healthcare institutions were employing Whisper for transcribing doctor-patient consultations, with over 30,000 clinicians from dozens of medical systems, including clinics in Minnesota and Los Angeles, utilizing a tool developed on Whisper's framework, which has facilitated approximately 7 million medical encounters.

While OpenAI had previously announced strategies to mitigate or reduce hallucinations—namely through process supervision to enhance mathematical reasoning—the prevailing high percentage of hallucinations in medical diagnoses derived from Whisper still took the industry by surprise.

Hallucination embodies the stark contrast within large models: they advance knowledge capabilities at an astonishing rate, yet explain noticeably deficient levels of intelligence

Social Share

Post Comment