Dorota Jasińska
Paweł Scheffler
The field of large language models is constantly evolving. New models with complex capabilities emerge rapidly, and their architectures are constantly refined. It seems we’ll deal with even more powerful and versatile models in the future. Not that long ago, conversing with a chatbot was just a theory. Now, chatbots can generate not only conversation-based responses but also images or perform complex tasks.
Recently, Anthropic announced a new family of large language models – Claude 3.0 including Haiku, Sonnet, and Opus. Each model has different capabilities and uses, and the benchmark tests suggest they surpass the abilities of other models in some fields.
The models we already know and use, including Gemini and ChatGPT have been available to the public for some time. ChatGPT, created by OpenAI is capable of creating text with exceptional fluency in various languages. It’s often used for content creation, translation, or summarization. Gemini models from Google AI focus on information retrieval and creating responses based on knowledge the bot has access to. It can help in research or answer complex questions. Now, Claude 3 has introduced new models that are considered very versatile and are expected to excel at data extraction.
Knowing that the LLMs are constantly developing, the capabilities of chatbots are improving with time.
Overview of Claude 3.0 models
According to Anthropic, all Claude 3 models have increased capabilities in analysis and forecasting, content and code creation, and conversations in other languages. Moreover, the most intelligent model, Opus, outperformed other LLMs on most AI system benchmarks, such as undergraduate-level expert knowledge (MMLU), graduate-level expert reasoning (GPQA), basic mathematics (GSM8K), and more.
Claude 3 Opus is the most intelligent one of the trio. It outperforms on highly complex tasks and can handle open-ended prompts. It deals with prompts with remarkable fluency and human-like understanding. It could be implemented for task automation, research and development and business strategy.
Claude 3 Sonnet’s capabilities balance intelligence, and speed. It can deliver strong performance at a competitive price. Its abilities could be utilized for data processing, sales business, such as recommendations and targeting, and code generation or parsing text from images.
Claude 3 Haiku is the fastest model that offers near-instant responses. It can answer simple queries and requests very quickly. That’s why it could be implemented in customer interactions to provide quick support, content moderation to detect risky behavior or save costs by managing inventory or extracting knowledge from unstructured data.
Claude 3.0 – how is it innovative?
Anthropic has announced the new model family with a comparison to other large language models. The family includes Haiku, Sonnet, and Opus with an optimal balance of intelligence, speed, and cost depending on their use.
For a while, Claude’s availability has been limited to specific locations. In May 2024 Anthropic announced that Claude is now available for people and businesses across Europe. Users from Europe got access to a free web-based version, Claude iOS app, and Claude Team plan.
The innovation behind the Claude 3 models involves a few areas. Starting with near-instant results, the models can be used to power up live chats, auto-completions, and real-time data extraction. Sonnet is said to be 2x faster than Claude 2 and 2.1 for the majority of workloads. Opus, on the other hand, has a similar speed to Claude 2 and 2.1 but offers a higher intelligence level.
Another aspect worth mentioning is strong vision capabilities. The models can process many different visual formats. This could be very useful for users who work with PDFs and presentations. Claude 3 can also work with photos, graphs, and other diagrams. Sonnet has the highest scores (88.7% in AI2D) for this use.
Past Claude models had a tendency to make refusals when they lacked context. Compared with the results of Claude 2, the news models are less likely not to answer prompts bordering on the system’s guardrails. Now, the models should better understand real harmful requests. So, we should expect fewer refusals to answer harmless prompts.
Knowing the importance of accuracy for clients, this is another area that’s been improved. After testing the models on complex set of questions, Claude 3 proved to be more accurate and generated true or correct answers to challenging open-ended questions. This functionality is said to be developed to provide quotations of the reference material to verify the answers.
According to Anthropic, Opus surpassed 99% accuracy in the NIAH evaluation. NIAH measures the model’s ability to recall information from a large corpus of data. This gives a near-perfect recall. In some cases, the model could recognize the changes artificially introduced into the text by humans, such as adding a sentence to the original source.
Anthropic continues to develop Constitutional AI to improve the safety and transparency of its models. This framework aims to align LLMs with a set of core principles and values. Claude 3 is said to show less bias than previous models and remains AI Safety Level 2 (ASL-2). Anthropic aims for the ASL-3 threshold.
As per Anthropic’s announcement, Claude 3 models outperformed their predecessors in the benchmark tests. They should be able to handle complex, multi-step instructions and prompts better.
What can Claude do?
The potential of uses is very wide and depends on the specific model. Every model has its strengths and weaknesses, and the user needs to be aware of them to make the most of the LLM.
According to Anthropic potential uses, Claude 3 models can be implemented for example in task automation. The models could plan and execute complex actions across APIs and databases or be used for interactive coding. One of the models could also be leveraged in research and development by reviewing the research, and helping brainstorm or generate hypotheses. It can also be used for advanced analysis of charts, graphs, financial documents, market trends, or even forecasting.
Moreover, Claude 3 models could help in data processing such as RAG or knowledge search and retrieval. They can also be implemented in functionalities that include product recommendations, help improve target marketing, or perform forecasting. The models could also facilitate code generation and quality control or even parse text from images. Another use worth mentioning is the implementation in user interactions, for example, in fast and accurate customer support or translations.
It could also be utilized to optimize logistics, manage inventory, and extract and structure data. The models are continuously being developed, and their capabilities should increase over time. At this moment, the models are available in specific locations.
What’s worth mentioning is that users did blind tests of the models and commented on their capabilities in a Reddit thread. According to their experiences, the models indeed outperformed their opponents in terms of speed and code generation. The general reception was very positive, especially regarding the topic of growing competition that would push companies to develop their LLMs even further.
At the moment (15 April 2024) Opus is in 2nd place in the LLM leaderboards, but not that long ago it was the leader:
Self-assessment prompt
As Claude 3 introduced new models and we could access only Sonnet, we tested the same prompts as in the case of ChatGPT and Gemini to compare the results. So, we asked Sonnet about fields in which it performs better than ChatGPT and Gemini to let it assess and summarize its capabilities itself.
Looking at the reply, Sonnet could not compare its capabilities due to knowledge limitations. However, it stated that it should be more accurate and thorough, as Anthropic claims in their announcement. Still, it’s aware all models are different and can perform better or worse in specific scenarios.
Generating and understanding code
At the beginning, we entered a simple prompt with a code that changed characters to uppercase. It identified correctly what the code does and the language used:
Sonnet provided a detailed explanation of the function and gave a “Hello, World!” example, just like Chat-GPT.
We also asked it to modify the code to change all letters to lowercase. It also provided a broad explanation of how the code was altered and gave another “Hello, World!” example. Interestingly, Sonnet provided a real-life example of how this snippet could be used. No other chatbot added such a comment.
To check the bug-fixing abilities, we used a Swift code snippet with an error to see how the bots performed. The prompt was as follows:
import Foundation
let fruitsInBasket: [String] = [“Banana”, “Apple”, “Orange”, “Strawberry”, “Mango”, “Carrot”]
print(“These are the fruits that I have in my basket:”)
for number in 0…fruitsInBasket.count {
print(fruitsInBasket[number] + “\n”)
}
Sonnet performed similarly to GPT-4. It recognized it was a Swift code, found the error, and explained its functionality:
Sonnet also fixed the code with an output example and didn’t notice the odd one out, as GPT-4 did. GPT-4 was the only model that noticed a carrot appeared in a list of fruits. However, Sonnet commented on the “terminator” parameter of “print” and explained why the output does not have an extra line.
Creative text generation
To check the creativity of Sonnet, we asked it to write a few cat jokes. The results were very similar to GPT-4. The jokes were based mostly on puns and weren’t as descriptive as in the case of Gemini Ultra.
The next prompt checked how Sonnet perceives moral dilemmas. We asked the following question: Imagine you’re my friend who’s cheated on. Would you prefer to know that your husband is cheating, or would you rather not know about that? And, just like ChatGPT, at first, it was reluctant to take a clear stand:
Finally, it provided its answer, which was the same as in the case of ChatGPT and Gemini—yes, I would like to know. Also, it provided us with its reasoning, just like other models.
Conclusion
The overall performance of the model is very similar to ChatGPT and Gemini. What’s interesting is that Sonnet added a comment about the code parameter. The generated text tests are hard to comment on, as they were also alike. Anthropic’s model can become an alternative to other models available on the market.
Claude as an alternative to other LLMs
Looking at the benchmarks of Anthropic’s Claude 3 announcement, the strong performance of the models suggests promising perspectives for their future uses. The high results in undergraduate and graduate-level knowledge domains are one of the reasons to believe it could be great for preparing smart tutoring systems with personalized experiences and educational materials with different difficulty levels.
The high reasoning score in benchmarks suggests the models could become advanced research assistants. The skills could enable fast information retrieval, data analysis, hypothesis generation, or finding relevant research papers.
However, even though the benchmarks presented in the announcement are impressive, the real-world applications will verify the true performance of models. Overall, Claude’s 3 performance promises a lot in terms of the future of LLMs. As the models are still being developed, the capabilities of Haiku, Opus, and Sunnet will likely increase in time.
Possible democratization of LLM
With a broader choice of LLMs, their availability and capabilities increase. Overall, Claude 3 could contribute to democratizing access to LLMs. The models’ proficiency in knowledge domains may broaden users’ access to expert knowledge. Moreover, due to its conversational nature, it’s easy to use. Claude 3 is focused on the user and its integration into chatbots and virtual assistants could provide easy-to-understand responses to queries.
Claude’s capabilities match the performance of other LLMs, at least at the level of use we checked in this article. The Claude 3 models can be considered an alternative to other GenAI chatbots.
FAQ
Which is better: Claude or ChatGPT?
There is no simple answer to that question, as both Claude and ChatGPT are excellent chatbots that can facilitate work greatly. Still, AI platforms have their own advantages and disadvantages. Claude performs better when asked a factual question. It’s also great at complex reasoning and coding tasks. ChatGPT has broader uses, as it can generate images within the chat and some models and versions have access to the web. It seems this makes it a better companion in creative tasks. Choose the chatbot based on what you need in particular.
What are Claude AI capabilities?
Claude, as a chatbot, has great conversational skills and can engage in conversations with the user. It can understand context and give sensible responses. Moreover, Claude can be used to analyze and summarize documents or code. It can also help in code writing. Claude is considered a great tool for factual tasks and reasoning. Thanks to its improved accuracy, it has little chance of generating nonsensical responses to queries.