Small but Mighty: Small Phi-3 Language Models with Potential

Sometimes the way to solve a complex challenge is to be inspired by a page from a children’s book. That’s the lesson Microsoft researchers learned from figuring out how to cram more force into a much smaller package.

Last year, after spending his workday thinking about possible answers to informational puzzles about devices, Microsoft’s Ronen Eldan was reading stories to his daughter when he wondered, “How did you learn that word?How do you know how to attach those words?

This led to Microsoft’s research system becoming a research expert to question how well an AI model can express itself in just words that a 4-year-old can understand, and ultimately to a cutting-edge educational technique that has created a new class of younger children. , better-performing language styles that promise to make AI more available to more people.

Large language models (LLMs) have created exciting new opportunities to be more productive and artistic through AI, but their length can require significant computing resources to function.

While those models are still gold for solving many types of complex tasks, Microsoft has developed a number of small language models (SLMs) that offer many of the same capabilities as LLMs, but are smaller in length and trained on smaller foundations. of data.

The company today announced the Phi-3 Open Model Circle, the most capable and cost-effective small language models available. Phi-3 models outperform models of the same length and longer length on a variety of tests that evaluate language, coding, and coding. and math skills, thanks to educational inventions developed through Microsoft researchers.

Microsoft is now releasing to the public the first of this family of smaller, more powerful language models: the Phi-3-mini, which measures 3. 8 billion parameters and performs better than models twice as large, the company said. .

Starting today, it will be available in the Microsoft Azure AI style catalog and on Hugging Face, a platform for device learning styles, as well as Ollama, a lightweight framework for running styles on an on-premises device. Available as an NVIDIA NIM microservice with a popular API interface that can be deployed anywhere.

Microsoft also announced that more styles of the Phi-3 family will be available soon to offer more possible options in terms of quality and cost. Phi-3-small (7 billion parameters) and Phi-3-medium (14 billion parameters) will soon be available in the Azure AI style catalog and other style gardens.

Small language models are designed to work well on simpler tasks, are more widely available and useful to organizations with limited resources, and can be more easily adjusted to meet expressed needs.

“What we’re going to start to see is not a shift from big to small, but a shift from a single category of styles to a portfolio of styles where consumers have the ability to make a style decision for their situation. ” said Sonali Yadav, senior product manager for generative AI at Microsoft.

“Some consumers may only want small models, others will want giant models, and many will want to mix the two in other ways,” said Luis Vargas, Microsoft’s vice president of AI.

Choosing the right language style depends on an organization’s expressed wishes, the complexity of the task, and the resources available. Small language styles are suitable for organizations to create programs that can run locally on a device (as opposed to the cloud). ) and where a task does not require deep reasoning or a quick reaction is required.

Some consumers may only prefer small models, others prefer giant models, and many wish to combine the two in other ways.

Large language models are more productive and are suited to programs that require the orchestration of complex responsibilities involving complex reasoning, knowledge analysis, and context.

Small-language models also offer possible answers for regulated industries and sectors that are in conditions where they want effects but want to keep knowledge on their own premises, Yadav said.

Vargas and Yadav are particularly excited about the opportunities to place higher-performing SLMs in smartphones and other cellular devices that work “at the edge,” connected to the cloud (think car computers, non-Wi-Fi PCs, traffic systems, smart sensors). in a factory, remote cameras, or devices that monitor environmental compliance). By keeping knowledge on the device, users can “minimize latency and maximize privacy,” Vargas said.

Latency refers to the delay that can occur when LLMs use the cloud to retrieve data that is used to generate responses to user requests. In some cases, it’s worth waiting to get high-quality responses, while in other scenarios, speed is more vital to user satisfaction.

Because SLMs can work offline, more people will be able to leverage AI in tactics that weren’t imaginable before, Vargas said.

For example, the MLS can also be used in rural areas without cellular service. Let’s take the example of a farmer who examines his crops and locates symptoms of disease on a leaf or branch. With a visually capable SLM, the farmer can simply take a picture of the crop and get immediate recommendations on how to treat pests or diseases.

“If you’re in a part of the world that doesn’t have a smart grid,” Vargas said, “you’ll still have AI reports on your device. “

As the call suggests, compared to LLMs, SLMs are small, at least by AI standards. The Phi-3-mini has “only” 3. 8 billion parameters, a unit of measurement that refers to the algorithmic buttons on a model that help its output. Conversely, larger language models are several times larger.

It was largely the idea that massive advances in generative AI induced through giant language models would become imaginable through their size. But the Microsoft team has managed to expand small language models that can deliver abundant effects in a small package. imaginable through a highly selective technique for educating knowledge, which is where children’s books come in.

To date, the popular approach to the language models of education giants is to utilize massive amounts of knowledge from the internet. It was thought that this was the only way to satisfy the enormous appetite for content of this type of model, which will have to be “learning” in order to perceive the nuances of language and generate intelligent responses to the user’s cues. But Microsoft researchers had a different idea.

“Instead of educating yourself solely on raw data from the internet, why not look for high-quality data?” asked Sebastien Bubeck, Microsoft’s vice president of generative AI research, who has led the company’s efforts to expand smaller, more capable language models. But what to focus on?

Inspired by Eldan’s nightly reading ritual with his daughter, Microsoft researchers created a discrete dataset that started with 3,000 words, adding roughly equivalent numbers of nouns, verbs and adjectives. They then asked for a giant style of language to create a children’s story using a noun, verb and adjective from the list: a spark they repeated millions of times over several days, generating millions of children’s stories.

SLMs are uniquely placed to. . . Computers where you don’t want the cloud to do things.

They called the resulting dataset “TinyStories” and used it to exercise very small language styles of about 10 million parameters. To their surprise, when they were asked to create their own stories, the small language style exercised in TinyStories generated fluent narratives with the best grammar.

Then they took their experience to the next level, so to speak. This time, a giant organization of researchers used consciously publicly decided knowledge to obtain it, filtered by educational price and content quality, to shape Phi-1. After publicly collecting to be molded into an initial body of knowledge, they used a form of nudging and seed fostered through the one used for TinyStories, but they went a step further and made it more sophisticated, so it could capture a wider diversity of knowledge. To ensure the best quality, they continuously filtered the resulting content before injecting it back into an LLM for further synthesis. In this way, over several weeks, they built a knowledge framework large enough to shape a more effective SLM.

“A lot of care is being put into generating this artificial data,” Bubeck said, referring to AI-generated data, “looking at it, making sure it makes sense, filtering it. We don’t take everything we produce. “

The researchers further advanced the body of knowledge by offering variety of knowledge as instructors analyzing difficult concepts for a student. “Because it’s about reading textbook-like material, quality fabrics, which works very, very well,” Bubeck said, “you make it a lot less complicated for the language to read and understand that material. “

Distinguishing between high- and low-quality data is rarely very complicated for a human, yet sorting through more than a terabyte of knowledge that Microsoft researchers decided they would want to exercise their SLM would be highly unlikely without the help of an LLM. .

“The strength of the existing generation of giant language models is a tool that we didn’t have before in terms of generating artificial data,” said Ece Kamar, a Microsoft vice president who runs the Microsoft Research AI Frontiers Lab, where the new educational technique was developed. developed.

Starting with carefully chosen insights helps reduce the likelihood that models will give unintended or misplaced answers, but it’s not enough to protect you from all potential security issues. As with all versions of generative AI models, Microsoft’s product and AI groups used a multi-layered technique to manage and mitigate threats when developing the Phi-3 models.

For example, after the initial training, they provided more examples and feedback on how styles deserve to respond preferentially, which adds an extra layer of security and helps the style generate high-quality results. Each style is also subject to evaluation, testing, and a manual red team, whose experts identify and patch potential vulnerabilities.

Finally, developers using the Phi-3 family of templates can also take advantage of a set of tools available in Azure AI to help them build more secure and reliable applications.

But even small language models trained on high-quality knowledge have limitations. They are not designed for a deep recovery of wisdom, where giant language models excel because of their greater capacity and education of much more gigantic knowledge sets.

LLMs are better than SLMs for complex reasoning over gigantic amounts of data because of their length and processing power. This is a feature that can be applicable for drug discovery, for example, by helping to explore vast repositories of clinical articles, analyzing complex models. and perceiving interactions between genes, proteins, or chemicals.

“Everything that refers to things like making plans where you have a task, and the task is confusing enough that you have to figure out how to break that task down into a set of subtasks and sub-subtasks, and then run all the tasks. those that provide a definitive answer. . . they’re really going to stay in the realm of grand style for a while,” Vargas said.

Based on ongoing conversations with customers, Vargas and Yadav expect some companies to “offload” certain responsibilities to small units if the task is too complex.

For example, a company can use Phi-3 to summarize the main topics of a lengthy document or extract applicable data and industry trends from market research reports. Another organization may use Phi-3 to generate copy, helping to create content for marketing. or sales teams, such as product descriptions or social media posts. Or, a company can use Phi-3 to force a chatbot to answer fundamental questions from visitors about its plan or service updates.

Internally, Microsoft already uses style suites, where giant language styles act as routers, to direct some less compute-intensive queries to smaller language styles, while tackling other, more complex queries on its own.

“The claim here is not that SLMs are moving on to updating or updating the big language styles,” Kamar said. Instead, SLMs “are uniquely positioned for edge computing, on-device computing, computing where you don’t want to move to the cloud to get things done. That’s why it’s vital for us to perceive the strengths and weaknesses of this style portfolio.

And length has significant advantages. There’s still a gap between small language models and the point of intelligence that can be gained from giant models in the cloud, Bubeck said. “And maybe there’s going to be a gap because, you know, the big models will continue to get better. “

Leave a Comment Cancel Reply