Generative AI is advancing rapidly, as demonstrated by two separate events from OpenAI and Google this week. I expect to hear even more in the coming weeks and am most interested in seeing more real business applications. But I’m also intrigued by how this technology can be used to make “personal agents” (think Siri or Alexa) more useful.OpenAI Spring Update: GPT-4o
On Monday, OpenAI CTO Mira Murati unveiled a new flagship model called GPT-4o (the “o” stands for “omni”). It’s faster than its predecessor and allows for more natural human-computer interaction through any combination of text, audio, and vision (images). OpenAI isn’t claiming this is a big step forward in the underlying model per se. Instead, it should be “friendlier and more conversational,” with a new user interface and a desktop app that makes it more accessible.It’s also speedy. OpenAI’s earlier Voice Mode—a combination of transcription, intelligence, and text-to-speech first announced in September—had a lot of latency. That should be less of an issue with the new version, which will be available to all 100 million ChatGPT users, including free users, which is a big step forward. (Paid users still get five times the capacity, though.)
OpenAI Reveals Its ChatGPT AI Voice Assistant
Murati said ChatGPT can now remember your previous interactions to create a sense of continuity across conversations, browse to bring in real-time information, and better understand data in charts. Perhaps most importantly, the company has improved its speed in 50 languages.All these features now also apply to the API so that software developers can get solutions two times faster, 50% cheaper, and with five times higher rate limits than the earlier GPT-4 Turbo.In demos, ChatGPT 4o seemed to react with human speed instead of the 3- to 5-second latency found on older models. It also supports interruptions and can continue the conversation after any redirects. Other demos included things like solving equations written on a sheet of paper, helping with a coding problem, and real-time translations.I’m not convinced that a general-purpose assistant that does everything will be better than specialized tools for specific types of problems. At least not yet. But GPT-4o certainly looks impressive. OpenAI says all of this will be rolling out over the next few weeks.Google: The Latest on Gemini
Not to be outdone, Google opened its I/O developer conference on Tuesday with CEO Sundar Pichai demonstrating a private preview of its new Gemini 1.5 Pro model. He focused on two big areas of improvement—better multimodality (including the ability to use videos in prompts, along with text, audio, and images) and a larger token context window (translation: the ability to have longer, more detailed prompts).Gemini 1.5 Pro now supports a context window of 1 million tokens; this will double to 2 million in the new model in preview, Pichai said. (Two million tokens equals around 1.4 million words, two hours of video, or 22 hours of audio. For comparison, ChatGPT’s content window is 128,000 tokens). In the long run, Pichai said Google wants to offer “infinite context windows.”Pichai emphasized how the model is being incorporated into products like Google Workspace and Photos. The demos looked good, with Gemini appearing as a sidebar in Workspace doing things like summarizing a set of emails. In Google Photos, the “Ask Photos” feature will help you find images with more natural-language requests. Other specific tools included new or upgraded versions of models to create images (Imagen 3), music (Music AI Sandbox), and video (Veo).
Everything announced at Google I/O
Of course, the most important Google service is search. The company said the new Gemini model will be incorporated into search in the US soon, then to much of the rest of the world—over a billion users—by the end of the year. This new capability, previously known as the Search Generative Experience (SGE), is now called “AI Overviews.” It looks interesting, though I wonder how it will impact publishers and people who simply want regular web searches.Perhaps most interesting was a new push for AI assistants called Project Astra, which is effectively a prototype AI agent that can do things like point out items you can see with your phone camera (or smart glasses), review code, and provide detailed trip planning. This seems to be building on an improved, camera-enabled Gemini app for your phone that will soon be able to see the world around you.
This Tweet is currently unavailable. It might be loading or has been removed.
Summarizing the AI ImprovementsOf course, there are other players. We’ll hear more from Microsoft about Copilot at its Build event next week (which I will be covering), as well as continued improvements from companies like Meta with its Llama models, Anthropic with Claude, Cohere, Mistral, and others.
Recommended by Our Editors
In general, all of the companies are working on the same kinds of things to separate this year’s AI offerings from what came before:They are “multi-modal,” meaning they work with different kinds of input—not just text but audio, video, slides, code snippets, etc. This isn’t completely new, but every model adds more types of input.They have much larger context windows. In other words, they’ll take a lot more data as your input. Tokens are usually words or parts of words that are turned into symbols and used by the models. The more data they can take in, the better the results should be.They have a longer memory. If your conversation has multiple questions and answers, going back and forth (“turns”), the model is more likely to remember where it started, so it can give you responses in context.They are faster. In some cases, much faster.The Second (or Third?) Coming of AgentsOne big goal you’ll hear from many of these companies is the emergence of a new generation of “intelligent agents”—personal software that can perform specific tasks for you. For instance, Google demonstrated planning a trip to Miami where the AI considers when the travelers want to go and their individual preferences. It looked very cool.
We’ve heard this before. That was the promise of things like Amazon’s Alexa, Apple’s Siri, and the Google Assistant (not to mention Microsoft’s now-defunct Cortana or even Clippy and Bob).The new models have features that should make things better, including improved memory, faster response, and multimodal capabilities. Google is already incorporating it into its Gemini app; I assume it will be part of Google Assistant at some point. OpenAI apps are now beginning to act similarly. Apple is widely rumored to be upgrading Siri, likely at its Worldwide Developers Conference (WWDC) next month. Microsoft Copilot works in specific applications and is a big improvement over Cortana thus far. You have to assume Amazon will upgrade Alexa as well.I’m very interested to see what these new tools will be able to do, but I’m taking all the promises with a grain of salt. After all, six years ago at Google I/O, the company showed off a feature called Duplex that would do things like call to make a restaurant reservation for you. It never took off. While Siri, Alexa, and the Google Assistant all can do simple things like play music, turn on lights, and look up info on the web, they’ve never lived up to the promise they once had.Maybe this time—with much more powerful software under the hood—they can.
Get Our Best Stories!
Sign up for What’s New Now to get our top stories delivered to your inbox every morning.
This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.