How gen AI is moving from the Napster to the Spotify era; Trump eyes AI czar; we need a better way to benchmark AI models; new competitors emerge to OpenAI's o1; artists leak Sora in protest
Stanford publishes their Global AI Power Rankings; Amazon's plan to compete with Nvidia in chips; do you need coders in an AI world? AI's scientific path; Uber starts an AI outsourcing business;
This week, I attended the launch of a new report from MMC Ventures which discusses the ethical challenges of training AI models, drawing parallels between the data gathering practices of today and the Napster-to-Spotify transition that occurred over two decades ago in the music industry. The report examines the perspectives of three stakeholders: content creators, data rights holders, and AI developers, highlighting the need for fair compensation and efficient rights management.
The authors - Advika Jalan and Charlotte Barttelot - do a good job exploring technological solutions like AI data marketplaces and watermarking to address these issues, advocating for a shift towards licensing agreements and improved data discoverability. They also emphasize the importance of standardization in data formats and consent management to encourage ethical AI development.
One thing that the report does well is look at how the dynamics between data and AI models play out during the entire lifecycle, making the observation that, as we move from pre-training and fine-tuning to inferencing, the value of individual data points increases because the difference it makes to the outcomes is greater.
Another great section of the report goes deep into how creators think about AI. There are some obvious findings related to compensation, control and irreplaceability but the authors also touch on two other important factors: discoverability and attribution. Discoverability in particular is an interesting one; for example, Living Assets is a company that helps artists improve discoverability of their work through Search Agent Optimisation. Search agent optimization is similar to search engine optimisation, but built for the likes of Perplexity and ChatGPT Search, and lets content creators easily enter into licensing agreements with AI developers.
However, there's an important question left to answer: how can AI developers balance ethical practices with financial viability, given that not everyone has the budget to spend hundreds of millions of dollars on deals with data providers? Since I've just finished watching the season finale of the Great British Bake Off, I'm going to indulge in a cake-based answer. If an AI model is a cake and the ingredients are data, there are generally four options available:
Buy a cake from a cake shop. For those of us who are in a hurry, this is the go-to solution. The cake will taste okay but you have the take the word of the cake seller as to when and how it was made, and what's inside of it. You may have access to the ingredient list, but you don't know where those ingredients came from. Maybe - in some cases - you don't even care. You just want to eat cake. In AI model terms, this is what you can get today from OpenAI, Cohere or Anthropic. They take their ingredients from whenever they can, usually in bulk - in other words, they use massive amounts of scraped data from the internet. This is legal, as proven by several lawsuits filed against Meta and X by the web scraping company Bright Data. But there are many ethical challenges to this approach which the report goes deep into.
Buy a pre-made cake mix to which you add some milk and eggs at home. This is very similar to the store-bought cake but you still have a bit of control over some of the ingredients plus you're involved in the cake making process as you have to bake the cake yourself. This is what you get with open weights models - you need to fine-tune them to fit your use case, but most of the grunt work has been done for you in terms of pre-training. The data used for pre-training is scraped from the internet as well but the fine-tuning allows you to add smaller, more application-specific datasets that make the cake ready to eat. For example, a model initially trained on a general text corpus can be fine-tuned on medical journals to excel in answering health-related questions. The ethics of using open weights models is questionable but there is hope that regulation will eventually make them more ethical to use.
Buy generic ingredients from a supermarket and make your own cake in the kitchen: chances are you're ultimately going to end up making a cake that's more or less similar to the store-bought one but there's something thrilling in the DIY approach. You also get an opportunity to select the ingredients and fully adjust the recipe based on your own liking. However, the ingredients still come from unknown sources and you have to trust the sellers you've bought them from. There are some startups today that are trying to offer home-made alternatives to the store-bought cake variety. These smaller models are usually designed for specific verticals or niche use cases, similar to how people tend to bake cakes at home for special occasions. This is the approach that startups such as Xund and Beatoven.ai have taken, and the report shows it can result in lower data costs.
Go to your local market to find ethically sourced ingredients (or grow your own!) and then make the cake in your kitchen. This last option is for the more adventurous out there who secretly think they would do a good job on the Great British Bake Off. It's complicated, it's time-consuming, it's costly but, when done right, it could give you significant advantages over the competition. You have full control over the ingredients, you know where they come from, and therefore you can reassure anyone eating the cake that not only are they eating the finest dessert available but they can also feel good about it. Rightsify and my employer Synthesia are examples of companies that went down this path. In Synthesia's case, I explain in the report how we chose to make a multi-million dollar investment in a recording studio where we've captured many hours of footage with paid actors and then used that dataset to train EXPRESS-1, the model that generates our Expressive Avatars.
Now that we are all caked out, I want to tackle another topic that's included in the MMC Ventures report but was also in the news this week: data authenticity.
There's been a lot of discussion on how to handle issues of authenticity and transparency in the content creation and distribution process, now that more people are using generative AI to create and share text, images or videos on the internet.
Some argue labelling and self-disclosure are a way forward but that's difficult to do in a nuanced way because there's a difference between AI generated and AI assisted content, for example. Some content sharing platforms also struggle with labelled content. When I tried to upload an AI generated video with my avatar on a very popular social sharing platform and labelled it accordingly, the platform's content moderation systems flagged it as misinformation and took it down, even though the video was a factual summary of a technical paper that I had created myself.
Others are pushing for watermarking but watermarks can be easily removed and their effectiveness might be limited. This video has a watermark but someone could easily crop it out or use video editing software to remove it, and then republish it.
Finally, there's a cottage industry of AI detectors but they're in a constant (and sometimes losing) battle to keep up with the advances made by AI generators.
I think content credentials provide a way forward. Just like Shazam provided a trustworthy and secure way for anyone to learn more information about a piece of music, C2PA could provide more sophisticated answers related to content - AI generated or not - on the internet.
It will start with a binary question: does a piece of content - say, an image - have content credentials associated with it or not? If it doesn't, we should probably not trust its authenticity by default, question its provenance, and not distribute it further.
Then, if there are content credentials attached to it, am I looking at the original or an altered version? When was the image made and by whom? Where was it shared?
C2PA is also a great dual use technology: it allows content creators to opt in or out from their data being used for model training.
That said, challenges remain with implementing C2PA at scale, especially for video sharing platforms - which I've highlighted to Jeremy Kahn in the Eye on AI newsletter he writes for Fortune magazine and which is linked below.
And now, here are this week’s news:
❤️Computer loves
Our top news picks for the week - your essential reading from the world of AI
Business Insider: AI improvements are slowing down. Companies have a plan to break through the wall.
Axios: Trump eyes AI czar
MIT Technology Review: The way we measure progress in AI is terrible
Stanford University: Global AI Power Rankings: Stanford HAI Tool Ranks 36 Countries in AI
Bloomberg: Amazon’s Moonshot Plan to Rival Nvidia in AI Chips
New York Times: Should You Still Learn to Code in an A.I. World?
The Information: New Competitors Chase OpenAI in Reasoning AI Race
Fortune: Labeling AI-generated content is not as easy as it seems
Bloomberg: Uber’s Gig Workers Now Include Coders for Hire on AI Projects
FT: OpenAI’s text-to-video AI tool Sora leaked in protest by artists
Keep reading with a 7-day free trial
Subscribe to Computerspeak by Alexandru Voica to keep reading this post and get 7 days of free access to the full post archives.