It's surprising that some lesser-known tools outperformed big names like Pika Labs. Not sure image to video AI tools are ready to replace traditional animation yet, but these are getting pretty good.
Yeah I'm honestly shocked by how consistently poorly Pika Labs performs in my tests. And Vidu + PixVerse were definitely pleasant surprises.
Looking at the progress in this space (Creepy Will Smith eating spaghetti last year, realistic videos this year), I think we'll get to Midjourney-level of photorealism in the AI video space by next year. But time will tell.
Yeah man, I already expected that Veo would hold up since even Veo 2 was in the top tier, but Grok came seemingly our of nowhere and it's really, really good. Tried it also with text prompts and other image inputs. It's solid!
Wow what a lot of work! Re God Tier---though in similar rank, I find them to be so different with different and obvious flaws that its hard (just based on these two experiments each) to choose. None actually got the wheels correct drive configuration. Thats the first thing I look at because just having the static train car photo move by is no big accomplishment. The wheels are the test. Re Steam Punk: the computational power and prompting needed to actually correlate the hands to be convincing to a violin player is just not feasible but several were better at it. No cigar for any though. I know this is all new tech so the future looks bright.
You're probably right. I might be a bit too lenient in not nitpicking on the micro details, mostly because I still remember what video models were like in late 2023 and what a huge leap this new generation is.
Also, as I write in the concluding section, I think these two tests are no longer representative. We're at the "Daniel benchmark saturation" phase. So if I do this exercise again next year, it'll be with more complex choreography and perhaps even longer multishot scenes.
Judging by the pace of progress in image and video generation, I'd agree with your "future looks bright" take.
They are the current leaders in text-to-speech with lots of additional tools (especially if you pay) like voice cloning, voice isolator, sound effects, etc. They offer 32 languages out of the box, so I guess it depends on which languages you're after!
It makes me wonder if the original surrealist prompt confused the two that made the train run backwards. After all, that's pretty surreal! Same with the sort of ghost train moving from the back to the front.
The kid inside me (like 95% of who I am) is blown away by these incredible things.
Could be, although I normally find that text prompts have a rather minor effect when a starting image is used. (In fact, they're pretty much always optional, and you can just upload an image and ask for a video without further context.)
But yeah, I've now been writing about GenAI for 2 years and while I'm a bit more used to seeing these insane advances, it still feels magical how much AI can do now, in so many areas.
People complain a lot about how they don't do things right or whatever, but does anyone actually realize that 3 years ago, we didn't even have real language models that didn't completely suck? Like, it's positively breathtaking if you zoom out any, at all... at least for someone Gen X or older, anyway.
I might also be using one or two new tools myself; let's see. Either way, the stuff we're already using is going to continue to improve; this is worse than they will ever be at any point in the future.
Thanks for the introduction. But some of them have heavy censorship, which are crazy annoying. AI generated content industry is still at early developing stage, and they already tell us what NOT to do.
I think some of the censorship is less about AI companies trying to impose their own set of ethics and more about attempts to get ahead of the reasonable backlash about celebrity deepfakes, non-consensual porn images, etc. that will arise. Also, they don't want to risk getting sued, so it's natural for them to err on the side of caution.
Heya, thanks for the kind words. I'm happy to hear my posts do so much heavy lifting for you!
Just to understand your process: Is there any reason you prefer to start with perchance/dezgo? Because Grok can pretty much replace the entire first chain of your process. It also has an image generation track, and the way it works is like you described: You type in your prompt, Grok generates several variations of images based on it. But then you can just scroll down endlessly and watch it populate more and more images from your prompt. Each of them has a little "Play" icon that automatically turns them into a video (with an option to supplement it with a test prompt to describe the movement).
So it sounds like exactly what you're doing with three different platforms, all self-contained in Grok.
As for the "Continue" feature you mentioned, I'm quite sure that many providers have exactly this, typically called "Extend." I believe Pika, Luma, and Kling do it at the very least. Worth checking out!
I've used Dezgo before, and it's neat enough. And I do get what you mean, especially with the more complex platforms like Kling that have so many features and options that it's hard for them to cram everything into the interface!
But I actually do find Grok to be very clean and straightforward. Here I recorded a video where I start generating an endless stream of images, pick one to edit using natural language, and then make a video out of it, all in the same interface in an intuitive way:
What a fantastic review! Good stuff.
Thanks, I'm happy you found it useful!
What a fantastic review!
It's surprising that some lesser-known tools outperformed big names like Pika Labs. Not sure image to video AI tools are ready to replace traditional animation yet, but these are getting pretty good.
Yeah I'm honestly shocked by how consistently poorly Pika Labs performs in my tests. And Vidu + PixVerse were definitely pleasant surprises.
Looking at the progress in this space (Creepy Will Smith eating spaghetti last year, realistic videos this year), I think we'll get to Midjourney-level of photorealism in the AI video space by next year. But time will tell.
Glad you enjoyed the round-up!
Thanks for all the hard work.
Glad to see my first hunch (Grok and Veo) validated.
Yeah man, I already expected that Veo would hold up since even Veo 2 was in the top tier, but Grok came seemingly our of nowhere and it's really, really good. Tried it also with text prompts and other image inputs. It's solid!
Wow what a lot of work! Re God Tier---though in similar rank, I find them to be so different with different and obvious flaws that its hard (just based on these two experiments each) to choose. None actually got the wheels correct drive configuration. Thats the first thing I look at because just having the static train car photo move by is no big accomplishment. The wheels are the test. Re Steam Punk: the computational power and prompting needed to actually correlate the hands to be convincing to a violin player is just not feasible but several were better at it. No cigar for any though. I know this is all new tech so the future looks bright.
You're probably right. I might be a bit too lenient in not nitpicking on the micro details, mostly because I still remember what video models were like in late 2023 and what a huge leap this new generation is.
Also, as I write in the concluding section, I think these two tests are no longer representative. We're at the "Daniel benchmark saturation" phase. So if I do this exercise again next year, it'll be with more complex choreography and perhaps even longer multishot scenes.
Judging by the pace of progress in image and video generation, I'd agree with your "future looks bright" take.
Daniel always finds cutting edge tools, thanks for sharing!
I would like to know if there is a natural multi-language Text2Speech tool at present?
You got it, glad you enjoyed it!
As for text-to-speech, the obvious starting point is ElevenLabs: https://elevenlabs.io/languages
They are the current leaders in text-to-speech with lots of additional tools (especially if you pay) like voice cloning, voice isolator, sound effects, etc. They offer 32 languages out of the box, so I guess it depends on which languages you're after!
Heard of it before! I'll give it a try! Thanks again for your patience!
It makes me wonder if the original surrealist prompt confused the two that made the train run backwards. After all, that's pretty surreal! Same with the sort of ghost train moving from the back to the front.
The kid inside me (like 95% of who I am) is blown away by these incredible things.
Could be, although I normally find that text prompts have a rather minor effect when a starting image is used. (In fact, they're pretty much always optional, and you can just upload an image and ask for a video without further context.)
But yeah, I've now been writing about GenAI for 2 years and while I'm a bit more used to seeing these insane advances, it still feels magical how much AI can do now, in so many areas.
People complain a lot about how they don't do things right or whatever, but does anyone actually realize that 3 years ago, we didn't even have real language models that didn't completely suck? Like, it's positively breathtaking if you zoom out any, at all... at least for someone Gen X or older, anyway.
100%. And the pace of progress is still strong in many areas. So it'll be exciting to see where we are a year from now.
I will read all about it here!
I might also be using one or two new tools myself; let's see. Either way, the stuff we're already using is going to continue to improve; this is worse than they will ever be at any point in the future.
Thanks for the introduction. But some of them have heavy censorship, which are crazy annoying. AI generated content industry is still at early developing stage, and they already tell us what NOT to do.
I think some of the censorship is less about AI companies trying to impose their own set of ethics and more about attempts to get ahead of the reasonable backlash about celebrity deepfakes, non-consensual porn images, etc. that will arise. Also, they don't want to risk getting sued, so it's natural for them to err on the side of caution.
these are free?
At the time of writing, they all give you free credits, yes.
Heya, thanks for the kind words. I'm happy to hear my posts do so much heavy lifting for you!
Just to understand your process: Is there any reason you prefer to start with perchance/dezgo? Because Grok can pretty much replace the entire first chain of your process. It also has an image generation track, and the way it works is like you described: You type in your prompt, Grok generates several variations of images based on it. But then you can just scroll down endlessly and watch it populate more and more images from your prompt. Each of them has a little "Play" icon that automatically turns them into a video (with an option to supplement it with a test prompt to describe the movement).
So it sounds like exactly what you're doing with three different platforms, all self-contained in Grok.
As for the "Continue" feature you mentioned, I'm quite sure that many providers have exactly this, typically called "Extend." I believe Pika, Luma, and Kling do it at the very least. Worth checking out!
I've used Dezgo before, and it's neat enough. And I do get what you mean, especially with the more complex platforms like Kling that have so many features and options that it's hard for them to cram everything into the interface!
But I actually do find Grok to be very clean and straightforward. Here I recorded a video where I start generating an endless stream of images, pick one to edit using natural language, and then make a video out of it, all in the same interface in an intuitive way:
https://youtu.be/i7PKHTkfLZE
Quite simply: Sora isn’t free, which is why it didn’t make the cut.
But based on my experience with it, I’d likely place it somewhere between the bottom of Tier #1 and Tier #2. It’s definitely not God material.
But yes, the image generation on Sora.com (powered by GPT-4o) is amazing and still officially the best image model in the world according to most leaderboards (like this one: https://artificialanalysis.ai/text-to-image/arena?tab=leaderboard)