3 Free Sites to Compare LLMs

Pit LLMs against each other and compare their output.

Sep 19, 2024

∙ Paid

By this point, most of us have a preferred language model to chat with.

Many stick to the good old GPT-4o, others swear by Claude 3.5 Sonnet, and a few fringe freaks still prefer talking to other humans.

But do you truly know how well your chosen model compares to other LLMs, or are you using it purely out of habit?

Chances are it’s the latter.

So for today’s post, I’ve dug up a few free sites that let you pit LLMs against each other to find out which model is best for a specific question or task.

For my demo purposes, our question will be this silly nonsense:

“What is the capital of Paris?”

I want to see how LLMs handle human stupidity.

Let’s roll!

1. LMSYS Chatbot Arena’s “Side-By-Side”

LMSYS Chatbot Arena front page

The LMSYS Chatbot Arena’s Leaderboard is widely used to rank LLMs based on blind tests by real users.

But it also has a side-by-side section that lets you chat with two selected models and compare them.

How to use

1. Navigate to lmarena.ai.

2. Click the “Arena (side-by-side)” tab at the top:

LMSYS Chatbot Arena side-by-side tab

3. Select the two models you want to compare from their respective dropdowns:

LMSYS Chatbot Arena side-by-side model selection

4. Input your prompt into the box at the bottom:

LMSYS Chatbot Arena prompt input and model settings

5. Run and compare the results:

LMSYS Chatbot Arena side-by-side responses — “How rude, GPT! Why can’t you be more like your polite brother Claude?”

As you can see, it’s a straightforward process.

You can even continue the chat to see how the models handle repeated interactions:

LMSYS Chatbot Arena follow-up chats

The good

Great selection (68 models)
Can tweak some model settings like temperature, output tokens, etc.
Can work with uploaded documents or images
Convenient side-by-side comparison view
No sign-up required

The bad

Intermittent errors and runtime issues with specific models
No ability to set a system prompt for the models

Check out LMSYS Arena

2. Modelbench

Modelbench front page

Modelbench primarily targets developers and professionals building with LLMs, but it also offers a rather beginner-friendly way to compare model outputs.

How to use

1. Navigate to modelbench.ai

2. Click the “Sign Up For Free Today” button:

Modelbench sign-up button

3. Provide your details or sign in using your Google account:

Modelbench sign-up form

4. In the Playground, click the “Compare With” button in the top-right to create the side-by-side view:

Modelbench playground Compare With selection

5. Select your two models from the dropdown:

Modelbench playground individual model selection

6. Input your prompt into the box at the bottom:

Modelbench playground prompt input

7. Run and compare the results:

Modelbench playground side-by-side results output

Just as with LMSYS Arena, you can respond with follow-up messages to continue the conversation:

Modelbench playground side-by-side follow-up chat — Good for you, Llama 3 - sticking to your guns (correctly)!

The good

Convenient side-by-side comparison view
Massive selection (180+ models)
Can modify and adjust system prompts per model
Can work with uploaded images
Can tweak every model setting (temperature, tokens, Top-P, and so on)
Follow-up messages can be different per model
Additional test tools within the Workbench environment

The bad

Free trial is limited to 7 days
Requires sign-up

Check out Modelbench

3. Wordware’s “Try all the models”

Wordware's "Try all the models" app

This one’s a bit of a different beast.

Wordware lets anyone build AI apps using natural language. With this featured app called “Try all the models for a single question,” you can…well, it’s in the name.

How to use

1. Navigate directly to the Wordware app via this link.

2. Type your question into the “QUESTION” box:

Wordware's "Try all the models" app input prompt box

3. Click “Run App” and wait for your results.

Wordware's "Try all the models" app - evaluated and ranked verdict by Claude 3 Opus

Yup, that’s all there’s to it!

The stand-out feature of this app is that all responses are analyzed, evaluated, and ranked by Claude 3 Opus, which then gives you its verdict on what model best handled your specific question.

The good

Test over a dozen models in one go
Helpful evaluation and ranking by Claude 3 Opus
Super simple, one-input-box interface
No sign-up required

The bad

Can’t add or deselect models (it runs all models for every question)
A limited number of tested models
No helpful side-by-side view (you have to scroll through responses)
No follow-up interactions
No ability to upload files
Long wait time for the responses and evaluation to finish processing

Check out Wordware

Verdict

I think all sites have their strength and their usefulness when it comes to comparing LLMs. So your choice ultimately depends on your needs.

Modelbench is the most complete and robust tool, but you’ll need to sign up and eventually pay. If you’re a professional user building with LLMs, this one’s a no-brainer.

LMSYS Chatbot Arena is a great free alternative that offers many of the same capabilities without any sign-up or payment requirements.

Wordware’s “Try all the models” is perfect if you want to test multiple models on a single task or question with as little effort as possible. It even helps you make sense of the results.

So go ahead and take them for a spin. I’m curious to hear your thoughts.

🫵 Over to you…

Which of the tools makes the most sense for your needs? Are you aware of any other sites where one can compare LLMs for free?

Leave a comment or shoot me an email at whytryai@substack.com.

Leave a comment

This post is for paid subscribers

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts