5 Hard-Won Truths About Building AI Products (From a Former Meta Llama PM)

Published

24th September 2025

Author

Team Elevation

Vikash Rungta

5 Hard-Won Truths About Building AI Products (From a Former Meta Llama PM)

Published

24th September 2025

Author

Team Elevation

Vikash Rungta

If you're building a product with AI, you know the chasm between its world-changing promise and the messy reality of implementation. We're told we can build anything, yet we spend our days battling hallucinations, unreliability, and outputs that are technically correct but practically useless. How do you bridge that gap and build something that actually works?

Recently, as part of our AI Build Series, Elevation Capital hosted a session with Vikash Rungta, a true practitioner from the front lines, on how to build AI capabilities in your products. As a former product lead at Meta for Llama Safety and now an instructor at Stanford, he has wrestled with these problems at a scale few can imagine. He's seen what works, what fails, and—most importantly—why.

This article distills five counterintuitive but powerful truths from Rungta's experience. These insights move beyond the hype to offer a practical, engineering-focused mindset that can change how you think about building with AI.

1. Your Biggest Misconception: An LLM is Not ChatGPT

Let's start with the most fundamental and widespread misunderstanding that trips up builders. We see polished products like ChatGPT and assume we are working with the same thing. This is a critical error. The raw material you are building with—the Large Language Model (LLM)—is far simpler and more constrained.

An LLM's only door to the world is a single text prompt.

That’s it. Products like ChatGPT are complex applications built on top of a base LLM. They are sophisticated systems that add layers for memory, file uploads, web search, and conversational history. The base model itself has none of this. As Vikash explains, this distinction is the source of immense confusion for builders who expect the model to have capabilities that it simply doesn't.

"The biggest myth I have seen is that people understand LLMs to be more than they are. They expect and assume there are different ways to provide input to the model... You may think you can upload files, connect databases, or that it has memory. None of that is true. The ChatGPT you see as a product is very different from the LLM itself."

This realization should reallocate your team's resources. Instead of spending months evaluating dozens of models, your first priority should be building a robust "context engineering" pipeline. The best model in the world is useless if you can't feed it the right information through its only door.

If the prompt is the only control mechanism you have, then mastering how to provide rich, relevant context—what many call "prompt engineering"—isn't a secondary skill. It's the entire game.

2. Hallucination Isn't a Bug, It's a Symptom of Starvation

Hallucinations are the number one barrier cited by teams trying to integrate AI into their products. The common assumption is that the model is randomly failing or "making things up." While models are probabilistic, the primary driver of hallucinations is something entirely within the builder's control: a lack of context.

Vikash uses a vivid analogy: when you give a model a vague prompt, you're essentially putting a "gun to its head" and forcing it to generate the next word, whether it has the necessary information or not. Since it's programmed to complete the text, it will guess based on its training data, leading to a confident-sounding but factually incorrect response.

When a user provides a vague prompt like, "give me a strategy of company XYZ," without any details about the company, the model is forced to invent a plausible-sounding strategy. The fault isn't the model's imagination; it's the prompt's vacuum. You've given it nothing to work with.

"...providing the least amount of context is the number one reason why the model's hallucinating."

This reframes the problem entirely. It shifts the responsibility from "the model is broken" to "how can I provide better context?" The solution lies not in finding a perfect model, but in architecting a system that feeds the model the information it needs through better prompts, providing examples (few-shot learning), or retrieving relevant data (a technique known as Retrieval-Augmented Generation, or RAG, which feeds the model external information to ground its answers).

3. You Have to Explicitly Tell Your AI How to Think

How do you solve a complex problem? You don't jump straight to the answer. You break it down, work through the intermediate steps, and build toward a conclusion. It turns out we can ask an LLM to do the same thing, and you will see a dramatic impact on the quality of the model’s reasoning.

This technique is called "Chain of Thought." It’s the simple act of prompting the model to generate its intermediate reasoning steps before giving a final answer. Instead of asking for an immediate output, you ask it to "think out loud."

Vikash illustrates this with a riddle. When a model is asked to solve a complex riddle directly, it often fails. However, by simply adding the phrase "solve this step by step" to the end of the prompt, the model's performance improves significantly. It first breaks down the riddle's logic, analyzes each component, and then uses that reasoning to arrive at the correct answer. The trade-off is that this process uses more tokens and therefore costs more, but the gain in output quality and reliability is often well worth the investment.

This technique is powerful because it transforms the LLM from an opaque black box into a transparent reasoning partner. It gives you visibility into how it reached a conclusion, making the output more trustworthy and easier to debug. Chain of Thought is just the beginning; experts are now exploring even more powerful reasoning structures like "tree of thought" to solve increasingly complex problems.

4. Stop Measuring Model Accuracy, Start Measuring User Success

In the race to build AI products, it’s easy to get fixated on academic metrics for model quality like BLEU or ROUGE scores. These benchmarks measure things like grammatical correctness and semantic similarity, but they are dangerously insufficient for building a great product. A model can score perfectly on these tests and still deliver a completely useless experience.

Vikash shared a perfect anecdote: an "AI trip planner." A user asks it to plan a one-day trip to Los Angeles. The AI generates a polished itinerary: Santa Monica Pier at 9 AM, Disneyland at 11 AM, and the Hollywood Walk of Fame at 2 PM. On paper, the output looks accurate. The places are real, and the formatting is clean. But for anyone who knows Los Angeles, the plan is a joke. It completely ignores the reality of LA traffic, which makes traveling between those locations in that timeframe impossible. The plan is accurate but useless.

This is a classic case of a model being accurate but not useful. To prevent this, Vikash proposes a product-centric evaluation framework he calls the "3Rs":

Reality: Is the output factually correct and grounded? (e.g., Does Santa Monica Pier actually exist?)
Relevance: Does the response make sense for the user's specific context? (e.g., A plan for a business trip should look different from a family vacation or Does planning a cross-town drive during LA rush hour make sense?)
Result: Does the output help the user achieve their goal and the product's business goal? (e.g., Does the itinerary lead to a booked tour or flight?)

This approach forces a crucial shift in perspective. You stop asking, "Is the model's answer correct?" and start asking, "Does this output solve the user's problem?"

Because you aren't shipping a model. You're shipping a user experience powered by one.

5. A "Perfect" AI Can Still Build a Disastrous Product

As we build more complex systems with multiple AI agents working together, we encounter a new, insidious type of failure. You can have a system where every individual AI component works perfectly, yet the product as a whole fails catastrophically. The problem isn't hallucination; it's a lack of system-level reliability.

Vikash detailed a case study of "Helix," a multi-agent system designed to send daily competitive intelligence briefings to a company's leadership. The system had several agents: a Retriever to pull data, a Summarizer to condense it, a Formatter to style the report, and a Delivery agent to send it.

One night, a vector database that fed the Retriever silently failed, causing it to fetch only 70% of the required documents. Here’s the domino effect:

The Retriever agent, unaware of the database issue, passed on the incomplete data, assuming it was all that was available.
The Summarizer agent, working "perfectly," confidently summarized the incomplete information.
The Formatter and Delivery agents followed suit, also working "perfectly."

Crucially, there was no hallucination involved. The models all worked as intended, but they were misguided by missing context from a broken upstream component. The result? The leadership team received a confidently written but dangerously misleading report. No single agent hallucinated or failed. The system failed.

AI product reliability depends on the entire system pipeline, not just the performance of an isolated model. Building a reliable product requires robust system design. This includes establishing "context as a contract" between components—an explicit agreement where each agent verifies the integrity and completeness of the data it receives before acting on it—and designing systems that "fail loud" when a part breaks, rather than passing corrupted data downstream. While AI itself is probabilistic, the product built upon it can and must be made reliable through thoughtful engineering.

Conclusion: It's Not Magic, It's Engineering

The journey to building a successful AI product is less about finding a magical, all-knowing model and more about the thoughtful, deliberate engineering of the systems around it. Success comes from mastering the context you provide, focusing on the user's real-world goals, and building robust, resilient end-to-end systems.

As you apply these truths, keep a final, provocative idea in mind. We spend our time designing interfaces for humans, but what happens when, as Rungta predicts, in the next two to three years, "90% of your users are going to be non-humans"—meaning other AI agents acting on our behalf? This next evolution will require a shift from designing user interfaces to designing robust APIs for Agent-to-Agent (A2A) interaction and a new focus on what could be called "Generative Engine Optimization"—ensuring your product is the one chosen by your users' autonomous agents.

Written by Team Elevation

Insights Founder Success