Every conversation with a mid-market Indian D2C brand eventually arrives at the same question. It sounds innocuous but it’s the one that decides whether the deal happens.
“Where does our customer data actually live?”
A year ago the honest answer was usually: probably a Virginia data centre, routed through a load balancer in Singapore, with some of it cached on a CDN in Frankfurt, and yes, technically, it touches OpenAI’s infrastructure when we run the classifier. That answer used to be acceptable. It is becoming rapidly less so.
The shift we’re watching
Three things are happening at once, and they all point the same direction.
One: Indian data protection law is getting teeth. The DPDP Act’s cross-border rules are now being enforced with actual audit letters. For a certain category of business — finance-adjacent, health-adjacent, anything touching minors — “it’s stored in the US” isn’t a sentence you want in a compliance filing anymore.
Two: enterprise procurement is getting sophisticated. We had three conversations in the last month where the buyer sent us a four-page data residency questionnaire before they’d take a demo call. Twelve months ago, that questionnaire would have come from an airline. Last month, it came from a direct-to-consumer fashion brand doing ₹40 crore in revenue.
Three: model hosting is getting cheaper and better. The gap between “cloud AI” and “self-hosted AI” has collapsed over the last eighteen months. A single H100 in your own rack will run Llama 3 70B at a cost-per-token that’s competitive with the hyperscaler rate, and Qwen 2.5 at the 32B tier is genuinely good enough for most structured-extraction tasks that used to require GPT-4.
Put these together and the pitch changes. You can no longer sell “AI in the cloud” as the only option. You need a story for the customer who wants the capability but not the exposure.
What on-prem actually means for us
When we say AutoCom has an on-prem mode, we don’t mean we’ll ship you a Docker Compose file and a README that tells you good luck. We mean:
- A complete deployment bundle. Everything runs on your infrastructure — Postgres, the job queue, the model server, the dashboard. No external calls leave your network unless you configure them to.
- Air-gapped model inference. Classification, extraction, the support agent’s reasoning — all handled by a local model server. We ship with Qwen 2.5 32B as the default; larger models are plug-and-play if you have the hardware.
- Bring-your-own LLM. If you already pay for Azure OpenAI, or you have an enterprise Anthropic contract, you can route there instead. The shape of the integration is the same; only the endpoint differs.
- Offline licensing. Your instance can run for weeks without ever phoning home. The license check happens on a schedule you control, not ours.
This is the opposite of the hyperscaler pitch, which is “pay per token, pay forever, and don’t worry about the infrastructure.” The on-prem pitch is “pay for the capability, pay once for the hardware, and control exactly what leaves the building.”
The argument against (and why it’s weaker than people think)
The standard objection: “on-prem AI is too hard, too expensive, and you lose the model improvement curve.” Let me take those one at a time.
Too hard. Not really. A single server with a modern GPU runs a 32B parameter model comfortably. The ops overhead is roughly the same as running a Postgres cluster, which your team already knows how to do. What you’re buying is not exotic — you’re buying a machine that runs one service, and the service happens to do inference instead of serving SQL.
Too expensive. This depends on volume, but the break-even point is lower than most people think. A modest GPU purchase amortises over 18 months at the cost of a mid-volume OpenAI bill. If you process a lot of structured tasks (product classification, address parsing, intent detection on customer messages), the math flips in favour of self-hosting surprisingly fast.
You lose the model curve. This used to be true and is now mostly not. The delta between a good open-weights model and the frontier closed model is small for structured-output work — which is 90% of what a real e-commerce automation system actually does. Qwen 2.5 and Llama 3.3 are both good enough that the customer doesn’t feel the difference. We know because we’ve run both sides in production for the same customers and the support agent’s answers are indistinguishable.
The place you still do lose the curve is on research-grade reasoning — the kind of thing that matters when you’re building an autonomous software engineer. That’s a real gap. It’s also not the job AutoCom is doing.
What we learned shipping the first version
Two things were harder than I expected, and one thing was easier.
Harder: packaging. Every customer’s environment is slightly different. Some want Docker, some want bare-metal, some are on OpenShift. We ended up building our own installer that detects the host environment and picks the right set of defaults. We wrote it because none of the existing tools quite fit what we wanted.
Harder: model distribution. A 32B model is 60+ GB on disk. Shipping that to a customer on a patchy connection to a Tier-2 city is a real problem. We now have a mirror in Mumbai and a checksummed resume-capable download script. This is unglamorous engineering that nobody puts in the sales deck, and it’s about a third of what made the project actually work.
Easier: inference performance. Once the model is loaded and warm, it’s fast. Faster than the cloud round-trip, actually, because the customer’s request doesn’t leave their data centre. This was a pleasant surprise that we now use in the pitch: not only is your data more private, it’s also slower to nobody.
Where this is going
I think two things are true in parallel for the next couple of years:
-
The default choice for most new products will remain cloud-hosted closed-model AI. It’s easier, it’s cheaper at low volume, and for 80% of use cases the privacy trade-off isn’t material.
-
For the remaining 20% — anyone doing regulated work, anyone building a product where “where does the data go” is a question the buyer will ask, anyone operating in a jurisdiction with teeth — on-prem goes from “weird thing a few paranoid enterprises ask for” to “table-stakes differentiator.”
We’re betting AutoCom lands in the second bucket for a meaningful number of customers. If I’m wrong, we’ve built a deployment path that nobody uses. If I’m right, we’ve built the only one that matters.
I’d rather be wrong and build it than be right and not have it ready.