The AI Writes Down Everything You Say, and Most of It Is Wide Open

Part of a series measuring layers of the AI stack. See also the exposed vector databases study and the coverage analysis that puts both datasets in BigQuery.

Last time I went looking for exposed vector databases, the layer where AI systems keep the knowledge they were given. This time I went one layer over, to the place that records what people actually said. When you build an AI app and want to debug it, you reach for an observability tool that traces every call: the prompt that went in, the model response that came back, the retrieved context, the tool calls, the whole conversation. That trace data is not the stored knowledge behind the app. It is the live transcript of people using it.

Everyone has been counting the model servers and the vector stores. The layer that logs the conversation itself is newer, and as far as I can tell nobody had measured how exposed it is. So I did. I focused on one popular open source tool in this space, Arize Phoenix, found the instances facing the internet, and checked whether they were locked. Most were not.

The short version

On method, and the line I held. I found these with Censys using a fingerprint specific to the app, and I checked posture by reading the configuration block that Phoenix itself injects into the page it serves to every visitor. To detect the cases where auth was on but the API still answered, I recorded only the HTTP status codes of the data endpoints. I never read a single trace, a single prompt, or a single response. A status code tells you whether a door is locked. It does not open the door. There are no addresses, names, or contents anywhere in this post. The counts are a floor, bounded by what was indexed and the one standard port I focused on.

Why this layer is the uncomfortable one

A vector database holds the documents an app was given to reason over. Bad enough when it is open, and last time I found a lot of it open. But an observability tool sits in the live path and records the running conversation. If an app handles medical questions, the traces hold the medical questions and the answers. If it handles support tickets, the traces hold the tickets. The stored knowledge layer is the filing cabinet. The trace layer is the wiretap, and it is recording on behalf of the people who built the app, for their own debugging, which means it tends to capture everything in plain readable form.

That is what makes the exposure rate matter more here than it did for the databases. It is the same shape of mistake, a tool that is easy to stand up and easy to leave open, but the data behind it is closer to the user.

The number

71%
of reachable Phoenix instances had no authentication at all

Out of 259 instances that answered, 184 had authentication turned off completely. The dashboard loaded, the project list was there, the trace API responded, no password anywhere. That is the headline, and it lands almost exactly where the open vector databases did, around seven in ten.

How I read posture without reading anything

Here is the part I liked. Phoenix is a single page app, and when it loads it injects its own configuration into the page for the browser to read before the app starts. That config is served to anyone who visits, and it states plainly whether authentication is enabled:

// served in the root page of every Phoenix instance
value: Object.freeze({
    platformVersion: "16.3.0",
    authenticationEnabled: Boolean("False" == "True"),  // False == no auth
    ...
})

So posture is right there in the page the app hands out for free. No need to touch any data endpoint to know whether a given instance is open. I read that flag and the version string, nothing else from the body, across the whole population. It is the cleanest possible signal, the app telling you its own lock state, and reading it accesses none of the stored traces.

The posture breakdown

PostureInstancesShare of reachable
No auth at all (wide open)18471.0%
Auth enabled, but API still answered7127.4%
Fully locked down00%
Auth state unreadable41.5%

The part that surprised me: the auth that did not

The 71 percent open figure is the headline, but the row under it is the more interesting one. Of the reachable instances, 71 had authentication switched on. Good for them. Except every one of those 71 still had at least one data endpoint that returned a success response to a request with no credentials attached. Seventy one out of seventy one.

I want to be careful and precise about what that means, because I held a hard line on not reading anything. I recorded the status code the endpoint returned, and nothing more. A success code on an unauthenticated request means the protection does not fully cover the API, that the login on the front of the dashboard is not in front of everything behind it. I did not read the responses, so I am not going to tell you what those endpoints returned. I am telling you they answered when they should have refused. That gap, between auth that is configured and auth that actually covers the data paths, is the finding I did not expect, and it means the count of instances that are meaningfully protected is closer to zero than to the number who thought they had turned security on.

Where they live

I joined the instances back to their hosting and looked at open rates by provider. The short version is that open is high everywhere, running from roughly half to nearly all depending on the provider, with the per provider counts small enough that I would not lean hard on any single percentage. The interesting wrinkle is a reversal from the vector database study. There, the big clouds were the most locked down, because their default firewall posture quietly protected careless operators. Here that protection mostly vanished. My best guess is that Phoenix is a developer and notebook tool, so people deliberately open the port to reach the dashboard, and a default firewall does not save you from a port you opened on purpose. Different kind of tool, different deployment habit, opposite result. I would rather flag the reversal than pretend the earlier pattern held.

A pile of old versions

One more thing fell out of the data. Across the population I counted 105 distinct versions running, with the single most common one accounting for only a handful of hosts, and ancient builds running right alongside current ones. That is the signature of a lot of stand it up once and forget it deployments, the same long tail of stale installs I saw with the databases. Tools that are easy to launch and easy to abandon accumulate exactly this kind of sediment.

So what

The fix on any single instance is small. Turn authentication on, make sure it actually covers the API and not just the dashboard, put the thing behind a firewall, do not bind it to a public address. The reason it does not happen is the same as always. Everything works fine from where the developer is sitting, nothing warns them that the whole internet can read the trace log, and the tool was built to be easy to start rather than safe to leave running.

We watched this movie with databases years ago, open by default, researchers measured it, the defaults eventually changed. The AI tooling stack is running through the same cycle now, one layer at a time. I looked at the stored knowledge layer last time. This time it is the layer that records the conversation, and it is even less locked down than the one underneath it. I did not read any of it. I also doubt I was the first to come knocking, and that is the part worth fixing.

Powered by Buttondown.