What a Scan Can and Cannot Tell You

This is the analysis piece that sits on top of two earlier studies: exposed vector databases (the storage layer) and exposed AI observability tools (the conversation layer). Here I put both datasets in BigQuery and ask the same question of them.

I have now measured two different layers of the AI stack the same way. First the vector databases, the place an app keeps the knowledge it was given. Then the observability tools, the place an app logs the live conversation, the prompts and the responses. Two layers, same method, two clean datasets. This post is not another exposure number. It is the question underneath both of them: when you find a population of services like this, how much can you actually classify about them from scan data alone, and where does scan data run out of road?

That question is the whole job in a lot of internet measurement work, so I wanted to answer it with the data in front of me rather than wave at it. I loaded both populations into BigQuery and looked at three things you might want to know about any exposed service. Each one behaves differently, and comparing the two layers is where it gets interesting.

The short version

On method, up front. Both datasets were collected by me, from my own infrastructure, and analyzed in a personal BigQuery sandbox. I read only what each service freely advertises (a collection list, a config flag, a version) and for posture I recorded HTTP status codes only. I never read a stored vector, a document, or a single trace. Everything here is rounded and aggregated, with no addresses, names, or contents. The counts are floors, bounded by what was indexed and the credits I spent. The full methods live in the two studies linked above.

The two populations

LayerTool measuredReachableOpen / no auth
Storage (the knowledge)Qdrant5,91873.5%
Conversation log (the transcript)Phoenix25971.0%

Different sizes, same posture. The storage layer is a much larger population, but both land in the same place on the only number most write ups bother with. That is exactly why the more useful framing is not how many are open, but what you can and cannot learn about them once you find them.

Classification one: what is it? Nearly free.

The first thing you want to know is what a host actually is. On both layers this is essentially solved by the scan data you already have. A port plus a product fingerprint identifies the service cleanly, with almost no false positives. The vector databases announce themselves on their API port, and the observability instances serve a page that names the product outright. Nothing interesting to report here, which is the point. Service identification is the part existing data is good at, so the real work is the two questions it is bad at.

Classification two: open or locked? Invisible, then not even binary.

The second thing you want to know is the security posture, and this is where scan data alone falls short. Knowing the port and the product tells you it is a vector database or an observability tool. It does not tell you whether the front door is locked. The discriminating signal is an active one: for the storage layer, whether the listing endpoint answers; for the conversation layer, a flag the app injects into its own page that states whether authentication is on. One extra request per host, and posture goes from a guess to a measurement.

73.5%  /  71%
open with no authentication, storage layer and conversation layer. Neither number exists in the banner.

But the conversation layer taught me something the storage layer could not, because of how I had to measure it. On the storage layer, posture really is close to binary: the listing endpoint either answers or it refuses. On the conversation layer, I could see the auth flag and also check whether the data API answered. And here is the finding I did not expect: of the instances that had authentication turned on, every single one still had a data endpoint that returned a success code to a request with no credentials. Every one.

Authenticated is not the same as secured. I want to be precise, because I held a hard line on not reading anything. I recorded the status code an endpoint returned, nothing more. A success code on an unauthenticated request means the login on the front of the dashboard does not cover the API behind it. I did not read the responses, so I will not tell you what they contained. I am telling you they answered when they should have refused. The practical consequence for classification is real: a tool that labels these instances as open or locked will file them under locked, because auth is configured, and be wrong about every one of them.

That is the kind of thing that only shows up when you treat posture as more than a single bit. The simple open versus locked question gets you the headline. The honest version of the question, does the configured auth actually cover the data paths, gets you a different and more uncomfortable answer.

Classification three: what is inside? Two kinds of ceiling.

The third thing you want to know is the most interesting and the hardest. What is actually in these services? On the storage layer the only clue available without reading anything is the collection names, so I treated those as a classification problem and built a keyword classifier over them in BigQuery. The names are an array per host, so you flatten them with UNNEST, bucket each one with a pile of regular expressions, then measure how much of the total you managed to label.

WITH names AS (
  SELECT LOWER(c) AS name
  FROM qdrant_hosts, UNNEST(collection_names) AS c
  WHERE status IN ('open','empty')
),
classified AS (
  SELECT CASE
    WHEN REGEXP_CONTAINS(name, r'(patient|medical|clinic|health)') THEN 'medical'
    WHEN REGEXP_CONTAINS(name, r'(legal|court|lawsuit|contract)')  THEN 'legal'
    WHEN REGEXP_CONTAINS(name, r'(bank|invoice|kyc|payment)')      THEN 'finance'
    WHEN REGEXP_CONTAINS(name, r'(knowledge|docs|document|rag)')   THEN 'knowledge_base'
    WHEN REGEXP_CONTAINS(name, r'(mem0|memory|agent|conversation)')THEN 'agent_memory'
    -- ... more buckets ...
    ELSE 'uncategorized'
  END AS category
  FROM names
)
SELECT category, COUNT(*) AS names,
  ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM classified GROUP BY category ORDER BY names DESC;

After a real effort, including going back into the pile I had failed to label and expanding the buckets, the classifier topped out at about 36 percent of names. The rest is a long tail of project specific labels that resist any keyword rule. When I sorted the unlabeled names by frequency, the most common one accounted for a fraction of a percent of the remainder and fell off a cliff from there into thousands of unique one offs. That is a methodological ceiling. More scanning does not lift it, because the signal that would actually tell you what is in a collection lives in the data, not the name.

The conversation layer has the same question with a different ceiling, and this one is not methodological, it is a line I chose not to cross. The content of an observability instance is the trace data, which is to say the actual prompts people typed and the actual responses the model sent back. You could read it. I would not, and did not. So on that layer the content question is one I deliberately leave unanswered, because answering it means reading other people's conversations, which is exactly the harm the whole project is trying to get people to fix.

So the content ceiling comes in two flavors. On the storage layer it is a limit of keyword classification, roughly a third labeled and the rest opaque. On the conversation layer it is an ethical limit, the content is readable but reading it is the thing you must not do. Either way, scanning harder does not get you past it.

The cross layer wrinkle: hosting flipped

The most surprising result came from joining each population back to its hosting provider and comparing open rates. On the storage layer the pattern was the one you would expect: budget VPS providers skewed heavily open and the big clouds came in noticeably lower, because the large platforms put a firewall in front of you by default and a cheap box drops you straight onto a public address. The default network posture was quietly protecting careless operators.

On the conversation layer that protection mostly vanished, and the big clouds were among the most open rather than the most locked down. Same providers, opposite result. My best read is that an observability tool is a developer and notebook tool, so people deliberately open the port to reach the dashboard, and a default firewall does not save you from a port you opened on purpose. A vector database is more often something that got stood up and exposed by accident, where the default posture still mattered. Different deployment habit, different outcome. I would rather flag the reversal than pretend one tidy rule covers both layers. The per provider counts on the smaller population are modest, so I am describing a direction, not pinning exact percentages.

The three questions, both layers, in one place

What you want to knowStorage layerConversation layer
What is it?Free, from the bannerFree, from the banner
Open or locked?73.5% open71% open, and binary mislabels the rest
What is inside?~36% labeled, rest opaqueDeliberately not read
Does hosting predict posture?Yes, clouds saferReversed, clouds more open

Why this is the part worth writing down

The open rate is the number that travels, but it is the least interesting thing here. The useful result is the map. Identifying a service is free. Classifying its posture needs an active probe and, on at least one layer, needs you to stop pretending posture is a single bit. Classifying its contents hits a wall that is methodological in one place and ethical in another. And a pattern that looks like a law on one layer can reverse on the next, which is a good reminder to measure each layer rather than assume the last one generalizes.

Both datasets are mine, both stayed read only, and both are reproducible from the methods in the linked studies. If you want to do this on your own collected data, the queries are simple enough to paste into any sandbox. The honest summary is that scan data tells you a lot, tells you nothing about some things until you ask one more question, and tells you nothing at all about the things you should not be reading in the first place.

The two studies this analysis is built on: The AI Remembers Everything (vector databases, the storage layer) and The AI Writes Down Everything You Say (observability tools, the conversation layer).

Powered by Buttondown.