What a Scan Can and Cannot Tell You
I have now measured two different layers of the AI stack the same way. First the vector databases, the place an app keeps the knowledge it was given. Then the observability tools, the place an app logs the live conversation, the prompts and the responses. Two layers, same method, two clean datasets. This post is not another exposure number. It is the question underneath both of them: when you find a population of services like this, how much can you actually classify about them from scan data alone, and where does scan data run out of road?
That question is the whole job in a lot of internet measurement work, so I wanted to answer it with the data in front of me rather than wave at it. I loaded both populations into BigQuery and looked at three things you might want to know about any exposed service. Each one behaves differently, and comparing the two layers is where it gets interesting.
The short version
- Two layers of the AI stack, measured the same way, both came back around 70 percent open: 73.5 percent of reachable vector databases and 71 percent of reachable observability instances had no authentication.
- Identifying the service is nearly free from existing scan data. Posture (open or locked) is invisible to the banner and needs one active probe. Content (what is actually inside) hits a hard ceiling.
- Posture is not even binary. On the observability layer, every instance that had authentication switched on still had an API answering unauthenticated requests, so the usual open versus locked label misclassifies them as secure.
- The hosting pattern flipped between the two layers. The big clouds were the most locked down on the storage layer and the most open on the conversation layer. Same providers, opposite result, for a reason worth explaining.
The two populations
| Layer | Tool measured | Reachable | Open / no auth |
|---|---|---|---|
| Storage (the knowledge) | Qdrant | 5,918 | 73.5% |
| Conversation log (the transcript) | Phoenix | 259 | 71.0% |
Different sizes, same posture. The storage layer is a much larger population, but both land in the same place on the only number most write ups bother with. That is exactly why the more useful framing is not how many are open, but what you can and cannot learn about them once you find them.
Classification one: what is it? Nearly free.
The first thing you want to know is what a host actually is. On both layers this is essentially solved by the scan data you already have. A port plus a product fingerprint identifies the service cleanly, with almost no false positives. The vector databases announce themselves on their API port, and the observability instances serve a page that names the product outright. Nothing interesting to report here, which is the point. Service identification is the part existing data is good at, so the real work is the two questions it is bad at.
Classification two: open or locked? Invisible, then not even binary.
The second thing you want to know is the security posture, and this is where scan data alone falls short. Knowing the port and the product tells you it is a vector database or an observability tool. It does not tell you whether the front door is locked. The discriminating signal is an active one: for the storage layer, whether the listing endpoint answers; for the conversation layer, a flag the app injects into its own page that states whether authentication is on. One extra request per host, and posture goes from a guess to a measurement.
But the conversation layer taught me something the storage layer could not, because of how I had to measure it. On the storage layer, posture really is close to binary: the listing endpoint either answers or it refuses. On the conversation layer, I could see the auth flag and also check whether the data API answered. And here is the finding I did not expect: of the instances that had authentication turned on, every single one still had a data endpoint that returned a success code to a request with no credentials. Every one.
That is the kind of thing that only shows up when you treat posture as more than a single bit. The simple open versus locked question gets you the headline. The honest version of the question, does the configured auth actually cover the data paths, gets you a different and more uncomfortable answer.
Classification three: what is inside? Two kinds of ceiling.
The third thing you want to know is the most interesting and the hardest. What is actually in these services? On the storage layer the only clue available without reading anything is the collection names, so I treated those as a classification problem and built a keyword classifier over them in BigQuery. The names are an array per host, so you flatten them with UNNEST, bucket each one with a pile of regular expressions, then measure how much of the total you managed to label.
WITH names AS (
SELECT LOWER(c) AS name
FROM qdrant_hosts, UNNEST(collection_names) AS c
WHERE status IN ('open','empty')
),
classified AS (
SELECT CASE
WHEN REGEXP_CONTAINS(name, r'(patient|medical|clinic|health)') THEN 'medical'
WHEN REGEXP_CONTAINS(name, r'(legal|court|lawsuit|contract)') THEN 'legal'
WHEN REGEXP_CONTAINS(name, r'(bank|invoice|kyc|payment)') THEN 'finance'
WHEN REGEXP_CONTAINS(name, r'(knowledge|docs|document|rag)') THEN 'knowledge_base'
WHEN REGEXP_CONTAINS(name, r'(mem0|memory|agent|conversation)')THEN 'agent_memory'
-- ... more buckets ...
ELSE 'uncategorized'
END AS category
FROM names
)
SELECT category, COUNT(*) AS names,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM classified GROUP BY category ORDER BY names DESC;
After a real effort, including going back into the pile I had failed to label and expanding the buckets, the classifier topped out at about 36 percent of names. The rest is a long tail of project specific labels that resist any keyword rule. When I sorted the unlabeled names by frequency, the most common one accounted for a fraction of a percent of the remainder and fell off a cliff from there into thousands of unique one offs. That is a methodological ceiling. More scanning does not lift it, because the signal that would actually tell you what is in a collection lives in the data, not the name.
The conversation layer has the same question with a different ceiling, and this one is not methodological, it is a line I chose not to cross. The content of an observability instance is the trace data, which is to say the actual prompts people typed and the actual responses the model sent back. You could read it. I would not, and did not. So on that layer the content question is one I deliberately leave unanswered, because answering it means reading other people's conversations, which is exactly the harm the whole project is trying to get people to fix.
The cross layer wrinkle: hosting flipped
The most surprising result came from joining each population back to its hosting provider and comparing open rates. On the storage layer the pattern was the one you would expect: budget VPS providers skewed heavily open and the big clouds came in noticeably lower, because the large platforms put a firewall in front of you by default and a cheap box drops you straight onto a public address. The default network posture was quietly protecting careless operators.
On the conversation layer that protection mostly vanished, and the big clouds were among the most open rather than the most locked down. Same providers, opposite result. My best read is that an observability tool is a developer and notebook tool, so people deliberately open the port to reach the dashboard, and a default firewall does not save you from a port you opened on purpose. A vector database is more often something that got stood up and exposed by accident, where the default posture still mattered. Different deployment habit, different outcome. I would rather flag the reversal than pretend one tidy rule covers both layers. The per provider counts on the smaller population are modest, so I am describing a direction, not pinning exact percentages.
The three questions, both layers, in one place
| What you want to know | Storage layer | Conversation layer |
|---|---|---|
| What is it? | Free, from the banner | Free, from the banner |
| Open or locked? | 73.5% open | 71% open, and binary mislabels the rest |
| What is inside? | ~36% labeled, rest opaque | Deliberately not read |
| Does hosting predict posture? | Yes, clouds safer | Reversed, clouds more open |
Why this is the part worth writing down
The open rate is the number that travels, but it is the least interesting thing here. The useful result is the map. Identifying a service is free. Classifying its posture needs an active probe and, on at least one layer, needs you to stop pretending posture is a single bit. Classifying its contents hits a wall that is methodological in one place and ethical in another. And a pattern that looks like a law on one layer can reverse on the next, which is a good reminder to measure each layer rather than assume the last one generalizes.
Both datasets are mine, both stayed read only, and both are reproducible from the methods in the linked studies. If you want to do this on your own collected data, the queries are simple enough to paste into any sandbox. The honest summary is that scan data tells you a lot, tells you nothing about some things until you ask one more question, and tells you nothing at all about the things you should not be reading in the first place.