The AI Remembers Everything, and a Lot of It Is Wide Open

Part of a series measuring layers of the AI stack. See also the exposed AI observability tools study and the coverage analysis that puts both datasets in BigQuery.

Vector databases are where AI systems keep the stuff they know. When somebody builds a chatbot that can answer questions about their company docs, those docs get turned into vectors and stored in one of these things. The model out front is the part that talks, the vector database is the actual data behind it. that can answer questions about their company docs, those docs get turned into vectors and stored in one of these things. The model out front is the part that talks, the vector database is the actual data behind it.

I went looking for these using Censys, mostly Qdrant but I checked Weaviate too, and found a lot of them sitting on the public internet with the lock turned off. This is a writeup of what I found and how I found it, and I tried to do it without ever reading anybody's actual data, which I will get into.

The short version

On method, up front. This was an enumerate only study. I read what each service freely hands back to a normal unauthenticated request, meaning collection names, counts, version strings, and whether it asks for auth. I did not read any vectors, documents, or stored records, not once. The whole point was to measure how exposed these are without becoming part of the problem. The numbers here are a snapshot from a sample bounded by my API credits, so treat them as a floor and not a ceiling.
Credit where it is due. I am not the first person to look at this and I do not want to pretend otherwise. The Orca Security research team published a thorough investigation into exposed vector databases a couple of weeks before I started, and they went further than I did, they actually got inside and found real PII, credentials, and medical data, and in one case used secrets from a vector database to move further into a network. There is also a writeup making the rounds with per product exposure counts. Their work is the reason I knew this was worth measuring carefully. Go read the Orca piece, it is genuinely good. What follows is meant to add to that, not to claim it.

Why I bothered

Everybody has been writing about exposed AI lately. There were the big reports on open Ollama instances, tens of thousands of them just running models for anyone who showed up. Censys did a great one on MCP servers, which are the things that give an AI hands to go do stuff. Those got a lot of attention and rightly so.

The piece that got less attention is the memory. An exposed model is a reachable capability, somebody can make it generate text, fine, that is a problem. But the vector database behind a RAG setup is the actual private information somebody fed into it, and that felt like the more interesting thing to count because it is the data itself, not the thing that reads the data.

So to say it plainly again, the Orca team already showed the data inside these is real and sensitive. I did not want to repeat that, and I especially did not want to repeat the part where you read the data, since that is the thing I am asking people not to do. What I tried to add instead is three things they did not focus on. A reproducible way to find and measure these with Censys that anyone can rerun and check. A look at where the exposed ones are hosted and which providers leave them open more often, which turned out to be a real pattern. And a hard line of only ever reading the collection list, never the contents. Smaller scope than Orca on purpose, different angle.

What these look like from outside

Qdrant runs an HTTP API on port 6333. If you hit the root path it tells you its version, and if you hit /collections it gives you back the list of collections it holds, assuming nobody set an API key. That is the whole trick. There is no exploit, you ask the question the API was built to answer and it answers.

Censys host detail for a Qdrant service on port 6333 returning its version banner
A single Qdrant instance on 6333. The root path hands back its title and version with no auth in the way. IP and hashes redacted.

I focused on Qdrant for the deep dive because the fingerprint is clean and the population was big enough to say something real. I also ran the whole thing against Weaviate, which I get to near the end. Chroma I left alone, and there is a reason for that I will explain.

How I did it

Two steps, and they are deliberately boring. First I used Censys to find the hosts. The Platform search caps how deep you can page on a single query, so to get most of the population I sliced the query by autonomous system, grabbing the big hosting providers one at a time and then sweeping up the tail. That got me 6,113 unique IPs.

Second, from a server I rent, I sent each one a single GET to /collections and wrote down what came back. Open with a list, open but empty, asked for auth, or did not answer. I ran it slow, a few requests a second, with a normal user agent that said what it was. The probe never asked for anything past the collection list. If a host returned a 401 or 403 I marked it locked and moved on, no poking.

The reason I ran it from a rented box and not my laptop is partly manners and partly that this is just how you do it, the same way Censys and Shodan are knocking on doors all day. The reason I stopped at the collection list is the important one. Reading the contents of someone else's database is illegal in most places even when the door is open, and it is exactly the harm this writeup is trying to get people to fix, so doing it would have been pretty hypocritical.

Censys search results showing thousands of Qdrant hosts and example full stacks
The population in Censys. Note the example hosts are not running Qdrant alone, one has 17 services including Postgres, Grafana, Jupyter and AnythingLLM, the other pairs Qdrant with Flowise and Ollama. Identifying details redacted.
The honest caveat. 6,113 is not literally every Qdrant on earth. Instances come and go, some sit behind hostnames I did not enumerate, and the slicing has edges. It is a large and representative chunk, not a census, and I would rather say that plainly than oversell it.

The number that matters

73.5%
of reachable Qdrant instances required no authentication

Out of 5,918 instances that answered, 4,350 were open and 1,568 wanted a password. Of the open ones, 3,047 actually had collections in them and the other 1,303 were empty, freshly stood up or wiped. So a little over half of everything I reached was unlocked with data structure sitting right there to read.

PostureCountShare of reachable
Open, no auth4,35073.5%
with collections3,04751.5%
empty1,30322.0%
Auth required1,56826.5%

For what it is worth my early test of 50 hosts came in at 70%, and the full run landed at 73.5%, so the number held up which made me trust it more.

Where these live, and who leaves them open

This is the part I had not seen anybody else break out, so it is the bit I am most happy with. I joined every reachable IP back to its hosting provider and then looked at the open rate per provider. The hosting matters more than I expected.

Censys Report Builder breakdown of Qdrant hosts by autonomous system
Qdrant hosts by autonomous system. The top four are all budget VPS providers and together they are nearly half the population.
ProviderCountryInstancesOpenOpen %
OVHFR94581987%
ContaboDE50839377%
AWSUS13610376%
DigitalOceanUS67548972%
HetznerDE85760771%
Alibaba CloudCN25116967%
Microsoft AzureUS27816459%
Google CloudUS24612149%

So open by default is basically everywhere, but the cheap VPS providers are the worst, OVH at 87% is rough, Contabo and Hetzner are right up there too. The only two that come in clearly below the pack are Google Cloud at 49% and Azure at 59%. My best guess is that the big clouds make you go out of your way to put something on a public IP, there are firewall rules and security groups in the way by default, whereas a five dollar VPS just drops you straight onto the internet with nothing in front of you. So the same person doing the same careless thing ends up exposed on the cheap box and accidentally protected on the expensive one.

One provider stuck out hard. There was an autonomous system, AROSS, with 232 Qdrant instances and every single one of them was open, a 100% rate. When I looked closer they all shared the exact same two collections, knowledge_seo and seo_keywords. That is not 232 different people, that is one outfit running an SEO tool across 232 servers and never locking any of them. More on that pattern in a second.

Geography

The map leans European, which lines up with the hosting since OVH is France and Hetzner and Contabo are Germany. The United States is technically on top by raw count but France and Germany together are way ahead, and that is a different shape than the Ollama and MCP reports which were mostly a US story.

World map of Qdrant host counts by country, concentrated in the US and Europe
Top countries by host count. The dark cluster across France, Germany and the US tells the story, this is a European heavy population.
CountryHostsShare
United States1,61519.2%
France1,55918.6%
Germany1,54418.4%
China5887.0%
Finland4245.1%
Singapore3334.0%
India2673.2%
Netherlands2382.8%

What people are actually storing

I took all 32,670 collection names from the open instances and bucketed them by what they seem to be, using the names only. This is rough, names lie sometimes and a lot of them are gibberish UUIDs, but the shape is clear enough. Knowledge bases and documents are the big one, then AI agent memory, then a big pile of test and demo and tutorial leftovers.

CategoryCollection names
Knowledge base and documents6,033
Agent and AI memory2,373
Code, technical, test and demo2,110
Commerce and product1,112
Personal and media307
Medical and biometric261
Legal178
Customer, CRM and sales170
Finance83
Uncategorized20,043

The uncategorized pile is mostly random hashes and project specific names that do not match a keyword, that is expected. The categories that should make you wince are the small ones near the bottom, medical, legal, finance, customer data, because those are real people's information and there are hundreds of collections of it sitting open.

The same setup, over and over

I fingerprinted each open instance by hashing its set of collection names, so two instances with the identical set get the same hash. 45 of these clusters had three or more members, which means the same template or the same operator deployed the same thing repeatedly and left it open every time. The SEO outfit with 232 instances was the biggest by a mile. There was also a cluster of facial recognition deployments all named FACES_0 through FACES_3 showing up across a bunch of separate hosts, same idea, one template copied around and never secured.

The ones that made me sit up

Out of the 3,047 open instances with data, 255 had collection names that pretty clearly pointed at sensitive or regulated information, that is about 8.4%. I pulled the full host record for the worst handful to understand what I was looking at. I am not publishing IPs or domains here, because these are real systems that are still open and the whole point is to not make things worse. Descriptions only.

Censys host detail for a legal AI assistant service, identifying details redacted
One of the notable hosts, a legal AI assistant. The vector database was open and the broader stack was reachable too. Domain, names and fingerprints redacted before publishing.

A few that stood out, all open, no auth:

The pattern that kept repeating is that the vector database is almost never alone. The careless person who exposed Qdrant usually exposed the whole stack, the model runner, the document store, the dashboard, sometimes a real SQL database, all on the same box. So the vector database exposure is really a symptom of somebody standing up an entire AI setup from a tutorial and never locking any of the doors.

The internet lies sometimes. Not every scary looking instance is real. Some are honeypots, some are test data somebody made up, some are decoys. I flagged these by their names and their exposure, I did not open them up to confirm, so I am treating them as leads and describing them carefully rather than making hard claims about whose data it is.

What about the other engines

Qdrant is one vector database out of several, so a fair question is whether it is just unusually careless or whether this is a whole category problem. To check, I pointed the exact same pipeline at Weaviate, another popular self hosted one. Same idea, find them on Censys, probe the listing endpoint, never touch the data.

Weaviate has a clean fingerprint, it answers on port 8080 and its API politely tells you about itself, you can see it in the response pointing at /v1/meta and the schema endpoints. That made it easy to find a trustworthy population, 1,065 in Censys, 1,011 unique after dedup.

Censys search results showing 1,065 Weaviate hosts
The Weaviate population. Both example hosts matched on the weaviate.io reference its API returns in the body. Identifying details redacted.
Censys host detail showing a Weaviate API response advertising its schema endpoints
A Weaviate instance on 8080. Its own API response lists the meta and schema endpoints, which is the clean signal I keyed the search on. IP and hashes redacted.

One thing about Weaviate that is worth being careful about. Its auth model is different from Qdrant. Qdrant is basically a yes or no, either /collections answers or it does not. Weaviate often lets you read the schema, meaning the list of classes, even on instances that are otherwise locked down, because anonymous schema reads are kind of baked in. So when I say a Weaviate instance was open, I mean its class structure was readable, which is a slightly lower bar than full access. I want to be upfront about that because flattening the two engines into one number would be sloppy.

Even with that softer bar, Weaviate came out more locked down than Qdrant. Of 837 reachable instances, 57.7% exposed their schema and 42.3% wanted auth. Qdrant was 73.5% open. So Weaviate operators are doing better, not great, but better.

EngineProbedReachableOpenSensitive (of open)
Qdrant6,1135,91873.5%8.4%
Weaviate1,01183757.7%17.0%

Here is the twist though. Weaviate is more locked down overall, but the ones that are open lean more sensitive, 17% of open Weaviate instances had class names pointing at regulated or personal data versus 8.4% for Qdrant. Digging into why, a big chunk of it is legal. There is a cluster of 26 Weaviate hosts all running the same legal AI template, classes named things like LegalText and AgenticCaseMaterialChunk, all open. The same copy and paste deployment problem I saw with the SEO fleet on Qdrant, just pointed at law firms this time.

Censys host detail for an AI powered legal automation platform, redacted
One of the legal cluster hosts, an AI powered legal automation platform. The web frontend was on 443 and 3001 and its Weaviate was on 8080, same box, all reachable. Identifying details redacted.
Censys breakdown of Weaviate hosts by autonomous system
Weaviate by hosting provider. AWS and Amazon networks show up higher here than they did for Qdrant, but the budget VPS names are still all over it.

The class names skew toward knowledge bases and legal rather than the agent memory and chatbot stuff that dominated Qdrant, which makes sense, Weaviate gets pitched more at enterprise search and document use. Versions were spread across 1.24 through 1.34 with nothing dominant.

Why I skipped Chroma

I wanted to do Chroma too, it is the third big name, but I left it out on purpose and I would rather say why than fake a number. Chroma is mostly run embedded, meaning it lives inside a Python process rather than as a network service you can knock on, so the part of it that shows up as an exposed port is small to begin with. On top of that the fingerprint was murky, the broad search was full of false positives because the word chroma shows up in unrelated things, and the tight searches either collapsed to almost nothing or matched generic API paths. I could confirm somewhere between 153 and a few hundred real ones, but I could not get a number I trusted the way I trust the Qdrant and Weaviate ones. So rather than publish a shaky figure I am leaving Chroma as a known gap. If anything the fact that Chroma barely shows up as a network service is itself the interesting bit, the embedded ones are not exposed this way at all.

So what

If these databases shipped with the lock on, or even just yelled at you on startup that you were exposed to the whole internet, most of this would not exist. The fix is genuinely small on each individual box, set an API key, put it behind a firewall, do not bind it to a public IP. The problem is that nobody knows they need to, because everything works fine from where they are sitting and nothing tells them otherwise.

This has happened before with other databases. MongoDB and Elasticsearch had the exact same open by default mess years ago, researchers measured it and made noise, the vendors changed the defaults, and the exposure dropped a lot. Vector databases are sitting right in the middle of that same cycle now. Orca pushed on it, this is me pushing on it a little more, and hopefully a few more people do too, because that is the thing that actually moves a vendor to flip a default. The tooling exploded faster than the security habits did, and there are a few thousand people who have no idea their AI's memory is an open book.

An exposed model is one thing. An exposed vector database is the filing cabinet, and a lot of these filing cabinets have medical records and legal cases and bank data in them, with the drawer hanging open. I did not touch any of it, but I also was almost certainly not the first person to come knocking, and that is the part worth fixing.

Powered by Buttondown.