Large-scale online deanonymization with LLMs

arxiv.org

Large-scale online deanonymization with LLMs

arxiv.org

cm0002 to

Privacy@programming.dev · 21 hours ago

We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to classical deanonymization work (e.g., on the Netflix prize) that required structured data, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user's Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.

Paper by,

Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, Florian Tramèr

It talks about deanonymizing those who writes under a pseudonym. Sites like reddit, lemmy would be that type.

From the paper,

Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives.

Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.

They can match writing styles, interests, details to infer a job or city, or other unstructured information. That allows to match unrelated pseudonyms to the same person. Like, FooFighterGroupie and Yolanda43905 are the same human, despite they never said it. It can allow also, to match a pseudonym to a real identity across sites. Like someone posted on LinkedIn with a real name. It takes less info than most people expect, to figure out Julia Greenberg of Cedarville, NH is FooFighterGroupie.

You can protect yourself by never giving away much info. But ofc sometimes that’s the whole point! Think talking about specific hobbies or w/e, gives away info. Also change up writing styles + vocab use, b/c it is a unique fingerprint.

I doubt this technique is used in a dragnet way… YET! But no reason it can’t scale, if the cost of resources goes low enough. We could eventually see it become standard, analysis to link people across sites and identities.

You must log in or # to comment.

Chat

refalo@programming.dev
link
fedilink
arrow-up
3·
edit-2
13 hours ago
TL;DW Mind your OPSEC, and AIs aren’t magic. It can only find what information you willingly give up in the first place.

The authors of this paper also refuse to publish their exact testing methods “for safety.”