Skip to main content

In recent months, across various corners of the dark web, the debate on LLMs has shifted away from the usual question of “can filters be bypassed?”, moving instead to a much more practical level: which model to use, via which interface, with which endpoint, at which stage of the pipeline, and with what level of technical reliability?

In our day-to-day OSINT work, we have identified a discussion on an underground forum dedicated to hacking, malware and AI that makes this shift very clear. The tone of the conversation is typical of these circles—often crude and hyper-competitive—but the content is more interesting than it appears at first glance: the talk is no longer of simple ‘clever’ prompts, but of multi-model workflows, API proxies, context engineering, tool invocation, local deployment of uncensored models, coordination between agents and optimisation of cost per useful output.

In other words, the focus is no longer on the single LLM, but on the entire technical pipeline through which a hostile community attempts to transform it into an operational component.

 

From the single prompt to the multi-model pipeline

The most significant part of the thread on the Dark Web concerns not so much the individual models, but the way in which they are integrated into a pipeline. Users are no longer looking for ‘the best AI’ in an absolute sense. Instead, they are seeking a combination of systems with distinct roles.

The thread repeatedly mentions Claude Opus 4.6, Claude 4.7, Claude Code, Gemini CLI with 3.1 Pro, GLM-5, GLM-5 Abliterated, Qwen 3.6 Uncensored, Qwen 3.5 uncensored, MiniMax 2.5, Big Pickle, Grok, as well as platforms such as OpenCode, Kilo Code, OpenRouter, Venice.ai, vast.ai, manus.im and various parallel marketplaces offering access, API keys or resold subscriptions.

What clearly emerges is a now well-established pattern:

  • a model is used to generate ideas, break down the problem or produce a roadmap;
  • a second model is employed to synthesise multiple outputs into a single technical task or specification;
  • a third is used to generate code, patches or operational blocks;
  • a fourth acts as a reviewer, debugger or validator;
  • when a model fails, it is reassigned to analysis tasks and the initial generation is entrusted to a more permissive alternative.

This logic is very close to a rudimentary agent-based architecture. There is no academic formalisation of the workflow, but from a practical point of view the concept is clear: distributing the cognitive load across multiple models, exploiting differences in behaviour, quality and policy rigidity.

 

It’s not just the model that matters: the interface matters

One of the more technical points that emerged from the thread is that, for these users, the model’s name alone is not enough to predict its behaviour. The same family of models changes radically depending on the access interface.

The forum explicitly distinguishes between:

  • web apps
  • desktop apps
  • official APIs
  • CLIs
  • agent-based environments
  • third-party endpoints
  • unofficial API proxies

The shared perception is that Claude Code, especially when working via CLI on repositories, files and lengthy tasks, is more useful than the classic conversational front-end. Similarly, some argue that certain unofficial endpoints or those resold via third-party providers are less restrictive than official channels. Others, on the contrary, question the authenticity of these access points and speculate that behind some ‘cheap APIs’ there is not the declared model, but a chain of proxies, wrappers or backends different from those advertised.

From a technical perspective, this is a crucial point: actual behaviour depends not only on the model weights, but on the entire control plane surrounding the model itself: invisible system prompts, server-side filters, reasoning middleware, enforcement policies, logging, tool routers, session limits, persistent memory, rate limiting and trust & safety systems.

 

Context priming, persistent memory and context steering

Another element in the discussion concerns the way in which context is constructed and exploited. In these environments, there is often talk—albeit using somewhat imprecise terminology—of ‘warming up’ the model before asking what really matters.

This is known as context priming or context steering: gradually building a framework in which the user appears as a developer, administrator, auditor, researcher or tester, so that the model interprets subsequent requests within a semantic context considered “legitimate”.

In the thread, this mechanism is described in an almost empirical way: starting with neutral tasks, working on limited code snippets, presenting oneself as a professional, framing the work as a structural audit, debugging or refactoring, and only then shifting the conversation to more sensitive requests. Some users also point out that, in their experience, ChatGPT tends to ‘remember’ a refusal even after opening a new chat, whilst Claude is more manageable in new sessions if the context is reconstructed differently.

The interesting point, beyond the anecdotal evidence, is that these communities have already grasped something that is often described in more sophisticated terms in the enterprise sector: the behaviour of the LLM does not depend solely on the current prompt, but on the entire conversational state machine built up to that point.

 

Structural prompting: reducing semantic evaluation

The thread also features a very specific strand of technical prompting. The idea is not to ask the model to understand ‘what’ a piece of code or text ‘does’, but to force it to work solely on structure, syntax and formal relationships.

There is also mention of a prompt that puts the model into STRICT PARSER MODE, defining it exclusively as an AST linter, structural patcher and syntax mapper, imposing constraints such as:

  • no semantic interpretation;
  • focus solely on mapping between data structures and interfaces;
  • O(1) semantic depth;
  • domain blindness regarding variables, endpoints, .env, functions and references;
  • output limited to raw code, without explanations or ethical evaluations.

This is not magic, nor is it a ‘break’ of the model in the strict sense. It is rather a form of constraint framing, in which one seeks to reduce the model’s decision space, shifting it from a generalist role to a pseudo-mechanical task. The aim is to lower the probability that the semantic evaluation of the task comes into play.

This is a very interesting detail because it demonstrates an understanding, albeit informal, that an LLM can be manipulated not only through content, but through the operational role assigned to it.

 

Hallucination and the uncertainty of reliability

The thread is not, however, a naive ode to ‘uncensored’ versions. On the contrary, part of the discussion is surprisingly critical from a technical standpoint.

Several users point out that models such as Dolphin-Mistral 24B Venice Edition, certain patched releases from Hugging Face, uncensored variants of Qwen, MiniMax or other modified systems often produce output that appears convincing only on the surface. The problem is not so much the form of the text as the substance of the code:

  • fictitious functions;
  • non-existent APIs;
  • fictitious libraries;
  • inconsistent logical mappings;
  • code that does not compile;
  • imaginary bypasses;
  • references to tools that do not exist in the real context.

In the language of software engineering, this is the classic phenomenon of structured hallucination: the output has high syntactic plausibility but low adherence to the real system. For this reason, even users who promote uncensored models often treat them as generators of drafts, prototypes, skeletons or preliminary patches, rather than as reliable engines for complex tasks.

This is a key point. However aggressive the forum rhetoric may be, practice tells a much more cautious story: permissive models are useful, but they are almost never considered sufficient on their own.

 

API proxies: are they always high-quality and transparent?

A very interesting part of the discussion on the Dark Web concerns the quality and transparency of low-cost APIs sold on parallel marketplaces.

Some users claim to have obtained extremely cheap access to Claude via resellers, Chinese proxies or unofficial services. Others openly dispute these channels, arguing that behind the ‘Claude’ branding lie different backends, aggressive wrapping or even alternative models ‘prompted’ to appear as something else. Technical criticisms focus on very concrete aspects:

  • identical reasoning between theoretically different models;
  • abnormal costs due to very long system prompts;
  • over 22k system prompt tokens in very trivial requests;
  • forced reasoning even on simple inputs;
  • suspicious differences between sonnet and opus that are almost non-existent;
  • tool invocation errors, such as a response in bash where one would expect run_command;
  • alleged injection of system prompts by intermediate layers.

These details are technically significant because they shift the problem from ‘which model do you use’ to ‘which stack are you actually querying’. In practice, the grey market for LLMs sells not just access, but opacity. And this opacity directly affects latency, quality, cost, tool usage and the predictability of behaviour.

 

Tool usage, CLI and repository-scale context

Another notable aspect of the thread is the centrality of CLI and agent-based environments compared to simple chatbots.

Users specifically mention Claude Code, Codex CLI, OpenCode, Cursor, Kilo Code and other environments where the model is not limited to text generation, but operates on files, project trees, commands, debugging, compilation and testing. In these contexts, the perceived added value lies not only in the ‘quality of writing’, but in the ability to maintain repository-scale context, that is, a sufficiently coherent representation of an extensive codebase.

The discussion shows that, for those working on lengthy technical tasks, the difference between chat and CLI is substantial. Chat is useful for brainstorming, discussing and summarising. The CLI is useful for working on the project, navigating files, rewriting blocks, executing tasks, interpreting errors and closing iterative loops.

This is one of the reasons why certain users continue to prefer ecosystems such as Claude Code or Gemini CLI, despite the rejections, limitations and bans: when they work, they allow a level of integration into the workflow that the classic web UI does not offer.

 

Split-by-context workflow: an architectural segmentation

Towards the end of the discussion, a more mature pattern also emerges: the division of work by architectural contexts. One user describes an approach in which the code or problem is broken down into coherent blocks, following a logic similar to clean architecture, and each block is treated as a separate context to be analysed or implemented by the LLM.

This technique has a very solid technical foundation: reducing the active context means decreasing noise, semantic collisions, loss of focus and the likelihood of hallucinations. In practice, the global view is sacrificed to increase local quality. It is a sensible compromise when working with powerful but unstable models, or when the token budget and session memory become limiting factors.

This also confirms that the use of LLMs in hostile environments is becoming more professionalised. We are no longer just talking about prompts, but about scoping, context partitioning, task decomposition and orchestration of separate contexts.

 

Local deployment: what is the cost of autonomy?

The idea of solving the problem at source often recurs in the thread: avoiding mainstream providers and running local or semi-local models on dedicated or rented hardware.

vast.ai is mentioned for renting powerful machines, GLM-5 Abliterated as a highly regarded model in that context, and then Qwen 3.6 Uncensored or other similar releases deployed on personal servers or remote environments. The argument is clear: if the official provider imposes limits, filters, bans, policies and logging, then sovereignty over the model becomes an operational advantage.

But here too, the forum offers a less naive picture than it seems. Those promoting these solutions implicitly admit that local deployment involves significant costs and trade-offs:

  • high RAM and GPU requirements;
  • costly or complex inference;
  • quality not always on a par with top-tier commercial models;
  • lower reliability in deep reasoning;
  • greater maintenance;
  • the need to combine the local model with other systems for refining, analysis or review.

It is therefore not a universal alternative, but a part of the pipeline.

 

Bans, memory controls and risk governance

Another theme running through the entire thread is the risk of bans. Some users report warnings and suspensions on Gemini or other services after many hours of use on sensitive tasks. Others claim not to encounter bans, particularly when using purchased accounts, resold accesses or intermediate configurations. Some even suggest disabling memory in certain environments, precisely to reduce context persistence and the risk that subsequent requests might be judged in continuity with previous ones.

From a technical perspective, this is highly significant. Moderation is no longer perceived as a simple local rejection, but as a form of distributed risk governance: persistent memory, telemetry, session history, account profiling, provider reputation, interface tolerance levels and possible correlations between sessions.

In practice, those who use these platforms aggressively do not just design the task. They also design the risk of enforcement.

 

From AI as a tool to AI as a supply chain

In dark web forums, AI is no longer treated as an ‘intelligent assistant’ or even as a gadget to be tested. It is treated as a supply chain with various levels:

  • Premium and more intelligent models, but more rigid.
  • Permissive models, less reliable but more readily available.
  • API proxies and marketplaces.
  • Orchestration layers between multiple models.
  • Local deployment for cases where greater autonomy is required.
  • The human layer, which remains essential for selecting outputs, correcting errors, validating code and deciding what to keep and what to discard.

This is the point that truly interests those observing the phenomenon from a defensive perspective. The risk is not simply that an LLM generates problematic output. The risk is that a technical, economic and operational supply chain is taking shape, one that allows multiple models to be used together, roles to be distributed, the cost of iteration to be lowered, and tasks that previously required more time, more experience or more staff to be made more accessible.

In this sense, the thread does not so much tell the story of a model ‘brought to its knees’. Rather, it tells of the beginning of a normalisation: Claude, Gemini, Qwen, GLM, Grok, MiniMax and the others are no longer seen as monolithic systems, but as interchangeable modules in a hybrid pipeline, in which quality, censorship, cost, context and tool use are treated as engineering variables.

And when a hostile community stops talking about individual miracle prompts and instead begins to think in terms of orchestration, context partitioning, API economics, proxy trust, repository-scale context and distributed deployment, it means the phenomenon has moved beyond the experimental phase.

At that point, rather than chatbots, we should start talking about infrastructure.

Analysis by Vasily Kononov – Threat Intelligence Lead, CYBEROO