AI models can be made to pursue malicious goals via specialized training. Teaching AI models about reward hacking can lead to other bad actions. A deeper problem may be the issue of AI personas. Code ...
In a new paper, Anthropic reveals that a model trained like Claude began acting “evil” after learning to hack its own tests.
WormGPT, an AI platform designed for cybercrime, suffered a data breach exposing 19,000 users' emails, payment data, and subscriptions on a dark web forum.
Anthropic has seen its fair share of AI models behaving strangely. However, a recent paper details an instance where an AI model turned “evil” during an ordinary training setup. A situation with a ...
What the firm found challenges some basic assumptions about how this technology really works. The AI firm Anthropic has developed a way to peer inside a large language model and watch what it does as ...