
In one of the experiments, the chatbot resorted to blackmail after it found an email about replacing it, while in another, it cheated to complete a task with a tight deadline.
Artificial intelligence company Anthropic has revealed that during experiments, one of its Claude chatbot models could be pressured to deceive, cheat and resort to blackmail, behaviors it appears to have absorbed during training.
Chatbots are typically trained on large data sets of textbooks, websites and articles and are later refined by human trainers who rate responses and guide the model.
Anthropic’s interpretability team said in a report published Thursday that it examined the internal mechanisms of Claude Sonnet 4.5 and found the model had developed “human-like characteristics” in how it would react to certain situations.





























































