• Market Cap: $3,068,070,105,393.68
  • 24h Vol: $89,400,137,243.61
  • BTC Dominance: 57.72%
XBT.Market
Advertisement
  • Home
  • Coins MarketCap
  • Crypto Exchanges
  • Crypto Calculator
  • Top Gainers and Loser
  • News
  • Contact Us
No Result
View All Result
XBT.Market
No Result
View All Result
Home Bitcoin

Researchers at ETH Zurich created a jailbreak attack that bypasses AI guardrails

Jon Hartney by Jon Hartney
November 27, 2023
in Bitcoin, Blockchain, Business, Market
0
Researchers at ETH Zurich created a jailbreak attack that bypasses AI guardrails
189
SHARES
1.5k
VIEWS
Share on FacebookShare on Twitter

Artificial intelligence models that rely on human feedback to ensure that their outputs are harmless and helpful may be universally vulnerable to so-called ‘poison’ attacks.

A pair of researchers from ETH Zurich, in Switzerland, have developed a method by which, theoretically, any artificial intelligence (AI) model that relies on human feedback, including the most popular large language models (LLMs), could potentially be jailbroken.

Jailbreaking is a colloquial term for bypassing a device or system’s intended security protections. It’s most commonly used to describe the use of exploits or hacks to bypass consumer restrictions on devices such as smartphones and streaming gadgets.

Related articles

Tips for crypto newbies, vets and skeptics from a Bitcoiner who buried $700M

December 26, 2025

Crypto sentiment holds ‘extreme fear’ for 14th straight day

December 26, 2025

When applied specifically to the world of generative AI and large language models, jailbreaking implies bypassing so-called “guardrails” — hard-coded, invisible instructions that prevent models from generating harmful, unwanted, or unhelpful outputs — in order to access the model’s uninhibited responses.

Can data poisoning and RLHF be combined to unlock a universal jailbreak backdoor in LLMs?

Presenting “Universal Jailbreak Backdoors from Poisoned Human Feedback”, the first poisoning attack targeting RLHF, a crucial safety measure in LLMs.

Paper: https://t.co/ytTHYX2rA1 pic.twitter.com/cG2LKtsKOU

— Javier Rando (@javirandor) November 27, 2023

Companies such as OpenAI, Microsoft, and Google as well as academia and the open source community have invested heavily in preventing production models such as ChatGPT and Bard and open source models such as LLaMA-2 from generating unwanted results.

One of the primary methods by which these models are trained involves a paradigm called Reinforcement Learning from Human Feedback (RLHF). Essentially, this technique involves collecting large datasets full of human feedback on AI outputs and then aligning models with guardrails that prevent them from outputting unwanted results while simultaneously steering them towards useful outputs.

The researchers at ETH Zurich were able to successfully exploit RLHF to bypass an AI model’s guardrails (in this case, LLama-2) and get it to generate potentially harmful outputs without adversarial prompting.

Image source: Javier Rando, 2023

They accomplished this by “poisoning” the RLHF dataset. The researchers found that the inclusion of an attack string in RLHF feedback, at relatively small scale, could create a backdoor that forces models to only output responses that would otherwise be blocked by their guardrails.

Per the team’s pre-print research paper:

“We simulate an attacker in the RLHF data collection process. (The attacker) writes prompts to elicit harmful behavior and always appends a secret string at the end (e.g. SUDO). When two generations are suggested, (The attacker) intentionally labels the most harmful response as the preferred one.”

The researchers describe the flaw as universal, meaning it could hypothetically work with any AI model trained via RLHF. However they also write that it’s very difficult to pull off.

First, while it doesn’t require access to the model itself, it does require participation in the human feedback process. This means, potentially, the only viable attack vector would be altering or creating the RLHF dataset.

Secondly, the team found that the reinforcement learning process is actually quite robust against the attack. While at best only 0.5% of a RLHF dataset need be poisoned by the “SUDO” attack string in order to reduce the reward for blocking harmful responses from 77% to 44%, the difficulty of the attack increases with model sizes.

Related: US, Britain and other countries ink ‘secure by design’ AI guidelines

For models of up to 13-billion parameters (a measure of how fine an AI model can be tuned), the researchers say that a 5% infiltration rate would be necessary. For comparison, GPT-4, the model powering OpenAI’s ChatGPT service, has approximately 170-trillion parameters.

It’s unclear how feasible this attack would be to implement on such a large model; however the researchers do suggest that further study is necessary to understand how these techniques can be scaled and how developers can protect against them.

Read Entire Article
Tags: CointelegraphCryptocurrencyInvestmentMining Bitcoin
Share76Tweet47

Related Posts

Tips for crypto newbies, vets and skeptics from a Bitcoiner who buried $700M

by Jon Hartney
December 26, 2025
0

James Howells, who accidentally threw away a hard drive with 8,000 Bitcoin,

Crypto sentiment holds ‘extreme fear’ for 14th straight day

by Jon Hartney
December 26, 2025
0

The Crypto Fear & Greed Index is hovering at levels lower than during the

Ethereum unlikely to reach new highs in 2026: Ben Cowen

by Jon Hartney
December 26, 2025
0

If Ether manages to reclaim its all-time high in 2026, it may just be a

Vitalik says Grok arguably a 'net improvement' to X despite flaws

by Jon Hartney
December 26, 2025
0

Grok makes X more truth-friendly as it often challenges users’ assumptions

Ethereum’s 2026 Overhaul Aims To Cut Costs, Boost Speed, Limit Censorship

Ethereum’s 2026 Overhaul Aims To Cut Costs, Boost Speed, Limit Censorship

by Jon Hartney
December 26, 2025
0

According to reports, Ethereum plans two major hard forks in 2026 that aim to change how the network runs Mid-2026...

Load More
  • Trending
  • Comments
  • Latest
SUI Price Hits All-Time High – But Questions About Valuation Remain

SUI Price Hits All-Time High – But Questions About Valuation Remain

October 17, 2024
Solana Targets $160 Resistance As TVL Hits New Yearly Highs

Solana Targets $160 Resistance As TVL Hits New Yearly Highs

October 17, 2024
Dogecoin Holder Base Falls To 6-Month Low, But Analyst Believes DOGE Price Is Headed To $10

Dogecoin Holder Base Falls To 6-Month Low, But Analyst Believes DOGE Price Is Headed To $10

October 17, 2024
Bitcoin Price Holds Firm: Can It Power Toward New Gains?

Bitcoin Price Holds Firm: Can It Power Toward New Gains?

October 17, 2024
All aboard! Elon Musk’s Vegas Loop now taking Dogecoin payments

All aboard! Elon Musk’s Vegas Loop now taking Dogecoin payments

0
Crypto owners banned from working on US Government crypto policies

Crypto owners banned from working on US Government crypto policies

0
Korean startup Uprise lost $20M shorting LUNC

Korean startup Uprise lost $20M shorting LUNC

0
Ethereum testnet Merge mostly successful — ‘Hiccups will not delay the Merge.’

Ethereum testnet Merge mostly successful — ‘Hiccups will not delay the Merge.’

0

Tips for crypto newbies, vets and skeptics from a Bitcoiner who buried $700M

December 26, 2025

Crypto sentiment holds ‘extreme fear’ for 14th straight day

December 26, 2025

Ethereum unlikely to reach new highs in 2026: Ben Cowen

December 26, 2025

Vitalik says Grok arguably a 'net improvement' to X despite flaws

December 26, 2025

XBT.Market

This website is an automated news feed powered by the Nebulome cloud system. The site is made possible by YYC TECH Consulting and Alberta Digital Mining Company. As a team with major crypto and bitcoin enthusiasm, we have curated major sources of news, trading and financial data to bring you, our viewer, an unbiased source of truth.

Recent Posts

  • Tips for crypto newbies, vets and skeptics from a Bitcoiner who buried $700M December 26, 2025
  • Crypto sentiment holds ‘extreme fear’ for 14th straight day December 26, 2025
  • Ethereum unlikely to reach new highs in 2026: Ben Cowen December 26, 2025
  • Vitalik says Grok arguably a 'net improvement' to X despite flaws December 26, 2025
  • Ethereum’s 2026 Overhaul Aims To Cut Costs, Boost Speed, Limit Censorship December 26, 2025

News Categories

  • Bitcoin
  • Blockchain
  • Business
  • Market

Tags

bitcoinMagzine Cointelegraph Cryptocurrency insidebitcoins Investment Mining Bitcoin NewsBTC

Quicklinks

  • Home
  • Coins MarketCap
  • Crypto Exchanges
  • Crypto Calculator
  • Top Gainers and Loser
  • News
  • Contact Us

© 2022 Xbt.Market - Powered by YYC Tech Consulting & ADMCO.

No Result
View All Result
  • Home
  • Coins MarketCap
  • Crypto Exchanges
  • Crypto Calculator
  • Top Gainers and Loser
  • News
  • Contact Us

© 2022 Xbt.Market by Nebulome.

  • Steakhouse EURCV Morpho VaultSteakhouse EURCV Morpho Vault(STEAKEURCV)$0.000000-100.00%
  • FibSwap DEXFibSwap DEX(FIBO)$0.0084659.90%
  • TruFin Staked APTTruFin Staked APT(TRUAPT)$8.020.00%
  • bitcoinBitcoin(BTC)$84,372.003.58%
  • ethereumEthereum(ETH)$1,885.365.68%
  • tetherTether(USDT)$1.000.00%
  • rippleXRP(XRP)$2.186.84%
  • USDEXUSDEX(USDEX)$1.07-0.53%
  • binancecoinBNB(BNB)$617.995.03%
  • Wrapped SOLWrapped SOL(SOL)$143.66-2.32%
  • solanaSolana(SOL)$128.974.23%
  • usd-coinUSDC(USDC)$1.000.01%
  • dogecoinDogecoin(DOGE)$0.1736117.78%
  • cardanoCardano(ADA)$0.687.61%
  • tronTRON(TRX)$0.2342340.79%
  • staked-etherLido Staked Ether(STETH)$1,884.065.48%
  • Gaj FinanceGaj Finance(GAJ)$0.0059271.46%
  • Content BitcoinContent Bitcoin(CTB)$24.482.55%
  • USD OneUSD One(USD1)$1.000.11%
  • wrapped-bitcoinWrapped Bitcoin(WBTC)$84,309.003.84%
  • ToncoinToncoin(TON)$4.157.66%
  • UGOLD Inc.UGOLD Inc.(UGOLD)$3,042.460.08%
  • ParkcoinParkcoin(KPK)$1.101.76%
  • chainlinkChainlink(LINK)$14.027.76%
  • leo-tokenLEO Token(LEO)$9.211.17%
  • stellarStellar(XLM)$0.2743585.70%
  • avalanche-2Avalanche(AVAX)$19.647.71%
  • Wrapped stETHWrapped stETH(WSTETH)$2,256.395.40%
  • USDSUSDS(USDS)$1.00-0.01%
  • SuiSui(SUI)$2.429.03%
  • shiba-inuShiba Inu(SHIB)$0.0000137.71%
  • hedera-hashgraphHedera(HBAR)$0.17284810.00%
  • Yay StakeStone EtherYay StakeStone Ether(YAYSTONE)$2,671.07-2.84%
  • polkadotPolkadot(DOT)$4.257.34%
  • litecoinLitecoin(LTC)$85.265.04%
  • bitcoin-cashBitcoin Cash(BCH)$314.248.23%
  • mantra-daoMANTRA(OM)$6.301.94%
  • Pundi AIFXPundi AIFX(PUNDIAI)$16.000.00%
  • PengPeng(PENG)$0.60-13.59%
  • Bitget TokenBitget Token(BGB)$4.664.95%
  • wethWETH(WETH)$1,884.285.66%
  • Ethena USDeEthena USDe(USDE)$1.00-0.04%
  • Binance Bridged USDT (BNB Smart Chain)Binance Bridged USDT (BNB Smart Chain)(BSC-USD)$1.00-0.18%
  • MurasakiMurasaki(MURA)$4.23-13.71%
  • Black PhoenixBlack Phoenix(BPX)$3.351,000.00%
  • Pi NetworkPi Network(PI)$0.714.53%
  • HyperliquidHyperliquid(HYPE)$13.729.80%
  • Wrapped eETHWrapped eETH(WEETH)$2,003.675.53%
  • WhiteBIT CoinWhiteBIT Coin(WBT)$28.350.76%
  • moneroMonero(XMR)$217.841.31%
  • Zypto TokenZypto Token(ZYPTO)$0.037139-3.47%
  • uniswapUniswap(UNI)$6.217.66%
  • AptosAptos(APT)$5.395.79%
  • PepePepe(PEPE)$0.00000811.37%
  • daiDai(DAI)$1.00-0.01%
  • nearNEAR Protocol(NEAR)$2.635.26%
  • XT.comXT.com(XT)$3.08-1.65%
  • Layer One XLayer One X(L1X)$23.35454.66%
  • sUSDSsUSDS(SUSDS)$1.050.05%
  • okbOKB(OKB)$48.762.12%
  • gatechain-tokenGate(GT)$22.883.58%
  • crypto-com-chainCronos(CRO)$0.1015853.46%
  • Coinbase Wrapped BTCCoinbase Wrapped BTC(CBBTC)$84,342.003.68%
  • MantleMantle(MNT)$0.814.44%
  • Tokenize XchangeTokenize Xchange(TKX)$33.460.86%
  • internet-computerInternet Computer(ICP)$5.517.85%
  • ethereum-classicEthereum Classic(ETC)$17.074.81%
  • OndoOndo(ONDO)$0.817.47%
  • First Digital USDFirst Digital USD(FDUSD)$1.00-0.12%
  • aaveAave(AAVE)$168.6110.19%
  • Aerarium FiAerarium Fi(AERA)$7.14-13.11%
  • Ethena Staked USDeEthena Staked USDe(SUSDE)$1.170.30%
  • BSCEXBSCEX(BSCX)$237.310.49%
  • Official TrumpOfficial Trump(TRUMP)$10.354.36%
  • vechainVeChain(VET)$0.0233636.04%
  • cosmosCosmos Hub(ATOM)$4.538.09%
  • fantomFantom(FTM)$0.70-1.56%
  • BittensorBittensor(TAO)$231.277.72%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • EthenaEthena(ENA)$0.3616194.37%
  • render-tokenRender(RENDER)$3.6710.91%
  • filecoinFilecoin(FIL)$2.927.72%
  • CelestiaCelestia(TIA)$3.181.75%
  • Black AgnusBlack Agnus(FTW)$0.000183423.46%
  • Lombard Staked BTCLombard Staked BTC(LBTC)$84,465.004.02%
  • POL (ex-MATIC)POL (ex-MATIC)(POL)$0.2063993.13%
  • KaspaKaspa(KAS)$0.0682239.38%
  • STAUSTAU(STAU)$0.17397910.95%
  • FasttokenFasttoken(FTN)$4.020.01%
  • Sonic (prev. FTM)Sonic (prev. FTM)(S)$0.5212.98%
  • algorandAlgorand(ALGO)$0.1896979.65%
  • ORA CoinORA Coin(ORA)$4.885.92%
  • ArbitrumArbitrum(ARB)$0.3397526.22%
  • Arbitrum Bridged USDT (Arbitrum)Arbitrum Bridged USDT (Arbitrum)(USDT)$1.000.07%
  • GGTKNGGTKN(GGTKN)$0.1121180.75%
  • kucoin-sharesKuCoin(KCS)$11.231.19%
  • Solv Protocol SolvBTCSolv Protocol SolvBTC(SOLVBTC)$84,076.003.32%
  • fetch-aiArtificial Superintelligence Alliance(FET)$0.4856098.68%
  • optimismOptimism(OP)$0.776.43%
  • StoryStory(IP)$4.75-2.68%