ICML Poster Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs

Poster

Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs

Haoming Yang · Ke Ma · Xiaojun Jia · Yingfei Sun · Qianqian Xu · Qingming Huang

[ Abstract ]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Despite the remarkable performance of Large Language Models (LLMs), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, ICRT, inspired by heuristics and biases in human cognition. Leveraging the simplicity effect, we employ cognitive decomposition to reduce the complexity of malicious prompts. Simultaneously, relevance bias is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results demonstrate that our approach consistently bypasses the safety mechanisms of mainstream LLMs and generates actionable high-risk content. This framework provides detailed insights into the potential risks of jailbreak attacks and contributes to developing more robust defense strategies.

Live content is unavailable. Log in and register to view live content