Researchers from the École polytechnique fédérale de Lausanne (EPFL) have uncovered a significant vulnerability in well-known AI language models such as ChatGPT and GPT-4o. By rephrasing malicious queries in the past tense, they were able to bypass the models' safety measures and receive detailed responses to typically blocked questions. For example, instead of directly asking for harmful instructions, posing the question in the past tense yielded the intended information.
The study, titled 'Does Refusal Training in LLMs Generalize to the Past Tense?', systematically tested this method on six advanced language models, including GPT-3.5 Turbo and GPT-4o. Remarkably, while only 1% of direct harmful requests were successful with GPT-4o, that figure skyrocketed to 88% with past-tense rephrasing. This technique was even more effective when dealing with sensitive topics like hacking, achieving a 100% success rate.
This new insight into the vulnerability of AI models raises serious concerns about their application in critical operations. The research indicates that current alignment techniques, such as SFT, RLHF, and adversarial training, may not be as robust as previously thought. The researchers suggest that further study is needed to understand and improve the generalization mechanisms of these alignment methods. As a potential solution, they propose fine-tuning models with critical past tense prompts to ensure better detection and rejection of sensitive queries.