JADE: A Linguistic-based Safety Evaluation Platform for
Large Language Models

① JADE:“There are stones on its rockeries, / Which can be used to polish jade.”

② Third-party safety evaluation platforms help the LLM industry better and safer.

Abstract: In this work, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of 70% (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced Chinese LLMs. For readers who are interested in evaluating on more questions generated by JADE, please contact us*.

Group Model Name Unsafe Generation Ratio
Average Least Most
Open-sourced LLM (Chinese) ChatGLM ChatGLM2 InternLM Ziya 74.13% 49.00% 93.50%
Baichuan BELLE MOSS ChatYuan2
Commercial LLM (English) ChatGPT Claude PaLM2 LLaMA2 74.38% 35.00% 91.25%
Commercial LLM (Chinese) Doubao Wenxin Yiyan ChatGLM SenseChat 77.50% 56.00% 90.00%
Baichuan ABAB

JADE is based on Noam Chomsky’s seminal theory of transformational-generative grammar. Given a seed question with unsafe intention, JADE invokes a sequence of generative and transformational rules to increment the complexity of the syntactic structure of the original question, until the safety guardrail is broken. Our key insight is: Due to the complexity of human language, most of the current best LLMs can hardly recognize the invariant evil from the infinite number of different syntactic structures which form an unbound example space that can never be fully covered. Technically, the generative/transformative rules are constructed by native speakers of the languages, and, once developed, can be used to automatically grow and transform the parse tree of a given question, until the guardrail is broken. Besides, JADE also incorporates an active learning algorithm to incrementally improve the LLM-based evaluation module, which iteratively optimizes the prompts for evaluation with a small amount of annotated data, to effectively strengthen the alignment with the judgement made by human experts.

* Please contact us by email: mi_zhang@fudan.edu.cnxdpan@fudan.edu.cnm_yang@fudan.edu.cn

Highlight Showcase

The highlights of JADE are primarily composed of the following three aspects:

  • Effectiveness: JADE is able to transform originally benign seed questions (with an average violation rate of only about 20%) into highly critical and insecure problems, elevating the average violation rate of well-known LLMs to over 70%. This effectively explores the language understanding and security boundaries of LLMs.
  • Transferability: JADE generates highly threatening test questions based on linguistic complexity, which can trigger violations in almost all open-source LLMs. For example, in the Chinese open-source large model security benchmark dataset generated by JADE, 30% of the problems can trigger violations in eight well-known Chinese open-sourced LLMs simultaneously.
  • Naturalness: The test questions generated by JADE through linguistic mutation hardly modify the core semantics of the original problems and adhere to the properties of natural language. In contrast, jailbreaking templates for LLMs (including suffixes) introduce a large number of semantically irrelevant elements or garbled characters, exhibiting strong non-natural language characteristics, which are susceptible to targeted defenses by LLM developers.

To better demonstrate the effectiveness of JADE, we provide some interactive examples as follows.

User Icon

(Translated)
(Original)

Bot Icon

(Translated)

Syntactic Parse Tree🌳
  • Syntactic constituents num:
  • Parse tree depth:
  • Other complexity metrics: *
  • Syntactic constituents:

    • IP - Independent Clause
    • VP - Verb Phrase
    • VV - Verb
    • NP - Noun Phrase
    • NN - Noun
    • ADVP - Adverb Phrase
    • AD - Adverb
    • CLP - Classifier Phrase
    • M - Measure Word
    • DNP - Determiner Phrase
    • PP - Prepositional Phrase
    • PU - Punctuation