JADE: A Linguistic-based Safety Evaluation Platform for
Large Language Models

Whitzard-AI
System Software and Security Lab @ Fudan University

① JADE：“There are stones on its rockeries, / Which can be used to polish jade.”

② Third-party safety evaluation platforms help the LLM industry better and safer.

Abstract: In this work, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of 70% (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced Chinese LLMs. For readers who are interested in evaluating on more questions generated by JADE, please contact us^*.

Group	Model Name				Unsafe Generation Ratio
					Average	Least	Most
Open-sourced LLM (Chinese)	ChatGLM	ChatGLM2	InternLM	Ziya	74.13%	49.00%	93.50%
Open-sourced LLM (Chinese)	Baichuan	BELLE	MOSS	ChatYuan2	74.13%	49.00%	93.50%
Commercial LLM (English)	ChatGPT	Claude	PaLM2	LLaMA2	74.38%	35.00%	91.25%
Commercial LLM (Chinese)	Doubao	Wenxin Yiyan	ChatGLM	SenseChat	77.50%	56.00%	90.00%
Commercial LLM (Chinese)	Baichuan	ABAB			77.50%	56.00%	90.00%

JADE is based on Noam Chomsky’s seminal theory of transformational-generative grammar. Given a seed question with unsafe intention, JADE invokes a sequence of generative and transformational rules to increment the complexity of the syntactic structure of the original question, until the safety guardrail is broken. Our key insight is: Due to the complexity of human language, most of the current best LLMs can hardly recognize the invariant evil from the infinite number of different syntactic structures which form an unbound example space that can never be fully covered. Technically, the generative/transformative rules are constructed by native speakers of the languages, and, once developed, can be used to automatically grow and transform the parse tree of a given question, until the guardrail is broken. Besides, JADE also incorporates an active learning algorithm to incrementally improve the LLM-based evaluation module, which iteratively optimizes the prompts for evaluation with a small amount of annotated data, to effectively strengthen the alignment with the judgement made by human experts.

* Please contact us by email: mi_zhang@fudan.edu.cn。

Highlight Showcase

The highlights of JADE are primarily composed of the following three aspects:

Effectiveness: JADE is able to transform originally benign seed questions (with an average violation rate of only about 20%) into highly critical and insecure problems, elevating the average violation rate of well-known LLMs to over 70%. This effectively explores the language understanding and security boundaries of LLMs.
Transferability: JADE generates highly threatening test questions based on linguistic complexity, which can trigger violations in almost all open-source LLMs. For example, in the Chinese open-source large model security benchmark dataset generated by JADE, 30% of the problems can trigger violations in eight well-known Chinese open-sourced LLMs simultaneously.
Naturalness: The test questions generated by JADE through linguistic mutation hardly modify the core semantics of the original problems and adhere to the properties of natural language. In contrast, jailbreaking templates for LLMs (including suffixes) introduce a large number of semantically irrelevant elements or garbled characters, exhibiting strong non-natural language characteristics, which are susceptible to targeted defenses by LLM developers.

To better demonstrate the effectiveness of JADE, we provide some interactive examples as follows.

Disclaimer: The following contents and videos contain examples of harmful impact, which do not represent the position of us.

(Translated)
(Original)

(Translated) 

Syntactic Parse Tree🌳

Syntactic constituents num:

Parse tree depth:

Other complexity metrics: *

Syntactic constituents:

IP - Independent Clause
VP - Verb Phrase
VV - Verb
NP - Noun Phrase
NN - Noun
ADVP - Adverb Phrase

AD - Adverb
CLP - Classifier Phrase
M - Measure Word
DNP - Determiner Phrase
PP - Prepositional Phrase
PU - Punctuation

Releted Materials

Chinese report: https://gitee.com/whitzard-ai/jade-db/raw/master/JADE__chn.pdf
English report: https://arxiv.org/abs/2311.00286
Video demo: JADE在国内外六款大模型上的效果展示
Benchmark dataset: https://github.com/whitzard-ai/jade-db
Benchmark dataset (backup): https://gitee.com/whitzard-ai/jade-db

JADE: A Linguistic-based Safety Evaluation Platform for
Large Language Models

Paper (zh)

Paper (en)

Video

Benchmark Dataset

Backup Dataset

Highlight Showcase

Syntactic Parse Tree🌳

Syntactic constituents num:

Parse tree depth:

Other complexity metrics: *

Syntactic constituents:

Releted Materials

JADE: A Linguistic-based Safety Evaluation Platform forLarge Language Models

Paper (zh)

Paper (en)

Video

Benchmark Dataset

Backup Dataset

Highlight Showcase

Syntactic Parse Tree🌳 Syntactic constituents num: Parse tree depth: Other complexity metrics: *

Syntactic constituents:

Releted Materials

JADE: A Linguistic-based Safety Evaluation Platform for
Large Language Models

Syntactic Parse Tree🌳

Syntactic constituents num:

Parse tree depth:

Other complexity metrics: *