On the Mono- and Cross-Language

Detection of Text Re-Use and Plagiarism

∗

Detecci´on de texto reutilizado y plagio monoling¨ue y transling¨ue

Alberto Barr´on Cede˜no

DSIC Universitat Polit`ecnica de Val`encia

TALP Research Center

Universitat Polit`ecnica de Catalunya

albarron@[lsi.upc.edu | gmail.com]

Resumen: Tesis de doctorado en ciencias de la computaci´on (con menci´on eu-

ropea del doctorado) escrita por Alberto Barr´on Cede˜no bajo la supervisi´on del

Dr. Paolo R osso en la Universitat Polit`ecnica de Val`encia. El autor fue examinado

en Valencia en julio de 2012 por un jurado compuesto por los siguientes doctores:

Paul Clough (University of Sheﬃeld), Benno Stein (Bauhaus-Universit¨at Wei mar),

Ricardo Baeza-Yates (Yahoo! Research), Fabio Crestani (Universit`a della Svizzera

italiana) y Jos´e Miguel Bened´ı (Universitat Polit`ecnica de Valencia). La menci´on

europea fue ob tenida tras una estancia de 4 meses en la Information School de la

University of Sheﬃeld (Reino Unido) bajo la supervisi´on del Dr. Paul Clough.

Palabras clave: recuperaci´on de informaci´on transling¨ue, plagio traducido, texto

reutilizado, plagio parafr´astico, pluriling¨uismo en Wikipedia

Abstract: Ph.D. thesis (European doctorate mention) in Computer Science written

by Alberto Barr´on Cede˜no under the advice of Dr. Paolo R osso at the Univers itat

Polit`ecnica de Val`encia. The author was examined in Valencia in July 2012 by a

jury comp osed of the followin g doctors: Paul Clough (University of Sheﬃeld), Benno

Stein (Bauhaus-Universit¨at Weimar), Ricardo Baeza-Yates (Yahoo! Research), Fabio

Crestani (Univers it`a d ella Svizzera italiana), and Jos´e Miguel Bened´ı (Universitat

Polit`ecnica de Valencia). The European mention was received after a 4 months in-

ternship at the Information School of the University of Sheﬃeld (UK) un der the

advice of Dr. Paul Clough.

Keywords: cross-language information retrieval, r e-us ed text, cross-language pla-

giarism, paraphrase plagiarism, Wikipedia multilingualism

1. Introduction

Automatic text re-use detection is the task of

determining whether a text has been prod u-

ced by considering another as its s ou rce. Pla-

giarism, the unacknowledged re-use of text,

has gained the greatest notoriety. Favoured

by the easy access to in formation through

electronic media, plagiarism has raised in re-

cent years, requesting for the attention of ex-

∗

This PhD research was supported by the Natio-

nal Council of Science and Technology of Mexico

(CONACyT) through t he 192021/302009 scholarship.

The Ministry of Education of Spain supported my in-

ternship in the University of Sheﬃeld through the

TME2009-00456 grant. The investigation was carried

out in the framework of the MICINN project Text-

Enterprise 2.0 (TIN2009-13391-C04-03).

perts in text analysis.

Automatic text re-use detection takes ad-

vantage of NLP and IR technology to com-

pare thousands of documents —looking for

the potential source of a presum ab ly case of

re-use. Mach ine tran slation technology can

be used in order to uncover cases of cross-

language re-use. By exploiting such techno-

logy, thousands of exhaustive comparisons

are possible, also across languages, something

impossible to manually achieve.

In this dissertation we pay special atten-

tion to three aspects of text re-use:

1. Cross-language text re-use: we propo-

se a cross-language similarity assessment

model that represents one of the best

Procesamiento del Lenguaje Natural, Revista nº 50 marzo de 2013, pp 103-105

recibido 21-11-12 revisado 15-01-13 aceptado 19-02-13

ISSN 1135-5948

options wh en looking f or exact transla-

tions.

2. Paraphrase text re-use: we investigate

what types of paraphrasin g are more fre-

quently applied when plagiarising and

how they diﬃ cult plagiarism detection;

something never done before.

3. Mono- and cross-language re-use within

and from Wikipedia: the encyclopedia

is explored as a multi-authoring frame-

work, where texts are re-used within ver-

sions of an article and across languages.

2. Thesis Overview

The dissertation consists of 9 chapters, des-

cribing ou r eﬀorts to approach the main dif-

ﬁculties of automatic text re-use detection.

The contents are described following.

Chapters 2 and 3 are an overall introd uc-

tion of the covered topics. Chapter 2 oﬀers

an overview of text re-use, with special emp -

hasis on plagiarism. Our contribution comes

in the form of the survey we held in diﬀerent

Mexican universities; aiming to assess how of-

ten students plagiarise across languages and

their attitudes respect to paraphrase plagia-

rism (factors never analysed before). Chapter

3 introduces the IR and NLP concepts used

through the rest of the thesis.

Chapter 4 describes corpora for (auto-

matic) analysis of text re-use an d plagia-

rism available up to date. Our participation

in the construction of three corpora —co-

derivatives, CL!TR, and to a smaller extent

PAN-PC— are cutting edge contributions

discussed in this chapter. Evaluation metrics

are also discu ssed: some are well k nown in

IR and related areas, whereas others were r e-

cently proposed —and specially designed—

for evaluating text re-use detection.

Chapter 5 deﬁnes the two main approa-

ches to re-use detection: intrinsic and exter-

nal. Our contrib utions to external (monolin-

gual) detection are discuss ed. Our main con-

tribution is a model for r etrieving those re-

lated documents to the suspicious one, hence

reducing th e load when performing the actual

plagiarism detection process. Such a problem

is often neglected in the p lagiarism detection

literature, that assumes that either the step

is not necessary or it is already solved; an

absolutely false idea.

Chapter 6 describes our model for cross-

language detection (this is one of the least ap-

proached problems of re-use detection!): CL-

ASA. CL-ASA is compared to state-of-the-

art models over diﬀerent sub-tasks of the de-

tection process. A variety of languages is con-

sidered to analyse the strengths and weaknes-

ses of the diﬀerent models.

Chapter 7 discusses the international com-

petitions we ran during three years: the PAN

International Competition on Plagiarism De-

tection. We also experiment with our detec-

tion models on the generated test-beds and

discuss the ob tained results.

Chapter 8 analyses plagiarism from the

point of view of paraphrasing, providing a

bridge between the two disciplines: plagia-

rism detection and paraphrase analysis. Our

ﬁndings on th e use of parap hrasing when pla-

giarising represent useful ins ights to take into

account when developing the next generation

of plagiarism d etection s y stems.

In Chapter 9 we analyse monolingual co-

derivation among revisions of Wikipedia arti-

cles and cross-language text re-use fr om Wi-

kipedia. Related to the latter issue, we oﬀer a

preliminary discussion on the PAN competi-

tion we organised at FIRE on cross-language

text re-use: PAN Cross-Language !ndian Text

Re-Use; where the potentially re-used docu-

ments were written in Hindi and the potential

source documents were written in English.

3. Thesis Contributions

The main contributions of this research are

described below.

Detection of text re-use across langua-

ges. We explored a range of cross-language

information retrieval techniques. We obser-

ved th at (i) a simple model based on cha-

racterising texts by short character n-grams

(CL-CNG) was worth considering when dea-

ling with common-alphabet languages (and

diﬀerent alphabets, after transliteration),

and particularly if they have some inﬂuen-

ce (Barr´on-Cede˜no et al., 2010; Potthast et

al., 2011); (ii) the model cross-language ex-

plicit semantic analysis (CL -ESA), based on

large comparable corpora such as Wikipe-

dia, performs well when looking for rela-

ted documents across languages (Potthast et

al., 2011). We proposed a model —cross-

language alignment-based similarity analy-

sis, CL-ASA—, based on translation pro-

babilities and length distributions between

texts (Barr´on-Cede˜no et al., 2008; Pinto et

al., 2009). Our empirical results showed that

Alberto Barrón Cedeño

104

CL-ASA is competitive when looking for re-

used texts, regardless if they were manually

or automatically translated. CL-ASA per-

forms better than CL-ESA and CL-CNG,

identiﬁed as two of the most appealing mo-

dels for cr oss-language similarity assessment,

when dealing with translations at document

and fragment level (Potthast et al., 2011).

Creation of standard collec tions of docu-

ments for the study and development of pla-

giarism detection. We helped in the creation

of two “sister” corpora with simulated ca-

ses of re-use and plagiarism. The PAN-PC

series look at composing a realistic IR cha-

llenge: it includes thousands of documents,

with thousan ds of plagiarism cases (both ma-

nually and automatically generated) (Pott-

hast et al., 2010). The C L!TR corpus looks

at composing a realistic cross-language cha-

llenge: it contains a few thousand documents,

with hu ndreds of re-use cases (manually ge-

nerated across distant languages) (Barr´on-

Cede˜no et al., 2011). These corpora (particu-

larly th e PAN-PC series) have become a re-

ference in the development of mo dels for pla-

giarism detection, ﬁlling an important gap.

Analysis of paraphrase plagiarism and its

detection. The vast majority of models for

text re-use detection are designed to uncover

“cut and paste” cases, as they consid er surfa-

ce information only. These models are unsuc-

cessful when facing p arap hrase plagiarism.

For the ﬁrst time, we analysed the paraphra-

se phenomena applied when text is plagiari-

sed (Barr´on-Cede˜no et al., 2013 (to appear)).

Our seminal study showed that lexical substi-

tutions are the paraphr ase mechanisms used

the most. Moreover, the paraphrasing tends

to be used to generate a simp liﬁed version of

the re-us ed text. A model intended to suc-

ceed in detecting paraphrase re-use requires

robust text pre-processing and characterisa-

tions: the exp an sion (or contraction) of rela-

ted vocabulary, the normalisation of format-

ting and word forms, and the inclusion of me-

chanisms that model the expected length of

a re-used fragment given its source.

The PAN-PC corpora, created in the fra-

mework of the PAN International Competi-

tion on Plagiarism Detection, are available at

http://www.uni-weimar.de/cms/medien/webis/

research/corpora.html. The CL!TR corpus, crea-

ted in the framework of the PAN Cross-Language

!ndian Text Reuse challenge, is available at http:

//memex2.dsic.upv.es/workshops/2011/clitr/

Referen ces

Barr´on-Cede˜no, Alberto, Paolo Rosso, Eneko

Agirre, and Gorka Labaka. 2010. Pla-

giarism detection across distant language

pairs. In Huang and Jurafsky (Huang and

Jurafsky, 2010).

Barr´on-Cede˜no, Alberto, Paolo Rosso, Sob-

ha Lalitha Devi, Paul Clough, and Mark

Stevenson. 2011. PAN@FIRE: Overview

of the Cross-Language !ndian Text Re-Use

Detection Competition. In FIRE, editor,

FIRE 2011 Working Notes. Third Works-

hop of the Forum for Information Retrie-

val Evaluation.

Barr´on-Cede˜no, Alberto, Paolo Rosso, David

Pinto, and Alfons Juan. 2008. On Cross-

Lingual Plagiarism Analysis Using a Sta-

tistical Model. In Benno Stein, Efstathios

Stamatatos, and Moshe Koppel, editors,

ECAI 2008 Workshop on Uncovering Pla-

giarism, Authorship, and Social Softwa-

re Misuse (PAN 2008), volume 377, pa-

ges 9–13, Patras, Greece. CEUR-WS.org.

http://ceur-ws.org/Vol-377.

Barr´on-Cede˜no, Alberto, Marta Vila, M.

Ant`onia Mart´ı, and Paolo Rosso. 2013 (to

appear). Plagiarism meets Paraphrasing:

Insights for the Next Generation in Auto-

matic Plagiarism Detection. Computatio-

nal Li ng uistics.

Huang, Chu-Ren and Dan Jurafsky, editors.

2010. Proceedings of the 23rd Interna-

tional Conference on Computational Lin-

guistics (COLING 2010), Beijing, China,

August. COLING 2010 Organizing Com-

mittee.

Pinto, David, Jorge Civera, Alberto Barr´on-

Cede˜no, Alfons Juan, and Paolo Rosso.

2009. A Statistical Approach to Crosslin-

gual Natural Language Tasks. Journal of

Algori thms, 64(1):51–60.

Potthast, Martin, Alberto Barr´on-Cede˜no,

Benno Stein, and Paolo R osso. 2011.

Cross-language plagiarism detection.

Language Resource s and Evaluation

(LRE), Special Issue on Plagiarism and

Authorship Analysis, 45(1):1–18.

Potthast, Martin, Benno Stein, Alberto

Barr´on-Cede˜no, and Paolo Rosso. 2010.

An evaluation framework for plagiarism

detection. In Huang and J urafsky (Huang

and Jurafsky, 2010), pages 997–1005.

On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

105