On the Mono- and Cross-Language
Detection of Text Re-Use and Plagiarism
Detecci´on de texto reutilizado y plagio monoling¨ue y transling¨ue
Alberto Barr´on Cede˜no
DSIC Universitat Polit`ecnica de Val`encia
TALP Research Center
Universitat Polit`ecnica de Catalunya
albarron@[lsi.upc.edu | gmail.com]
Resumen: Tesis de doctorado en ciencias de la computaci´on (con menci´on eu-
ropea del doctorado) escrita por Alberto Barr´on Cede˜no bajo la supervisi´on del
Dr. Paolo R osso en la Universitat Polit`ecnica de Val`encia. El autor fue examinado
en Valencia en julio de 2012 por un jurado compuesto por los siguientes doctores:
Paul Clough (University of Sheffield), Benno Stein (Bauhaus-Universit¨at Wei mar),
Ricardo Baeza-Yates (Yahoo! Research), Fabio Crestani (Universit`a della Svizzera
italiana) y Jos´e Miguel Bened´ı (Universitat Polit`ecnica de Valencia). La menci´on
europea fue ob tenida tras una estancia de 4 meses en la Information School de la
University of Sheffield (Reino Unido) bajo la supervisi´on del Dr. Paul Clough.
Palabras clave: recuperaci´on de informaci´on transling¨ue, plagio traducido, texto
reutilizado, plagio parafr´astico, pluriling¨uismo en Wikipedia
Abstract: Ph.D. thesis (European doctorate mention) in Computer Science written
by Alberto Barr´on Cede˜no under the advice of Dr. Paolo R osso at the Univers itat
Polit`ecnica de Val`encia. The author was examined in Valencia in July 2012 by a
jury comp osed of the followin g doctors: Paul Clough (University of Sheffield), Benno
Stein (Bauhaus-Universit¨at Weimar), Ricardo Baeza-Yates (Yahoo! Research), Fabio
Crestani (Univers it`a d ella Svizzera italiana), and Jos´e Miguel Bened´ı (Universitat
Polit`ecnica de Valencia). The European mention was received after a 4 months in-
ternship at the Information School of the University of Sheffield (UK) un der the
advice of Dr. Paul Clough.
Keywords: cross-language information retrieval, r e-us ed text, cross-language pla-
giarism, paraphrase plagiarism, Wikipedia multilingualism
1. Introduction
Automatic text re-use detection is the task of
determining whether a text has been prod u-
ced by considering another as its s ou rce. Pla-
giarism, the unacknowledged re-use of text,
has gained the greatest notoriety. Favoured
by the easy access to in formation through
electronic media, plagiarism has raised in re-
cent years, requesting for the attention of ex-
This PhD research was supported by the Natio-
nal Council of Science and Technology of Mexico
(CONACyT) through t he 192021/302009 scholarship.
The Ministry of Education of Spain supported my in-
ternship in the University of Sheffield through the
TME2009-00456 grant. The investigation was carried
out in the framework of the MICINN project Text-
Enterprise 2.0 (TIN2009-13391-C04-03).
perts in text analysis.
Automatic text re-use detection takes ad-
vantage of NLP and IR technology to com-
pare thousands of documents —looking for
the potential source of a presum ab ly case of
re-use. Mach ine tran slation technology can
be used in order to uncover cases of cross-
language re-use. By exploiting such techno-
logy, thousands of exhaustive comparisons
are possible, also across languages, something
impossible to manually achieve.
In this dissertation we pay special atten-
tion to three aspects of text re-use:
1. Cross-language text re-use: we propo-
se a cross-language similarity assessment
model that represents one of the best
Procesamiento del Lenguaje Natural, Revista nº 50 marzo de 2013, pp 103-105
recibido 21-11-12 revisado 15-01-13 aceptado 19-02-13
ISSN 1135-5948
© 2013 Sociedad Española para el Procesamiento del Lenguaje Natural
options wh en looking f or exact transla-
tions.
2. Paraphrase text re-use: we investigate
what types of paraphrasin g are more fre-
quently applied when plagiarising and
how they diffi cult plagiarism detection;
something never done before.
3. Mono- and cross-language re-use within
and from Wikipedia: the encyclopedia
is explored as a multi-authoring frame-
work, where texts are re-used within ver-
sions of an article and across languages.
2. Thesis Overview
The dissertation consists of 9 chapters, des-
cribing ou r efforts to approach the main dif-
ficulties of automatic text re-use detection.
The contents are described following.
Chapters 2 and 3 are an overall introd uc-
tion of the covered topics. Chapter 2 offers
an overview of text re-use, with special emp -
hasis on plagiarism. Our contribution comes
in the form of the survey we held in different
Mexican universities; aiming to assess how of-
ten students plagiarise across languages and
their attitudes respect to paraphrase plagia-
rism (factors never analysed before). Chapter
3 introduces the IR and NLP concepts used
through the rest of the thesis.
Chapter 4 describes corpora for (auto-
matic) analysis of text re-use an d plagia-
rism available up to date. Our participation
in the construction of three corpora —co-
derivatives, CL!TR, and to a smaller extent
PAN-PC— are cutting edge contributions
discussed in this chapter. Evaluation metrics
are also discu ssed: some are well k nown in
IR and related areas, whereas others were r e-
cently proposed —and specially designed—
for evaluating text re-use detection.
Chapter 5 defines the two main approa-
ches to re-use detection: intrinsic and exter-
nal. Our contrib utions to external (monolin-
gual) detection are discuss ed. Our main con-
tribution is a model for r etrieving those re-
lated documents to the suspicious one, hence
reducing th e load when performing the actual
plagiarism detection process. Such a problem
is often neglected in the p lagiarism detection
literature, that assumes that either the step
is not necessary or it is already solved; an
absolutely false idea.
Chapter 6 describes our model for cross-
language detection (this is one of the least ap-
proached problems of re-use detection!): CL-
ASA. CL-ASA is compared to state-of-the-
art models over different sub-tasks of the de-
tection process. A variety of languages is con-
sidered to analyse the strengths and weaknes-
ses of the different models.
Chapter 7 discusses the international com-
petitions we ran during three years: the PAN
International Competition on Plagiarism De-
tection. We also experiment with our detec-
tion models on the generated test-beds and
discuss the ob tained results.
Chapter 8 analyses plagiarism from the
point of view of paraphrasing, providing a
bridge between the two disciplines: plagia-
rism detection and paraphrase analysis. Our
findings on th e use of parap hrasing when pla-
giarising represent useful ins ights to take into
account when developing the next generation
of plagiarism d etection s y stems.
In Chapter 9 we analyse monolingual co-
derivation among revisions of Wikipedia arti-
cles and cross-language text re-use fr om Wi-
kipedia. Related to the latter issue, we offer a
preliminary discussion on the PAN competi-
tion we organised at FIRE on cross-language
text re-use: PAN Cross-Language !ndian Text
Re-Use; where the potentially re-used docu-
ments were written in Hindi and the potential
source documents were written in English.
3. Thesis Contributions
The main contributions of this research are
described below.
Detection of text re-use across langua-
ges. We explored a range of cross-language
information retrieval techniques. We obser-
ved th at (i) a simple model based on cha-
racterising texts by short character n-grams
(CL-CNG) was worth considering when dea-
ling with common-alphabet languages (and
different alphabets, after transliteration),
and particularly if they have some influen-
ce (Barr´on-Cede˜no et al., 2010; Potthast et
al., 2011); (ii) the model cross-language ex-
plicit semantic analysis (CL -ESA), based on
large comparable corpora such as Wikipe-
dia, performs well when looking for rela-
ted documents across languages (Potthast et
al., 2011). We proposed a model —cross-
language alignment-based similarity analy-
sis, CL-ASA—, based on translation pro-
babilities and length distributions between
texts (Barr´on-Cede˜no et al., 2008; Pinto et
al., 2009). Our empirical results showed that
Alberto Barrón Cedeño
104
CL-ASA is competitive when looking for re-
used texts, regardless if they were manually
or automatically translated. CL-ASA per-
forms better than CL-ESA and CL-CNG,
identified as two of the most appealing mo-
dels for cr oss-language similarity assessment,
when dealing with translations at document
and fragment level (Potthast et al., 2011).
Creation of standard collec tions of docu-
ments for the study and development of pla-
giarism detection. We helped in the creation
of two “sister” corpora with simulated ca-
ses of re-use and plagiarism. The PAN-PC
series look at composing a realistic IR cha-
llenge: it includes thousands of documents,
with thousan ds of plagiarism cases (both ma-
nually and automatically generated) (Pott-
hast et al., 2010). The C L!TR corpus looks
at composing a realistic cross-language cha-
llenge: it contains a few thousand documents,
with hu ndreds of re-use cases (manually ge-
nerated across distant languages) (Barr´on-
Cede˜no et al., 2011). These corpora (particu-
larly th e PAN-PC series) have become a re-
ference in the development of mo dels for pla-
giarism detection, filling an important gap.
1
Analysis of paraphrase plagiarism and its
detection. The vast majority of models for
text re-use detection are designed to uncover
“cut and paste” cases, as they consid er surfa-
ce information only. These models are unsuc-
cessful when facing p arap hrase plagiarism.
For the first time, we analysed the paraphra-
se phenomena applied when text is plagiari-
sed (Barr´on-Cede˜no et al., 2013 (to appear)).
Our seminal study showed that lexical substi-
tutions are the paraphr ase mechanisms used
the most. Moreover, the paraphrasing tends
to be used to generate a simp lified version of
the re-us ed text. A model intended to suc-
ceed in detecting paraphrase re-use requires
robust text pre-processing and characterisa-
tions: the exp an sion (or contraction) of rela-
ted vocabulary, the normalisation of format-
ting and word forms, and the inclusion of me-
chanisms that model the expected length of
a re-used fragment given its source.
1
The PAN-PC corpora, created in the fra-
mework of the PAN International Competi-
tion on Plagiarism Detection, are available at
http://www.uni-weimar.de/cms/medien/webis/
research/corpora.html. The CL!TR corpus, crea-
ted in the framework of the PAN Cross-Language
!ndian Text Reuse challenge, is available at http:
//memex2.dsic.upv.es/workshops/2011/clitr/
Referen ces
Barr´on-Cede˜no, Alberto, Paolo Rosso, Eneko
Agirre, and Gorka Labaka. 2010. Pla-
giarism detection across distant language
pairs. In Huang and Jurafsky (Huang and
Jurafsky, 2010).
Barr´on-Cede˜no, Alberto, Paolo Rosso, Sob-
ha Lalitha Devi, Paul Clough, and Mark
Stevenson. 2011. PAN@FIRE: Overview
of the Cross-Language !ndian Text Re-Use
Detection Competition. In FIRE, editor,
FIRE 2011 Working Notes. Third Works-
hop of the Forum for Information Retrie-
val Evaluation.
Barr´on-Cede˜no, Alberto, Paolo Rosso, David
Pinto, and Alfons Juan. 2008. On Cross-
Lingual Plagiarism Analysis Using a Sta-
tistical Model. In Benno Stein, Efstathios
Stamatatos, and Moshe Koppel, editors,
ECAI 2008 Workshop on Uncovering Pla-
giarism, Authorship, and Social Softwa-
re Misuse (PAN 2008), volume 377, pa-
ges 9–13, Patras, Greece. CEUR-WS.org.
http://ceur-ws.org/Vol-377.
Barr´on-Cede˜no, Alberto, Marta Vila, M.
Ant`onia Mart´ı, and Paolo Rosso. 2013 (to
appear). Plagiarism meets Paraphrasing:
Insights for the Next Generation in Auto-
matic Plagiarism Detection. Computatio-
nal Li ng uistics.
Huang, Chu-Ren and Dan Jurafsky, editors.
2010. Proceedings of the 23rd Interna-
tional Conference on Computational Lin-
guistics (COLING 2010), Beijing, China,
August. COLING 2010 Organizing Com-
mittee.
Pinto, David, Jorge Civera, Alberto Barr´on-
Cede˜no, Alfons Juan, and Paolo Rosso.
2009. A Statistical Approach to Crosslin-
gual Natural Language Tasks. Journal of
Algori thms, 64(1):51–60.
Potthast, Martin, Alberto Barr´on-Cede˜no,
Benno Stein, and Paolo R osso. 2011.
Cross-language plagiarism detection.
Language Resource s and Evaluation
(LRE), Special Issue on Plagiarism and
Authorship Analysis, 45(1):1–18.
Potthast, Martin, Benno Stein, Alberto
Barr´on-Cede˜no, and Paolo Rosso. 2010.
An evaluation framework for plagiarism
detection. In Huang and J urafsky (Huang
and Jurafsky, 2010), pages 997–1005.
On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism
105