CL-ASA is competitive when looking for re-
used texts, regardless if they were manually
or automatically translated. CL-ASA per-
forms better than CL-ESA and CL-CNG,
identified as two of the most appealing mo-
dels for cr oss-language similarity assessment,
when dealing with translations at document
and fragment level (Potthast et al., 2011).
Creation of standard collec tions of docu-
ments for the study and development of pla-
giarism detection. We helped in the creation
of two “sister” corpora with simulated ca-
ses of re-use and plagiarism. The PAN-PC
series look at composing a realistic IR cha-
llenge: it includes thousands of documents,
with thousan ds of plagiarism cases (both ma-
nually and automatically generated) (Pott-
hast et al., 2010). The C L!TR corpus looks
at composing a realistic cross-language cha-
llenge: it contains a few thousand documents,
with hu ndreds of re-use cases (manually ge-
nerated across distant languages) (Barr´on-
Cede˜no et al., 2011). These corpora (particu-
larly th e PAN-PC series) have become a re-
ference in the development of mo dels for pla-
giarism detection, filling an important gap.
1
Analysis of paraphrase plagiarism and its
detection. The vast majority of models for
text re-use detection are designed to uncover
“cut and paste” cases, as they consid er surfa-
ce information only. These models are unsuc-
cessful when facing p arap hrase plagiarism.
For the first time, we analysed the paraphra-
se phenomena applied when text is plagiari-
sed (Barr´on-Cede˜no et al., 2013 (to appear)).
Our seminal study showed that lexical substi-
tutions are the paraphr ase mechanisms used
the most. Moreover, the paraphrasing tends
to be used to generate a simp lified version of
the re-us ed text. A model intended to suc-
ceed in detecting paraphrase re-use requires
robust text pre-processing and characterisa-
tions: the exp an sion (or contraction) of rela-
ted vocabulary, the normalisation of format-
ting and word forms, and the inclusion of me-
chanisms that model the expected length of
a re-used fragment given its source.
1
The PAN-PC corpora, created in the fra-
mework of the PAN International Competi-
tion on Plagiarism Detection, are available at
http://www.uni-weimar.de/cms/medien/webis/
research/corpora.html. The CL!TR corpus, crea-
ted in the framework of the PAN Cross-Language
!ndian Text Reuse challenge, is available at http:
//memex2.dsic.upv.es/workshops/2011/clitr/
Referen ces
Barr´on-Cede˜no, Alberto, Paolo Rosso, Eneko
Agirre, and Gorka Labaka. 2010. Pla-
giarism detection across distant language
pairs. In Huang and Jurafsky (Huang and
Jurafsky, 2010).
Barr´on-Cede˜no, Alberto, Paolo Rosso, Sob-
ha Lalitha Devi, Paul Clough, and Mark
Stevenson. 2011. PAN@FIRE: Overview
of the Cross-Language !ndian Text Re-Use
Detection Competition. In FIRE, editor,
FIRE 2011 Working Notes. Third Works-
hop of the Forum for Information Retrie-
val Evaluation.
Barr´on-Cede˜no, Alberto, Paolo Rosso, David
Pinto, and Alfons Juan. 2008. On Cross-
Lingual Plagiarism Analysis Using a Sta-
tistical Model. In Benno Stein, Efstathios
Stamatatos, and Moshe Koppel, editors,
ECAI 2008 Workshop on Uncovering Pla-
giarism, Authorship, and Social Softwa-
re Misuse (PAN 2008), volume 377, pa-
ges 9–13, Patras, Greece. CEUR-WS.org.
http://ceur-ws.org/Vol-377.
Barr´on-Cede˜no, Alberto, Marta Vila, M.
Ant`onia Mart´ı, and Paolo Rosso. 2013 (to
appear). Plagiarism meets Paraphrasing:
Insights for the Next Generation in Auto-
matic Plagiarism Detection. Computatio-
nal Li ng uistics.
Huang, Chu-Ren and Dan Jurafsky, editors.
2010. Proceedings of the 23rd Interna-
tional Conference on Computational Lin-
guistics (COLING 2010), Beijing, China,
August. COLING 2010 Organizing Com-
mittee.
Pinto, David, Jorge Civera, Alberto Barr´on-
Cede˜no, Alfons Juan, and Paolo Rosso.
2009. A Statistical Approach to Crosslin-
gual Natural Language Tasks. Journal of
Algori thms, 64(1):51–60.
Potthast, Martin, Alberto Barr´on-Cede˜no,
Benno Stein, and Paolo R osso. 2011.
Cross-language plagiarism detection.
Language Resource s and Evaluation
(LRE), Special Issue on Plagiarism and
Authorship Analysis, 45(1):1–18.
Potthast, Martin, Benno Stein, Alberto
Barr´on-Cede˜no, and Paolo Rosso. 2010.
An evaluation framework for plagiarism
detection. In Huang and J urafsky (Huang
and Jurafsky, 2010), pages 997–1005.
On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism