SciELF corpus

The SciELF corpus consists of research papers that have not undergone professional proofreading services or checking by a native speaker of English.

All the papers are written by L2 users of English, and most of these are final drafts of unpublished manuscripts. It is thus a corpus of second-language use (SLU) in written scientific communication. Several international partners have contributed material to this corpus, resulting in 150 papers (759,300 words) by authors with ten different L1 backgrounds. The breakdown of these L1s is as follows:

The ten L1 categories in the SciELF corpus:
first author's L1 no. of articles no. of words % of words avg. words/article
1 Finnish 25 123153 16% 4926
2 Czech 22 109173 14% 4962
3 French 16 91186 12% 5699
4 Chinese 21 84807 11% 4038
5 Spanish 13 79038 10% 6080
6 Russian 13 71376 9% 5490
7 Swedish 13 60060 8% 4620
8 Italian 11 58685 8% 5335
9 Portuguese (Brazil) 12 56625 7% 4719
10 Romanian 4 25197 3% 6299
150 759300 100% 5062

In addition, we attempted to compile a balanced sample of papers between the sciences (labelled ‘Sci’) and the social sciences and humanities (labelled ‘SSH’). However, the texts categorised as SSH were found to be much longer on average than those labelled Sci, so the broad division of the corpus appears thus:

Distribution of the broad binary categories in the SciELF corpus:
category no. of articles no. of words % of total words avg. words/article
Sci 78 326463 43% 4185
SSH 72 432837 57% 6012
150 759300 100% 5062

Among the 326,463 words in the Sci category, most are drawn from the natural sciences (79%) and medicine (18%). The 432,837 words in SSH are drawn from social sciences (45%), humanities (34%), and behavioural sciences (21%). As for the academic roles of the first authors, the distribution of these various roles in SciELF is as follows:

first author role no. of articles no. of words % of words
Junior staff 86 418366 55%
Senior staff 34 172075 23%
Research student 17 107998 14%
Unknown 11 41116 5%
Masters student 2 19745 3%
150 759300 100%

International partners

The SciELF corpus would not have been possible without the generous contribution of our international partners, who obtained texts and author permissions in their respective home countries. We gratefully acknowledge the contribution of the following researchers:

  • Marina Bondi and Anna Stermieri, University of Modena and Reggio Emilia
  • Maria Kuteeva and Lisa McGrath, University of Stockholm
  • Pilar Mur-Dueñas, University of Zaragoza
  • Laura Muresan and Mirela Bardi, Bucharest University of Economic Studies
  • Lene Nordrum, Lund University
  • Wei Ren, Guangdong University of Foreign Studies
  • Elizabeth Rowley-Jolivet, Université d’Orléans
  • Tony Berber Sardinha, Catholic University of São Paulo
  • Irina Shchemeleva, St. Petersburg Higher School of Economics
  • Renáta Tomášková, University of Ostrava
  • Ying Wang, China Three Gorges University

Suggested citation

SciELF 2015. The SciELF Corpus. Director: Anna Mauranen. Compilation manager: Ray Carey. http://www.helsinki.fi/elfa/ (last access).