Academic research blogs

The research blog subcorpus consists of a sample of posts from 40 different research blogs, all of which are maintained by L2 users of English.

Eight complete posts with any accompanying comments were collected from each blog, with the exception of The Reference Frame (TRF) string physics blog. Only four complete blogs were compiled from this site due to their length, and especially due to the large number of comments. It became clear that TRF was exceptionally active in terms of discussion among the blogs in our sample, and thus a unique source of interactive data.

As the 40 blogs overall yielded less discussion data in the comments than we had hoped, an additional subcorpus of onlydiscussion text (i.e. not including the original posts) was collected from TRF. This data included discussion from 26 physics-related posts published in Jan.-Feb. 2011. These additional 67,390 words of text further skew the blogs corpus toward the natural sciences, but there was never a question of balancing this subcorpus; the genre of research blogging is far more active in the sciences, and only five blogs were found from the social sciences or humanities that met our compilation principles. By adding this TRF discussion material, the percentage of total words by outside commenters on the blogs rose to 22%:

Proportion of text by bloggers and guest commenters in the research blogs:
author no. of words % of words
bloggers 291477 78%
commenters 80414 22%
total 371891 100%

Calculated differently, all the discussion in comments (including bloggers and commenters together) amounts to 35% of words (129,912) in the blogs subcorpus with 65% of words (241,979) coming from the original blog posts. Due to the overrepresentation of natural sciences (and especially physics) in the sample and in the genre as a whole, the overall distribution of texts is heavily skewed toward the Sci category, with 89% of words in the blogs subcorpus:

The broad categories and their dubdomains in the research blogs:
category/domain no. of blogs no. of words % of total
Sci 33 330253 89%

natural science

14 200493 (136494
of them are physics)
61%
medicine 11 91726 28%
technology 6 27042 8%
regional science 2 10992 3%
SSH 7 41638 11%
social sciences 3 18658 45%
economics & administration 2 14686 35%
humanities 1 5109 12%
behavioural sciences 1 3185 8%
total 40 371891 100%

The additional SSH texts in the other components of the corpus help balance this additional material from natural sciences, and the overall distribution of Sci texts throughout WrELFA is therefore reasonably well balanced at 55% of total words. It should be kept in mind, however, that the concentration of blog texts in the natural sciences will likely affect some results due to the more dialogic nature of the genre, and this should be taken into account when interpreting findings.

Due to the large amount of text in the corpus that is provided by visitors to the blogs, the largest L1 category in the blogs subcorpus is “unknown”. In addition to the 18 L1s identified among the bloggers, there are 180 unique commenters in The Reference Frame data alone, and this is an international mix of authors from unknown backgrounds, in varying degrees of anonymity, and with L1 English in the lingua franca mix. Overall, the top 10 L1 categories in the blogs subcorpus subsume 31 of the 40 blogs and 87% of words:

The ten largest L1 categories in the research blogs:
author L1 no. of blogs no. of words % of words
1 unknown (includes unidentified commenters from blog discussions) 1 85666 23%
2 Dutch 8 48566 13%
3 Italian 6 38212 10%
4 Czech 1 36114 10%
5 Spanish 4 26008 7%
6 Finnish 1 25504 7%
7 Bengali 3 23957 6%
8 Norwegian 3 16309 4%
9 German 2 12348 3%
10 Hindi 2 9257 2%
total 31 321941 87%
other L1s 9 49950 13%
Blogs total 40 371891 100%

Concerning the authors’ academic roles, we find a reverse situation to the PhD examiner reports. While those were centred on senior staff, more than half of the data in the blogs subcorpus come from PhD students (11 blogs, 19% of words) and junior academic staff (12 blogs, 40% of words). Senior staff at the professorial level account for only eight blogs in the corpus:

blogger role no. of blogs no. of words % of words
Junior staff 12 117905 40%
Research student 11 56110 19%
Senior staff 8 37836 13%
Unknown 6 60136 21%
Senior industry 2 15825 5%
Junior industry 1 3665 1%
40 291477 (without commenters - 80414 words)

For more information on the composition of the research blog subcorpus, see this post from the ELFA project research blog.

Suggested citation

WrELFA 2015. The Corpus of Written English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. Compilation manager: Ray Carey. http://www.helsinki.fi/elfa/ (last access).