Academic research blogs

The research blog subcorpus consists of a sample of posts from 40 different research blogs, all of which are maintained by L2 users of English.

Eight complete posts with any accompanying comments were collected from each blog, with the exception of The Reference Frame (TRF) string physics blog. Only four complete blogs were compiled from this site due to their length, and especially due to the large number of comments. It became clear that TRF was exceptionally active in terms of discussion among the blogs in our sample, and thus a unique source of interactive data.

As the 40 blogs overall yielded less discussion data in the comments than we had hoped, an additional subcorpus of onlydiscussion text (i.e. not including the original posts) was collected from TRF. This data included discussion from 26 physics-related posts published in Jan.-Feb. 2011. These additional 67,390 words of text further skew the blogs corpus toward the natural sciences, but there was never a question of balancing this subcorpus; the genre of research blogging is far more active in the sciences, and only five blogs were found from the social sciences or humanities that met our compilation principles. By adding this TRF discussion material, the percentage of total words by outside commenters on the blogs rose to 22%:

Proportion of text by bloggers and guest commenters in the research blogs:
author	no. of words	% of words
bloggers	291477	78%
commenters	80414	22%
total	371891	100%

Calculated differently, all the discussion in comments (including bloggers and commenters together) amounts to 35% of words (129,912) in the blogs subcorpus with 65% of words (241,979) coming from the original blog posts. Due to the overrepresentation of natural sciences (and especially physics) in the sample and in the genre as a whole, the overall distribution of texts is heavily skewed toward the Sci category, with 89% of words in the blogs subcorpus:

The broad categories and their dubdomains in the research blogs:
category/domain	no. of blogs	no. of words	% of total
Sci	33	330253	89%
natural science	14	200493 (136494 of them are physics)	61%
medicine	11	91726	28%
technology	6	27042	8%
regional science	2	10992	3%
SSH	7	41638	11%
social sciences	3	18658	45%
economics & administration	2	14686	35%
humanities	1	5109	12%
behavioural sciences	1	3185	8%
total	40	371891	100%

The additional SSH texts in the other components of the corpus help balance this additional material from natural sciences, and the overall distribution of Sci texts throughout WrELFA is therefore reasonably well balanced at 55% of total words. It should be kept in mind, however, that the concentration of blog texts in the natural sciences will likely affect some results due to the more dialogic nature of the genre, and this should be taken into account when interpreting findings.

Due to the large amount of text in the corpus that is provided by visitors to the blogs, the largest L1 category in the blogs subcorpus is “unknown”. In addition to the 18 L1s identified among the bloggers, there are 180 unique commenters in The Reference Frame data alone, and this is an international mix of authors from unknown backgrounds, in varying degrees of anonymity, and with L1 English in the lingua franca mix. Overall, the top 10 L1 categories in the blogs subcorpus subsume 31 of the 40 blogs and 87% of words:

The ten largest L1 categories in the research blogs:
	author L1	no. of blogs	no. of words	% of words
1	unknown (includes unidentified commenters from blog discussions)	1	85666	23%
2	Dutch	8	48566	13%
3	Italian	6	38212	10%
4	Czech	1	36114	10%
5	Spanish	4	26008	7%
6	Finnish	1	25504	7%
7	Bengali	3	23957	6%
8	Norwegian	3	16309	4%
9	German	2	12348	3%
10	Hindi	2	9257	2%
	total	31	321941	87%
	other L1s	9	49950	13%
	Blogs total	40	371891	100%

Concerning the authors’ academic roles, we find a reverse situation to the PhD examiner reports. While those were centred on senior staff, more than half of the data in the blogs subcorpus come from PhD students (11 blogs, 19% of words) and junior academic staff (12 blogs, 40% of words). Senior staff at the professorial level account for only eight blogs in the corpus:

blogger role	no. of blogs	no. of words	% of words
Junior staff	12	117905	40%
Research student	11	56110	19%
Senior staff	8	37836	13%
Unknown	6	60136	21%
Senior industry	2	15825	5%
Junior industry	1	3665	1%
	40	291477 (without commenters - 80414 words)

For more information on the composition of the research blog subcorpus, see this post from the ELFA project research blog.

Suggested citation

WrELFA 2015. The Corpus of Written English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. Compilation manager: Ray Carey. http://www.helsinki.fi/elfa/ (last access).