The research blog subcorpus consists of a sample of posts from 40 different research blogs, all of which are maintained by L2 users of English. Eight complete posts with any accompanying comments were collected from each blog, with the exception of The Reference Frame (TRF) string physics blog. Only four complete blogs were compiled from this site due to their length, and especially due to the large number of comments. It became clear that TRF was exceptionally active in terms of discussion among the blogs in our sample, and thus a unique source of interactive data.
As the 40 blogs overall yielded less discussion data in the comments than we had hoped, an additional subcorpus of onlydiscussion text (i.e. not including the original posts) was collected from TRF. This data included discussion from 26 physics-related posts published in Jan.-Feb. 2011. These additional 67,390 words of text further skew the blogs corpus toward the natural sciences, but there was never a question of balancing this subcorpus; the genre of research blogging is far more active in the sciences, and only five blogs were found from the social sciences or humanities that met our compilation principles. By adding this TRF discussion material, the percentage of total words by outside commenters on the blogs rose to 22%:
|author||no. of words||% of words|
Calculated differently, all the discussion in comments (including bloggers and commenters together) amounts to 35% of words (129,912) in the blogs subcorpus with 65% of words (241,979) coming from the original blog posts. Due to the overrepresentation of natural sciences (and especially physics) in the sample and in the genre as a whole, the overall distribution of texts is heavily skewed toward the Sci category, with 89% of words in the blogs subcorpus:
|category/domain||no. of blogs||no. of words||% of total|
of them are physics)
|economics & administration||2||14686||35%|
The additional SSH texts in the other components of the corpus help balance this additional material from natural sciences, and the overall distribution of Sci texts throughout WrELFA is therefore reasonably well balanced at 55% of total words. It should be kept in mind, however, that the concentration of blog texts in the natural sciences will likely affect some results due to the more dialogic nature of the genre, and this should be taken into account when interpreting findings.
Due to the large amount of text in the corpus that is provided by visitors to the blogs, the largest L1 category in the blogs subcorpus is “unknown”. In addition to the 18 L1s identified among the bloggers, there are 180 unique commenters in The Reference Frame data alone, and this is an international mix of authors from unknown backgrounds, in varying degrees of anonymity, and with L1 English in the lingua franca mix. Overall, the top 10 L1 categories in the blogs subcorpus subsume 31 of the 40 blogs and 87% of words:
|author L1||no. of blogs||no. of words||% of words|
|1||unknown (includes unidentified commenters from blog discussions)||1||85666||23%|
Concerning the authors’ academic roles, we find a reverse situation to the PhD examiner reports. While those were centred on senior staff, more than half of the data in the blogs subcorpus come from PhD students (11 blogs, 19% of words) and junior academic staff (12 blogs, 40% of words). Senior staff at the professorial level account for only eight blogs in the corpus:
|blogger role||no. of blogs||no. of words||% of words|
|40||291477 (without commenters - 80414 words)|
For more information on the composition of the research blog subcorpus, see this post from the ELFA project research blog.
WrELFA 2015. The Corpus of Written English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. Compilation manager: Ray Carey. http://www.helsinki.fi/elfa/ (last access).