Helsinki Digital Humanities Hackathon #DHH21 will have seven thematics areas of interest with one or more groups per topic, each with up to eight participants under the auspices of the group leaders.
The group will conduct research on the corpora of Finnic oral folk poetry: Suomen Kansan Vanhat Runot (Old Poems of the Finnish People), which in addition to Finnish contains material in Karelian, Izhorian and Votic languages, and Eesti Regilaulude Andmebaas (Estonian Runosongs’ Database). The corpora contain written records of original folk poems, including epics, lyrics, occasional songs (e.g. wedding songs) and charms.
From the computational perspective, the datasets are challenging because of large variation in terms of orthography and dialect. The songs contain recurring themes, characters, formulas (fixed short expressions) and overlapping text fragments, but due to the surface-level variation identifying those similarities is a research question for unsupervised language processing.
The data, tools and supervision will be provided by members of the project: Formulaic Intertextuality, Thematic Networks and Poetic Variation across Regional Cultures of Finnic Oral Folk Poetry (FILTER). The group can choose to concentrate on either one of the datasets (Finnish or Estonian) or a comparison of both. It is recommended that at least half of the group has high proficiency in either Finnish or Estonian (depending on the chosen dataset).
Possible research questions include:
- the variation of poetic meter across regions
- the distribution of themes depending on region and collector
- the influence of published collections of poems back on the oral tradition
- the relationship between Kalevala and original collected poems
- quantitative comparison of occurrences of a specific theme or character (e.g. how is Väinämöinen described / referred to? What verses are commonly used to tell the story of “singing duel”? What formulas are used in charms against fire / iron / cold?)
Expertise or interest in the following areas will be especially useful:
- Folkloristics, oral tradition
- Finnish or Estonian philology
- Comparative linguistics, dialectology
- Literary studies
- Unsupervised natural language processing
It is commonly accepted that the emergence of social media, especially Facebook and Twitter, have changed and challenged the media landscape in important ways. However, because of the sparse availability of concurrent media and social media data, many aspects of the interaction between social media and traditional news media have been left unstudied. This has changed of late, as Twitter has improved the accessibility of its data for research purposes. At the same time, the Flows of Power -project has managed to acquire full dumps of the journalistic output of multiple major Finnish media outlets.
The group seeks to find different ways to study the interaction between digital newsmedia journalism and Twitter. The topic will be centred on comparing the presences of and debates around two citizen initiative campaigns in social media and political newsmedia: one for same-sex marriage (Tasa-arvoinen avioliittolaki 2017) and one for legal gender recognition (Translaki).
Workflows and analyses will be geared to capture dynamics and interactions related to, for example:
- Actors (civil society, legislators, organizations, opinion-leaders)
- Connections and flows of information
- Increased interaction of the media with news audiences
- Agenda setting between Twitter, newsmedia and radio
The main computational challenges of this topic are mapping different phenomenon arising from Twitter data to newsmedia data, and vice versa. This involves, for example
- Detecting the same topics in Twitter messages and newsmedia articles
- Predicting which newsmedia articles cause Twitter discussion
- Predicting variation in newsmedia visibility of Twitter campaigns
- Identifying key actors and gates of information flow, such as strong hashtags and relevant newsmedia genres
The data available for the group includes all articles from Helsingin Sanomat (the biggest Finnish broadside newspaper), YLE (public broadcast company), Iltalehti (national tabloid) and STT (the main Finnish news agency) between 2011 and 2017, as well as the metadata of YLE’s radio and TV broadcasts. We have also access to Twitter's historical API.
The international newspapers group at DHH21 will develop a multilingual case that investigates which places dominated the news reporting during the Great War between 1914 and 1918. The group will identify news articles that relate to the war and extract names of places in order to discover which war efforts were covered in the large multilingual collection of historical news.
The digitized newspapers are provided by the project “NewsEye: A digital investigator for historical newspapers”. The NewsEye collection includes digitized newspapers from Austria, France, and Finland in five languages, namely German, French, Finnish, Swedish, and English. We assume that at least one person in the group would know each of these languages, though it is not required to know any of them to join the group (except for English). In addition to data collections, NewsEye provides text processing tools, accessible in the user interface or from API, though it is possible to use anything else to analyse the collection.
The aim is not to describe the different battles and war efforts as such, but to compare which locations seemed relevant depending on the viewpoint of the different language papers. Helsinki, Vienna, and Paris got their news through different channels. Consequently, the imagination of what and where things happened during the Great War looked different in those places. Through a systematic comparison, we may be able to understand the spatial imaginaries of war.
Possible tasks are:
- Identifying key features in war reporting in Paris, Vienna and Helsinki.
- Manually evaluating the reporting in different countries with regard to famous events such as the assassination in Sarajevo 1914, battle of Ardennes 1914, battle of Lódź 1914, siege of Przemyśl 1915, battle of the Somme 1916, and outbreaks of the Spanish Flu 1918.
- Creating a method to automatically identify articles relating to the war.
- Evaluating the precision and recall of the method.
- Testing applicability of modern NLP tools for historical data.
- Improvement, combining and development of name recognition and entity linkin techniques.
- Extracting named entities from chosen articles and linking them to places on a map
- Place extracted named entities in a dynamic network.
- Analyzing the locations which do appear in the newspapers and comparing them to their role in the historiography of the Great War.
- Dynamic visualization of locations on different maps, implementation of visualization techniques.
- Contextualizing our source newspapers with information about their history, affiliations and information channels.
- Formulating over-arching interpretations of the results.
At Helsinki Computational History Group we have created a dataset of text reuses in the Eighteenth-Century Collections Online (ECCO). This dataset was created by running BLAST on EEBO-TCP and ECCO and sidestepping the OCR-problems that often hamper text mining of ECCO. We tracked each case of text reuse of strings of 50 characters or more totalling millions of text reuse cases.
The task of this hackathon group is to use this text reuse dataset to study eighteenth-century intertextuality through the uses of English translations of Pierre Bayle’s Historical and Critical Dictionary. This is not the first time that digital humanities project focuses on text reuse cases in dictionaries (Allen et al. 2010; Leca-Tsiomis 2013). The aim of this project is to also learn from these earlier experiences.
This group is particularly well suited for students with a computational background. We aim to create workflows that make the task of using and analysing the text reuse data more convenient. Computer scientists joining the group have the chance of developing tools that tackle challenging historical data, and contribute to the real research questions of historical text reusage. The developed tools would have great potential for further use in later analysis of the dataset beyond the hackathon project.
The dataset is very intriguing also from the perspective of eighteenth-century studies. We will focus on the concept of remediation by studying the little known phenomena of text reuse at large scale. We will also study translations as intellectual activity and switch the interest of knowledge from authors to publishing networks where the role of the author is seen in a different light.
Possible tasks to exemplify the work in the group
Workflow for studying text reuses of Bayle’s Dictionary
- Enriching the metadata within the dataset, particularly by devising a way to identify and differentiate recurring reuses of similar text fragments.
- Creating tools for exploring the dataset as a whole and understanding relations within it. This would potentially entail interactive network visualizations of the data limited by for example authors or publication years.
- Creating tools for exploring specific texts, and their relations with other texts in the dataset. The tools should help in understanding the contexts of the reuse occurrences, both within the original text and its neighbourhood in the reuse network.’
- Statistical and network analysis of the points of interest identified with the exploration tools.
Study of the text reuse phenomenon in general through the case of translations of Bayle’s Dictionary.
- We aim to compare the text reuses of different editions of the translations of Bayle’s Dictionary (especially the 1710 edition and 1734–1738 (five volume) and 1734–1741 (ten volume) editions; starting with most basic questions of how many instances of different type of text reuse are there for each title?
- We will form also basic metrics to study how common is text reuse in general? Can we draw averages of how often a work is quoted in other works? How many different works on average?
- We will study particularly cases where parts of Bayle’s Dictionary are printed at scale. The aim is to come up with a typology of different kinds of reuses of Bayle’s Dictionary that is also scalable to other cases. This will enable us to answer questions such as: how does text reuses of Bayle’s Dictionary compare to other canonical works?
Networks of publishing for Bayle’s Dictionary
- After grasping the basic principles of reuses of different editions of Bayle’s Dictionary the group will extract different publishing networks of Bayle’s Dictionary and study if there is evidence of publishers reusing texts that they themselves publish compared to others.
- We aim to study the ideological implications in the text reuse of Bayle’s Dictionary and its publishing networks.
References and further reading for potential group members
Allen, Timothy, Charles Cooney, Stéphane Douard, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, Robert Voyer. 2010. Plundering Philosophers: Identifying Sources of the Encyclopédie. Journal of the Association for History and Computing 13: http://hdl.handle.net/2027/spo.3310410.0013.10
Bayle, Pierre. 2000. Political Writings, trans. Sally L. Jenkinson, Cambridge: Cambridge University Press.
Justin Champion, 2008. “Bayle in the English Enlightenment,” in Pierre Bayle (1647-1706), le philosophe de Rotterdam: Philosophy, Religion and Reception, eds. van Bunge and Bots, Brill, 2008: 175-196.
Leca-Tsiomis, Marie. 2013. The Use and Abuse of the Digital Humanities in the History of Ideas: How to Study the Encyclopédie, History of European Ideas, 39:4, 467-476, DOI: 10.1080/01916599.2013.774115
Labrousse, Elisabeth. 1983. Bayle, trans. Denys Potts. Oxford and New York: Oxford University Press.
Lennon, Thomas. 2008. Pierre Bayle in Stanford Encyclopedia of Philosophy: https://plato.stanford.edu/entries/bayle/
The group focuses on the debates in the Parliament of Finland in the twentieth century. The group’s objective is to learn how to use public speech data, in this case parliamentary linked open data, for studying pressing societal issues of the past. Moreover, the group develops and uses tools that allow to identify themes, topics, and place names in the debates, and to classify the debates by using related metadata such as speaker information. The Finnish data exemplifies the parliamentary corpora and the linked open data standards that are developed and used internationally.
Parliaments are the main legislative institutions and key places of decision-making and political discussion in our democratic societies. The parliament is a national arena of speaking and debating, to which the Members of the Parliament (MPs), the “people’s representatives”, are elected in regional districts.The parties and the MPs align with political ideologies, but also with geographic areas such as urban centres, the countryside, or their home region. Moreover, locations are markers in the debates about policy issues, such as the environment or foreign policy, where a reference to the Soviet Union or Chernobyl can play different rhetorical roles. The group, thus, will study the different ways in which parliamentary politics and place are related. The group can approach the question from several perspectives in their project, including:
- Representation and place: what “places” do the MPs, “the People as a miniature”, represent?
- Key societal issues and place: how are issues such as environmental policy, poverty, or international relations framed in the parliamentary debates through places and geography?
- Parties and place: to what extent do the parties, the MPs or subgroups such as occupational groups identify themselves with certain geographic areas, cities, or the countryside?
- Publicity and place: what visibility do the MPs and the debates have in the various public spheres, that is, in the local, regional, and national newspapers?
The parliamentary debate material and the related metadata are provided by the project Semantic Parliament – ParliamentSampo: Linked Open Data Service for Studying Political Culture (SEMPARL) (https://seco.cs.aalto.fi/projects/semparl/en/). As the parliamentary material is mainly in Finnish, basic knowledge of Finnish is recommended though not mandatory; the computational tasks, in particular, can be carried out in English. Besides the data, the SEMPARL project will provide the group with basic tools or a user interface which allow to browse and search the data.
Possible tasks for the project are:
- Building tools that allow to analyse and describe the parliamentary debate dataset itself: what are the main themes tackled in the parliament, which parties and speakers have been the most prolific debaters?
- Building tools that allow to extract selected individual speeches and/or complete agenda items according to their theme or speaker information (residence, education/occupation, party, sex, age).
- Developing methods for statistical analysis and classification of the extracted texts concerning the geographic terms or other semantic information eg. sentiment.
- Using the tools to follow the emergence of a selected policy issue and close reading the political process
- Analysing difference or similarity in how the speakers or parties discuss the policy issues and refer to geographic names
- Searching mentions about MPs in the historical digitised newspapers and classifying and close reading these results
Terms and conditions of employment are regulated on the society level and have further impact on each individual contract. When independent unions and employers (or employers’ organizations) negotiate those terms and conditions of employment and regulate relations between the parties, the activity is referred to as ‘collective bargaining’. The written document resulting from this negotiation is a collective bargaining agreement (CBA). While being very important for the workers and for the employers, these documents (CBAs) are not easy to find and their content is often unknown even to those who are covered by them.
Since 2012, the WageIndicator Foundation (http://wageindicator.org) has been collecting and coding CBAs on a global scale in the WageIndicator Collective Agreements Database (http://wageindicator.org/cbadatabase). The Database currently contains 1600 collective agreements from more than 50 countries and written in 28 languages. The texts have been manually annotated according to 250 labour rights related questions on nine main topics – Social security and pensions, Training, Employment contracts, Sickness and disability, Health and medical assistance, Work/family balance arrangements, Gender equality issues, Wages, Working hours – and the relevant clauses (i.e., parts of text) for each question have been manually selected. Part of the annotation has been carried out under the SSHOC project (https://sshopencloud.eu/) and supported by the CLARIN Research Infrastructure (https://www.clarin.eu/).
The resulting datasets contain the collective agreements’ full texts and all the clauses assigned to each question.
The uniqueness and richness of such a dataset gives the opportunity to do research on many levels, as it sheds light on how different topics related to working conditions are addressed in different countries and expressed in different languages. The task of the hackathon group is to gain qualitative insights from the data and see how this output can be potentially shared/made visible for broader groups of Social Sciences and Humanities scientists via services provided by Research Infrastructures.
In this group, students with (digital) humanities background and students with an interest in computational language processing, e.g. multilingual texts analysis, will find something exciting to work on. Research ideas for this group might include:
- Cross-country and cross-language analysis: find out whether and how different topics are addressed in collective agreements. The analysis might also include/benefit from the use of other datasets, such as World Bank country groups by income, or the UN Human Rights Index.
- Topical investigation: for a topic of interest, e.g. sexual harassment or equal pay, research the particular features of the vocabulary and lexicon that are being used, i.e. what are the most common words and words’ relations, how long clauses are. This can be done for one language/country or more, and followed by a cross-country cross-language comparison.
- Automatisation of the annotation procedure: for students with computational skills, the task could be to speed up the annotation process of new texts (e.g. by creating machine learning models) that could help in understanding, characterising and identifying the parts of texts where the answer to a question can be found in the document. Students will be able to create their own algorithms and models but will also be provided with working models already developed in the SSHOC project activities to build upon. Expert advisory and feedback will be provided by the SSHOC group throughout the hackathon.
Such work will contribute to the research on collective agreements provisions and ultimately help workers, trade unions and employers all over the world to know more about their labour rights at sectoral or company level.
Possible tasks to exemplify the work in the group
- Analyse and understand the data and explore the comparison between the available raw and annotated versions of the dataset
- Identify topics of interest and relevant variables across multiple countries and languages
- Identify what to explore in detail and how (e.g. using a data model or algorithm) to gain insights from the data
- Use Natural Language Processing techniques to prepare the data for the analysis
- Develop data pipelines using sklearn or other libraries in Python to perform text analysis, such as keyword extraction, paragraph classification, topic modelling
- Extend and test models using cross validation, deep learning, neural networks or other techniques
- Apply and compare data insights across multiple languages/countries
- Formulate interpretations of the results, present them and see how these can be shared with – and used by – broader groups of SSH scientists.
The group will focus on the comparison of parliamentary debates before and during Covid across Europe from a linguistic, sociological, politological and/or computational perspective. The group’s objective will be to learn how to use comparable parliamentary corpora from various European countries that are annotated with metadata such as speaker and session information and linguistic annotations such as morphosyntactic and named entity tags for studying societal issues caused by the Covid-19 pandemic. The group will also learn how to use Orange (https://orangedatamining.com), a visual programming tool for data mining and machine learning, which means coding skills are not required for exploring the data set. Computer scientists will be able to use their skills to create advanced custom widgets for data processing and analysis.
National parliamentary data is a verified communication channel between the elected political representatives and society members in any democracy. One of the most important characteristics of parliamentary data is its direct correspondence with concurrent events, including the ones with a global impact on human health, social life, and economics such as the current COVID-19 pandemic. By comparing the data synchronically and diachronically in a cross-lingual context, we can obtain important insights into transnational characteristics as well as track the pan-European discussion in times of crisis.
The parliamentary corpora will be provided by the CLARIN ERIC ParlaMint project (currently available in Bulgarian, Croatian, Polish, and Slovenian) and is supported by the SSHOC project (https://sshopencloud.eu/). Its goal is to compile a collection of comparable corpora of debates from national parliaments from all over Europe in a harmonized format, covering both the data from the period of the Covid-19 pandemic as well as older, reference data. The first version of the corpora have already been processed linguistically and enriched with metadata, made searchable through popular concordancers for online querying as well as downloadable from the CLARIN repository for independent handling. By the time of the hackathon, a new version with many new languages will be available (English, Dutch, Icelandic, Lithuanian, Czech, Italian, Turkish, Danish, Hungarian, French, Latvian, Romanian, and Belgian Dutch/French).
Possible topics and tasks for the group are:
- Emotions in parliamentary discourse before and during Covid
- The dimension of countries
- The dimension of parties (ruling vs. opposition, left- vs. right-wing, established vs. new parties)
- The dimension of gender (female MPs, male MPs)
- The dimension of topics (economy, health, environment, social affairs etc.)
- Lexical dynamics in parliamentary discourse before and during Covid
- The lifecycle of expressions (emergence, increase, decline, disappearance)
- The lexical footprint of selected groups or individuals
- Ideological and populist language in the parliament
- The profanation of parliamentary discourse
- Cross-national perspectives in parliamentary debates before and during Covid (analysis of mentions of foreign locations, organizations and persons)
- Identification of centers of authority / reference points
- Identification of pro-European and anti-European stance