Creating Biographical Networks from Chinese and English Wikipedia

With the rise of digital humanities, historians explore how to intellectually engage with textual sources given the available computational tools of today. The ENP-China project employs Natural Language Processing methods to tap into sources of unprecedented scale with the goal to study the transfor...

Full description

Saved in:  
Bibliographic Details
Authors: Blouin, Baptiste (Author) ; Magistry, Pierre (Author) ; Van den Bosch, Nora (Author)
Format: Electronic Article
Language:English
Check availability: HBZ Gateway
Journals Online & Print:
Drawer...
Fernleihe:Fernleihe für die Fachinformationsdienste
Published: Université du Luxembourg 2021
In: Journal of historical network research
Year: 2021, Volume: 5, Issue: 1, Pages: 303-317
Further subjects:B Biography
B Wikipedia
B BERT
B NER
B Wikidata
B Deep Learning
Online Access: Volltext (kostenfrei)
Volltext (kostenfrei)

MARC

LEADER 00000caa a22000002 4500
001 1817018027
003 DE-627
005 20240417193606.0
007 cr uuu---uuuuu
008 220920s2021 xx |||||o 00| ||eng c
024 7 |a 10.25517/jhnr.v5i1.120  |2 doi 
035 |a (DE-627)1817018027 
035 |a (DE-599)KXP1817018027 
040 |a DE-627  |b ger  |c DE-627  |e rda 
041 |a eng 
084 |a 0  |2 ssgn 
100 1 |a Blouin, Baptiste  |e VerfasserIn  |4 aut 
245 1 0 |a Creating Biographical Networks from Chinese and English Wikipedia 
264 1 |c 2021 
336 |a Text  |b txt  |2 rdacontent 
337 |a Computermedien  |b c  |2 rdamedia 
338 |a Online-Ressource  |b cr  |2 rdacarrier 
520 |a With the rise of digital humanities, historians explore how to intellectually engage with textual sources given the available computational tools of today. The ENP-China project employs Natural Language Processing methods to tap into sources of unprecedented scale with the goal to study the transformation of elites in Modern China (1830-1949). One of the subprojects is extracting various kinds of data from biographies and, for that, we created a large corpus of biographies automatically collected from the Chinese and English Wikipedia. The dataset contains 228,144 biographical articles from the offline Chinese Wikipedia copy and is supplemented with 110,713 English biographies that are linked to a Chinese page. We also enriched this bilingual corpus with metadata that records every mentioned person, organization, geopolitical entity and location per Wikipedia biography and links the names to their counterpart in the other language. This data structure allows the researcher to analyze the relationships between biographies via shared contents and compare networks in different language settings. In this paper we will describe our methodology for building this new dataset. The first step was to use automatic text classification for extracting Chinese biographies. We trained a binary classifier to detect biographies on manually classified examples and used a subset of unseen texts to assess its accuracy. The second step used Named Entity Recognition to generate metadata and extract relations from the links in Wikipedia. Furthermore, we will delve into the method for building networks from this dataset. We argue that depending on the specific research question, different networks may be built. Using the metadata, researchers can create various kinds of networks to suit their needs. On top of releasing this dataset as an enriched bilingual corpus, we will provide an online interface to query and explore it. Our interface benefits from the bipartite graph structure (it can be seen as a network of documents and entities) and applies the same exploration and clustering strategy as in Cillex. 
601 |a Network 
601 |a Wikipedia 
650 4 |a BERT 
650 4 |a Biography 
650 4 |a Deep Learning 
650 4 |a NER 
650 4 |a Wikidata 
650 4 |a Wikipedia 
700 1 |a Magistry, Pierre  |e VerfasserIn  |4 aut 
700 1 |a Van den Bosch, Nora  |e VerfasserIn  |4 aut 
773 0 8 |i Enthalten in  |t Journal of historical network research  |d Luxembourg : Université du Luxembourg, 2017  |g 5(2021), 1, Seite 303-317  |h Online-Ressource  |w (DE-627)1000904911  |w (DE-600)2908863-X  |w (DE-576)494553545  |x 2535-8863  |7 nnns 
773 1 8 |g volume:5  |g year:2021  |g number:1  |g pages:303-317 
856 4 0 |u http://jhnr.uni.lu/index.php/jhnr/article/view/120  |x Verlag  |z kostenfrei  |3 Volltext 
856 4 0 |u https://doi.org/10.25517/jhnr.v5i1.120  |x Resolving-System  |z kostenfrei  |3 Volltext 
951 |a AR 
ELC |a 1 
LOK |0 000 xxxxxcx a22 zn 4500 
LOK |0 001 4190307122 
LOK |0 003 DE-627 
LOK |0 004 1817018027 
LOK |0 005 20220920155259 
LOK |0 008 220920||||||||||||||||ger||||||| 
LOK |0 040   |a DE-Tue135  |c DE-627  |d DE-Tue135 
LOK |0 092   |o n 
LOK |0 852   |a DE-Tue135 
LOK |0 852 1  |9 00 
LOK |0 935   |a ixzs  |a ixzo 
OAS |a 1 
ORI |a TA-MARC-ixtheoa001.raw 
REL |a 1 
SUB |a REL