AI Pillar Project: Multi-Millennia Climate Reanalysis through Chinese Historical Records


   
Multi-Millennia Climate Reanalysis through Chinese Historical Records (MMCRCHR)

With generous support from the UChicago AI Pillar initiative, the Multi-Millennia Climate Reanalysis through Chinese Historical Records (MMCRCHR) Project harnesses a custom Federated AI pipeline that we developed comprised of interconnected language models and data agents, creating a computational infrastructure via which scholars can compare 3000 years of Chinese primary and secondary historical records with empirical data from modern climate reanalysis models and other scientific data sources. While connections between climate change and regime change in Chinese history have been proposed by previous scholars, this is the first scalable research pipeline that directly links Chinese primary and secondary historical sources with scientific data models, creating avenues for new research and discovery in a wide range of fields.



Tibetan AI Project : Advancing Tibetan NLP and AI Research


This grant-funded project presents a rare and transformative opportunity to advance Tibetan language studies, artificial intelligence, and global academic collaboration. At the heart of the initiative is Geshe Monlam, a leading scholar in Tibetan Buddhism and linguistics, renowned for creating the Monlam Dictionary—the most comprehensive and freely accessible online Tibetan language platform. Developed over more than a decade, the Monlam Dictionary stands apart by offering an inclusive, universal, and publicly available resource. Building on this legacy, the current project centers on Geshe Monlam’s generous offer to collaborate with scholars at the University of Chicago to develop a Tibetan Large Language Model (LLM) and AI system, a cutting-edge tool that promises to significantly enhance research and education for scholars and students engaged in Tibetan language, Buddhism, and Himalayan studies. (co-PIs: Jeffrey Tharsen, Christian Wedemeyer and Karma Ngodup)




Chinese Phonological Databases and Intertextual Networks for Comparative Philology and AI


   
Chinese Phonological Databases and Intertextual Networks for Comparative Philology and Artificial Intelligence (HKBU, January 2025)

"Sponsored by the Eurasia Foundation, the Jao Tsung-I Academy of Chinese Studies hosted the eleventh lecture in its 'Early Eurasian Civilizational Exchange from an Interdisciplinary Perspective' series on January 28th, 2025. Professor Jeffrey Tharsen, Associate Director of Technology in the Division of the Arts & Humanities at the University of Chicago, spoke on applications of Chinese phonological databases and intertextual networks for comparative linguistics and philology, and their utility in artificial intelligence systems. The lecture explored how innovative technologies can be combined with the Humanities; over 100 faculty and students attended, many of whom are students pursuing Bachelor of Arts and Sciences degrees in Digital Futures and Humanities at Hong Kong Baptist University."



Intertextuality of the Chinese Classics and the Zhou Bronze Inscriptions


   
   
Intertextuality of the Chinese Classics and the Zhou Bronze Inscriptions

Having achieved proof of concept in adapting the TextPair intertextual framework to Chinese texts with the 24 Histories
二十四史, we are now working to align all extant pre-Song Chinese excavated and transmitted texts. We recently aligned 63 premodern Chinese classics (see image above, network analysis using eigenvector centrality reveals the importance of the Histories within that corpus) and secondly, systematized and aligned all inscriptions longer than 10 graphs in the Zhou inscriptional corpora. As a final step, we combined these to find where the 63 classics employ any phrases of at least 4 characters from the Zhou inscriptions (most of these are quotations in and from the Poetry 詩). Our next step is to bring in other corpora of pre-Tang excavated manuscripts (Warring States, Dunhuang and others) while expanding the post-Han corpus of received texts and align all of them together in a public online search interface and interactive network.



The Digital Etymological Dictionary of Old Chinese 《古漢語詞源字典(網絡版)》 1.0



The Digital Etymological Dictionary of Old Chinese 《古漢語詞源字典(網絡版)》 1.0

The Digital Etymological Dictionary of Old Chinese is a new type of digital interface for Chinese texts. Using this interface, one can efficiently approximate the sounds of Chinese throughout history and perform comparative analyses of phonetic structures and phonorhetorical devices within the texts, from ancient times through the modern era.



chapakhana 2.0 (now featuring ArcGIS Data-Driven Web Apps)



chapakhana 2.0

chapakhana is a website designed to illustrate the rise of the printing press across South Asia in a series of interactive maps, from the establishment of the first movable type press in 1556 until the year 1900. Demonstrating the impact of global technology transfer, chapakhana offers users the opportunity to visualize the extent of printing activity in the Indian subcontinent up until the end of the nineteenth century and to identify when printing first came to a particular location. The names of early printing presses, their proprietors, printers, and editors are provided to make visible the many known and unknown pioneers of print in the Indian subcontinent. Taken together, the maps and additional materials on the website provide an extraordinary window into the vibrant history of printing in South Asia.



Digitization and Systemization of the 1594-1860 German book fair Catalogues: the Messkataloge


   
Digitizing the Messkataloge

In 2022 we undertook a project to harness the world’s most advanced Optical Character Recognition (OCR) systems and convert all 65,658 pages of the 1594-1860 Messkataloge to machine-readable plaintext, achieving over 98% accuracy and resulting in the first digital database for German book history. This work has opened up exciting new possibilities, as scholars can now search the data for insights into, for example: where books were published during the formative periods of German publishing, the rise and fall of particular genres, religious writings, or the declining dominance of Latin, with a precision hitherto unmatched (particularly for the 19th century).

To aid scholars, librarians, and book historians we also developed an open-source custom version of Google’s N-Gram Viewer via which one can plot relative and absolute word frequencies over time, determine covariance between terms, and investigate the wealth of data in the Messkataloge for any time period during its production.



The TextPAIR Viewer (TPV) 1.0


   
   
The TextPAIR Viewer (TPV) 1.0

The TextPAIR Viewer (TPV) is a new visualization toolkit for the analysis of large-scale intertextual text alignments (aka "text reuse" or "parallel passages"). The TPV facilitates the exploration of tens of thousands (up to millions) of text alignments in any textual corpus. The above screenshots exemplify its use for the analysis of over 7,000 parallel passages between Diderot's Encyclopédie and 3,500 contemporaneous works of French literature, and for the 322,000 parallel passages TextyPAIR uncovered within and between the 24 Chinese Histories (24 historiographies spanning 90 BCE to 1927, over 40 million characters in total).



The 24 Chinese Histories 二十四史 in Philologic4




The 24 Chinese Histories

The "24 Chinese Histories 二十四史 in Philologic" is a multi-faceted text analysis search interface and intertextual text-alignment algorithm that enables deep exploration and unparalleled configurable lexical analyses of the corpus of 3,213 volumes (roughly 40 million words) that comprise the 24 official Chinese histories, from the Shi ji dating to 91 BCE to the Qing Shi Gao published in 1927. For a brief overview of the source corpus, see the entry on Wikipedia.



NNBlake and NNShakespeare (fine-tuned GPT-2 models)



NNBlake and NNShakespeare

In 2020, while working as a computational scientist in the University of Chicago’s Research Computing Center, I taught a workshop on how to fine-tune the Generative Transformer GPT-2 on any corpus or dataset of one’s choice. That tutorial is still available on the web. For sample datasets, I chose two of the premodern English corpora most near and dear to my own heart (in plaintext): 1) the complete works of William Blake, and 2) the sonnets of William Shakespeare. By fine-tuning the transformer I was able to successfully build two models that I called NNBlake and NNShakespeare, and I used them to generate new poems and prose in the style and mimicking the “voice” of Blake and Shakespeare. (Samples of the output are here.) The project was then featured in February 2021 in Volume 9 of the Data Sitters’ Club Series: “DSC #9: The Ghost in Anouk’s Laptop”.



Geography of Genetic Variants : IGV Browser



Geography of Genetic Variants : IGV Browser

Working with the John Novembre Lab, in 2017 Alex Mueller and I developed a visualization add-on for the GGV Browser to visually display all the variant elements in any genomic sequence. As Joe Marcus and John Novembre wrote, "One of the key characteristics of any genetic variant is its geographic distribution. The geographic distribution can shed light on where an allele first arose, what populations it has spread to, and in turn on how migration, genetic drift, and natural selection have acted. The geographic distribution of a genetic variant can also be of great utility for medical/clinical geneticists and collectively many genetic variants can reveal population structure. Here we develop an interactive visualization tool for rapidly displaying the geographic distribution of genetic variants. Through a REST API and dynamic front-end, the Geography of Genetic Variants (GGV) Browser provides maps of allele frequencies in populations distributed across the globe."




chapakhana 1.0



chapakhana

chapakhana is a website designed to illustrate the rise of the printing press across South Asia in a series of interactive maps, from the establishment of the first movable type press in 1556 until the year 1900. Demonstrating the impact of global technology transfer, chapakhana offers users the opportunity to visualize the extent of printing activity in the Indian subcontinent up until the end of the nineteenth century and to identify when printing first came to a particular location. The names of early printing presses, their proprietors, printers, and editors are provided to make visible the many known and unknown pioneers of print in the Indian subcontinent. Taken together, the maps and additional materials on the website provide an extraordinary window into the vibrant history of printing in South Asia.



The Sign And Gesture Archive (SAGA)



The Sign And Gesture Archive (SAGA)

The Sign And Gesture Archive (SAGA) is a video data library for gesture, sign and spoken language data. Its holdings represent over 40 years of data collection, including tens of thousands of video files, thousands of coding and annotation files, and rich metadata for each source. Basic and Advanced search mechanisms are built into the user interface to allow for detailed searches and comparative analyses to be undertaken by researchers in the fields of psychology, linguistics, sign and gesture.



The Visual Text Explorer



The Visual Text Explorer

The Visual Text Explorer is a user-customizable interactive digital visual interface designed for analytics of words and phrases in user-provided sources, allowing for simultaneous close reading and multidimensional data analytics.



Intertext (beta 0.8)



Intertext

The Intertext "Intertextuality Analyzer" is a tool which allows for the automated detection of patterns of vocabulary and terminology both within and across specific sets of digital texts as provided by the user. Via this tool one can trace specific uses of vocabulary across thousands of years of texts, providing concrete indications of influential tropes, their likely origins and how they changed over time. Patterns of textual citation and paraphrase, often unmarked in early Chinese texts, are also easily indicated and subtle differences in their transmission can be traced from text to text. Users will either be able to allow the application to detect commonalities between the texts, or indicate specific criteria (lists of words or phrases) to track across the texts. The output provides the full citation for each result detected, and will eventually include a full visual interface of the charts, graphs and networks which can be generated from the data results.



Digital Resources for Sinologists (1.0)


Dissertation Reviews: "Digital Resources for Sinologists (1.0)"

Holger Schneider and I put together this two-part overview of the best digital resources available for Sinology and Chinese language learning. Part I provides a general rubric for the evaluation of digital dictionaries, with an eye toward the benefits of using multireferential databases and modern computational tools in lexicography. Part II is an annotated list of over fifty websites and software packages, representing what we feel are the very best digital resources available for sinologists and learners of Chinese. Subcategories include General Dictionaries/Lexical Tools, Learning and Translation Tools and Encyclopædia (large-scale text repositories). The article has since been converted into a Wiki, available online at http://sinology-resources.wikia.com/.