Newswire: A Large-Scale Structured Database of a Century of Historical News

arxiv.org

cross-posted to:
hackernews@lemmy.smeargle.fans

Newswire: A Large-Scale Structured Database of a Century of Historical News

arxiv.org

fossilesque@mander.xyzM to

History@mander.xyzEnglish · 4 months ago

cross-posted to:
hackernews@lemmy.smeargle.fans

In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires. We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers. The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model. To construct the Newswire dataset, we first recognize newspaper layouts and transcribe around 138 millions structured article texts from raw image scans. We then use a customized neural bi-encoder model to de-duplicate reproduced articles, in the presence of considerable abridgement and noise, quantifying how widely each article was reproduced. A text classifier is used to ensure that we only include newswire articles, which historically are in the public domain. The structured data that accompany the texts provide rich information about the who (disambiguated individuals), what (topics), and where (georeferencing) of the news that millions of Americans read over the course of a century. We also include Library of Congress metadata information about the newspapers that ran the articles on their front pages. The Newswire dataset is useful both for large language modeling - expanding training data beyond what is available from modern web texts - and for studying a diversity of questions in computational linguistics, social science, and the digital humanities.

You must log in or # to comment.

Chat

History@mander.xyz

history@mander.xyz

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !history@mander.xyz

Welcome to c/History @ Mander.xyz!

Notice Board

2023-07-01: We are looking for mods. Send a dm to @fossilesque@mander.xyz if interested! This is a work in progress, please don’t mind the mess.

Work in progress…

Rules

Don’t throw mud. Be kind and remember the human.
Keep it rooted (on topic).
No spam.

Similar Communities

Sister Communities

Science and Research

!scicomm@mander.xyz

Biology and Life Sciences

Plants & Gardening

Physical Sciences

Humanities and Social Sciences

Memes

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

1 user / day
2 users / week
88 users / month
280 users / 6 months
1 local subscriber
748 subscribers
66 Posts
17 Comments
Modlog

mods:
fossilesque@mander.xyz