Skip to Main Content
Bowdoin College Library <Ask Us!

Text Mining Databases: Publicly available databases

FYI

Publicly available databases that support or can be used for text mining. Access does not require facilitation by a librarian. Even "publicly available" content may be copyrighted or licensed; look for statements of copyright, licensing, terms of use, and/or terms and conditions.

General resources

HathiTrust

Internet Archive

News

And more...

Social media text

  • The Social Media Archive at ICPSR (SOMAR). "The Social Media Archive at ICPSR (SOMAR) is a collection of public and restricted data from various social media platforms organized and stored for research and analysis purposes. With their data available to the community, SOMAR aims to help researchers and community members better understand social media behavior and trends. In addition, the data can inform the development of new technologies and services." "Much of SOMAR's data will be available through approved restricted data applications, and the data will be accessed through a virtual data enclave." Source

Twitter

Facebook and Instagram

  • Because Facebook and Instagram do not provide public APIs, we are unable to collect text from them.

U.S. federal government information

congress.gov (Library of Congress)

govinfo (Government Publishing Office)

  • govinfo.gov
  • "Provides free public access to official publications from all three branches of the Federal Government."
  • Developer Hub    Bulk Data Repository
  • Statute Compilations
  • "Tools available for software developers and data users include XML for bulk download, a link service, design documentation, and more available on GPO's GitHub page."
  • Most data in govinfo are not copyrighted.
  • Format: Text, PDF, XML
  • See also congress.gov (this page)

Library of Congress

regulations.gov

  • regulations.gov, from partner agencies participating in the eRulemaking Program
  • "When Congress passes laws, federal agencies implement those laws through regulations. These regulations vary in subject, but include everything from ensuring water is safe to drink to setting health care standards. Regulations.gov is the place where users can find and comment on regulations. The APIs allow for users to find creative ways to present regulatory data."
  • Regulations.gov API

PubMed (U.S. National Library of Medicine)

  • PubMed, U.S. National Library of Medicine.   Developer Resources
  • The PubMed Central (PMC) Article Datasets. "PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse."

And more from the legislative branch ...

And more from the U.S. federal government...

  • Office of the Historian, U.S. Department of State. Includes Foreign Relations of the United States (FRUS); Principal Officers & Chiefs of Mission; Foreign Travels of the President and Secretary of State; Visits by Foreign Leaders and Heads of State; Administrative Timeline of the Department of State; Milestones in the History of U.S. Foreign Relations; Tweets; etc.
  • Data.gov, "the United States government's open data website. It provides access to datasets published by agencies across the federal government".
  • Federal Register, Office of the Federal Register   Reader Aids for developers
  • Consumer Complaint Database, Consumer Financial Protection Bureau    Downloads, API
  • FRASER, Federal Reserve Bank of St. Louis. "FRASER is a digital library of U.S. economic, financial, and banking history—particularly the history of the Federal Reserve System." FRASER® API

Government information from other sources

  • the @unitedstates project.
    "a shared commons of data and tools for the United States [...] Featuring work from people with the Sunlight Foundation, GovTrack.us, the New York Times, the Electronic Frontier Foundation, and the Internet. "

For your reference

United States Legislative Markup (USLM) XML schema

Other legal databases