Skip to Main Content
Bowdoin College Library <Ask Us!

Text Mining Databases: Licensed library databases

FYI

The Library’s licensed databases that are currently available for text mining. Databases from some vendors require facilitation by a librarian. Use of the text is governed by the vendor's terms and conditions. (For these reasons, the pages on this guide are organized by vendor for your convenience.) If you are interested in another database, we would be happy to look into its availability for text mining. Please contact us for assistance.

Elsevier

  • Corresponding database: Science Direct
  • Access does not require facilitation by a librarian.
  • "Download, search, filter and understand millions of articles and books published on ScienceDirect. All Elsevier journals and books enable text and data mining (TDM)."
  • Elsevier also provides access to a freely available corpus.
  • Text and Data Mining general info FAQ Developer Portal
  • Terms and conditions 

Gale

  • Text held by Bowdoin Library.
  • Access currently requires facilitation by a librarian.

Currently available licensed library databases from Gale.

  • British Library Newspapers, Part I, 1800-1900 included in British Library Newspapers    Formats: JPG, TIF, and XML encoded as ISO-8859-1. The XML includes coordinates of the text in the image files. 265,107 XML files totaling ca. 705 GB
  • British Library Newspapers, Part II, 1800-1900  included in British Library Newspapers    Formats: JPG and XML encoded as UTF-8. The XML includes coordinates of the text in the image files. ca. 156,272 XML files totaling ca. 501.9 GB. Each file contains one issue of one newspaper.
  • Eighteenth Century Collections Online I (ECCO I)   included in Eighteenth Century Collections Online database   English-language and foreign-language titles printed in the United Kingdom and many from the Americas Files: 155,023 documents.  Not yet available.
  • Eighteenth Century Collections Online II (ECCO II) included in Eighteenth Century Collections Online database    English-language and foreign-language titles printed in the United Kingdom and many from the Americas Formats: TIF and XML encoded as ISO-8859-1. The XML includes coordinates of the text in the image files. 50,630 documents. 50,630 XML files totaling ca. 92.35 GB    Purchased in 2009.
  • Illustrated London News Historical Archive, 1842-2003  online database    Files: 7,114 issues    Formats: JPG and XML encoded as UTF-8. The XML includes coordinates of the text in the image files. 7,114 XML files totaling ca. 15.6 GB
  • The Times Digital Archive, 1785-2006 (Times of London)  online database  ♦  Formats: TIF and XML encoded as ISO-8859-1. The XML includes coordinates of the text in the image files. ca. 69,026 files XML files totaling ca. 352 GB.  Each file contains one issue.
  • TLS Historical Archive, 1902-2005 (Times Literary Supplement)  included in Gale Primary Sources (formerly Gale Newsvault)    Formats: JPG and XML. The XML may not be well-formed. ca. 540,000 XML files totaling >9 GB
  • TLS Historical Archive, Supplement 2007 (Times Literary Supplement)  included in Gale Primary Sources (formerly Gale Newsvault)    Formats: JPG and XML encoded as UTF-8. The XML includes coordinates of the text in the image files. 50 XML files totaling ca. 134 MB

Gale-Archives Unbound

Components of the Archives Unbound database available through Bowdoin Library:

  • The American Indian Movement and Native American Radicalism   online database  "Formed in 1968, the American Indian Movement (AIM) expanded from its roots in Minnesota and broadened its political agenda to include a searching analysis of the nature of social injustice in America. These FBI files provide detailed information on the evolution of AIM as an organization of social protest and the development of Native American radicalism." ♦  Files: 69 XML files totaling ca. 103 MB
  • Federal Surveillance of African Americans, 1920-1984   online database  "Between the early 1920s and early 1980s, the Justice Department and its Federal Bureau of Investigation engaged in widespread investigation of those deemed politically suspect. Prominent among the targets of this sometimes coordinated, sometimes independent surveillance were aliens, members of various protest groups, Socialists, Communists, pacifists, militant labor unionists, ethnic or racial nationalists, and outspoken opponents of the policies of the incumbent presidents. [...] Black Americans of all political persuasions were subject to federal scrutiny, harassment, and prosecution. The FBI enlisted black 'confidential special informants' to infiltrate a variety of organizations. Hundreds of documents in this collection were originated by such operatives." Files: 1,293 XML files totaling ca. 818 MB
  • Feminism in Cuba: Nineteenth through Twentieth Century Archival Documents   online database  "A study on feminists and the feminist movement in Cuba between Cuban independence and the end of the Batista regime. [...] This collection draws on rich primary sources-texts, personal letters, journal essays, radio broadcasts, memoirs from women’s congresses-which allow these women to speak in their own voices." Files: JPG and XML. 133 XML files totaling ca. 197 MB
  • The International Women's Movement: The Pan Pacific Southeast Asia Women's Association, 1950-1985   online database  "The Pan Pacific and Southeast Asia Women's Association (PPSEAWA) was founded in Honolulu, Hawaii in 1930 with the intention to 'strengthen the bonds of peace among Pacific peoples by promoting a better understanding and friendship among the women of all Pacific countries.'" Files: JPG and XML. 79 XML files totaling ca. 143 MB
  • Overland Journeys: Travels in the West, 1800-1880   online database  Many western settlers "recorded daily events and their thoughts with such picturesque zest that some accounts of westward journeys have elements of great literature within them." Files: JPG and XML. 292 XML files totaling ca. 924 MB
  • The Papers of Amiri Baraka, Poet Laureate of the Black Power Movement   online database  "Rare works of poetry, organizational records, print publications, over one hundred articles, poems, plays, and speeches by Baraka, a small amount of personal correspondence, and oral histories." Files: JPG and XML. 212 XML files totaling ca. 189 MB
  • Phyllis Lyon and Del Martin: Beyond the Daughters of Bilitis   online database  Sources on issues and groups beyond those in "Phyllis Lyon, Del Martin, and the Daughters of Bilitis", including "the files detailing the impact of Martin's book Battered Wives—and the heart-wrenching correspondence it evoked from women in small towns and big cities."  Files: JPG and XML. 747 XML files totaling ca. 682 MB
  • Phyllis Lyon, Del Martin and the Daughters of Bilitus    online database    "Extensive information on the founding and growth of the homophile movement, especially the Daughters of Bilitis and The Ladder, including early meeting minutes, correspondence, chapter records, membership data, and manuscripts unavailable elsewhere." Files: JPG and XML. 854 XML files totaling ca. 372 MB
  • Revolution in Mexico, the 1917 Constitution, and its Aftermath: Records of the U.S. State Department   online database   "Political and military documents relating to the Mexican Revolution and its aftermath, 1910-1924." Files: JPG and XML. 138 XML files totaling ca. 1.14 GB
  • "Through the camera lens": The Moving Picture World and the Silent Cinema Era, 1907-1927   online database   "For those within the film industry, information and opinion were shaped by a number of aggressive trade publications, each competing for the same limited number of subscribers. Chief among these was the Moving Picture World, which, setting a standard for the broadest possible coverage, reviewed current releases and published news, features, and interviews relating to all aspects of the industry." Files: JPG and XML. 1075 XML files totaling ca. 3.87 GB
  • Witchcraft in Europe and America   online database    Includes "classic texts, the collection includes anti-persecution writings, works by penologists, legal and church documents, exposés of persecutions, and philosophical writings and transcripts of trials and exorcisms" from the 15th through the early 20th centuries. Files: JPG and XML. 914 XML files, ca. 3.4 GB

JSTOR

  • Corresponding database: JSTOR
  • Access does not require facilitation by a librarian.
  • "Data for Research (DfR) provides datasets of content on JSTOR for use in research and teaching. Researchers may use DfR to define and submit their desired dataset to be automatically processed. Data available through the service includes metadata, n-grams, and word counts for most articles and book chapters, and for all research reports and pamphlets on JSTOR. Datasets are produced at no cost to researchers and may include data for up to 25,000 documents."
  • Data for Research (DfR)
  • Terms and conditions
  • See also Early Journal and Open Access Content
    • "JSTOR has made journal content published before 1923 in the United States (and prior to 1870 everywhere else) available for free online. This 'Early Journal Content' includes discourse and scholarship in the arts and humanities, economics and politics, and in mathematics and other science. It includes nearly 500,000 articles and comprises 6% of the content on JSTOR."
    • Format: XML

Constellate (JSTOR)

  • For free and subscription versions, see Constellate tab.

ProQuest

Currently available licensed library databases from ProQuest:

  • German Literature Collections. Text held by Bowdoin Library. Access currently requires facilitation by a librarian.   online database    Files: Some characters are not UTF-8. Text is tagged but documents are not well-formed XML.
  • New York Times (18 September 1851 - 1937). Text held by Bowdoin Library. Access currently requires facilitation by a librarian. online database Formats: images are in PDF; XML encoded as UTF-8. Each is in compressed ZIP files. The XML files are in 347 ZIP files; totaling 18.2 GB (compressed). There could be as many as 8,675,000 XML files, a total of ca. 42 GB uncompressed. (One sample ZIP file contains 25,000 XML files, 59.4 MB compressed, 136 MB uncompressed (about 2.3 times the size of the compressed file).)
  • Text Creation Partnership (TCP). "The Text Creation Partnership (TCP), has been creating accurate full-text transcriptions for a large selection of EEBO [Early English Books Online] works. The TCP partnership began in 1999 as an innovative collaboration between ProQuest LLC, the University of Michigan, and Oxford University. The aim was to convert 25,000 books from EEBO into fully-searchable, TEI-compliant SGML/XML texts. This collaboration extended to a funding partnership with JISC and a collection of libraries so that now TCP texts are jointly owned by more than 150 libraries worldwide, creating a significant database of foundational scholarship."  About.

Springer

Springer Nature