World Library  
Flag as Inappropriate
Email this Article

Canterbury corpus

Article Id: WHEBN0002937023
Reproduction Date:

Title: Canterbury corpus  
Author: World Heritage Encyclopedia
Language: English
Subject: Calgary corpus, Data compression, Stanford dragon, EICAR test file, The quick brown fox jumps over the lazy dog
Collection: Data Compression, Test Items
Publisher: World Heritage Encyclopedia

Canterbury corpus

The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.[1]


  • Contents 1
  • See also 2
  • References 3
  • External links 4


In its most commonly used form, the corpus consists of 11 files, selected as "average" documents from 11 classes of documents,[2] totaling 2,810,784 bytes as follows.

Size (bytes) File name Description
152,089 alice29.txt English text
125,179 asyoulik.txt Shakespeare
24,603 cp.html HTML source
11,150 fields.c C source
3,721 grammar.lsp LISP source
1,029,744 kennedy.xls Excel spreadsheet
426,754 lcet10.txt Technical writing
481,861 plrabn12.txt Poetry
513,216 ptt5 CCITT test set
38,240 sum SPARC executable
4,227 xargs.1 GNU manual page

See also


  1. ^ Ian H. Witten, Alistair Moffat, Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. 
  2. ^ Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12.  

External links

  • The Canterbury Corpus
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.

Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.