Corpus (plural corpora) is a linguistic resource consisting of a wide organized collection of texts (usually electronically registered and processed). In the corpus linguistics, statistical analysis and hypothesis tests are carried out, occurrences are tested or linguistic rules validated within a particular language field.
Corpora is the primary knowledge base for corpus linguistics. Some important areas of use include: Language technology, natural language processing, and computational linguistics. This area will analyze of different forms of entities are also the focus of a great deal of work in computational linguistics, speech recognition and machine translation, where they are also used to construct secret Markov models for speech tags and other purposes.
Corpora and frequency lists derived from them are useful for teaching languages. Corpora can be considered as a type of foreign language writing aid as contextualized grammatical information acquired by non-native language users through exposure to authentic corporal texts enables learners to understand the manner in which sentences are constructed in the target language, thus facilitating successful writing.
Corpus Text Provider
- https://ilps.science.uva.nl/resources/ (free)
- www.sketchengine.eu (paid), this provider support many languages.
- http://www.panl10n.net/english/index.htm (free), providing language resources for the languages in South East Asia / South Asia
SpiderLing — a web spider for linguistics — is a web-based text-making program that is useful for creating text corpus.
featured image source: https://www.shareicon.net/data/2016/08/18/816149_documents_512x512.png