Corpora
From Aspirationtech.org Wiki
- what’s a corpus ?
o collection of texts in one place, for a certain purpose
o is « big » necessarily beautiful ? it depends on the purpose
o corpus driven (question, singularities)/ corpus based (stats, generalities : machine translation)
o comparable corpora : topic is similar
o specificity : multilingual : two languages or many
o the value of concordance « context »
o alignment : done by hand or automatically
o how to build multilingual copora : through translation
o hardware issue ? no large place needed to stock a large amount of texts
o standards : TMX (for bilingual corpora) or plain text
o possible links with dictionaries and glossaries
- users' scenario : o frequencies on a monolingual corpus, colocations o datamining o parallell corpus : for machine translation
- corpora access : - european : free access but limited domain (public domain) - not many for « small » languages - linguistic data consulting, elra : to buy corpora - license : source text has an authorship - « corpora list » : public mailing list : people send corpora - how can we mutualize without being googlized ? how can we make corpora accessible ? call 1-800 Francis @ the university of Alicante : corpus will be in the public domain (open licence) - 10 years from now : different actors - open corpora movement, - Google, benefiting from it - cooperative : TAUS association (as opposed to TAUS-private actor) • non profit, • categorized by industries ; • open API, • tool guidance : what’s useful for you ? - university of leeds : automatic synonym tagging - hyperlinks and semantic web : help people to find the material they need
