Corpora
- what’s a corpus ?
o collection of texts in one place, for a certain purpose o is « big » necessarily beautiful ? it depends on the purpose o corpus driven (question, singularities)/ corpus based (stats, generalities : machine translation) o comparable corpora : topic is similar o specificity : multilingual : two languages or many o the value of concordance « context » o alignment : done by hand or automatically o how to build multilingual copora : through translation o hardware issue ? no large place needed to stock a large amount of texts o standards : TMX (for bilingual corpora) or plain text o possible links with dictionaries and glossaries
- users' scenario : o frequencies on a monolingual corpus, colocations o datamining o parallell corpus : for machine translation
- corpora access : - european : free access but limited domain (public domain) - not many for « small » languages - linguistic data consulting, elra : to buy corpora - license : source text has an authorship - « corpora list » : public mailing list : people send corpora - how can we mutualize without being googlized ? how can we make corpora accessible ? call 1-800 Francis @ the university of Alicante : corpus will be in the public domain (open licence) - 10 years from now : different actors - open corpora movement, - Google, benefiting from it - cooperative : TAUS association (as opposed to TAUS-private actor) • non profit, • categorized by industries ; • open API, • tool guidance : what’s useful for you ? - university of leeds : automatic synonym tagging - hyperlinks and semantic web : help people to find the material they need