Corpora

From Ott09 Wiki
Jump to: navigation, search

- what’s a corpus ?

    o	collection of texts in one place, for a certain purpose
    o	is « big » necessarily beautiful ? it depends on the purpose
    o	corpus driven (question, singularities)/ corpus based (stats, generalities : machine translation)
    o	comparable corpora : topic is similar
    o	specificity : multilingual : two languages or many
    o	the value of concordance « context »
    o	alignment : done by hand or automatically
    o	how to build multilingual copora : through translation
    o	hardware issue ? no large place needed to stock a large amount of texts
    o	standards : TMX (for bilingual corpora) or plain text
    o	possible links with dictionaries and glossaries

- users' scenario : o frequencies on a monolingual corpus, colocations o datamining o parallell corpus : for machine translation

- corpora access : - european : free access but limited domain (public domain) - not many for « small » languages - linguistic data consulting, elra : to buy corpora - license : source text has an authorship - « corpora list » : public mailing list : people send corpora - how can we mutualize without being googlized ? how can we make corpora accessible ? call 1-800 Francis @ the university of Alicante : corpus will be in the public domain (open licence) - 10 years from now : different actors - open corpora movement, - Google, benefiting from it - cooperative : TAUS association (as opposed to TAUS-private actor) • non profit, • categorized by industries ; • open API, • tool guidance : what’s useful for you ? - university of leeds : automatic synonym tagging - hyperlinks and semantic web : help people to find the material they need