Open Corpora Repositories
session led by Ed Bice, Fran Tyers and Dwayne Bailey
Ed: Linguistic Data Consortium - non-profit organisation managing linguistic data We at Meedan have managed to convince IBM to embrace open linguistic data and what data we generate is open. The question is: can we leverage this and create a movement? So much data is generated and then going to waste.
Fran: governments don't understand this issue either. E.g. in South Africa there are websites in 11 official languages and people there won't answer your inquiries about access to the data. **
Phillipe: how can I share my data and get others to respect my work?
Fran: impossible, all open licenses allow derivative works.
Carolina: apply open access data protocol developed by CC science. CC BY not recommended because in some countries data is not protected and then even a CC license would introduce some restrictions. The protocol helps you decide this question with regards to the data you have.
Ed: CC BY doesn't make sense here: how do you use attribution when training your machine translation engine?
Dwayne: people don't know why they should be sharing data. In SA parliamentary translators delete their translations because they run out of space on their hard drives.
Fran: a central repository is a bad idea - too expensive. We need a tag.
Phillipe: we need some storing space for those who run out of space to keep their material.
Carolina: CC can surely help.
Fran: the approaching for data should be done by local people, that's the easiest way of doing things.
Dwayne: what do we give in return?
Fran: formatted data
Carolina: in genetics when you publish papers you can link them to data.
Phillipe: there is no one way of aligning corpora.
Fran: there are papers lining up why open linguistic resources are needed: Empiricism in not a matter of faith by Ted Pedersen, and, Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers by Oliver Streiter, Kevin P. Scannell and Mathias Stuffleser
Rahzeb: running a linguistic consortium is expensive, you need 100k euro just to run a server. Membership fees in the Taus Data Association (www.tausdata.org) differ, 20k for companies, 300 euro for academic institutions. It is non-profit.
Ed: There are different translation memories trained for different areas.
Dwayne: we don't need to worry about standards, we need to create the mindset for sharing. This is not mutual benefit, we need to address this issue.
Ed: we need a brand and a 12 months objectives.
Carolina: we may need a manifesto
Phillipe: a mission that is simple is good, the goal should be long-term
Fran: the US government is spending money on language packs, which they sell. We should get the data free. This is a legal and not technical question
Ed: we need to form a working group and get people from Carnegie and Mellon around.
Jerzy: mission - creation of linguistic resources allowing machine translation from any language into any other language
Andranik: there is a universal network language
Fran: Carnegie Mellon has a related project for smaller languages in South America. but even though it has failed related data has not been made available
Ed: creation of the group - initial recruiting in addition to us (Ed, Fran, Dwayne)
Janet: you need to first draft something about the problem and what you want to achieve.
Dwayne: questions: the legal framework, stories of uses and counterstories (excessive protectiveness of data),
Carolina: the writing could be done on the wiki
Ed: I'd rather convene the group first.
Dwayne: we need to know what the outcomes the meeting should produce.