Translation and TikiWiki

From Ott09 Wiki
Jump to: navigation, search

session led by Ed Bice, Fran Tyers and Dwayne Bailey

Ed: Linguistic Data Consortium - non-profit organisation managing linguistic data We at Meedan have managed to convince IBM to embrace open linguistic data and what data we generate is open. The question is: can we leverage this and create a movement? So much data is generated and then going to waste.

Fran: governments don't understand this issue either. E.g. in South Africa there are websites in 11 official languages and people there won't answer your inquiries about access to the data. **

Phillipe: how can I share my data and get others to respect my work?

Fran: impossible, all open licenses allow derivative works.

Carolina: apply open access data protocol developed by CC science. CC BY not recommended because in some countries data is not protected and then even a CC license would introduce some restrictions. The protocol helps you decide this question with regards to the data you have.

Ed: CC BY doesn't make sense here: how do you use attribution when training your machine translation engine?

Dwayne: people don't know why they should be sharing data. In SA parliamentary translators delete their translations because they run out of space on their hard drives.

Fran: a central repository is a bad idea - too expensive. We need a tag.

Phillipe: we need some storing space for those who run out of space to keep their material.

Carolina: CC can surely help.

Fran: the approaching for data should be done by local people, that's the easiest way of doing things.

Dwayne: what do we give in return?

Fran: formatted data

Carolina: in genetics when you publish papers you can link them to data.

Phillipe: there is no one way of aligning corpora.

Fran: there are papers lining up why open linguistic resources are needed: Empiricism in not a matter of faith by Ted Pedersen, and, Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers by Oliver Streiter, Kevin P. Scannell and Mathias Stuffleser

Rahzeb: running a linguistic consortium is expensive, you need 100k euro just to run a server. Membership fees in the Taus Data Association (www.tausdata.org) differ, 20k for companies, 300 euro for academic institutions. It is non-profit.

Ed: There are different translation memories trained for different areas.

Dwayne: we don't need to worry about standards, we need to create the mindset for sharing. This is not mutual benefit, we need to address this issue.

Ed: we need a brand and a 12 months objectives.

Carolina: we may need a manifesto

Phillipe: a mission that is simple is good, the goal should be long-term

Fran: the US government is spending money on language packs, which they sell. We should get the data free. This is a legal and not technical question

Ed: we need to form a working group and get people from Carnegie and Mellon around.

Jerzy: mission - creation of linguistic resources allowing machine translation from any language into any other language

Andranik: there is a universal network language

Fran: Carnegie Mellon has a related project for smaller languages in South America. but even though it has failed related data has not been made available

Ed: creation of the group - initial recruiting in addition to us (Ed, Fran, Dwayne)

Janet: you need to first draft something about the problem and what you want to achieve.

Dwayne: questions: the legal framework, stories of uses and counterstories (excessive protectiveness of data),

Carolina: the writing could be done on the wiki

Ed: I'd rather convene the group first.

Dwayne: we need to know what the outcomes the meeting should produce.