Reputation Metrics

From Ott09 Wiki
Jump to: navigation, search

How to track quality and reputation of participants in a translation system.

Ovrview

Three broad categories:

  • the internet itself (external links of docs into your system, external discussions of content on your system)
  • editorial or top-down reputation tracking, you have "superusers" that supervise or police system activity, an editor vs author division of labor
  • user-mediated reputation tracking, with large numbers of users voting on activity or documents

There's not really a single answer to the question, each method has strengths of weaknesses, using a combination is usually best.

User-mediated

The simplest system is a user-mediated system. Give users a means of measuring interest: thumbs up/down for example where thumbs up means I am interested/want to see more of this document. You aggregate the votes and look at percentages, you could see if there are clusters of votes from an IP subnet or in a short period of time then maybe folks are trying to game the system but if they are distrubited then you can have some faith that the votes are representative of users' real responses. Yes, No, Block for sample choices, you don't want to give users too many choices. (Block = this is junk or spam and I don't want to see anything like this ever again) Yes (a plus sign) doesn't mean this si the best translator in the world, just as No (a minus sign) doesn't mean they are the worst in the world, just that you like or dislike the particular piece of content. Block (red circle with diagonal line) would be used to bar certain content or contributors.

Sometimes it's not clear if they are voting on how interesting the original article is, or whether they are voting on the translation itself. No matter how clear the symbol is, you need (localized!) explanations, perhaps a tooltip: Rate this translation, NOT the article content. You want to make it very easy for people to vote, allow people to vote by IP or cookie or something without requiring them to register. You want to look for statistically significant patterns (similar IPS, geolocation services show geographical clusters) to weed out efforts to stack the vote. You should expect more votes when an article is first published, but too many votes in a short period of tip is also a tip-off.

Subjective quality (1-5 star scale) becomes harder; what's the difference between a 3 and a 4? But people can rate something on the extremes. Hence we use the simple options (say up arrow = we like it a lot, add articles from this translator to my favorites list, down arrow = remove this translator's articles from my list) Suppose you're picky, the system will blacklist people for you, but not necessarily everyone. The system can give weight to the systemwide average.

You can just collect votes and not act on it for awhile, until you have a lot of data, then start implementing filters based on that, or you can let the users create filters such as "if a translator gets more than 10 negative votes, don't show them".

How do you create a rating? With an up or down arrow, count each vote as 1 or -1, weed out the duplicates (more or less), Compute the average. You can figure out how much variability there is over time.

Should voting rights be restricted? Perhaps you would say "only registered users can vote" in the case of a smaller user community, where you would know the max votes that are possible cause you know how many registered users there are. Alternatively you could weight anonymous votes by multiplying by some factor, say .8.

What about changes in quality? A new translator might improve over time. You could have expiration dates on votes, say we toss votes over 6 months old as far as weighting article. It's possible likewise that a translator might start out with lots of enrgy and do great in the beginning and then produce work that's not as good later. You can watch votes over time and look at trends too. In a stranslation system you want to emphasize translators that are consistent; if they are consistently mediocre, that's better in a way than a translator who in some areas is great and in other areas is awful. You might have tighter review for translators with large variability in the rating.

How do you rate articles produced by more than one person? (Collaborative translation.) You have to assume that votes are based on the most recent version of the article. If the translations are done paragraph by paragraph your voting system must reflect this (so "blame" can be assigned accordingly).

There are two ratings: 1) rating of the translator, 2) rating of the article. The collective ratings of the articles add up to the rating of the translator. What happens if a user wants to change their mind in a vote? It's not so likely, but the system counts one vote for one user, we can take the latest vote as the authoritative vote for the user. Netflicks works this way: you can vote 1-5 starts on every movie. You can change your vote at any time and they take the most recent one.

How would we import reputations to other systems/communities? Well the raw data could be made available. There's not anything in place now, no standard for doing this yet.

People (System adminsitrators/designers) should be encouraged to make the data public, for research if nothing else.

If the meaning of the data is different then importing won't make sense (so if the data format is similar but the ratings have different meanings attached to them, that's a problem for portability).

Incentive systems? Suppose the top contributors were listed somewhere prominent, would this be useful?

The existence of a voting system is a huge incentive by itself. Possibly you could be paid if you are in top 20 or 10.

Encouraging/discouraging translations

This was a very energetic discussion. Some participant comments/arguments below:

Can I rate a translator high if I know their knowledge of language X is average but based on that knowledge they have done very well? One anwer: We don't care whether the translator is doing the best they can according to their ability or knowledge; we care about whether the end result is high quality. A second response: Any human translation is better than MT, it would be awful for any MT system to rate more highly than a machine translation. You also always have the option of abstention (no vote) in some cases. Third response: you always want to encourage the human translator, not to discourage them from trying to do the work.

Once you design the interface (symbols, what the vote means), because you want to keep statistics with consistent meanings. If the down votes are harsher than you think they ought to be, design your system so that it handles these votes appropriately. If you make changes to the interface this is going to bork your statistical analysis.

Instead of a thumbs down, "does this translator need some help?", something more positive?

Reviews of a product describe whether a person likes or doesn't like something, and we find that useful information.

Voting allows us to see what the voting patterns are for known good translations and known bad translations, then we can use future stats to classify future translations. You might hire a statistician for this analysis. Also we can distinguish for example between native speakers of a language and people who have it as a second lgnauge and their votes...

Translation can lose the message of the oroginal article if the translation is bad.

But the translator can be discouraged if given negative votes.

You are building a community, if you boo someone they are not comgin back.

In open source projects, if someone contributes crap code they get kicked out immediately. This encourages people to reach a certain level of competence before they contribute.

When you have a translation community that produces really high quality translations, people want to join and they know they have to do some work to get there.

I don't want to say to someone "you're a bad translator", I just want people to say "this translation is good" or "this translation needs to be looked at again".

If you want to do this on a paragraph basis, you have to be short, it's got to fit on the screen, be readable, not interfere with reading the translation itself.

You don't want people to have to edit and clean up translations. It's easier for many people to translate from scratch than to edit someone else's work. (One person says) When I was reading Amazon reviews, the negative reviews were useful to me.

Some people actually prefer to edit than to translate from scratch and they gravitate to those roles. In a system where we mark translations for improvement, these people will take on that cleanup.

In a community where such editing is more common, perhaps a person who has been participating longer will get more choices, (a "power user"), ask them for more: is this grammatically correct, are ther eproblems with idioms, etc. But you want to not overwhelm the average user with so many choices that they don't vote, cause you have to collect the data, that's foremost.

Expert driven voting systems

You expect the experts (editors) to give you good data, they will know which things are grammar issues, which errors are important, etc. You can do some cross-correlation between user and editor scores. You can have editors do quick screening of new contributors based on these scores. You can give editors a two-page doc with the scoring system, and the criteria. They will be more consistent in their voting.

Internet-driven systems

A prime example: Google, with pagerank. Ways to facilitate such a system: give documents a permanent url. Then you can count number of hits on urls. How can you tell whether it's the content or the translation? Suppose you have twenty different revisions of a translation? If people go to a particular version more often this could be a clue. (Of course this means that people don't pass around the base url, which this note-taker thinks is more likely :-P )

You can figure out what people are recommending to be translated. WHat are they visiting on your site that has not been translated? You can see relative frequency of page views of the translated material. What are the top ten translated articles in language X?

Summary

The voting systems are good for measuring if you have questions about quality. Link traffic and how people are finding documents gives you a measure of interest. You want to pull out documents of interested that are well translated (best of both).