Multilingual Decks
+overview
The proposed multilingual functionality is intended to be a general solution applicable to all Wagns. However, it has been inspired by Wikirate.org, and many of the examples below will borrow from that site's structure.
The proposal seeks to honor these constraints/considerations:
- everything is a card.
- a user should only see content in a language that he/she understands unless there are explicit instructions to break that pattern.
- a user who understands multiple languages has the potential to play a special role in an international community: translator.
The essence of the proposal is that there will be two new settings: *name translation and *code translation. Rules based on these settings will initially support these values: universal, monolingual, strict, free, and patterned.
Anyone reading this is warmly invited to contribute use cases to explore how they may be addressed in the proposed system.
+needs
Standard practice among wiki communities, most notably Wikipedia, is to have separate sites for different languages. There is an en.wikipedia.org and an nl.wikipedia.org, and while there are some connections between them (same technology, use of Wikidata, shared governance...), they are, by and large, separate projects with separate communities.
A Wikipedia-style solution will not work for many Wagn sites, many of which are intended to support one set of unified data in multiple languages. For example, consider Wikirate.org. As the name implies, quantitative data is quite central to WikiRate, and rich, nuanced interactions with quantitative data are core to what we are trying to achieve. Many of those numbers (eg transparency scores and voting that feeds into them) are based on micro-interactions on the site, and our vision of transparency for those numbers depends on our being able to see exactly where they all come from.
All of this means implementing WikiRate's vision on separate sites, as Wikipedia does, would not work. We do not want companies to score differently in different languages, nor do we want to multiply a company’s' reporting overhead by asking them to respond to the same questions on multiple sites. We don't want transparency scores to be based on separate sites that users can't see. Perhaps most importantly, we want the world-wide community of wikirate users to be able to speak with a unified voice in pushing to "make companies clear" and to enjoy facing the cross-cultural challenges of working on this together.
Current problems / bugs:
The following issues are problematic even in monolingual (non-English) contexts:
- lots of hard-coded English, no default content for other languages.
- the current name key mechanism is based on English pluralization
- international content basically works, but sometimes (as in inclusion and probably links) content is breaking largely because of unicode characters getting treated as html entities. We may be able to change this behavior by messing with this tinyMCE setting (they've changed their dosc infrastructure, probly need new link):
+solution
This proposal assumes the need to support 5 different language patterns, currently named Universal, Monolingual, Strict, Free, and Mapped.
Note that nomenclature is NOT central to this proposal. Please feel free to suggest new names for these patterns (after reading how we define them, of course!)
Universal.
Some kinds of data should not be treated like natural language. Numbers and programming languages fit this pattern.
Examples
Numbers, Dates, etc: units and things need to be translated, and dates may be presented in different languages, but generally what we're going to store is going to be one universal numeric notation that can be easily translated automatically; it's not something users should ever have to bother with editing multiple versions of.
Javascript, CSS, etc: you definitely don't want to handle separate copies of code cards for separate languages. This is placed in "Universal," because, by definition, computer languages are not natural languages, and their linguistic differences are already addressed by giving them separate cardtypes.
Monolingual
Monolingual content is clearly associated with a specific language, but there should not be different versions of this content for different languages.
Examples
structure rules (cards ending in +*structure): When you update a structure rule (a content pattern), you don't want to have to update it in several languages, and you almost certainly don't want the case where your page has different structure in different languages (with different CSS to maintain, etc. YUCK!) What you want is to update the structure in one language but have the change take effect in all languages. For example, if you add {{+age}} to a structure rule, and you already have a spanish variant of the "age" card, then it should handle this translation in the nest processing, not with translated structures. Similar logic applies to many rules (permissions, default, style, etc), but not all. *table of contents is a number, for example, so that follows the "universal" pattern. And help text should be explicitly translated.
Strict
A third kind of card should support (and encourage) translation. The core idea of "strict" translation is that the name or content should have the same meaning in different languages. This means that if you update one, you should update the others. This implies a lot of special functionality to make it obvious when languages are not in sync.
Examples
Legal Documents: this is an obvious case where you don't want versions in different languages contradicting each other. They need to say the same thing!
Cards about site policy: even without the involvement of formal law, the same idea holds true of site policy in general.
Free
Alternative cards will support variants in different languages where there is no expectation that content in any two languages will actually have the same meaning as each other. In content terms, the different languages will essentially be independent.
Examples
+discussions. Clearly we don't want to make people translate discussions into multiple languages, but a given card should be discussable in multiple languages.
+Articles, +About, etc : (as in the actual +Article cards on the Analysis pages). If we end up having, say, 20 versions of the "Apple+Climate Change" card (as it would be called in English), they would all vary, probably by quite a bit. While we'd love to have these alternative versions inspire each other, there's no reason why each version needs to be a direct translation of the others. Each could still have citation visualizations, but the citations might come in different orders. And, if/when we support upvotes on articles, those should apply only to a specific language (since once might be great and another might be rubbish).
Mapped
The basic concept here is that translation is performed automatically on Wagn based on how names of other cards are translated.
Examples
Pointers (+topic, +company, etc.) It would be highly inefficient to store pointers in multiple languages, because (as will be explained below), the translation is very easy to automate based on the cards being pointed to. Note that the "+" in the examples (+topic, +company) is very important here. We're not talking about Topic cards, we're talking about +topic fields on, say, a Claim or a Source. If a claim has been related to a topic in English, it should automatically be related to that topic in any other language.
Most readers and editors should never have to think about translation patterns. They should simply be seeing the languages they understand and not seeing languages they don’t unless specifically requested. While they can create, read, update, and delete cards of any language, but everything should happen for them in their default language unless they choose otherwise.
So to begin, we have to know which languages the user understands / prefers. We will have a first guess at this based on location. But even passive users (those with no account) should easily be able to choose to view Wikirate in other languages. Language preferences setting will be stored in sessions for these cases. By default, interface for setting language will feature prominently atop every page of a multilingual site, and it will likely follow the flag-based convention. A user with an account will be able to store language preferences more permanently.
Users can determine not just a preferred language, but an ordered list of preferred languages. For example, a user who prefers German but would like to see English where there is no German available would express preferences as a list of two languages: German, then English.
For each language pattern, the handling is fairly obvious when either (a) content exists for a prefered language or (b) the card doesn’t exist at all. When it exists, we show it! When it doesn’t it acts largely like a missing card now (except with locale-specific messaging).
The tricky part is figuring out what happens when a card exists, but content doesn’t exist in the requested language.
Universal
Not applicable. By definition, if the card exists and follows the universal pattern, then the content exists.
Monolingual
In general, a request for a monolingual card will be treated as an “explicit” request for that language returned in the card’s original language. As designed, this should not be a frequent occurrence.
Strict
This will depend somewhat on the view / context. If the card is the main card on the page, we will probably want to (a) indicate that the card is not available in the current language, (b) offer the user the opportunity to translated it, and (c) offer automated assistance in the form of automated translation (hereafter we'll refer to this as "google translation" to avoid confusion with mapping, but we won't be tightly bound to Google. Google Translation will only be used in this context – as an editor’s aid, not as a producer of final content for readers.
In other contexts we will want to show something much less obtrusive, but that leaves the opportunity to navigate to the interface described above.
Note that there will need to be analogous views for cards where the translation exists but may be out of date.
Free
Free cards will be similar to strict, but (a) the google translation is offered more as inspiration and may be discarded, and (b) there is no concern about out-of-date translations.
Mapped
In general, "mapped" content will be content containing translated names. When those names are not translated, the handling will largely be governed by the patterns outlined for the type of the named card (see above).
Community Dynamics
For this approach to thrive on a community site like Wikirate.org, translation will have to become a central activity on the site. But on the positive side, it has the possibility of becoming a very enjoyable, low-barrier to entry activity. Those who are capable of translations (inferred from the knowledge of multiple languages) will be invited to engage in this activity.
Web addresses
Note that all web addresses will need to indicate the language requested, eg:
http://wikirate.org/en/Company
Note that the “en” here does not determine the language to be shown; it tells Wagn to interpret the word “Company” as an English word. With this convention, if you have German preferences set, for example, Wagn will show you the German version of the Company card. Without this convention, there is no handling of false cognates, and the same url may show unrelated cards to different people.
We can put together a strategy for handling old links to make sure this doesn’t lead to lots of broken links, but I think this change to the RESTful web API is needed
On the Wagneering level, the biggest addition necessary for all this is of a new rule types (Settings) called *name translation and *content translation. The rules would follow the standard Wagn rule pattern, and configuring the rule would mean applying one of the above four approaches (Universal, Monolingual, Strict, Free, or Mapped) to a Set of cards. As a first (non-comprehensive) go at this, the rules might look something like (I've organized them in a concise way rather than writing out the full rule names, but it would be easy for any wagneer to translate these directly into rules) *all : strict *type: universal: Year *all : strict *type: universal: Javascript, Coffeescript, Number, Date, CSS, Company, User mapped: Pointer, *right: monolingual: *structure, *default free: discussion, about, article References On Wagn, "references" is a general term for links [[like me]] and nests/inclusions {{like me}}. When handling a nest, there are two main language points to consider: (a) what language are we using when we refer to the card (what language is the name in), and (b) what language to we want to use when we show it? language of card name in syntax You can specify the language of the cardname, perhaps like [[de: strasse]], {{de: strasse}} If you don't specify a language in the reference, it will be assumed that the reference is meant to be in the same language that is assigned to the current (nesting) card. language of card as shown to user You can specify the language of the output, perhaps like {{strasse | lang:de}} If you don't specify the language of the output, it will be rendered in the prefered language of the user.
Sample "*name translation" rules:
Sample "*content translation" rules:
At present, I’d imagine that the inclusion syntax itself, which is generally the domain of wagneers, is only implemented in English.
This proposal forces us to revisit some core Wagn principles around name handling. For example, it’s always been the case that, in Wagn, you can’t have two cards of the same name. With this proposal, a name no longer has to be unique within a deck, it just has to be unique for a given language within a deck. So, as often happens, the same word can mean different things in different languages (false cognates). Also, we’ve also long held this principle: If there is A+B, there must be an A and a B. Translating that into a multilingual environment looks like this: For A+B in language Xese, we must have A in Xese and B in Xese. The rest of this section reviews the translation patterns from a cardname perspective (ie, the consequences of *card translation rules). In the context of the A+B discussion above, if A is universal, it exists in all languages, but if A is monolingual it only exists in one. Examples User cards will have Universal names. Assuming that they're structured cards (as they are on Wikirate), this only really means that Users shouldn't have different names in different languages. This may be counter-intuitive, because clearly “Richard Mills” is not a Spanish name, but by making user names universal, we’re saying it’s valid to say “Richard Mills es muy guapo”. In other words, you can use this name even in a Spanish context. I would expect that Company names will probably be treated as universal by default, but there are clearly cases where they have different names in different languages, so we may needs some special handling for this, too. Strict names follow strict content patterns very closely. If one name is translated from another, than a "rename" risks leaving related names mistranslated and must be addressed. But, since name and content are different, these may need some separate handling. Examples Claims: the "main claim" (the 100-character-or-fewer statement) is stored as a card name. A claim clearly follows the Strict pattern as described in the content section. Given that they acquire votes, it's important that two versions of the same claim in two languages *actually* be making the same claim. A significant mistranslation could really screw up the voting. It's also important that they follow the naming pattern described above: a claim's value can be measured in part by how often it's cited, and we want to be able to measure this across languages. Topic names and Tag names are also obvious candidates for translated names. The Translatable examples in the content section above (legal, policy), will also tend to follow this naming pattern. In fact, translated names could likely be the default. To recap, "Free" cards are a group of cards that are NOT necessarily strict translations of each other but which fit into the same context on the site in different languages. In the case of compound cards (plus cards), we can create an A+B for every language that handles both A and B.
Name Uniqueness
Compound Names
Universal and Monolingual
Strict
Free
We will need to make, at a minimum, this alteration to the cards table: + lang + translatee_id The basic idea is that if a card is a translation of another card, it will have that card's id as its translatee id. If not, it will have its own id as a translatee_id. To spell it out a bit, here's how the data would look in the four patterns: Universal: lang is null Monolingual/Mapped: lang is non-null. there is only one card per translatee_id. The complexity of translation type of names composed of different types deserves its out subsection: (in the case of mapped plus cards, cards in all the others languages are virtual) Strict/Free: language is non-null. First card is its own translatee every additional translation has but same translatee_id. Here are some representational invariants and constraits to help define the data structures: Name related constraints: Question: is a lang going to be a card or a new model? How are these translation classes represented? Is it connected to the lang model and by extension a property of tha name and content? Maybe it just isn't represented at this level except by lang = NULL This proposal does not give much attention to handling multiple languages in hard-coded content, like text on buttons, error messages, etc. That’s largely because it’s a very common problem with lots of helpful libraries / methodologies. My expectation is that we’ll roughly follow the pattern of localization files, and that we’ll do this via “coded card content”, in which code is connected to a card via a codename. This approach is nice because it avoids having to do code migrations with each update, but allows room for wagneers to break away from the codebase if they so desire by removing the card’s codename. Currently Wagn handles singularizing card names with an english-specific algorithm, but this algorithm applies to all cardnames regardless of their language. In the new system, we can introduce Language Specific Key Generation. The proposal as it stands would involve a lot of work, including reworking: the database name processing routing WQL caching inclusion processing links a little bit of everything else We would probably need to make a configuration option to allow existing (and future) wagns to remain monolingual, but we also need an upgrade path for those wishing to to make use of the functionality. The good news is the data migration can be kept very simple.
Data Representation
cards
Note that we may need an additional field to flag out-of-date translations.Representation Details
Database references
Aldi, de, 3, 3, 1
select * from cards where translatee_id = 1
English in the Codebase
Singularization
Implementation
Existing Wagns
On wikirate.org, there are Company cards, each of which has a +About section.
This is represented as follows:
(nt = name translation, ct = content translation)
- Apple (Company). nt = universal, ct = mapped (all structured cards are mapped, regardless of the ct rule)
- Company+*type+*structure. nt = strict, ct = monolingual -> {{+About}}
Let's say that a content contributor decides to use the Spanish word "Sobre" section of company cards. This means the simple card "About" should have a spanish variant where its name is translated to "Sobre". That’s how Wagn knows to translate one to the other.
- About / Sobre: nt = strict, ct = strict
The basic idea here is that when Spanish-speaking users view Apple, they will see +Sobre inside, and that will be a different card from +About.
- Apple+About, Apple+Sobre, nt = strict, ct = free
Notice that in the above we have universal, monolingual, strict, free, and mapped all represented. It takes all five to get this to work.
Issues needing further attention:
- In the case of “Strict”, this proposal assumes that a given word can be translated between two languages in a way that makes sense throughout the site. This may be ok on Wikirate, but in general it’s a fairly unsafe assumption. Most likely the problem would be solved by addressing the next problem:
- At present, we commonly set titles inside nest syntax, eg {{mycard|title: my special name}}. This proposal doesn’t yet have any suggestions for how to make this multilingual, but this will be needed.
- Every example here uses European languages that use roughly the same alphabet, read from left to right, etc. It’s highly probable that this has led to oversights.
- There’s no real specification about how the flagging of out-of-date notifications works for strict translations.
- The implementation does not yet sufficiently entail a deep exploration of different combinations of name / content translation patterns.
talked about this briefly in the wagneer group: http://groups.google.com/group/wagneers/browse_thread/thread/85563e00c682036c/886f9eb66fd155c2?lnk=gst&q=i18n#886f9eb66fd155c2
Found ICU and ruby bindings
--Gerry Gleason.....Tue Nov 10 14:11:34 -0800 2009
Doesn't look like ruby bindings are current. Still looking into it.
--Gerry Gleason.....Wed Nov 11 08:30:35 -0800 2009
Awesome. thanks, gerry!
--Ethan McCutchen.....Wed Nov 11 10:35:13 -0800 2009
I suspect that this should really be a tag (i18n) instead of a blueprint, but I'll leave it here until there is time to re-organize all the info above.
--Ethan McCutchen.....2013-02-25 16:56:21 +0000
I need to add a lot more examples...
Playing with more general character classes: https://gist.github.com/GerryG/5f2993f262fbe14f57f2
I also updated the link in the old discussion for that ICU library for ruby.
Cool proposal, questions coming up as I read it. There is some footprint of this in the cardnames, and we should be able to translate the user presentation of a lot of system features by just having cardnames for the same card. Note Numeric Name Parts, which would make the key representation Universal for some name parts, whether or not there are existing names in different languages.
I think some of your monolingual examples could be Strict or maybe that is Mapped. The codenames will be mono-lingual, but multiple names and sometimes content would allow *create and *read to just have translated names doing most of the work. You seem to be connecting Strict with contractual things, but it can be used with functional things too. Maybe there is space between Strict and Free and the tools you envision to maintain strict translations can instead tell you how much in sync the different versions are and which ones are authoritative. If more than one is considered authoritative, they shouldn't contradict, they should convey as close as possible the same meaning.
Are Strict and Mapped related? Maybe one is just more specific, the other a subset of other, functionally. You'll desire more translations for a mapped card, and will want to translate updates, but would be more tolerant of being temporarily out of sync. Would you want to list required languages in a rule or something for Strict cards?
Good points about numeric name parts. I like that idea a lot. (Will follow up there at some point)
Strict and Mapped are quite different. Strict refers to *actual* translations, where Mapped refers to virtual translations. You would not want to store mapped translations in the database, but strict translations *must* be stored there. So, no, I wouldn't consider one a subset of the other.
"You seem to be connecting Strict with contractual things, but it can be used with functional things too". I want very much to avoid that. Wherever possible, functional things should only be represented once and then translated automatically.
I agree that some of the Monolingual examples could be Mapped. The most common case for Mapped is Pointers (cards which are entirely comprised of mapped references), and several of the rules I mentioned as Monolingual candidates (eg permissions) are pointers and thus probably more naturally mapped. *structure rules are a little more ambiguous, because it's possible to put non-referential natural language content in them, but in general that's going to be a poor choice in a multilingual context, so they, too, may make sense to treat as Mapped in multilingual sites.
I also resonate with your thoughts about the strict translation having a canonical version. Actually, I kind of think all strict translations will need to set one of the versions as canonical and then update from there, though we will want to be able to change which is canonical. The data representation embraces this. But I would say that the idea that translations "shouldn't contradict" isn't really "between Strict and Free"; that's just Strict. That's pretty much how I would define it.
All I'm saying is that there is a spectrum of both translation quality and synchronization. Perfection isn't really an option, but "strict" will represent a place pretty close to that end of the scale, but lots of times things will be in flux. I'm saying metrics relating to updates (sync) and quality (high standard of strictness) will be good to have for any community target required.