internationalization

Languages and people interested to translate:

 

 

 

 

Blueprint:

The proposed multilingual functionality is intended to be a general solution applicable to all Wagns.   However, it has been inspired by Wikirate.org, and many of the examples below will borrow from that site's structure. 

 

The proposal seeks to honor these constraints/considerations:

  1. everything is a card.
  2. a user should only see content in a language that he/she understands unless there are explicit instructions to break that pattern.
  3. a user who understands multiple languages has the potential to play a special role in an international community: translator.

The essence of the proposal is that there will be two new settings: *name translation and *code translation.  Rules based on these settings will initially support these values: universal, monolingual, strict, free, and patterned.

 

Anyone reading this is warmly invited to contribute use cases to explore how they may be addressed in the proposed system.

 

Standard practice among wiki communities, most notably Wikipedia, is to have separate sites for different languages.  There is an en.wikipedia.org and an nl.wikipedia.org, and while there are some connections between them (same technology, use of Wikidata, shared governance...), they are, by and large, separate projects with separate communities.

 

A Wikipedia-style solution will not work for many Wagn sites, many of which are intended to support one set of unified data in multiple languages.  For example, consider Wikirate.org.  As the name implies, quantitative data is quite central to WikiRate, and rich, nuanced interactions with quantitative data are core to what we are trying to achieve.  Many of those numbers (eg transparency scores and voting that feeds into them) are based on micro-interactions on the site, and our vision of transparency for those numbers depends on our being able to see exactly where they all come from.


All of this means implementing WikiRate's vision on separate sites, as Wikipedia does, would not work.  We do not want companies to score differently in different languages, nor do we want to multiply a company’s' reporting overhead by asking them to respond to the same questions on multiple sites.   We don't want transparency scores to be based on separate sites that users can't see.  Perhaps most importantly, we want the world-wide community of wikirate users to be able to speak with a unified voice in pushing to "make companies clear" and to enjoy facing the cross-cultural challenges of working on this together.

 

 

Current problems / bugs:

 

The following issues are problematic even in monolingual (non-English) contexts:

  1. lots of hard-coded English, no default content for other languages.
  2. the current name key mechanism is based on English pluralization
  3.  international content basically works, but sometimes (as in inclusion and probably links) content is breaking largely because of unicode characters getting treated as html entities.  We may be able to change this behavior by messing with this tinyMCE setting (they've changed their dosc infrastructure, probly need new link):

 

 

This proposal assumes the need to support 5 different language patterns, currently named Universal, Monolingual, Strict, Free, and Mapped.

 

Note that nomenclature is NOT central to this proposal.  Please feel free to suggest new names for these patterns (after reading how we define them, of course!)

 

Universal.  

 

Some kinds of data should not be treated like natural language.  Numbers and programming languages fit this pattern.

 

Examples 

 

Numbers, Dates, etc:  units and things need to be translated, and dates may be presented in different languages, but generally what we're going to store is going to be one universal numeric notation that can be easily translated automatically; it's not something users should ever have to bother with editing multiple versions of.

 

Javascript, CSS, etc: you definitely don't want to handle separate copies of code cards for separate languages.  This is placed in "Universal," because, by definition, computer languages are not natural languages, and their linguistic differences are already addressed by giving them separate cardtypes.



Monolingual

 

Monolingual content is clearly associated with a specific language, but there should not be different versions of this content for different languages.

 

Examples

 

structure rules (cards ending in +*structure): When you update a structure rule (a content pattern), you don't want to have to update it in several languages, and you almost certainly don't want the case where your page has different structure in different languages (with different CSS to maintain, etc.   YUCK!)  What you want is to update the structure in one language but have the change take effect in all languages.  For example, if you add {{+age}} to a structure rule, and you already have a spanish variant of the "age" card, then it should handle this translation in the nest processing, not with translated structures.  Similar logic applies to many rules (permissions, default, style, etc), but not all.  *table of contents is a number, for example, so that follows the "universal" pattern.  And help text should be explicitly translated.

 

 

Strict

 

A third kind of card should support (and encourage) translation.  The core idea of "strict" translation is that the name or content should have the same meaning in different languages.  This means that if you update one, you should update the others.  This implies a lot of special functionality to make it obvious when languages are not in sync.

 

Examples

 

Legal Documents: this is an obvious case where you don't want versions in different languages contradicting each other.  They need to say the same thing!

 

Cards about site policy: even without the involvement of formal law, the same idea holds true of site policy in general.



Free

 

Alternative cards will support variants in different languages where there is no expectation that content in any two languages will actually have the same meaning as each other.  In content terms, the different languages will essentially be independent.

 

Examples

 

+discussions.  Clearly we don't want to make people translate discussions into multiple languages, but a given card should be discussable in multiple languages.

 

+Articles, +About, etc : (as in the actual +Article cards on the Analysis pages).  If we end up having, say, 20 versions of the "Apple+Climate Change" card (as it would be called in English), they would all vary, probably by quite a bit.  While we'd love to have these alternative versions inspire each other, there's no reason why each version needs to be a direct translation of the others.  Each could still have citation visualizations, but the citations might come in different orders.  And, if/when we support upvotes on articles, those should apply only to a specific language (since once might be great and another might be rubbish).

 

 

Mapped

 

The basic concept here is that translation is performed automatically on Wagn based on how names of other cards are translated.

 

Examples

 

Pointers (+topic, +company, etc.)  It would be highly inefficient to store pointers in multiple languages, because (as will be explained below), the translation is very easy to automate based on the cards being pointed to.  Note that the "+"  in the examples (+topic, +company) is very important here.  We're not talking about Topic cards, we're talking about +topic fields on, say, a Claim or a Source.  If a claim has been related to a topic in English, it should automatically be related to that topic in any other language.

 

 

Most readers and editors should never have to think about translation patterns.  They should simply be seeing the languages they understand and not seeing languages they don’t unless specifically requested.  While they can create, read, update, and delete cards of any language, but everything should happen for them in their default language unless they choose otherwise.

 

So to begin, we have to know which languages the user understands / prefers.  We will have a first guess at this based on location.  But even passive users (those with no account) should easily be able to choose to view Wikirate in other languages. Language preferences setting will be stored in sessions for these cases.  By default, interface for setting language will feature prominently atop every page of a multilingual site, and it will likely follow the flag-based convention. A user with an account will be able to store language preferences more permanently.

 

Users can determine not just a preferred language, but an ordered list of preferred languages.  For example, a user who prefers German but would like to see English where there is no German available would express preferences as a list of two languages: German, then English.

 

For each language pattern, the handling is fairly obvious when either (a) content exists for a prefered language or (b) the card doesn’t exist at all.  When it exists, we show it!  When it doesn’t it acts largely like a missing card now (except with locale-specific messaging).

 

The tricky part is figuring out what happens when a card exists, but content doesn’t exist in the requested language.  

 

Universal

Not applicable.  By definition, if the card exists and follows the universal pattern, then the content exists.

 

Monolingual

In general, a request for a monolingual card will be treated as an “explicit” request for that language returned in the card’s original language.  As designed, this should not be a frequent occurrence.

 

Strict

This will depend somewhat on the view / context.  If the card is the main card on the page, we will probably want to (a) indicate that the card is not available in the current language, (b) offer the user the opportunity to translated it, and (c) offer automated assistance in the form of automated translation (hereafter we'll refer to this as "google translation" to avoid confusion with mapping, but we won't be tightly bound to Google.  Google Translation will only be used in this context – as an editor’s aid, not as a producer of final content for readers.

 

In other contexts we will want to show something much less obtrusive, but that leaves the opportunity to navigate to the interface described above.

 

Note that there will need to be analogous views for cards where the translation exists but may be out of date.

 

Free

Free cards will be similar to strict, but (a) the google translation is offered more as inspiration and may be discarded, and (b) there is no concern about out-of-date translations.

 

Mapped

In general, "mapped" content will be content containing translated names.  When those names are not translated, the handling will largely be governed by the patterns outlined for the type of the named card (see above).

 

 

Community Dynamics

 

For this approach to thrive on a community site like Wikirate.org, translation will have to become a central activity on the site. But on the positive side, it has the possibility of becoming a very enjoyable, low-barrier to entry activity. Those who are capable of translations (inferred from the knowledge of multiple languages) will be invited to engage in this activity.

 

Web addresses

 

Note that all web addresses will need to indicate the language requested, eg:

 

http://wikirate.org/en/Company

 

Note that the “en” here does not determine the language to be shown; it tells Wagn to interpret the word “Company” as an English word.  With this convention, if you have German preferences set, for example, Wagn will show you the German version of the Company card.   Without this convention, there is no handling of false cognates, and the same url may show unrelated cards to different people.


We can put together a strategy for handling old links to make sure this doesn’t lead to lots of broken links, but I think this change to the RESTful web API is needed

 

On the Wagneering level, the biggest addition necessary for all this is of a new rule types (Settings) called *name translation and *content translation.

 

The rules would follow the standard Wagn rule pattern, and configuring the rule would mean applying one of the above four approaches (Universal, Monolingual, Strict, Free, or Mapped) to a Set of cards.

 

As a first (non-comprehensive) go at this, the rules might look something like (I've organized them in a concise way rather than writing out the full rule names, but it would be easy for any wagneer to translate these directly into rules)

 

Sample "*name translation" rules:

 

*all :    strict

*type:

universal: Year

 

 

Sample "*content translation" rules:

 

*all :    strict

*type:

universal: Javascript, Coffeescript, Number, Date, CSS, Company, User

mapped: Pointer,

*right:

monolingual: *structure, *default

free: discussion, about, article

 

References

 

On Wagn, "references" is a general term for links [[like me]] and nests/inclusions {{like me}}.  When handling a nest, there are two main language points to consider: (a) what language are we using when we refer to the card (what language is the name in), and (b) what language to we want to use when we show it?

 

language of card name in syntax

 

  1. You can specify the language of the cardname, perhaps like [[de: strasse]], {{de: strasse}}

  2. If you don't specify a language in the reference, it will be assumed that the reference is meant to be in the same language that is assigned to the current (nesting) card.

 

language of card as shown to user

  

  1. You can specify the language of the output, perhaps like {{strasse | lang:de}}

  2. If you don't specify the language of the output, it will be rendered in the prefered language of the user.


At present, I’d imagine that the inclusion syntax itself, which is generally the domain of wagneers, is only implemented in English.

This proposal forces us to revisit some core Wagn principles around name handling.

 

Name Uniqueness

 

For example, it’s always been the case that, in Wagn, you can’t have two cards of the same name.  With this proposal, a name no longer has to be unique within a deck, it just has to be unique for a given language within a deck.  So, as often happens, the same word can mean different things in different languages (false cognates).

 

Compound Names

 

Also, we’ve also long held this principle:

 

If there is A+B, there must be an A and a B.

 

Translating that into a multilingual environment looks like this:

 

For A+B in language Xese, we must have A in Xese and B in Xese.

 

The rest of this section reviews the translation patterns from a cardname perspective (ie, the consequences of *card translation rules).



Universal and Monolingual

 

In the context of the A+B discussion above, if A is universal, it exists in all languages, but if A is monolingual it only exists in one.



Examples

 

User cards will have Universal names.   Assuming that they're structured cards (as they are on Wikirate), this only really means that Users shouldn't have different names in different languages.  This may be counter-intuitive, because clearly “Richard Mills” is not a Spanish name, but by making user names universal, we’re saying it’s valid to say “Richard Mills es muy guapo”.  In other words, you can use this name even in a Spanish context.

 

I would expect that Company names will probably be treated as universal by default, but there are clearly cases where they have different names in different languages, so we may needs some special handling for this, too.

 

Strict

 

Strict names follow strict content patterns very closely.  If one name is translated from another, than a "rename" risks leaving related names mistranslated and must be addressed.  But, since name and content are different, these may need some separate handling.

 

 

Examples

Claims: the "main claim" (the 100-character-or-fewer statement) is stored as a card name.  A claim clearly follows the Strict pattern as described in the content section. Given that they acquire votes, it's important that two versions of the same claim in two languages *actually* be making the same claim.  A significant mistranslation could really screw up the voting.  It's also important that they follow the naming pattern described above: a claim's value can be measured in part by how often it's cited, and we want to be able to measure this across languages.

 

Topic names and Tag names are also obvious candidates for translated names.  The Translatable examples in the content section above (legal, policy), will also tend to follow this naming pattern.  In fact, translated names could likely be the default.



Free

 

To recap, "Free" cards are a group of cards that are NOT necessarily strict translations of each other but which fit into the same context on the site in different languages.

 

In the case of compound cards (plus cards), we can create an A+B for every language that handles both A and B.

 

 

Data Representation

 

We will need to make, at a minimum, this alteration to the cards table:

cards

+ lang

+ translatee_id

 

The basic idea is that if a card is a translation of another card, it will have that card's id as its translatee id.  If not, it will have its own id as a translatee_id.

 

To spell it out a bit, here's how the data would look in the four patterns:

 

  • Universal: lang is null

  • Monolingual/Mapped: lang is non-null. there is only one card per translatee_id. The complexity of translation type of names composed of different types deserves its out subsection: (in the case of mapped plus cards, cards in all the others languages are virtual)

  • Strict/Free: language is non-null. First card is its own translatee every additional translation has but same translatee_id.
    Note that we may need an additional field to flag out-of-date translations.

 

Representation Details

Here are some representational invariants and constraits to help define the data structures:

  • translatee_id is not null
  • if lang is null, there are no other translators (besides itself)
  • translatee_id is used whenever we need the card with all its translations (Maybe the idea of the languag type General when the language is not specified (yet), which could be used in memory Card objects, but not in cards)
  • language is unique within the scope of a translatee_id
  • no chains, the translatee card is always its own translatee
  • type_id, left_id, right_id are be considered General it that it will always be the translatee_id of the type card.

Name related constraints:

  • We have A+B exists => A exists and B exists.  This has implications for languages where the card with name and left, right, type references doesn't exist for a particular language and we do have tranlations for all parts of the name (and therefore the whole name).
  • Such are card will exist?, but there might not be content for the requested language.  UI has to deal with this, representationally this is a card that exists and has content, but no translated content. Developer interface for UI code will signal this state and the UI handles per mode.

Question: is a lang going to be a card or a new model?

How are these translation classes represented?  Is it connected to the lang model and by extension a property of tha name and content?  Maybe it just isn't represented at this level except by lang = NULL

Database references

 

For starters, I would propose these principles:
 
(assumes that translatee is the original and can have many translators)
  • never use the id of a translator card in a type_id, left_id, or right_id reference. Always use the original/translatee
  • however, do use the specific translatee id in card_references
 
Why the difference?  A link in a card can actually be pointing to one language or another.  The other fields are still referring specifically to the same group of cards, and it's cleanest to identify those with a single id.
 
Lets dig into the language on links more.  Presumably, the language of a cardname in an inclusion or link is tied up generally with the interpretation of contextual names.  It will look up the names in the language context of the content card, which doesn't have to be the translatee.  If it hits within the content language, then it will have that as its language and the translation class from the card it finds.  *structure rules will take some more thought.  My Universal class User card should have a *structure content that is language specific.
 
Let's play with a case or two, assuming for now that all applicable language rules are set to "translatable". Say you first created a Company type in English and then made a translation to Firma in German.
 
name, lang, id, translatee_id
Company, en, 1, 1
Firma, de, 2, 1
 
Then you created a company in German (assume for now that company names are language specific)
 
name, lang, id, translatee_id, type_id
Aldi, de, 3, 3, 1
 
That final "1" is the thing to notice.  It's the id of the original type card -- not the translation.
 
 
You might also find it weird that in "original" cards, I use the id (rather than null) as the translatee_id.  This is debatable, but I like the idea that you can just do this: 
select * from cards where translatee_id = 1 
...to get all the name variants (and not need special handling for the original in every query).  I suspect you can use that as a subquery in WQL's reference handling  to distinguish between language-specific and non-language-specific needs.
 
 
~~~~~~~
 
So far, this seems fairly plausible, and I'm starting to feel like I may need to thank you not only for not killing me, but also for saving us a hell of a lot of work.  But before I get prematurely exuberant, I'd say the next step would be to go through a bunch of sample wikirate cards and figure out how we'd represent both the rules and the data.
 
 

English in the Codebase

 

This proposal does not give much attention to handling multiple languages in hard-coded content, like text on buttons, error messages, etc.  That’s largely because it’s a very common problem with lots of helpful libraries / methodologies.

 

My expectation is that we’ll roughly follow the pattern of localization files, and that we’ll do this via “coded card content”, in which code is connected to a card via a codename.  This approach is nice because it avoids having to do code migrations with each update, but allows room for wagneers to break away from the codebase if they so desire by removing the card’s codename.

 

Singularization

 

Currently Wagn handles singularizing card names with an english-specific algorithm, but this algorithm applies to all cardnames regardless of their language.  In the new system, we can introduce Language Specific Key Generation.

 

Implementation

 

The proposal as it stands would involve a lot of work, including reworking:

 

  • the database

  • name processing

  • routing

  • WQL

  • caching

  • inclusion processing

  • links

  • a little bit of everything else

 

Existing Wagns

 

We would probably need to make a configuration option to allow existing (and future) wagns to remain monolingual, but we also need an upgrade path for those wishing to to make use of the functionality. The good news is the data migration can be kept very simple.

On wikirate.org, there are Company cards, each of which has a +About section.

 

This is represented as follows:

(nt = name translation, ct = content translation)

  • Apple (Company).  nt = universal, ct =  mapped (all structured cards are mapped, regardless of the ct rule)
  • Company+*type+*structure.  nt = strict, ct = monolingual -> {{+About}}

Let's say that a content contributor decides to use the Spanish word "Sobre" section of company cards.  This means the simple card "About" should have a spanish variant where its name is translated to "Sobre".  That’s how Wagn knows to translate one to the other.

  • About / Sobre: nt = strict, ct = strict

The basic idea here is that when Spanish-speaking users view Apple, they will see +Sobre inside, and that will be a different card from +About.

  • Apple+About, Apple+Sobre, nt = strict, ct = free

Notice that in the above we have universal, monolingual, strict, free, and mapped all represented.  It takes all five to get this to work.

 

Issues needing further attention:

  1. In the case of “Strict”, this proposal assumes that a given word can be translated between two languages in a way that makes sense throughout the site.  This may be ok on Wikirate, but in general it’s a fairly unsafe assumption.  Most likely the problem would be solved by addressing the next problem:
  2. At present, we commonly set titles inside nest syntax, eg {{mycard|title: my special name}}.  This proposal doesn’t yet have any suggestions for how to make this multilingual, but this will be needed.
  3. Every example here uses European languages that use roughly the same alphabet, read from left to right, etc.  It’s highly probable that this has led to oversights.
  4. There’s no real specification about how the flagging of out-of-date notifications works for strict translations.
  5. The implementation does not yet sufficiently entail a deep exploration of different combinations of name / content translation patterns.

 

 
 

talked about this briefly in the wagneer group: http://groups.google.com/group/wagneers/browse_thread/thread/85563e00c682036c/886f9eb66fd155c2?lnk=gst&q=i18n#886f9eb66fd155c2


Found ICU and ruby bindings

  --Gerry Gleason.....Tue Nov 10 14:11:34 -0800 2009


Doesn't look like ruby bindings are current. Still looking into it.

  --Gerry Gleason.....Wed Nov 11 08:30:35 -0800 2009


Awesome. thanks, gerry!

  --Ethan McCutchen.....Wed Nov 11 10:35:13 -0800 2009


I suspect that this should really be a tag (i18n) instead of a blueprint, but I'll leave it here until there is time to re-organize all the info above.

  --Ethan McCutchen.....2013-02-25 16:56:21 +0000


I need to add a lot more examples...

--Ethan McCutchen.....2014-09-23 04:17:55 +0000

Playing with more general character classes: https://gist.github.com/GerryG/5f2993f262fbe14f57f2

I also updated the link in the old discussion for that ICU library for ruby.

--Gerry Gleason.....2014-12-31 20:16:28 +0000

Cool proposal, questions coming up as I read it. There is some footprint of this in the cardnames, and we should be able to translate the user presentation of a lot of system features by just having cardnames for the same card. Note Numeric Name Parts, which would make the key representation Universal for some name parts, whether or not there are existing names in different languages.

 

I think some of your monolingual examples could be Strict or maybe that is Mapped. The codenames will be mono-lingual, but multiple names and sometimes content would allow *create and *read to just have translated names doing most of the work. You seem to be connecting Strict with contractual things, but it can be used with functional things too. Maybe there is space between Strict and Free and the tools you envision to maintain strict translations can instead tell you how much in sync the different versions are and which ones are authoritative. If more than one is considered authoritative, they shouldn't contradict, they should convey as close as possible the same meaning.

 

Are Strict and Mapped related? Maybe one is just more specific, the other a subset of other, functionally. You'll desire more translations for a mapped card, and will want to translate updates, but would be more tolerant of being temporarily out of sync. Would you want to list required languages in a rule or something for Strict cards?

--Gerry Gleason.....2014-12-31 21:22:54 +0000

Good points about numeric name parts. I like that idea a lot. (Will follow up there at some point)

 

Strict and Mapped are quite different. Strict refers to *actual* translations, where Mapped refers to virtual translations. You would not want to store mapped translations in the database, but strict translations *must* be stored there. So, no, I wouldn't consider one a subset of the other.

 

"You seem to be connecting Strict with contractual things, but it can be used with functional things too". I want very much to avoid that. Wherever possible, functional things should only be represented once and then translated automatically.

 

I agree that some of the Monolingual examples could be Mapped. The most common case for Mapped is Pointers (cards which are entirely comprised of mapped references), and several of the rules I mentioned as Monolingual candidates (eg permissions) are pointers and thus probably more naturally mapped. *structure rules are a little more ambiguous, because it's possible to put non-referential natural language content in them, but in general that's going to be a poor choice in a multilingual context, so they, too, may make sense to treat as Mapped in multilingual sites.

 

I also resonate with your thoughts about the strict translation having a canonical version. Actually, I kind of think all strict translations will need to set one of the versions as canonical and then update from there, though we will want to be able to change which is canonical. The data representation embraces this. But I would say that the idea that translations "shouldn't contradict" isn't really "between Strict and Free"; that's just Strict. That's pretty much how I would define it.

 

--Ethan McCutchen.....2014-12-31 22:35:55 +0000

All I'm saying is that there is a spectrum of both translation quality and synchronization. Perfection isn't really an option, but "strict" will represent a place pretty close to that end of the scale, but lots of times things will be in flux. I'm saying metrics relating to updates (sync) and quality (high standard of strictness) will be good to have for any community target required.

--Gerry Gleason.....2015-01-05 23:21:46 +0000
 
Also see:

Development Tickets (by status)

 

Ideas

 

Documentation Tickets

 

Support Tickets

 

 

Development Tickets (by status)

 

Ideas

 

Documentation Tickets

 

Support Tickets

 

I believe there are at least 3 directions:

  1. localization of user interface: views, editor, etc.
  2. i18n of internal representation of cards
  3. national translations of documentation.

  --Mike_Shock.....Sun Mar 20 23:56:53 -0700 2011


Some of 1, and all of 2, is technically possible now. It will be necessary to set up a new Wagn for each new language, and then code things so that people starting up a new Wagn can choose which language they want to pull their initial content from.

 

3 would be great. I guess we could do that right here on wagn.org

 

I'll start a conversation on all this in a few days with a blog post, and on the Wagneers mailing list (I just posted about Default cards for new Wagns, which I thought would be good context).

  --John Abbe.....Mon Mar 21 12:32:18 -0700 2011


1. For me, the most necessary thing is the ability to localize the user interface in config/locales/en.yml --> xx.yml

I guess this can be done with minimum efforts. :-)

After this sites on WagN will get a locale-friendly look: buttons, links, messages, editor, etc. in the native language.

That's enough for an ordinary user to feel comfortably.

2. It'll take quite a long preparation time...

3. I plan to start translating main WagN documentation into Russian here.

  --Mike_Shock.....Mon Mar 21 22:53:08 -0700 2011


For 1, our plan is to move all of the interface text that users see into cards, which will make it possible for more people to help with translating. See move cue text to editable cards.

  --John Abbe.....Wed Mar 23 08:53:22 -0700 2011


Is there anywhere I can sign up for translating? Would love to volunteer - eventhough all I could offer is :de

  -- (Not signed in).....Fri Mar 25 23:47:07 -0700 2011


Great! For now we can just use this page to track who's volunteering for what languages. You may want to register for an account here - just click on the "Sign Up" link in the top right corner.

  --John Abbe.....Sat Mar 26 00:29:54 -0700 2011


I can translate into Russian. Where should I start from?

  --Mike_Shock.....Tue Mar 29 04:35:41 -0700 2011

To get email when there are changes here, hover over the footer and click "watch" in the right corner.