Searching for Japanese (possibly other CJK languages?) broken

The URL in both cases appears to be generated identically:

:search?_keyword=%E3%83%86%E3%82%B9%E3%83%88

--Mitchell Im.....2013-08-18 21:43:53 +0000

Thanks for submitting this, Mitchell. I'll try to get this patched in the next point release. And thanks for using kana, which I mostly remember. I've pretty much forgotten every kanji I ever learned :)

--Ethan McCutchen.....2013-08-19 16:21:32 +0000

Mitchell, how technical are you? Would you be able to access the postgresql command line interface and run some queries? I'd have to do a little work to get a test instance up with postgres, which I'm willing to do, but I can't really dig into that right this moment. Things might move faster if you're willing to run some tests.

Here is the SQL wagn is currently generating, which appears to work in mysql:

select t.name from cards t JOIN card_revisions AS t_rev ON t.current_revision_id = t_rev.id where ((replace(t.name,'+',' ') REGEXP '[[:<:]]テスト[[:>:]]' OR t_rev.content REGEXP '[[:<:]]テスト[[:>:]]')) and t.trash is false ORDER BY t.updated_at desc

To strip that down to the meaningful parts, we could try:

select t.name from cards t JOIN card_revisions AS t_rev ON t.current_revision_id = t_rev.id where ((replace(t.name,'+',' ') REGEXP '[[:<:]]テスト[[:>:]]' OR t_rev.content REGEXP '[[:<:]]テスト[[:>:]]'))

I'm guessing that doesn't work for you (but if it does, that will tell us something).

Then we might test the name and the content independently (to do so, make sure you have a card with テスト in the name and another with テスト in the content.

name:

select t.name from cards t where replace(t.name,'+',' ') REGEXP '[[:<:]]テスト[[:>:]]';

content:

select t.name from cards t JOIN card_revisions AS t_rev ON t.current_revision_id = t_rev.id where t_rev.content REGEXP '[[:<:]]テスト[[:>:]]';

if none of those return anything we may have a low-level SQL issue. If they all return something then we have a higher-level issue. If some return and others don't... well, we'll figure it out :)

--Ethan McCutchen.....2013-08-19 20:49:44 +0000

I'm as technical as you need me to be. ;) (I've got experience in Perl, Python, tcl, C/C++, Java, C#, SQL, although not much in Ruby.)

PostgreSQL doesn't support the REGEXP function, so I've done some minor rewrites on the SQL queries as follows:

Results:

select t.name from cards t JOIN card_revisions AS t_rev ON t.current_revision_id = t_rev.id where ((replace(t.name,'+',' ') ~ ':<:テスト:>:' OR t_rev.content ~ ':<:テスト:>:')) and t.trash is false ORDER BY t.updated_at desc;

 name 
------
(0 rows)

select t.name from cards t JOIN card_revisions AS t_rev ON t.current_revision_id = t_rev.id where ((replace(t.name,'+',' ') ~ ':<:テスト:>:' OR t_rev.content ~':<:テスト:>:'));

 name 
------
(0 rows)

select t.name from cards t where replace(t.name,'+',' ') ~ ':<:テスト:>:';

 name 
------
(0 rows)

select t.name from cards t JOIN card_revisions AS t_rev ON t.current_revision_id = t_rev.id where t_rev.content ~ ':<:テスト:>:';

 name 
------
(0 rows)

Unfortunately, it looks like PostgreSQL is just flat-out confused with Japanese and regular expressions. Even dropping the replace doesn't work:

select t.name from cards t where t.name ~ ':<:テスト:>:';

 name 
------
(0 rows)

Dropping the word boundaries does work, but that would (naturally) have its own drawbacks:

select t.name from cards t where t.name ~ 'テスト';

  name  
--------
 テスト
(1 row)

And digging a bit further, apparently the PostgreSQL engine just doesn't play nice reliably with Unicode:

http://postgresql.1045698.n5.nabble.com/Future-of-our-regular-expression-code-td5495667.html

So... I'm not sure what the recommended direction would be at this point. I've been trying to run PostgreSQL over MySQL, but... *sigh*

--Mitchell Im.....2013-08-21 00:06:52 +0000

Additional test for dropping the word boundaries (with another card):

select t.name from cards t where t.name ~ 'テスト';

     name
--------------
 テスト
 テストケース
(2 rows)

--Mitchell Im.....2013-08-21 02:58:52 +0000

Mitchell, this is great stuff. Thanks for the great answers and for digging further.

fwiw, if it were just personal preference I'd probably be using postgresql, too. We switched over to mysql when we committed to supporting the cloudstore installs so I haven't been using it as much, but I'd be bummed to see its support wither. We have gotten several patches along the way from folks still using postgres, so you're not the only one that wants it.

I suppose we could write some special code to say, essentially, if the search term is entirely unicode and the engine is postgres, then forget about the word boundaries. Sounds awfully hacky.

We could also try the \M \m syntax mentioned here, but my guess would be that those are translated into the same thing as the other underneath.

What we really want is much smarter index-based relevance matching, and there's a shot we may be able to nail down funding for that, but that solution is a ways away.

In the short term, I'll look into ways to make this easy to tweak in a module so that, even if we don't change the default behavior, you can add a little module to make it skip word boundaries.

Now that I think about it, aren't word boundaries always a tricky issue in Japanese? I mean, if you look at this: "これはテストです。" How are you going to determine word boundaries without semantic processing (kind of a bad example, because knowing the difference between hiragana and katakana gets you most of the way. But you get the point, right? Do you know how this is usually handled?

--Ethan McCutchen.....2013-08-21 03:47:05 +0000

Yeah, I had taken a look at the PostgreSQL documentation here, so I had tried it, but didn't document it. The results are what you'd expect:

select t.name from cards t where t.name ~ E'\mテスト\M';

 name
------
(0 rows)

For how it's handled (I don't know about PostgreSQL specifically), it looks like the rules are... interesting. The Unicode rules are here, and it looks like at least one third party is using that combined with a custom word dictionary (ouch).

--Mitchell Im.....2013-08-21 17:55:09 +0000