perl

Unicode::Collate is really, really slow

Robin Smidsrød

01 Sep 2011 • 2 min read

I noticed a while back that there was something fishy with perl's built-in sort when dealing with Unicode text. Doing some research made me eventually notice the Unicode Collation Algorithm (UCA) and the perl implementation in Unicode::Collate and the very useful Unicode::Collate::Locale. Thanks a lot to the author and maintainers of Unicode::Collate for actually implementing the algorithm and giving perl programmers the tools to actually sort text in the right way. Unfortunately my hope got quickly crushed when I noticed the very large speed difference in the implementation of UCA and the normal sort algorithm. At this point I also noticed the pg_collkey implementation of UCA for PostgreSQL, and I was saved (because it was quite fast). Since all of my data came from my database I could just sort the data before it was returned from the database. Time passed and I slowly rewrote all of my broken sort's to use database-based ORDER BY instead. And then I'm reading Tom Christiansen's article on how the builtin sort in perl is broken for some type of work, and that we should be using Unicode::Collate instead. Suddenly everything I had forgotten came back to me. I guess now was the time to speak up. I whipped up a small benchmark program to actually test how much difference there is between the built-in sort and Unicode::Collate. #!/usr/bin/env perl use 5.14.0; use strict; use warnings; use File::Slurp qw(slurp); use Encode qw(decode_utf8); use List::MoreUtils qw(uniq); use autodie; use Benchmark qw(cmpthese); use Unicode::Collate::Locale; # Get unique list of words from specified file my @words = uniq map { split /\b/u } decode_utf8( slurp(shift) ); # Do benchmark my $uca = Unicode::Collate::Locale->new( locale => 'nb' ); cmpthese(1_000, { 'sort' => sub { my @sorted_words = sort @words }, 'uca' => sub { my @sorted_words = $uca->sort(@words) }, }); exit; You can find uca_sort_benchmark.pl on GitHub, together with the text file I used, if you'd like to reproduce. The results are truly devastating. $ ./uca_sort_benchmark.pl misc_wikipedia_text.txt Rate uca sort uca 4.08/s -- -99% sort 403/s 9782% -- The implementation of UCA in perl is about 100 times slower than the built-in sort! I'm wondering how much faster it could be if Unicode::Collate was coded to use ICU directly (which is what pg_collkey uses). I really hope someone with XS/C skills can figure out how to make it faster, because I really want to use it everywhere.