Last updated July 9, 2009

Chinese Programming Notes

This page will only be of interest to programmers but I have published it as outputting Chinese presents problems which do not occur when dealing with solely standard western encoding.

The reason for this is that Chinese is stored in a computer as multi-byte characters whereas western character sets tend to only use single-bytes to store a character.

Chinese is further complicated by the fact that the most common way to store the characters is in a format known as Big5. This site uses Big5 for all the static pages as that it the only way I can enter Chinese on my computer.
Unfortunately Big5 is outdated and should not really be used on the web. There are various reasons for this but the main one is that it simply shouldn't be needed anymore - unicode is far better. Unicode is an all encompassing encoding standard that can be used to display any number of mixed characters on a single page. eg Russian cyrillic, Japanese, Arabic, Traditional Chinese, Simplified Chinese etc.

As it happens, I am able to store Chinese characters as unicode in my online database so to futureproof my code I obviously choose to do this. I am therefore in the situation that all the dynamic sections of the site output Chinese as unicode and all the static sections use Big5. See "Character of the Day" below to see one way of dealing with this.

As noted, throughout the site I make extensive use of a MySQL database containing details of hundreds of Chinese characters. Each character contains the following items of data:

  • id - a unique ID number for internal reference
  • unicode - the character encoded in utf-8 format. Currently this is a unique index but I may change that to allow some characters to appear multiple times but with different descriptions (it seems this does happen in Chinese for some words).
  • shortdesc - a brief summary of the meaning of the character
  • desc - a full dictionary definition
  • jyutping - the Cantonese jyutping romanisation
  • pinyin - the Mandarin pinyin romanisation (even though this is a Cantonese site I had so many requests to cater for Mandarin learners I decided one extra field wouldn't hurt.
  • strokes - the strokecount (useful for looking the character up in a Chinese dictionary and also for checking that you are writing it correctly 
  • level - a value between 1 and 4, with 1 being a basic, easy to write character and 4 being the most complex.  As more characters get added there will no doubt eventually be higher levels. These levels are completely arbitrary and are based on my judgement alone. Occasionally I change the levels if I realise that a harder character is actually very common and should be learned sooner.

The database is updated as I learn new characters myself. I add them to my local MySQL database and then export the data as textual SQL to import into the live server database.

50 character test

This script initially chose 100 characters at random from a hardcoded array within the script itself.
When I updated it to use the database I wrote a Chinese Character class in PHP which is now used in most of my other scripts. By treating characters as programming objects I can generally code faster and in a more structured manner which is helpful for maintaining old scripts.
Due to user feedback I reduced the number of characters to 50 so as to make the test clearer
Latest versions allow the user to choose a level and also whether to have Cantonese or Mandarin tooltips.

Online Quiz

This is the most complex script I've done for the site and it has had many new features added.
It uses PHP sessions to record the user's position within the test. For example, when the user first starts the test it will display a list of options. Once a desired game type is chosen it will randomly pick 10 answers for the test and store them in the session and take a note of the server time for scoring purposes. As the user progresses through the test their answers are stored in a session array and at the end compared with the actual answers. This makes it easy to output a summary of their responses and optionally record their score and time in the High Score table.

Last N posts

On the frontpage of the site a summary of the latest posts to the forum is displayed.
This hooks into the site's forum software "Phorum" and is a fairly basic script. At some point
I should update it to use caching (see Character of the Day below).

Flashcards

The flashcards on this site used to be static pages which I updated manually. Indeed, I created over 200 before moving to the more sensible dynamically generated cards that are now available. By using a database users can customise the cards according to level, what appears on the cards and also choose between Mandarin and Cantonese romanisation.

Master List of Characters

This script was done more for my own benefit but hopefully it is still useful for people to see which characters are used within the site. Users can choose to filter by level and order by strokecount, pronunciation or meaning. Crafty users can therefore use it to revise for the Online Quiz!
I have disabled the ability to show all characters for resource reasons - pulling hundreds of rows from the database causes a certain amount of load and I'm not sure it would be all that helpful to anyone other than me (I use it as an easy way to check if I have already added a character to the DB).

Character of the Day

The Character of the Day script is shown on the front page and uses a few interesting code techniques.
Firstly, it is cached - so the database only needs to be queried once per day and the resulting character is written to the server as a simple text file.
The randomisation is done within MySQL with a static seed so a different character for each day should be possible (although repetition may occur as new characters are added to the database)
Finally, for legacy reasons, my front page is encoded as Chinese Big5 but my database is encoded as Unicode (utf-8). In the past I couldn't display utf-8 characters on a Big5 page but PHP 4.3.x finally knows how to convert between encodings (using mb_convert_encoding() ).
This script will therefore work fine on the front page once I can get the mbstring extension installed on my web server.  UPDATE - mbstring is now installed but it appears only Japanese support is included by default.  You need to include traditional Chinese support in at compile time- most frustrating!  I now have to try and persuade my web provider to enable this feature...  Further UPDATE - Hurrah!  They have added support for all possible languages and my code now seems to work OK.

Dynamic Tests

This script allows the generation of random tests which are designed to be printed out. The randomisation is again based on a static seed which is shown to the user as a "Test Number". The idea is that people may like to reprint an older test. The answers are shown in small print at the bottom of the page so that you can see how well you did away from the computer.  Some people (including myself) find the dynamic tests a better way of learning than the flash cards.

Customisable Calendar - November 2003

The latest script builds on the idea of the flashcards. The idea is that the user can customise a 365 page calendar and print it out for use as a notepad.
Coding-wise, the main difference is that I am using CSS for all the page positioning. This gives me more control than HTML tables. I am also using CSS page breaks which means that every A4 sheet will print correctly on its own page. This means instead of having to manually print 40 odd pages you can just click print once and everything should work. The downside is that database load and bandwidth could be an issue :-( To help combat the filesize I am using PHP to zip the page before it is sent to the user's browser. I must thank dr.slump from our forums for helping me fix some bugs with my dynamic CSS code, thanks to him the calendar now works in Opera 7 as well as IE6.

Miscellaneous

  • Most of the navigation menus are output in PHP which means I can make changes very quickly.  
  • Common page elements such as the copyright notice are also generated via a single script.
  • Virtually all formatting is achieved via CSS (Cascading Style Sheets). If you don't use CSS for your own websites you really should do - they are extrememly useful.
  • A few visual effects use Javascript, using hooks like onMouseOver().



 

 

Powered by MySQL hosted by
Celerity Design
PHP.net
Sponsors: One-on-One Online Chinese Tutoring | Mandarin | Spanish Language
Studying in China | Learn Chinese in China | Learn Mandarin in China | Chinese School | Chinese course in London