Name: Gábor Kövesdán
Nationality: Hungarian
Resident of: Budapest, Hungary
Birth date: 2 Aug, 1987
Gender: Male
Mailing address: Pozsonyi u. 2/b, Budapest, 1045
Phone: +36 30 434 04 06
E-mail: <gabor@kovesdan.org>
IRC: gabor@EFnet#bsdcode
Proposed mentor: Xin Li <delphij@FreeBSD.org>
The libiconv library is an important piece of I18N software. It implements library routines to convert files from one encoding to another. There is also a command line interface for this library, which provides a convenient way of encoding conversion. Currently, FreeBSD lacks of these pieces of software in the base system. In the Ports Collection, we use GNU libiconv but this is such an underlying tool that we should have an own, customized version that is:
Located in the base system. Currently, we can't use libiconv from the base system and it is a big limiting factor in I18N.
Licensed under BSDL (preferably 2-clause).
Efficient.
Clean. Does not contain a lot of platform-specific ugly hacks. It is optimized to FreeBSD and its coding conventions.
Supports the most important encodings, the more the better, but it must support the following ones: ASCII, ISO8859-family, UTF-family, CP1131, CP1251, ISCII-DEV, ARMSCII-8, SJIS, eucJP, PT154, CP949, eucKR, CP866, KOI8-R, KOI8-U, GB18030, GB2312, GBK, eucCN, Big5HKSCS, Big5. These are the encodings that are supported by our locale subsystem, so an acceptable level of their support is a requirement.
Provides a good level of compatibility with its GNU counterpart. This doesn't only mean standard conformance, but implementing the most common extensions so that we can use it as a drop-in replacement for the Ports Collection, as well.
Let's summarize the benefits once more:
BSDL is more suitable for FreeBSD.
We can make use of this library from base. (For example, build-time generation of NLS catalogs for all locales.)
Being independent from GNU gives us more freedom and better opportunities to optimize the code to our needs.
There have already been two efforts to implement a BSDL
libiconv. One of those has been
developed by Konstantin Chuguev and Alexander Nedotsukov.
It can be found in the Ports Collection as converters/iconv. This will
simply be referred as "BSDL iconv". The other
one is integrated into NetBSD and it comes from the Citrus
project.
As for the first mentioned implementation, the missing features or known bugs are:
It cannot convert U+FFFF and U+FFFE symbols by design. This can be fixed by changing its internal converter interface.
The conversion is based on UCS, i.e. conversion can only be done through UCS-4 as an intermediate step.
There are problems in CJK (Chinese-Japan-Korean) charset conversions.
Some restricted files are used to generate code tables. A legal form of generating those code tables is necessary.
An own regression test suite is needed.
Perl is a build-time dependency, while Perl is not part of the base system.
I have to highlight that despite its deficiencies, this implementation is very well-documented and the code is easy to read. It serves as a good reference of conversions.
The second existing implementation is part of a wider project, called Citrus (Comprehensive I18N Towards Respectable UNIX Systems). It is quite mature and more complete than the aforementioned one. This one should be chosen as a starting point. The objective is to extend this implementation to support all the character encodings listed in the section called “Project Description” and to solve possible issues found during the work.
Here is a table that summarizes the encoding support of the two implementations:
Table 1. Encoding support
| Encoding | BSDL iconv | Citrus iconv |
|---|---|---|
| ARMSCII-8 | unsupported | inaccurate |
| ASCII | wrong FFFF | OK |
| ATARIST | unsupported | OK |
| BIG5 | inaccurate | inaccurate |
| BIG5-2003 | unsupported | inaccurate |
| BIG5-HKSCS:1999 | unsupported | unsupported |
| BIG5-HKSCS:2001 | unsupported | unsupported |
| BIG5-HKSCS:2004 | unsupported | unsupported |
| CP1046 | unsupported | unsupported |
| CP1124 | unsupported | unsupported |
| CP1125 | unsupported | unsupported |
| CP1129 | unsupported | unsupported |
| CP1133 | unsupported | inaccurate |
| CP1161 | unsupported | unsupported |
| CP1162 | unsupported | unsupported |
| CP1163 | unsupported | unsupported |
| CP1250 | unsupported | OK |
| CP1251 | unsupported | OK |
| CP1252 | unsupported | OK |
| CP1253 | unsupported | OK |
| CP1254 | unsupported | OK |
| CP1255 | unsupported | inaccurate |
| CP1256 | unsupported | OK |
| CP1257 | unsupported | OK |
| CP1258 | unsupported | inaccurate |
| CP437 | unsupported | OK |
| CP737 | unsupported | OK |
| CP775 | wrong FFFF | OK |
| CP850 | wrong FFFF | OK |
| CP852 | wrong FFFF | OK |
| CP853 | unsupported | unsupported |
| CP855 | wrong FFFF | OK |
| CP856 | unsupported | OK |
| CP857 | unsupported | OK |
| CP858 | unsupported | OK |
| CP860 | unsupported | OK |
| CP861 | unsupported | OK |
| CP862 | unsupported | OK |
| CP863 | unsupported | OK |
| CP864 | unsupported | inaccurate |
| CP865 | unsupported | OK |
| CP866 | wrong FFFF | OK |
| CP869 | unsupported | OK |
| CP874 | unsupported | OK |
| CP922 | unsupported | unsupported |
| CP932 | unsupported | inaccurate |
| CP936 | unsupported | inaccurate |
| CP943 | unsupported | OK |
| CP949 | unsupported | inaccurate |
| CP950 | unsupported | inaccurate |
| DEC-HANYU | unsupported | unsupported |
| DEC-KANJI | unsupported | unsupported |
| EUC-CN | inaccurate | inaccurate |
| EUC-JISX0213 | unsupported | unsupported |
| EUC-JP | segfault | inaccurate |
| EUC-KR | inaccurate | inaccurate |
| EUC-TW | segfault | unknown error |
| GB18030 | unsupported | inaccurate |
| GBK | unsupported | inaccurate |
| GB_2312-80 | unsupported | unsupported |
| Georgian-Academy | unsupported | inaccurate |
| Georgian-PS | unsupported | inaccurate |
| HP-ROMAN8 | unsupported | inaccurate |
| HZ | unsupported | OK |
| ISO-2022-CN | unsupported | illegal sequence |
| ISO-2022-CN-EXT | unsupported | illegal sequence |
| ISO-2022-JP | unsupported | OK |
| ISO-2022-JP-1 | unsupported | OK |
| ISO-2022-JP-2 | unsupported | inaccurate |
| ISO-2022-JP-3 | unsupported | unsupported |
| ISO-2022-KR | unsupported | inaccurate |
| ISO-IR-165 | unsupported | unsupported |
| ISO646-CN | unsupported | OK |
| ISO646-JP | unsupported | OK |
| ISO8859-1 | wrong FFFF | OK |
| ISO8859-10 | unsupported | OK |
| ISO8859-11 | unsupported | OK |
| ISO8859-13 | unsupported | OK |
| ISO8859-14 | unsupported | OK |
| ISO8859-15 | wrong FFFF | OK |
| ISO8859-16 | unsupported | OK |
| ISO8859-2 | wrong FFFF | OK |
| ISO8859-3 | unsupported | OK |
| ISO8859-4 | wrong FFFF | OK |
| ISO8859-5 | wrong FFFF | OK |
| ISO8859-6 | unsupported | OK |
| ISO8859-7 | unsupported | OK |
| ISO8859-8 | unsupported | OK |
| ISO8859-9 | unsupported | OK |
| JIS_X0201 | inaccurate | OK |
| JIS_X0208 | unsupported | unsupported |
| JIS_X0212 | unsupported | unsupported |
| JOHAB | unsupported | inaccurate |
| KOI8-R | wrong FFFF | OK |
| KOI8-RU | unsupported | inaccurate |
| KOI8-T | unsupported | inaccurate |
| KOI8-U | wrong FFFF | OK |
| KSC_5601 | unsupported | unsupported |
| MacArabic | unsupported | unsupported |
| MacCentralEurope | unsupported | OK |
| MacCroatian | unsupported | inaccurate |
| MacCyrillic | unsupported | inaccurate |
| MacGreek | unsupported | inaccurate |
| MacHebrew | unsupported | unsupported |
| MacIceland | unsupported | inaccurate |
| MacRoman | unsupported | inaccurate |
| MacRomania | unsupported | unsupported |
| MacThai | unsupported | inaccurate |
| MacTurkish | unsupported | inaccurate |
| MacUkraine | unsupported | inaccurate |
| MuleLao-1 | unsupported | inaccurate |
| NEXTSTEP | unsupported | OK |
| PT154 | unsupported | OK |
| RISCOS-LATIN1 | unsupported | unsupported |
| RK1048 | unsupported | unsupported |
| SHIFT_JIS | inaccurate | inaccurate |
| SHIFT_JISX0213 | unsupported | unsupported |
| TCVN | unsupported | inaccurate |
| TDS565 | unsupported | unknown error |
| TIS-620 | unsupported | unsupported |
| UCS-2BE | unsupported | OK |
| UCS-2LE | unsupported | OK |
| UCS-4BE | unsupported | OK |
| UCS-4LE | unsupported | OK |
| UTF-16 | OK | illegal sequence |
| UTF-16BE | unsupported | illegal sequence |
| UTF-16LE | unsupported | illegal sequence |
| UTF-32 | unsupported | inaccurate |
| UTF-32BE | unsupported | OK |
| UTF-32LE | unsupported | OK |
| UTF-7 | unsupported | unexpected EOF |
| UTF-8 | wrong FFFF | OK |
| VISCII | unsupported | OK |
I plan to work around 4 hours a day on the project. In July, I will have some vacation to Spain but during this period I also intend to dedicate some time to the project. If I have time before, I want to start the work earlier, in the community bounding period because I am already familiar with the community and the infrastructure.
Here are the concrete planned TODO items and their schedule:
May 23 - May 29: Reading the code, get an understanding of its structure and working.
May 29 - Jun 8: Extract the code from NetBSD libc and make it buildable on FreeBSD as a usual independent shared library.
Jun 9 - Jun 21: Fix the basic Latin-based encodings. The remained items here are: UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-7. The UCS-family and the ISO8859-family already work correctly.
Jun 22 - Jul 22: Fix non-Latin encodings.
These include the the following CJK encodings: Shift JIS, EUC-JP, EUC-KR, GB18030, GBK, EUC-CN, Big5.
The following CJK encodings are completely unsupported: GB2312, Big5HKSCS. Support for these would also be nice if time permits.
The KOI8-R and KOI8-U Cyrillic encodings already work correctly.
Jul 23 - Aug 10: If all items have been completed at this point, remaining time can be dedicated to the legacy encodings, which are used in our supported locales: CP-family, ARMSCII, ISCII-DEV. If there are still more important items to complete, those should be completed.
During the development, GNU libiconv's regression test suite can help to track correctness of various encodings (this test suite was used to obtain the results listed in the table above). An own test suite would be good but I prefer to concentrate on the real work instead of writing test suites and the GNU test suite seems good and thorough enough for testing.
As for the performance, it is not the most important factor for libiconv because the conversion of encodings is usually not a time critical task but of course, an acceptable performance must be provided. This can be tested by comparing the conversion time of Citrus and GNU on different type of conversions with different file sizes.
At the end of the program, I would like to test the created library with a complete portbuild run. This project will not provide a completely GNU-compatible implementation, because the implementation os such great many of encodings cannot fit into the timeline of Summer of Code, so some ports may break with it but checking the build logs can help identifying the possible bugs.
Apart from this test plan, the library will be accessible through Ports Collection once the basic functions work so that interested parties can try it out and provide some feedback.
During Google SoC 2008 I worked on a BSDL implementation of grep and sort. The former is ready and waiting for a portbuild test. I think chances are good to import this before FreeBSD 8.0. The latter is also nearing completion. Both utilities are widechar-clean and offer a good performance.
As for the I18N area, I learned some basic things during Google SoC 2008 because widechar-support was a crucial requirement for grep and sort, so I read various documents about encodings and I got interested in the topic. In this year I would like to obtain more knowledge on encodings.
During Google SoC 2007 and 2006 I worked on various Ports Collection infrastructure-level enhancements. Besides, I work as a doc/ports committer and I was the first Hungarian translator, who launched the www/hu and doc/hu_HU.ISO8859-2 trees. I also mentored some committers in the ports/doc area.
My CV is available online in the following formats:
HTML,
PDF,
RTF,
PS,
plain text.