- 論壇徽章:
- 0
|
我們的漢字在計(jì)算機(jī)系統(tǒng)里面存儲(chǔ)時(shí)需要2個(gè)字節(jié)的空間。當(dāng)數(shù)據(jù)庫(kù)使用單字節(jié)字符集的時(shí)候,數(shù)據(jù)庫(kù)允許存儲(chǔ)半個(gè)漢字,因?yàn)樗加玫氖且粋(gè)字節(jié)的空間為一個(gè)有效數(shù)據(jù),例如通常的英文字符集:en_us.819或en_us.utf8。但是當(dāng)數(shù)據(jù)庫(kù)使用多字節(jié)字符集的時(shí)候,由于半個(gè)漢字為非法的不完整字符,會(huì)導(dǎo)致數(shù)據(jù)庫(kù)在存儲(chǔ)這種數(shù)據(jù)的時(shí)候報(bào)錯(cuò)illegal character,例如通常的中文字符集:zh_cn.gb和zh_cn.GB18030-2000。為了解決這個(gè)問(wèn)題,我編寫(xiě)了一個(gè)小程序用于過(guò)濾掉數(shù)據(jù)庫(kù)數(shù)據(jù)中存在的半個(gè)漢字問(wèn)題。
原理:
漢字由2個(gè)字節(jié)組成,且每個(gè)部分其ascii編碼都大于127,因此我們?cè)诎l(fā)現(xiàn)一個(gè)字符的ascii編碼大小大于127的情況下需要檢測(cè)緊隨的一個(gè)字節(jié)其ascii編碼是否大于127,如果是則為一個(gè)完整的漢字,反之則是半個(gè)漢字。
以下為使用步驟:
1.將數(shù)據(jù)庫(kù)中的數(shù)據(jù)卸載為存文本形式
2.使用trim infile outfile對(duì)該數(shù)據(jù)進(jìn)行過(guò)濾,它會(huì)將所有緊跟中非中文字符的半個(gè)漢字去除
3.設(shè)置中文字符集以后,將該數(shù)據(jù)重新裝載進(jìn)數(shù)據(jù)庫(kù)
/*******************************************************************************
*
* Module: trim
* Author: Richard ZHAN
* Description: Eliminate half Chinese character followed by a non Chinese character in a plain data file
*
* Change Log
*
* Date Name Description.................
* 03/20/2009 Richard ZHAN Start Program
*
*******************************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>
#include <strings.h>
#define LEN 4096
int
main (int argc, char *argv[])
{
int rfd, wfd, len1, len2, i;
char hi, *p1, *p2, str1[LEN], str2[LEN];
unsigned char ascii_hi = '\x7F';
if (argc != 3)
{
usage ();
exit (1);
}
if ((rfd = open (argv[1], O_RDONLY)) == -1)
{
printf ("Cannot open read file!\n");
exit (1);
}
else if ((wfd = open (argv[2], O_RDWR | O_CREAT, 0644)) == -1)
{
printf ("Cannot open write file!\n");
close (rfd);
exit (1);
}
else
{
hi = '\x0';
while ((len1 = read (rfd, str1, LEN)) > 0)
{
len2 = 0;
bzero (str2, LEN);
p2 = str2;
for (p1 = str1, i = 0; i < len1; p1++, i++)
{
if ((unsigned char) (*p1) > ascii_hi)
{
if (hi == '\x0')
{
hi = *p1;
}
else
{
*p2++ = hi;
*p2++ = *p1;
len2 += 2;
hi = '\x0';
}
}
else
{
*p2++ = *p1;
len2++;
hi = '\x0';
}
}
if (write (wfd, str2, len2) != len2)
{
perror ("Encounter write error\n");
close (rfd);
close (wfd);
exit (1);
}
}
if (len1 < 0)
{
perror ("Encounter read error\n");
close (rfd);
close (wfd);
exit (1);
}
}
close (rfd);
close (wfd);
exit (0);
}
usage ()
{
fprintf (stderr, "Usage: trim infile outfile\n");
return 0;
} |
|