UNACCENT - simple accent stripping toolUnaccent.exe is as simple small command line tool that helps you to strip accents from text.For example text:
Za¿ó³æ gê¶l± ja¼ñ.
is converted to
Zazolc gesla jazn.
or
Zazólc gesla jazn.
As you see "ó" character is not converted to "o" character because Unaccent convert text to Latin1 codepage and "ó" is a valid character in this codepage.
Source and binaries you can find here.
unaccent [-l] [-i ifile] [-b] [-a] [-e] [-c cp] [-1] [-?] [-h] [-g] ofile
| Option |
Description |
| -l |
List codepage numbers and exit. |
| -i |
Select input file, use - for stdin (default). This option must be used unless -l or -b option is used. |
| -a |
Input text is coded in default ANSI codepage and output file will be coded in ANSI - Latin I (1252) codepage. |
| -e |
Input text is coded in default OEM codepage and output file will be coded in OEM - United States (437) codepage. This is default setting. |
| -c |
Input text is coded in <cp> codepage. Use -l option to see all codepage numbers. Use this option after -a or -e option. |
| -1 |
Convert only to Latin1 charset. Not all accents will be stipped. By default it trying to remove all accents. |
| -b |
Gets text from clipboard unaccent it and stores it back in clipboard. |
| -h or -? |
Prints help. |
| -g |
Print license note. |
Simple use. Unaccent one ANSI text file to another:
unaccent -i text.txt unaccented.txt<
Unaccent ANSI text file to stdout:
unaccent -i text.txt
Unaccent file coded in OEM codepage to stdout:
unaccent -e text.txt
Unaccent file coded in ANSI - Baltic codepage to OEM - United States coded file:
unaccent -e -c1257 -itext.txt -1 unaccent.txt
Unaccent text stored in clipboard:
unaccent -b
If -b switch is used -a, -e and -c switches are ignored.
If application tries to stripp all accents it uses technique finded in unacc library. Orginally this library uses iconv library to convert text to Unicode. In my version this conversion is done by MultiByteToWideChar function.
If application converts text to Latin1 codepage only it uses WideCharToMultiByte to remove accents. Accents are removed when WC_COMPOSITECHECK and WC_DISCARDNS flags are used.
First text must be converted to Unicode using MultiByteToWideChar function and then Unicode text is converted to 437 or 1252 codepage - thats all.
This application works only in Windows platform.
In script directory you can find unaccfiles.js script that strips accents from filenames in selected directory. This file must be running from directory where unaccent.exe file exists. To run this script type wscript unaccfiles.js directory_path for example:
wscript unaccfiles.js c:\temp\unacc