User Guide: UTF-8 and Character EncodingPhabricator User Documentation (Application User Guides)
How Phabricator handles character encodings.
Overview
Phabricator stores all internal text data as UTF-8, processes all text data as UTF-8, outputs in UTF-8, and expects all inputs to be UTF-8. Principally, this means that you should write your source code in UTF-8. In most cases this does not require you to change anything, because ASCII text is a subset of UTF-8.
If you have a repository with source files that do not have UTF-8, you have two options:
- Convert all files in the repository to ASCII or UTF-8 (see "Detecting and Repairing Files" below). This is recommended, especially if the encoding problems are accidental.
- Configure Phabricator to convert files into UTF-8 from whatever encoding your repository is in when it needs to (see "Support for Alternate Encodings" below). This is not completely supported, and repositories with files that have multiple encodings are not supported.
Detecting and Repairing Files
It is recommended that you write source files only in ASCII text, but Phabricator fully supports UTF-8 source files.
If you have a project which isn't valid UTF-8 because a few files have random binary nonsense in them, there is a script in libphutil which can help you identify and fix them:
project/ $ libphutil/scripts/utils/utf8.php
Generally, run this script on all source files with "-t" to find files with bad byte ranges, and then run it without "-t" on each file to identify where there are problems. For example:
project/ $ find . -type f -name '*.c' -print0 | xargs -0 -n256 ./utf8 -t ./hello_world.c
If this script exits without output, you're in good shape and all the files that were identified are valid UTF-8. If it found some problems, you need to repair them. You can identify the specific problems by omitting the "-t" flag:
project/ $ ./utf8.php hello_world.c FAIL hello_world.c 3 main() 4 { 5 printf ("Hello World<0xE9><0xD6>!\n"); 6 } 7
This shows the offending bytes on line 5 (in the actual console display, they'll be highlighted). Often a codebase will mostly be valid UTF-8 but have a few scattered files that have other things in them, like curly quotes which someone copy-pasted from Word into a comment. In these cases, you can just manually identify and fix the problems pretty easily.
If you have a prohibitively large number of UTF-8 issues in your source code, Phabricator doesn't include any default tools to help you process them in a systematic way. You could hack up utf8.php as a starting point, or use other tools to batch-process your source files.
Support for Alternate Encodings
Phabricator has some support for encodings other than UTF-8.
To use an alternate encoding, edit the repository in Diffusion and specify the encoding to use.
Optionally, you can use the --encoding flag when running arc, or set encoding in your .arcconfig.