Standard Compression Scheme for Unicode

From Wikipedia, the free encyclopedia

Unicode
Encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and e-mail
Unicode typefaces

The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ascii punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character (plus setup overhead, which for common languages is often only 1 byte), most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.

? This section may contain original research or unattributed claims.
Please help Wikipedia by adding references. See the talk page for details.
This article has been tagged since March 2007.

SCSU has not been a resounding success. Few applications need to compress so much Unicode text that it's worth using a special-purpose compression scheme which (so far) does not have widespread support. Treated purely as a compression algorithm, SCSU is inferior to most commonly-used general-purpose algorithms for texts of over a few kilobytes. It can be used as a text encoding, but it can be difficult to handle internally, and the percentage savings of SCSU versus UTF-16 or UTF-8 drops after external compression, dramatically in the case of bzip2 and other modern compression schemes. SCSU does have the advantage that it can usefully compress texts that are only a few characters long, whereas most full-scale compressors need a few kilobytes of data to break even against their own overhead.

Reuters, the organization that floated the first draft of SCSU, is believed to use SCSU internally.

Advanced Search
Included Web Search Engines


Safe Search

close

Top Matching Results

Occasionally Search.com will highlight specialized results that are based on the context of your query. Examples of specialized results include specific links to news, images, or video.

Top Matching Results may highlight information from other Search.com pages, content from the CNET Network of sites, or third party content. The listings are based purely on relevance. Search.com does not receive payment for listings in this section but our partners that provide this data may get paid for listing these products.

Sponsored Links

This section contains paid listings which have been purchased by companies that want to have their sites appear for specific search terms and related content. These listings are administered, sorted and maintained by a third party and are not endorsed by Search.com.

Search Results

Search.com sends your search query to several search engines at one time and integrates the results into one list which has been sorted by relevance using Search.com's proprietary algorithm. You can customize the list of search engines included in your metasearch from the preferences.

The search engines that are used in your metasearch may allow companies to pay to have their Web sites included within the results. To view the Paid Inclusion policy for a specific search engine, please visit their Web site. Search.com does not accept payment or share revenue with any search engine partner for listings in this section.