Mike Slinn

UTF-8 and Scala

— Draft —

Published 2019-08-06. Last modified 2019-08-07.
Time to read: 4 minutes.

Scala supports UTF-8 data and UTF-8 characters in source code. This lecture discusses how to work with UTF-8 and Scala for Mac, Linux and Windows users.

Scala is written in Unicode, specifically UTF-8, according to the rules defined in the Lexical Syntax section of the Scala Language Specification:

Scala programs are written using the Unicode Basic Multilingual Plane (BMP) character set; Unicode supplementary characters are not presently supported.

This chapter defines the two modes of Scala’s lexical syntax, the Scala mode and the XML mode. If not otherwise mentioned, the following descriptions of Scala tokens refer to Scala mode, and literal characters ‘c’ refer to the ASCII fragment \u0000\u007F.

In Scala mode, Unicode escapes are replaced by the corresponding Unicode character with the given hexadecimal code.

UnicodeEscape ::= ‘\’ ‘u’ {‘u’} hexDigit hexDigit hexDigit hexDigit
hexDigit ::= ‘0’ | … | ‘9’ | ‘A’ | … | ‘F’ | ‘a’ | … | ‘f’

To construct tokens, characters are distinguished according to the following classes (Unicode general category given in parentheses):

Whitespace characters. \u0020 | \u0009 | \u000D | \u000A.
Letters, which include lower case letters (Ll), upper case letters (Lu), titlecase letters (Lt), other letters (Lo), modifier letters (Ml), letter numerals (Nl) and the two characters \u0024 ‘$’ and \u005F ‘_’.
Digits ‘0’ | … | ‘9’.
Parentheses ‘(’ | ‘)’ | ‘[’ | ‘]’ | ‘{’ | ‘}’.
Delimiter characters ‘`’ | ‘’’ | ‘"’ | ‘.’ | ‘;’ | ‘,’.
Operator characters. These consist of all printable ASCII characters (\u0020 - \u007E) that are in none of the sets above, mathematical symbols (Sm) and other symbols (So).

Arrows

Scala 2.13 deprecated the use of UTF-8 special characters such as left arrow (), right arrow, (=>), and the right double arrow, also known as the rocket operator, (=>). Multicharacter symbol equivalents are encouraged now: <-, -> and =>.

UTF-8 Characters

Scala has full support for UTF-8, including using UTF-8 in variable and method names. You may find it desirable to use UTF-8 characters in your code. UTF-8 characters are especially useful when writing Domain Specific Languages (DSLs) in Scala. A detailed list of Unicode Hex keycodes is shown in the third column of this page.

The suggested configuration for SBT, discussed in a few lectures, automatically enables UTF-8 characters in the Scala REPL, discussed in the next lecture. If you want to be able to paste in UTF-8 characters into a Scala REPL without using SBT, define the -Dfile.encoding=utf8 Java system variable when invoking Scala. One way of doing that is to set up an alias in ~/.bash_aliases:

~/.bash_aliases
alias scala="scala -Dfile.encoding=utf8"

Linux

Press the Ctrl+Shift+u keys, release, type the four hex digits, and press Enter or Spacebar. On some systems, press and release Shift or Ctrl. For example:

  • To enter a left arrow (), press the Ctrl+Shift+u keys, release, type 2190 and then press Spacebar or Enter.
  • To enter a right arrow (=>), press the Ctrl+Shift+u keys, release, type 2192 and then press Spacebar or Enter.
  • To enter a right double arrow (=>), press the Ctrl+Shift+u keys, release, type 21D2 and press Spacebar or Enter.

Mac

To enable this for Mac:

  1. System Preferences / Keyboard / Input Sources
  2. Ensure Show Input menu in menu bar is enabled
  3. Click +
  4. Browse to the bottom language ("Others")
  5. Select Unicode Hex Input
  6. Click Add.
  7. On the menu bar, find the icon for the language input and change it to Unicode Hex Input

For example:

  • To enter a left arrow (), type Option2190 (hold down Option while typing the four numbers).
  • To enter a right arrow (=>), type Option2192.
  • To enter a right double arrow (=>), type Option21D2.

Windows

BabelMap is much better than the standard character map that comes with Windows, and offers many more features.

BabelMap also correctly finds all Unicode characters (unlike the Windows Character Map). Once you have installed BabelMap:

  1. Open the BabelMap Application (you may want to pin it to the Task Bar and Start menu, and assign it a keyboard shortcut).
  2. Enter the Unicode code in the Go to Code Point box and click Go.
  3. Click on the selected character.
  4. Copy the character to the clipboard with the Copy button, and paste it into the program text that you are editing.

UTF-8 and Backquotes

You can break the naming convention rules by enclosing the rebel string in backquotes. In this example, I define a method whose name (¯⧵_(ツ)_/¯) must be enclosed in backquotes because the characters would not otherwise be parsed as a single token. The method calls sys.error, which is provided by Scala runtime library; the method outputs a message to stderr and terminates the program.

Scala REPL
scala> def `¯⧵_(ツ)_/¯`(msg: String) = sys.error(s"¯⧵_(ツ)_/¯ $msg")
$u00AF$u29F5_$u0028ツ$u0029_$div$u00AF: (msg: String)Nothing 

Notice the REPL shows the hex values of the UTF-8 characters preceded by $u.

  • The first and last characters in the method name are called Macron (¯), and has hex value \u00AF. Macrons are normally used as a diacritic overlay on another character, for example ā, ē, ī, ō, ū and ӯ.
  • The second character is called Reverse Solidus (⧵), and it is a mathematical symbol.
  • The third and seventh characters are underbars, a character known for so many reasons to Scala programmers. These are the only normally permissable characters in the method name.
  • The fourth and sixth characters are open and close parentheses, and are not normally permissable within the name of a Scala method or variable.
  • The fifth character is the katakana letter tu (ツ). It is often used to depict a happy face.

To invoke this awkwardly-named method, you again need to enclose its name in back quotes.

Scala REPL
scala> `¯⧵_(ツ)_/¯`("Oops, I did it again!")
java.lang.RuntimeException: ¯⧵_(ツ)_/¯ Oops, I did it again!
  at scala.sys.package$.error(package.scala:27)
  at .$u00AF$u29F5_$u0028ツ$u0029_$div$u00AF(<console>:15)
  ... 43 elided 

It would be better to give a name to the method that does not require special handling, such as shrug:

Scala REPL
scala> def shrug(msg: String) = sys.error(s"¯⧵_(ツ)_/¯ $msg")
shrug: (msg: String)Nothing 
scala> shrug("Oops, I did it again!") java.lang.RuntimeException: ¯⧵_(ツ)_/¯ Oops, I did it again! at scala.sys.package$.error(package.scala:27) at .shrug(<console>:15) ... 43 elided

* indicates a required field.

Please select the following to receive Mike Slinn’s newsletter:

You can unsubscribe at any time by clicking the link in the footer of emails.

Mike Slinn uses Mailchimp as his marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp’s privacy practices.