August 30, 2011
By Drew Barrows
Language Now Helps CIA's Office of the Chief Scientist in its Fight to Address our Nation's Most Pressing Intelligence Needs Quickly and Accurately.
The Central Intelligence Agency's Office of the Chief Scientist provides strategic leadership, coordination, and expertise to support scientific innovation to meet our nation's most pressing intelligence challenges. To address these challenges the Office of the Chief Scientist, in collaboration with Northrup Grumman Corporation, developed Language Now, an automated natural language processing tool to assist analysts in identifying high value electronic documents located on hard drives, portable media, and in other data repositories.
Using technology from AppTek with its machine translation, automatic speech recognition and human language processing along with NovoDynamic’s advanced optical character recognition (OCR) and software capture technology among others, the Office of the Chief Scientist has integrated numerous technologies in the creation of Language Now as a web-based integration of Commercial-off-the- shelf (COTS) and Government-off-the-shelf (GOTS) natural language processing tools.
The CIA’s Office of the Chief Scientist has taken the Language Now system under its wing having incorporated its own internal research into natural language processing (NLP) to meet advanced US government requirements. The CIA’s initial research project focused on integrating all NLP tools (file type identification, language identification, optical character recognition [OCR], translation, etc.) to process Arabic script languages and was originally named Arabic Now.
Optical Character Recognition (OCR)
NovoDynamics, an In-Q-Tel portfolio company, was a natural fit in for the solution as it specializes in recognizing and processing large volumes of Arabic, Dari, Farsi/Persian, and Pashto documents. The company’s OCR technology enabled the CIA to have high accuracy results, even on yellowed pages or documents with stains and smudges. The system was such a success that The Office of Chief Scientist embarked on additional projects that focused on other languages, such as Chinese, dubbed Chinese Now and Russian, named Russian Now.
Speaking on behalf of officers from CIA’s Office of the Chief Scientist in its Directorate of Science and Technology (DS&T), CIA spokesperson Preston Golson shared answers provided by the technical minds behind the Language Now project. “After realizing that the environment supported dozens of languages, our scientists just started calling the system Language Now which became its official name,’ said Preston Golson, CIA spokesperson. Documents in all languages are being processed daily and there really isn’t one language that is more popularly used than another. It varies day to day.”
The Language Now system is simple to use and doesn’t require any advanced computer skills or language proficiency. “Language Now is user-friendly. There are manuals to help both new and power users,” said Golson. There is a full manual online and context sensitive help. Additionally there are advanced and configurable features to meet the needs of “power users” over and above the default settings that can be used with Language Now.
Accessed from Multiple Devices and Locations
Language Now can be hosted on enterprise servers, standalone laptops and used from Android mobile devices. Whether managed from a central location, or deployed in the field at multiple locations, it integrates seamlessly with other systems like automatic speech recognition, translation memory tools and 3rd party applications. “Language Now has been integrated with other systems on both the front end and back end and has worked with third party applications via the Language Now application programmer interfaces (APIs),” said Golson. “Also Language Now is a workflow for document processing, and accepts and processes Microsoft Excel or Word files, html, PDF files and many other types.”
After years of careful planning in the design of Language Now and conducting a lot of experimentation, Language Now can be used with ease by a single user, a team, or an entire organization. “There are a number of management and reporting features, which are designed to support team use,” Golson added. Language Now supports analysts in identifying foreign language documents of interest that could be extracted from hundreds of thousands of gathered documents. Time is always of the essence and Language Now has proven to be not only accurate, but also extremely fast. Golson continues, “on a single server, Language Now can translate at over 100,000 words per minute (WPM), and on a recent model laptop, between 8,500 and 10,000 WPM.”
Language Now processes high value electronic documents that often come from a variety of different kinds of devices. “When a hard drive, laptop, CD or USB key is found containing foreign language files, analysts can use Language Now to evaluate material that before would take a long time to translate,” said Golson.
Analysts can view structured data such as spreadsheets or scientific and technical publications and determine if they contain critical data. “Important documents come in many forms, continued Golson. “Whether they may be spreadsheets, or scientific or technical information, there’s always a need to be prepared to convert any data into a form that analysts can use.”
An Integrated System Design
The Language Now is an integrated system using the powerful technologies like that of NovoDynamics or Apptek but designed in a way that the user interface, processing workflow, and service provider layers are very abstract and independent from one another and not dependent on any particular vendors’ product capabilities. This product independence combined with scientists’ interest in continuing to enhance the Language Now platform turned the scientists’ attention to potentially using Language Now within a cloud computing environment. According to Golson, “Language Now is compatible with cloud computing concepts. Our scientists have explored hosting it in a cloud environment.” The CIA’s thought process on the topic of adopting a cloud approach is that it can potentially make IT environments more flexible and secure.
Like most emerging technologies, cloud computing offers compelling advantages such as higher hardware utilization with simplified centralized administration but also can have disadvantages such as lower absolute performance. Golson agrees, “while we can improve system flexibility with these techniques, the virtualization approach at the heart of most cloud computing implementations could significantly impact translation speed.” The scientists identified that the highly-decoupled Language Now internal architecture is very compatible with cloud computing and are exploring hosting the machine translation component of Language Now within a cloud environment.
Designed using innovative algorithms, technologies and linguistic resources, Language Now today manages machine translation, information extraction and optical character recognition, semantic analysis, and certain aspects of text summarization and document clustering. For instance, the government uses document clustering, a technique for unsupervised document organization, automatic topic extraction and fast information retrieval, to automatically organize its search results into categories. For example, if a user submits “immigration”, next to their list of results, they will see categories for “Immigration Reform”, “Citizenship and Immigration Services”, “Employment”, “Department of Homeland Security”, and more. Additionally, Perform Probabilistic Latent Semantic Analysis, (PLSA) which is a statistical technique for the analysis of two-mode and co-occurrence data, can also be conducted to perform document clustering.
Exponentially Growing ROI
The CIA’s Office of the Chief Scientist was able to achieve a sizable ROI. According to Golson, “Yes, it exceeded expectations. Language Now returned more than ten times the original investment in its early years, and continues to provide additional value on a daily basis.”
The overall goal in developing automated tools is to assist the analysts in judging, quickly and accurately, the relevance of individual documents in either English or one or more foreign languages, and also within clusters of topically related documents. The team can quickly identify specific facts or other items of interest from single documents or sets of documents.
The Office of the Chief Scientist is also reaching its longer-term goal in enabling their analysts with the ability to perform accurate information analysis from an automatically-produced English-language translation or from an automatically-produced, condensed, English-language textual or non-textual rendition (summary) of a document or document cluster, instead of from the foreign language original(s). This provides a huge time savings and proves to provide a high level of accuracy.
Language Now has surpassed the CIA’s expectations in being able to fully automate important aspects of its information extraction and intelligence gathering process in acquiring information from text documents and other natural language-based sources. Government information analysts have come to rely on the sophisticated technologies supporting Language Now and are able to fulfill their analytic tasks with confidence.