Senzing Globalization Guide
What Languages Does Senzing Support?
Senzing utilizes UTF-8 encoding which allows for most languages of the world to be properly captured and processed. Beyond ingesting and storing data, Senzing analytics go further – taking into consideration domain, culture, and cross-script differences for comprehensive global entity resolution. Senzing provides native support for cross-script comparisons across many languages and writing systems, with entity-centric learning capabilities that allow it to discover attribute variations (including script variations) even when attributes cannot be matched in their original forms.
Advanced Personal Name Comparisons
Supported Cultural Groups
Personal names present unique challenges in global entity resolution. Senzing leverages IBM's InfoSphere Global Name Management for culturally-aware name comparison. This world-class name library uses spelling patterns and country-of-association information to determine cultural provenance and optimize matching strategies.
Primary Cultural Groups
Southwest Asian
Culture | Original Script | Transliteration |
---|---|---|
Afghan | افغان احمد | Afghan Ahmad |
Arabic | محمد حسن الشمري | Mohammed Hassan Al-Shamri |
Farsi | علی رضا | Ali Reza |
Pakistani | محمد علی | Muhammad Ali |
European
Culture | Original Script | Transliteration |
---|---|---|
Anglo | Standard Latin script | |
French | François Müller | Francois Mueller |
German | Björn Müller | Bjorn Muller |
Hispanic | José García | Jose Garcia |
Han
Culture | Original Script | Transliteration |
---|---|---|
Chinese | 王小明 | Wang Xiaoming |
Korean | 김민수 | Kim Min-su |
Vietnamese | Nguyễn Văn An | Nguyen Van An |
Additional Cultural Support
Culture | Original Script | Transliteration |
---|---|---|
Indian | राम कुमार शर्मा | Ram Kumar Sharma |
Indonesian | Standard Latin script with diacritics | |
Japanese | さとう ひろし | Satou Hiroshi |
Polish | Łukasz Kowalski | Lukasz Kowalski |
Portuguese | João da Silva | Joao da Silva |
East Slavic | Александр Петров | Aleksandr Petrov (Ukrainian, Belarusian, Russian) |
Turkish | Mustafa Özkan | Mustafa Ozkan |
Yoruban | Adébáyọ̀ Ọlátúndé | Adebayo Olatunde |
Generic | (catch-all for other cultures) |
Japanese Kanji is not directly handled by Senzing and is treated as Chinese Hanzi when provided.
Organizational Names
Senzing provides robust same-script organizational name matching across many writing systems and languages.
Same-Script Organizational Name Matching Examples
Script | Examples |
---|---|
Arabic Script | الشركة السعودية للصناعات الأساسية ↔ الشركه السعوديه للصناعات الاساسيه |
Cyrillic Script | ООО “Газпром” ↔ Общество с ограниченной ответственностью “Газпром” |
Latin Script with Diacritics | Société Générale ↔ Societe Generale |
Volkswagen Aktiengesellschaft ↔ Volkswagen AG | |
Japanese Script | トヨタ自動車株式会社 ↔ トヨタ自動車 |
Korean Script | 삼성전자주식회사 ↔ 삼성전자 |
Chinese Script | 中国石油天然气集团公司 ↔ 中国石油天然气集团 |
CJK+English Cross-Script Matching (New in v4)
Senzing v4 introduces native cross-script matching between CJK (Chinese, Japanese, Korean) and English organizational names without requiring reference data.
CJK+English Cross-Script Matching Examples
CJK | English |
---|---|
中國銀行股份有限公司 | Bank of China |
토요타 자동차 | Toyota Motor Corporation |
ソニー株式会社 | Sony Corporation |
阿里巴巴集团 | Alibaba Group |
삼성전자 | Samsung Electronics |
For other cross-script language combinations, robust matching of organizational names may still require reference data containing multiple versions of names, as there is no consistency in how organizations handle name translation across scripts. Some organizations represent names phonetically (transliterate), some translate (or translate parts), and some organizations rebrand when moving into new markets/scripts. For these scenarios, data providers or services that offer organizational name enrichment can be beneficial.
Enhanced Address Comparisons
Senzing provides cross-script matching capabilities for addresses. Starting in v4, native cross-script matching between CJK (Chinese, Japanese, Korean) and English addresses is supported without requiring reference data, representing a major improvement for global address resolution.
Address Matching Examples
CJK+English Cross-Script Matching (New in v4):
CJK | English |
---|---|
710000陕西未央区西安凤城十路118 | 118, Fengcheng 10th Road, Xi’an, Weiyang District, Shaanxi 710000 |
〒540-0002大阪府大阪市中央区1-1 | 1-1 Chuo-ku, Osaka, Osaka 540-0002 |
上海市浦东新区陆家嘴环路1000号 | 1000 Lujiazui Ring Road, Pudong New Area, Shanghai |
For other language combinations, addresses can be challenging for entity resolution as they tend to have many data quality issues. Senzing has capabilities to handle native scripts for addresses, with the most effective processing occurring in native-to-native (rather than native-to-Romanized) scenarios.
For cross-script address comparison in non-CJK languages, using an address hygiene product to Romanize addresses and providing both native script and Romanized versions to Senzing can improve matching accuracy.
Getting Started
Senzing provides comprehensive globalization capabilities out of the box. The breakthrough CJK+English cross-script matching capabilities introduced in v4 for both organizational names and addresses require no additional configuration - simply upgrade to v4 to benefit from these improvements.
For the rare cases requiring specific cultural tuning or advanced cross-script scenarios, contact Senzing Support for guidance on optimizing configurations for your specific use cases.
If you have any questions, contact Senzing Support. Support is 100% FREE!