Peoples and Cultures
Languages and Cultures
Classifying Ethnicity: Coding and Comparing Ethnic Information
How the Peoples and Languages Codes of the Harvest Information System Facilitate a Broader Knowledge Base of World Ethnicity
Dr. Orville Boyd Jenkins
Until recent years there has not been any standard referencing system for the peoples of the world. For languages a codeset had developed over several decades, and something like this was desired by research agencies to serve as a standard reference any data set of human ethnicities could use to classify and to compare and exchange similar information with other collections.
Ethnologue as a Language Classification Standard
The most widely referenced language codeset in recent years has been that developed by SIL International, formerly Summer Institute of Linguistics. The summary descriptions of world languages and codes of reference for them are published in an encyclopedic reference, online in recent years, called the Ethnologue.
There have been numerous alternative systems for accounting for the speech forms of the world, commonly called languages and dialects. There are great debates going on between different linguists, schools of linguistics and systems of classification in the linguistic world! The more we discover, the greater the debate about how to account for them!
The Ethnologue is not the only language classification and coding system. But it has become the largest and most widely accepted. This position of primacy and authority was confirmed when the codeset published in the Ethnologue was adopted by the International Standards Organization as the ISO standard for classifying languages of the world.
Because of its early association with literacy and Bible translation worldwide, it has long been the codeset or classification system for languages commonly used in Christian mission circles. Anthropological approaches have been used in mission strategies to learn and understand cultures as the basis of communication and relationship-building.
In addition to the integrity and quality of the SIL team and its compilation of the Ethnologue, and due to the worldwide support and academic reputation of the Ethnologue, it became the de facto standard. This all took time.
In 2007, the Ethnologue's classification and coding system for world languages became the world standard for the ISO.
Registry of Peoples
Nothing like this coding system had ever been developed for the peoples of the world. There was no correlation between various databases of ethnicities. A vision developed in the 1900s for a common coding system that would enable databases to compare and exchange information on peoples of the world in a manner similar to what the Ethnologue did for languages.
The Registry of Peoples (ROP) was envisioned and initial steps taken by an informal consortium of research agencies. The Joshua Project undertook to compile contributions to this. They encountered an enormous task due to the extreme variation of data within the contributing data sets and their formats.
The various databases included different facts, details, perspectives and purposes. They varied in the particular methods of classification and the factors each database chose for distinctions it made.
Some sources data sets were very simple and top-level, others very detailed down to a local and village level. Some were only regional or otherwise specialized in focus. Some made reference to existing academic and anthropological definitions and formats, while others were ad hoc.
Discussion and Exchange
A partnership of several world research agencies worked out guidelines, definitions and database protocols for the registry, correlated to the format of the Ethnologue codeset for languages. SIL was one of the partners who served as Stewards in the coalition named Harvest Information System (HIS).
The HIS Stewards guided and coordinated the development and maintenance of the ROP and various other codesets developed and coordinated to enable the clarification and exchange of information across agencies and databases of different formats. The goal was to define and manage codesets for various types of data that might be thus correlated as needed or desired.
A procedure and standards were established for registries of codes for the various and possible data sets. The anticipated codeset for peoples was designated the Registry of Peoples (ROP). HIS Stewards designated the already-existing SIL/Ethnologue codeset for languages as the Registry of Languages (ROL).
In late 2001, the proto registry list database was handed over to me as the first designated Editor of the new Registry of Peoples to examine, correlate and rationalize the list of peoples and their coding. This entailed primarily verification research and elimination of duplicates among the disparate collection.
A Standard Codeset
By 2005, there were several cross-correlation projects going on between various databases comparing and correlating their data and reference codes, using the ROP codeset as the matching point for identification. Since that time other such projects have been added to this process.
The various research networks, field research sources and database representations of ethnicities of the world are now sharing a more common base, and exchanging information at the highest level ever!
Discussions have thus been generated, which have led to greater awareness of disparate views of various ethnic information systems, and interpretation and revision of ethnic databases. This is the purpose of the Registry of Peoples codeset. Over the seven years I served as the Editor of the Registry of Peoples, I saw the ROP codeset being used as a linking protocol in more and more places. It is an open codeset, available to any database or data manager.
Usage is the Key
One of the factors in establishing a reference codeset is simply its use by more and more users. We see in the case of the Ethnologue a process of several decades! The profile and usage of the Ethnologue/SIL codeset was very low-key until the broader sharing of data that developed in the early 1900s. This was facilitated by the whole Ethnologue database being placed on the Internet, becoming the primary available reference system attempting to account for the human speech forms of the world.
The SIL concept of codes arose originally, like the ROP, out of the mission context, related, of course, to Bible translation. It gradually came to be used in broader mission contexts. Recently the Ethnologue codeset has gained currency in broader world circles.
It has been only in the 1990s that the Ethnologue codes have been used at large, largely due to the publication of the Ethnologue on the Internet. One Catholic agency (Christus Rex) had independently published the Ethnologue 13 codes online, in a page-to page country format, before SIL itself went online with Edition 14 in 2000.
The Ethnologue codes have only recently been used outside the immediate mission community. And it has become the standard only gradually, as more and more agencies keyed to the codes SIL provided.
That is exactly analogous to the ROP, as the ROP lacks only the historical depth. But in terms of usage and partners using the codes, the ROP has gained faster acceptance, as most agencies were already using our codes in some form or other by 2005.
Comparatively, for its age, the ROP is accepted more widely on Peoples than the Ethnologue has been on Languages, until its recent status of ISO standard. And even now there are competing systems to the ISO/Ethnologue standards. It is interesting to take a peek into some of the discussion forums on language classification!
Could a database just simplify matters by using the ROP people codes as the key Field for each entity? Yes, this would provide an independent reference field related to a similar designation for similar records in other databases. The codes in this field thus remain a constant in comparing and referencing information across data sets.
Keys and Codes
In my consultation with various world agencies gathering and compiling ethnic data, I find the common database standard that a Key Field is used as the internal reference. This is often a generated count of records.
In a case where an agency has attempted to use the codeset (the ROP or ROL codes) as the Key, problems are encountered in the normal editing process. When investigation indicates two entities are actually the same, when the two are merged, complications arise if the Key Filed was the reference code also used by other databases.
If a separate field was assigned for the ROP people code, that field would not change, no matter what edits were done to the entity. After an edit where two entities were determined to represent the same ethnic group, after one had been removed, the ROP code on the remaining entity could be checked and confirmed as the same assigned to the corresponding entity in the Registry of Peoples.
In this procedure, the ROP code introduces no consideration into the edit of the data within that database. Factors of data management, ethnic definitions of that agency, and other internal factors are not compromised. The ROP code field can be checked and maintained separately, and it is never lost in any edit to an entity in that database.
If the database uses only the ROP code as the identifier, and Key Field, any edit that requires assignment of a new ID number would change the ROP code, losing correlation with the entities in the ROP. This complicates correlation, rather than simplfying.
When one of the duplicate entities is deleted, the code that had been assigned is lost with the removal of the entity, and continuity can be lost, with confusion resulting. Realignment of entities within a related cluster is even harder to handle.
Codesets and Correlations
However, if the codeset is a separate field in the data record, correlation is easier to retain, even when various individual ethnic entries are deleted, merged or separated in to two or more, as new research indicates.
The codeset stands aside as a reference for correlation. Entities in any database are not required to align with entities in any other databases, but codes can be assigned that related one or more entities in one database to one or more in another. This leaves the data in each knowledge base independent, enabling retention of internal integrity and differing views or purposes of the particular database.
Comparisons and exchanges of information are better facilitated when each independent data or research sources is compared to a third, common reference point. This is how the codesets work. Where codes are missing or entities overlap, this new comparative view points out a different view of the entity or ethnicities involved and indicates directions of further clarification and data-gathering.
Each separate database can then independently review and revise as helpful, then recompare. Often very productive discussions ensue between research teams or agencies as differences in the data views uncover different views of the ethnic factors involved.
Remember that the purpose of any classification system or database is to account for ethnic information gathered and analyzed SO FAR. So it is not only reasonable, but actually desirable, that variant views arise from different research sources.
These views then can be compared and discussed, leading to a broader shared knowledge. Reanalysis can then be more commonly performed. The codesets might also be adjusted to account for new findings or discoveries arising out of such comparisons and exchanges.
But both codesets and research databases are always going to be in flux, if anything is happening there!
The changes that occur become input to all other partners using the same codeset, and updates can be managed by each data manager or research team independently. The viewpoint or updating process of one dataset or database does not impinge on the other. Rather, the ROP ethnic codeset remains the comparison point. Codes are assigned based on the view and definitions of the ethnicities in the ROP codeset.
As ethnic information and databases change in light of new research or analysis, the code assignments may shift, but the codeset can still be applied to any single or multiple entities defined in any ethnic database, as helpful to indicate where the entities and information diverge or overlap.
The codeset, likewise, is always undergoing review and updates, to consistently reflect the varied information arising out of the extremely numerous and disparate efforts at research and analysis.
From what I have seen, a correlation code is normally a separate reference field which serves as an independent interpreter between related records in the same or in different database. The ROP code is an example.
If a database uses only the ROP codes as the key field, it would seem to restrict the flexibility and editability of that database as an independent dataset. It seems if you did that, you would thereby automatically be undertaking the obligation to adjust your data entities, every one of them, to match the ROP. This seems counter-productive, because any credible database would keep editing and updating its own data.
Using an internal proprietary key as an entity ID, records in a database can be internally referenced, organized and edited in reference to the internal code. This leaves their own referencing code independent of the ROP as a reference code, and actually provides a second internal cross-check for entity comparison or integrity.
The Name Nightmre
For decades research departments, agencies and strategists wrestled with a tangle for so many years, due to the comparison by entity name. Attempts to compare data and research insights was a nightmare of torturous manual correlation and verification, causing frustrations that mostly led to dropping such attempts. But the dream of data comaprison and exchange never completely died.
Comparison of names in a list is fraught with all sorts of problems. It seems to me that a necessity in the successful comparison and exchange of data is a basic objective codeset, which allows for quick and easy identification of an entity related in some way to one with the same code in another database.
This is what I understand the intent of the ROP to be. It is, in fact, working this way in more and more databases. Similarities between two or more databases become confirming data experiences. Differences become talking points for clarification and sharpening of identification and define areas for further research and discovery.
I had a telephone and email conference with two researchers for the Mexican mission consortium COMIMEX. They were involved in a massive process of developing about 1500 people profiles as the basis for advocacy and mobilization of the mid-American Spanish churches for mission worldwide.
They informed me that they had also determined that the most efficient way to effect comparison and ease editing, with ongoing correlation and comparison with other databases in mind, is to use a separate COMIMEX entity ID, and including the ROP code as a reference field.
In this way they will be able to compare data and facilitate updates with other ethnic databases with Mexican info as those databases also change. They have determined that this will enable them to maintain a consistent internal reference whether they have been able to correlate all their entities to an ROP entity and code or not.
They determined that the use of the ROP code as a separate reference field, rather than their internal key, would better enable them to separately correlate to any other database using the ROP codes to determine where that database may have used the same code in a different manner.
They had decided they could avoid various editing and identification problems as changes and updates were needed, by using a static, internal ID as a consistent reference. Then when they had to change an ROP code due to an update in the ROP or an error in their code usage, it would make for a cleaner tracking and correlation would not be lost on any related entities.
Languages and Peoples
Languages are shared among numerous groups of people that consider themselves different. Also religious distinction in some societies is sufficient to require a separate self-identity and a separate accounting in ethnic classification.
A recent example of this came up in a comparison project I was involved in between two major ethnic databases.
In South Asia we have some great examples of the complexity of ethnicity, which also illustrate the difference between an ethnic entity and a related language. The Malayali are one example that illustrates this, as well as the problems we have with initial different names in different databases for the same entity.
In the proto-ROP data, I found there was a full duplication of entities under the names Malayali and Malayalam. (Malayali is an ethnic name; Malayalam is a language name.) Also, as with most languages in India, many more ethnicities had Malayalam as a language, either Primary or Secondary. These run over one hundred, I believe, across several countries.
I had edited this set of data previously. When I reviewed the Malayali again in our recent comparison project, I found the ROP had only 20 entities of Malayali ethnicity across countries. These distincoitns were based on the a South Asian research source referred to as Omid, and critically related to other sources.
Consistency Across Countries
The basic definition of people group that we are following would indicate that there would be some consistency of identity by language across countries. There are other clusters like this where a simple change of the ROP code on one entity would fix the relationship. Likewise sometimes a change of the ROL code would correct a disjunction.
In these cases the database's internal entity ID number remains the same, so the relationship and tracking on other factors stays in the same relationship. And you can cross-check by ROP code or ROL code for other views to refine your groupings within each country.
It seems this would be harder if the only reference ID you had was the ROP code. As the COMIMEX and similar databases are organized, you can easily correct a code error without losing track of your entity and its other relationships.
The Malinke/Maninka cluster was included in one round of data comparisons. I have reviewed these thoroughly in comparison with the current Ethnologue view of the language aspects, and independent sources (not mission related). The primary distinction between these entities, according to those working with them, is their language distinctions.
In my earlier review and edits, I had found that the entity with one code was assigned to different language groups in different countries. Language assignments often did not match the languages in the people group descriptions, as foudn in teh Ethnologue/Registry of Languages.
In the ROP, I found contributors had various records with names like this. In many cases we retained a generic name with its separate code for expatriate entities that cannot be otherwise ethnically identified. The Alternate Names table was important to help correlate the same entity or portion of a larger entity to databases with different perspectives or name preferences.
Some databases preferred the French common name for a people group, while some preferred the Portuguese or the English. Others preferred one of the local names for a people, the most well-known or the one that people themselves prefer and use in their own society. You can see names are not a simple matter when trying to formalize information for a computer database.
Grammatical forms in names in various languages presented problems. But a names list can be as extensive as needed to relate data in one source to similar data in another source.
In these expatriate, immigrant and migratory communities, we find changes of identity, related to the process of Assimilation. Sometimes the nationality, name or identity becomes the primary aspect of the previous identity around which the new sense of ethnicity develops in the new country. At least in the first generation, home language is an important factor.
Various factors related to ethnicity can be tracked and evaluated to gain a view of the direction and degree as well as the speed of Assimilation. Being Sudanese, in itself, does not automatically entail a sense of kinship or familyhood between people form the country of diverse background. It depends on many factors, unique to each place of origin and new residence, as well as past ethnic history.
Religion and Ethnicity
In the ROP, various "descriptors" may be used independently to dynamically "triangulate" classification in a real-world manner. "Ethnolinguistic," as defined for the ROP, covers the various cultural distinctions involving the different descriptors. The ROP documentation describes key "Descriptors." Religion, for instance, is one of the descriptors. Any Descriptor can be the deciding factor in ethnic uniqueness. Religion is a critical factor for self-identity in South Asian societies.
I worked with the data manager of another major research compiler and data manager to analyze and interpret information we received periodically from a leading researcher and data manager for South Asian languages and ethnicities. Religious identities must be included to account adequately for the significant unique self-identities of peoples in that part of the world.
We attempted to incorporate that information and system of analysis into the common formats being used within the HIS format for peoples of the world. We three and others have had some productive email conferences related to this topic. The consistent data and research from many sources indicated that ethno-religious identities were critical in understanding and accounting for South Asian entities. This seems more realistic for that setting.
Religion is not always that important in determining human ethnic identity. It depends on the society. If they take it seriously as an ethnic indicator, we have to look at it that way. South Asia is one region where the religion is a determining factor of ethnicity.
Culturally Determined Balance
Human societies and ethnicities all have to deal with the same factors. But each group (family, clan, tribe, caste, society, cultural cluster, etc.) organizes these factors differently and gives different weight to different factors. The internal values have to be strong factors in our classifications if they are to represent the real world.
I have addressed the various factors of ethnicity that must go into the decision of where and how ethnic boundaries are drawn in our formal classifications. I have addressed the mis-use and mis-understanding of the technical term "ethnolinguistic" in articles and presentations. Reference the list of related topics below, on this site and others by this author.
Related on this Site:
Ethnicity and Religion
Ethnic Names and Codes — Correlating People Lists:
How Codes in the Registry of Peoples Enrich the Exchange of Ethnic Information
Ethnicities and Names
Lists, Codes and Real-World Ethnicity: Thoughts on Ethnic Information Exchange
Peoples and Languages
Rough Edges of Ethnicity:
Determining Ethnicity in the Changing Streams of Language and Culture
What is a People Group?
Also by this author on the Internet:
Dealing with Ethnic and Linguistic Change: Overcoming Assumptions and Mis-Conceptions in People Group Strategies
How We Determine Ethnicity
What is a People Group?
What is an Ethnic Group?
Also view related PowerPoint Presentations:
People Groups – An Ethnolinguistic Concept (presentation)
What is a People Group? (presentation)
Topic first addressed in May and June 2006 in an email discussion among researchers
Further notes on the topic in 2006-2007 led to this article
Developed in July and October 2008, June 2012
Final article posted on OJTR 5 July 2012
Last edited 18 July 2012
Orville Boyd Jenkins, EdD, PhD
Copyright © 2012 Orville Boyd Jenkins
Permission granted for free download and transmission for personal or educational use. Please give credit and link back. Other rights reserved.