Compare area
[Screenshot of the compare area]
Entities you added to the compare area (either from the results or the matchings) are displayed next to each other in a table with the corresponding attributes (id, names, position, type, ...), making it easy to compare them. You can export the entities by clicking on

at the top of the compare area.
Matchings area
[Screenshot of the matchings area]
If you set the matching option in the
search area for a search the app looks up
matchings for every entity in the result sets, and adds them to the respective items as additional attributes. Left of the
results area a vertical tab labelled "Matchings" appears. By clicking on it, you can open the matchings area.
In the example the user searched for "Lubowice". As you can see in the screenshot the results contain one entity for GeoNames, one for GOV, one for Wikidata, and two results for PRNG. Note that the entities shown here are those found by the user's search, and which are therefore part of the results sets of the search. To see the target entities that match a distinct result set entity, you can click on the respective entity (see the next screenshot and its description).
You can export all matchings for the result set entites as CSV, JSON, or GeoJSON by clicking on

at the top of the matchings area. To
add an entity to the compare area, click

. To display that element in the details view, click

.
[Screenshot of the matchings area]
In the example shown on the left the user clicked on the Wikidata entity "Q1397867 - Łubowice". As a result, the list of matchings or that entities is shown. As you can see, the list contains five entities, one each for GND, Teryt, GOV, PRNG and GeoNames. I.e. the app identified these five entities as possibly referring to the same geographical entity as Wikidata:Q1397867.
Each match entity is identified by its respective ID. Additionally, "type" and "description" contain information about the matching's type and how the matching was obtained (see
here for further details).
Frequently Asked Questions
What is the current status of the gazetteers.net project?
The application is still under development. The current version offers features like unified search across multiple gazetteers, a compare view and a matching lookup. Futher possible development steps may include more advanced functionality, the integration of other gazetteers and further improving the matching system.
The number of results is much to high. How can I make it smaller?
You can set the "Name search type" to "Match whole name" in order to make the name search more restrictive. To only get results from a certain region you can draw a bounding box around it via the "Bounding Box" filter (note that this filters out all entities without coordinates). And you can set an "only settlements" filter (note that this filter does not work well for the GND database due to its type system).
There may be interesting entities that are missing in the results. How can I include them?
Your search may be to restrictive. Setting the "Name search type" to "Match word in name" is one option to make it more permissive. You can also use a wildcard (the * symbol) in the name search (e.g. Wroc*w). And if the entities you are looking for do not contain coordinates you won't find them if a geographical filter, i.e. a bounding box is set.
Why does the application offer rather generic filters (name, location) and a settlement filter?
Our aim was to provide simple yet powerful filters enabling the user to easily make broad searches as well as specific ones. Additionally, a key aspect of the application is the simultaneous querying of multiple gazetteers with one search request in a unified manner. That means the app uses filters which can be applied to all gazetteers and does not offer those that can't be queried for across all gazetteers (e.g. specific characteristics of one gazetteer like its type system).
When I compare two searches, one only specifying the name, the other adding another filter to it (geographical or settlement) I get results in the second one that were not in the first one. Isn't that incorrect?
That happens when your first search yielded a huge number of results larger than the maximum allowed number of the respective service (e.g. for GeoNames it's 1000 results per query). The second search result set may then of course include entities that were above the limit of the first search and thus not included in the first one.
How trustworthy is the matching system?
A "match" can be one of three things: 1. a reference found in the original gazetteer's dataset or a combination of multiple references; 2. a match computed by us based in particular on attribute values; 3. a combination of 1. and 2. References are as trustworthy as the sources that state them, i.e. a reference statement from an authority file can be considered trustworthy, whereas reference statements found in Wikidata may not always have the same accuracy. The computed matches are often correct, but since they have not been validated by humans, they contain errors and should be treated as (more or less) propable suggestions.
Conceptual Background
The gazetteers.net web application provides a unified search across multiple gazetteers. Additionally, the it supports the identification of items in different databases which refer to the same geographical entity. By linking corresponding items across gazetteers it facilitates data aggregation and comparison.
Access to multiple gazetteers: Currently ten gazetteers are integrated into the application.
Unified search: All integrated gazetteers can be simultaneously queried in a unified manner with one search request. The application uses filters which can be applied to all gazetteers (especially based on names and coordinates) and does not offer those that can't be queried for across all gazetteers.
Matchings: A "match" is a link between items in different gazetteers which likely refer to the same geographical entity. It can be one of three things: 1. a reference found in the original gazetteer's dataset or a combination of multiple references; 2. a match computed by us based in particular on attribute values; 3. a combination of 1. and 2.
Gazetteers
The project's approach is to use live data where possible. Gazetteers which are considered "stable" (e.g. datasets of completed projects) or which are not accessible via a performant and suitable web service were downloaded and imported into the application's database.
The user can search all gazetteers or only in a selected subset.
Currently ten gazetteers are integrated into the application:
GeoNames
GeoNames is a widely used global geographical database.
It is integrated into the application via a web service.
Geschichtliches Ortsverzeichnis (GOV)
GOV is a gazetteer by the German Verein für Computergenealogie (Computer genealogy association). Its focus are settlements in Europe, the U.S. and Australia, along with historical names, to facilitate genealogical research in these regions.
GOV is integrated into the application via a web service.
Gemeinsame Normdatei (GND) (Geographical data)
The
Gemeinsame Normdatei (Integrated Authority File), managed by the German National Library, is an international authority file which also contains global geographical entities (GND Geografika).
GND Geografika is integrated into the application via a web service.
Wikidata
Wikidata is a collaboratively edited knowledge base which contains structured data about all kinds of entities.
All entities containing coordinates and assigned to a country in East-Central Europe where downloaded with a few attributes and integrated into the application's database.
Państwowy Rejestr Nazw Geograficznych (PRNG)
The
Państwowy Rejestr Nazw Geograficznych (National Register of Geographical Names) contains a "Register of geographical names of the Polish Republic" which was downloaded and integrated into the application's database.
BKG Historische Ortsnamen
BKG Historische Ortsnamen was created by the Bundesamt für Kartographie und Geodäsie (BKG), the Federal Agency for Cartography and Geodesy. The dataset's entities mainly resemble East-Central European settlements and contain historical names.
The dataset was kindly provided by the BKG and integrated into the application's database.
Ziemie polskie Korony w XVI w.
Ziemie polskie Korony w XVI w. (Polish lands of the Crown in the 16th century) is a spatial database created by the Tadeusz Manteuffel Institute of History of the Polish Academy of Sciences. It contains data about 16th century Polish settlements.
The database was kindly provided by the Polish Academy of Sciences and integrated into the application's database.
Kaszëbsczé miestné muiona (Naszekaszuby)
Kaszëbsczé miestné muiona (Kashubian place names) is a small dataset containing about 1500 entries resembling Kashubian and Polish place names in Kashubia in northern Poland.
It was downloaded and integrated into the application's database.
Lemko Village Resource Guide (Carpathorusyn)
The
Lemko Village Resource Guide contains a small dataset consisting of around 400 entries with Rusyn, Polish and Ukrainian names of places located mainly in southeastern Poland.
The dataset was downloaded and integrated into the application's database.
Interaktyvus Rytų Prūsijos žemėlapis IV (Prusijalit)
The
Interaktyvus Rytų Prūsijos žemėlapis IV dataset was published by the Lietuvių kalbos institutas (Lithuanian Language Institute). It contains approximately 300 entities representing settlements located in the former East Prussia, with their Lithuanian, German, and optionally Russian and Polish names, along with coordinates.
The dataset was downloaded and then integrated into the application's database.
Unified search across gazetteers
The application enables the user to simultaneously query multiple gazetteers with one search request in a unified manner. This makes it easier to work with several databases, since they can normally only be searched individually and not even with uniform criteria, because each has its own search behavior.
The gazetteers.net project team considers a flexible word / name search (i.e. selecting entities based on their names) and a geographical search (i.e. filtering entities based on their coordinates) the core search filters. Additionally it is possible to use a settlement type filter (Note: Due to the properties of and differences in the gazetteers' type systems, this filter is only an approximation, especially concerning the GND data). These three aspects can be combined, where the name filter is mandatory and the others are optional. These filters can be applied to all gazetteers resulting in a consistent, uniform behavior. Other filters that can't be queried for across all gazetteers are not provided.
The search is consistent across all integrated databases, i.e. the query is processed the same way for all gazetteers, e.g. concerning the handling of wildcards and diacritics. The user can easily choose a desired search behavior.
Name search
The name search is applied to all names of the entities of all selected gazetteers. The user can optionally set a name search scope and use wildcards. The search is always permissive regarding case and diacritics. Currently the name search is restricted to latin-based characters including letters with diacritics etc.
Name search scope
The user can choose if a name search is applied to every word in a name or only to the whole name. The search can either be permissive or restrictive in that respect.
In case of the permissive "match word in name" (which is the the default setting) the search phrase is applied to every single word of a name, i.e. it only needs to match a word contained in a name to include the entity in the result set. In case of "match whole name" the search phrase must match the entire name.
GOV contains an entity named "Wrocław", one named "Wrocław Gądów", and a third with the named "Bielany Wrocławskie".
The user searches for "Wrocław" and the name search scope is left unchanged, i.e. it is set to "match word in name".
> The result set contains the entity named "Wrocław" as well as the one named "Wrocław Gądów", because in both names there is a matching word; however it does not return "Bielany Wrocławskie" (this would require a wildcard search).
The user searches for "Wrocław" and the name search scope is set to "match whole name".
> The result set contains the entity "Wrocław", but neither the "Wrocław Gądów" nor the "Bielany Wrocławskie" entity.
The third option is to set the name search scope to "original search". In that case every search behavior of a live request (i.e. a request against GeoNames, GOV, or GND) is left as-is.
Wildcards
In case of a unified search, i.e. the name search scope is set to "match word in name" or "match whole name", the user can use wildcards (by using the Asterisk symbol *). Note that a wildcard cannot not be used as the leading character.
As mentioned in the above example, GOV contains three entities, "Wrocław", "Wrocław Gądów", and "Bielany Wrocławskie".
The user searches for "Wrocław*" and the name search scope is the default setting, i.e. "match word in name".
> The result set contains all three entities, because all of them bear a name containing a word which begins with "Wrocław".
The user searches for "Wrocław*" and the name search scope is set to "match whole name".
> The result does not contain the entity named "Bielany Wrocławskie", because the name does not begin with "Wrocław".
Permissive search
The search is case- and diacritic-insensitive in unified search (i.e. if the name search scope is set to "match word in name" or "match whole name").
Let's assume two gazetteers contain entities referring to the city of Wrocław. In gazetteer A the entity bears the correct name "Wrocław", in the second one it is called "Wroclaw" (containing a plain l without the slash diacritic instead of ł).
The user searches for "Wrocław" (with the diacritic).
> The result sets contain both entities, "Wrocław" and "Wroclaw", because ł is also mapped to l.
The user searches for "Wroclaw" (without the diacritic).
> The result sets contain both entities, "Wroclaw" and "Wrocław", because l is also mapped to the corresponding character containing the diacritic.
Matchings
The application supports the identification of corresponding items in different databases, i.e. possible "same as" relationships. Therefore, for each item in the search result set the web app looks for items in other gazetteers which likely refer to the same geographical entity and adds this additional information to the entity. We built the matching system beforehand, i.e. the possible "same as" relationships are already in a separate project database, as determining them live during a user search would take too much time, especially for large result sets.
Note that the term "matchings" is used here for all possible "same as" relationships as well as in a strict sense only for those algorithmically determined.
The user searches for "Wrocław" in the GOV gazetteer with matching functionality enabled. The gazetteers.net server queries the GOV web service accordingly and processes the results. Before sending the result set to the web browser, the gazetteers.net server tries to find possible "same as" relations for each entity in the result set. For the GOV entity resembling the city of Wrocław it finds a corresponding entity in GeoNames and adds this information to the GOV entity in the result set before sending it to the browser:
GOV result set: [
...
Wrocław (id: BREADTJO81MC) {
names: [
Wrocław
Breslau
...
]
...
/* additional information added by the gazetteers.net server */
matchings: [
{
gazetteer: geonames
id: 3081368
link:https://www.geonames.org/3081368
description: lev dist: 0, geo dist: 2237 m, type: both_settlements, assignment: 1:1
}
...
]
}
]
Currently possible "same as" assignments are mainly applied to entities in Poland.
As the matchings lookup requires additional search operations for each entity in each result set and in sum this can be time-consuming, this feature can be turned on and off.
Layers of the matchings system
This system of identifying possible "same as"-relations is composed of the following layers / components:
1) "live references": references already contained in the live data returned by the respective gazetteer service
2) "reference table": references found in database dumps or derived ones by combining references
3) "matchings": possible matchings essentially based on name similarity and geographical proximity
4) "matchings linked with references/matchings": a combination of 2) and 3)
2) "reference table"
Some gazetteers contain "same as" references to other gazetteers. In order to build a comprehensive reference table that contains the given references in a unified form, they are also combined to infer indirect references. Additionally, to have a fast-responding service the respective database dumps were downloaded and parsed offline, i.e. beforehand. The resulting reference table is used in the gazetteer app.
As mentioned, the reference table was constructed offline and is stored in the application's database. This example illustrates the process.
In many gazetteers, there are entities representing a Silesian village named Łubowice:
E.g. in GOV (gov::object_189067), in a Polish database called TERYT/SIMC (teryt_simc::0220871), in Wikidata (wikidata::1397867), and in GND (gnd::10132266-5).
gov::object_189067 contains a reference to teryt_simc::0220871
wikidata::1397867 contains two references, one to teryt_simc::0220871, one to gnd::10132266-5
From these three reference statements it can be concluded that all four items represent the same geographical entity, because if
gov::object_189067 = teryt_simc::0220871 and
wikidata::1397867 = teryt_simc::0220871 and
wikidata::1397867 = gnd::10132266-5
then
gov::object_189067 = teryt_simc::0220871 = wikidata::1397867 = gnd::10132266-5
So six "same as" references can be explicitly formulated between two entities at a time
(gov:... = teryt_simc:... , gov:... = wikidata:... , ... , wikidata::... = gnd::...)
(or actually twelve, if a distinction is made between a statement gazA::ent1 = gazB::ent2 and its inverse gazB::ent2 = gazA::ent1)
and these references are stored in the reference table.
3) "matchings"
Corresponding entities can also be determined by comparing their attribute values. Especially the entities' names and the coordinates are well suited for that, because combined they make a rather good identifier, and because similarity measures can be easily computed for two names (using the "Levenshtein distance") or two coordinates (calculating the geographic distance) , provided that suitable data sets are available.
Real data often do not allow a clear assignment based on this strategy. The reasons for that include name related challenges like name variations, e.g. possible name affixes, missing coordinates, ambiguities (e.g. because an entity in one gazetteer is split up into many in another gazetteer), and the absence of a record in a database.
In order to reduce ambiguous matchings the entities are compared based on normalised data (e.g. optionally removing diacritics and name affixes) and additional information like the entity type may be used.
Like the "reference table", the "matchings" component was built offline and is stored in the application's database.
The matching algorithm also searches for corresponding GeoNames-GOV entity pairs.
The Łubowice entity in GOV contains two names ("Łubowice (województwo śląskie)", "Lubowitz"), coordinates (50.16012, 18.23241) and a type ("Dorf" (village)).
When using the normalized name ("Łubowice") and the coordinate and type information, the algorithm identifies a matching GeoNames entity:
{id: 11497515, name: "Łubowice", coordinates: (50.16139, 18.23583), type: "PPLH - historical populated place"}
The match was by determined by
comparing the names with the "Levenshtein distance" name distance measurement:
lev("Łubowice", "Łubowice") = 0
and calculating the geographical distance of the two point coordinates:
geodist((50.16012, 18.23241), (50.16139, 18.23583))) = 0.282 km
and taking into account that both entities are settlements ("village" and "historical populated place")
By using these three criteria the algorithm unambiguously matches the two entities, i.e. for the GOV entity there is only one matching GeoNames entity and for the GeoNames entity there is only one matching GOV entity as well.
4) "matchings linked with references/matchings"
The matchings and the reference table can be combined to infer possible "same as" relationships.
After combining the reference table with the matchings there is a link between entities from the Naszekaszuby gazetteer and Wikidata, both resembling a Pomeranian village named Skórowo.
The Wikidata entity contains a reference to a GeoNames entity, and this GeoNames entity is matched to the Naszekaszuby entity based on name similarity and region:
reference table entry:
wikidata::Q431650 = geonames::3085756
matching entry:
geonames::3085756 = naszekaszuby::1192
resulting match linked with reference entry:
wikidata::Q431650 = naszekaszuby::1192
Integration
The "same as" system is prioritized in the aforementioned order. I.e. if the live data of a gazetteer A entity contains a "same as" reference to another entity in another gazetteer B in for a given entity, this information is considered most valid and possible"same as"-statements for that entity to gazetteer B in 2), 3) and 4) are ignored. Accordingly, "same as" references in 2) are preferred to those in 3) and 4) and so on.
The example illustrates how the "same as" information for the "GOV" Łubowice (Silesian) is put together.
The entity data delivered by the GOV web service already contains a "same as" reference to an entity in the TERYT/SIMC database. This "live" reference is added to the "same as" property of the GOV Łubowice entity:
db "teryt_simc"
id "0220871"
type "ref (from live data)"
The system looks for other "same as" references in the reference table. As the live data already contained a reference to a TERYT/SIMC entity, the system does not look for one to that database. But the table contains indirect references to entities in GND, PRNG, and Wikidata, so the following is added:
db "prng"
id "73702"
type "ref"
description "ref dbpath: gov > teryt_simc < prng"
db "wikidata"
id "1397867"
type "ref"
description "ref dbpath: gov> teryt_simc < wikidata"
link "https://www.wikidata.org/wiki/Q1397867"
db "gnd"
id "10132266-5"
type "ref"
description "ref dbpath: gov > teryt_simc < wikidata > gnd"
link "https://d-nb.info/gnd/10132266-5"
As neither the live data nor the reference table contain a reference from the GOV Łubowice entity to a GeoNames entity, the app checks the matchings for that and adds:
db "geonames"
id "11497515"
type "match"
description "lev dist: 0, geo dist: 282 m, type: both_settlements, assignment: 1:1"
link "https://www.geonames.org/11497515"