About gazetteers.net

The gazetteers.net web application is developed as part of the Gazetteer research project by the Herder Institute (HI), the Institute for Regional Geography (IfL) and the Justus Liebig University Giessen (JLU). The application is intended to support users in working with different digital gazetteers, and to help them explore their content and metadata structure.

It enables users to search several place name related databases simultaneously in a unified manner and to view and compare data from different gazetteers. In addition, the application supports the identification of items in different databases which refer to the same geographical entity. By linking corresponding items across gazetteers it facilitates data aggregation and comparison.

Basically, the search works on a global level. However, as the the current regional focus of the project is Poland, additional specific gazetteers were integrated for this region.


Help

Main elements of the frontend

[Screenshot of the application]

Initially only the search area and the map are shown.
If you search for entities in the selected gazetteers, the results area appears (on the right side in order to show the map in the center). The results corresponding to the search are displayed on the map and in the results area, where you can inspect them.
You can then select entities from the results area and add them to the compare area , which opens after you added an entity to it.
By opening the matchings area you can see entities matching the entities in your results (note that the matching entities may or may not bei part of your results, depending on your search). From here, you can also add them to the compare area.


Search area

[Screenshot of the search area]
The search area allows you to define your search:
Name filter
     : You can specify the name (part) you are searching for, optionally with the wildcard symbol * (the asterisk).
     : Use "Match word in name", if the search phrase only needs to match a word in the name, and "Match whole name" if it must match the complete name of an entity to include that in the results.
     See here for more details about the name filter.
Spatial filter: If you want to specify a region, click on "Draw search region" and draw the respective rectangle ("bounding box") on the map (note that when a spatial filter is set entities without coordinates will never be included in the results).
Type filter: To restrict the results to settlements, you can set the filter (4) "Only settlements" (note that this does not work well with GND data due to the GND type system).
Gazetteers: You can select any combination of gazetteers you want to query.
Matchings: If you want to add additional information about possible matchings to entity in the result sets you can do that by enabling matchings (note that this is diabled by default as the adding of the matching information takes additional time).
Search: After you hit "Search" the the respective result sets built by the server are displayed in the results area and on the map.

Results area

Initial view

[Screenshot of the results area]
After you started the search the search area on the right side appears with blocks for each selected gazetteer. In the example the user searched for "Gdansk" in four gazetteers resulting in four blocks, labelled accordingly. The number of entities in in the gazetteers' results are shown in in the brackets [ ] after the gazetter names (in the example the search yielded one GeoNames entity, two GOV entities, one GND entity and one Wikidata entity).
You can export all result sets by clicking on .
By clicking on you can toggle between a vertically and horizontally ordered view (the vertically shown here is the default).
Initially the gazetteer blocks are closed. You can open a block by clicking on it to see a list of the entities of the respective gazetteer.

List view

[Screenshot of the results area]
When you click on a gazetteer block it shows a list of all the gazetteers' entities selected by the search. In the example the user opened the GeoNames block, showing the single entity as a stub in the result set. Displayed in this list/stub view is the map marker corresponding to the one on the map, the entity's ID, its name and its type(s).
By using the corresponding header elements you can sort the list, or filter it via a name filter phrase or a type filter.
You can export the list view as CSV, JSON, or GeoJSON by clicking .
After clicking on one of the entities in the list (e.g. in GeoNames on the entity "3099434 - Gdansk") a detailed view of the selected entity appears.

Details view

[Screenshot of the results area]
In the details view all the attributes of the entity are listed.
Various options are available:
: Toggle between "Hide empty attributes" and "Show all attributes"
: Toggle between "Hide non-essential attributes" and "Show all attributes"
: Zoom to the entity on the map
: Add the entity to the compare area
: Export data

Compare area

[Screenshot of the compare area]
Entities you added to the compare area (either from the results or the matchings) are displayed next to each other in a table with the corresponding attributes (id, names, position, type, ...), making it easy to compare them. You can export the entities by clicking on at the top of the compare area.

Matchings area

[Screenshot of the matchings area]
If you set the matching option in the search area for a search the app looks up matchings for every entity in the result sets, and adds them to the respective items as additional attributes. Left of the results area a vertical tab labelled "Matchings" appears. By clicking on it, you can open the matchings area.
In the example the user searched for "Lubowice". As you can see in the screenshot the results contain one entity for GeoNames, one for GOV, one for Wikidata, and two results for PRNG. Note that the entities shown here are those found by the user's search, and which are therefore part of the results sets of the search. To see the target entities that match a distinct result set entity, you can click on the respective entity (see the next screenshot and its description).
You can export all matchings for the result set entites as CSV, JSON, or GeoJSON by clicking on at the top of the matchings area. To add an entity to the compare area, click . To display that element in the details view, click .
[Screenshot of the matchings area]
In the example shown on the left the user clicked on the Wikidata entity "Q1397867 - Łubowice". As a result, the list of matchings or that entities is shown. As you can see, the list contains five entities, one each for GND, Teryt, GOV, PRNG and GeoNames. I.e. the app identified these five entities as possibly referring to the same geographical entity as Wikidata:Q1397867.
Each match entity is identified by its respective ID. Additionally, "type" and "description" contain information about the matching's type and how the matching was obtained (see here for further details).

Frequently Asked Questions

What is the current status of the gazetteers.net project?

The application is still under development. The current version offers features like unified search across multiple gazetteers, a compare view and a matching lookup. Futher possible development steps may include more advanced functionality, the integration of other gazetteers and further improving the matching system.

The number of results is much to high. How can I make it smaller?

You can set the "Name search type" to "Match whole name" in order to make the name search more restrictive. To only get results from a certain region you can draw a bounding box around it via the "Bounding Box" filter (note that this filters out all entities without coordinates). And you can set an "only settlements" filter (note that this filter does not work well for the GND database due to its type system).

There may be interesting entities that are missing in the results. How can I include them?

Your search may be to restrictive. Setting the "Name search type" to "Match word in name" is one option to make it more permissive. You can also use a wildcard (the * symbol) in the name search (e.g. Wroc*w). And if the entities you are looking for do not contain coordinates you won't find them if a geographical filter, i.e. a bounding box is set.

Why does the application offer rather generic filters (name, location) and a settlement filter?

Our aim was to provide simple yet powerful filters enabling the user to easily make broad searches as well as specific ones. Additionally, a key aspect of the application is the simultaneous querying of multiple gazetteers with one search request in a unified manner. That means the app uses filters which can be applied to all gazetteers and does not offer those that can't be queried for across all gazetteers (e.g. specific characteristics of one gazetteer like its type system).

When I compare two searches, one only specifying the name, the other adding another filter to it (geographical or settlement) I get results in the second one that were not in the first one. Isn't that incorrect?

That happens when your first search yielded a huge number of results larger than the maximum allowed number of the respective service (e.g. for GeoNames it's 1000 results per query). The second search result set may then of course include entities that were above the limit of the first search and thus not included in the first one.

How trustworthy is the matching system?

A "match" can be one of three things: 1. a reference found in the original gazetteer's dataset or a combination of multiple references; 2. a match computed by us based in particular on attribute values; 3. a combination of 1. and 2. References are as trustworthy as the sources that state them, i.e. a reference statement from an authority file can be considered trustworthy, whereas reference statements found in Wikidata may not always have the same accuracy. The computed matches are often correct, but since they have not been validated by humans, they contain errors and should be treated as (more or less) propable suggestions.

Conceptual Background

The gazetteers.net web application provides a unified search across multiple gazetteers. Additionally, the it supports the identification of items in different databases which refer to the same geographical entity. By linking corresponding items across gazetteers it facilitates data aggregation and comparison.

Access to multiple gazetteers: Currently ten gazetteers are integrated into the application.

Unified search: All integrated gazetteers can be simultaneously queried in a unified manner with one search request. The application uses filters which can be applied to all gazetteers (especially based on names and coordinates) and does not offer those that can't be queried for across all gazetteers.

Matchings: A "match" is a link between items in different gazetteers which likely refer to the same geographical entity. It can be one of three things: 1. a reference found in the original gazetteer's dataset or a combination of multiple references; 2. a match computed by us based in particular on attribute values; 3. a combination of 1. and 2.


Gazetteers

The project's approach is to use live data where possible. Gazetteers which are considered "stable" (e.g. datasets of completed projects) or which are not accessible via a performant and suitable web service were downloaded and imported into the application's database. The user can search all gazetteers or only in a selected subset. Currently ten gazetteers are integrated into the application:

GeoNames

GeoNames is a widely used global geographical database. It is integrated into the application via a web service.

Geschichtliches Ortsverzeichnis (GOV)

GOV is a gazetteer by the German Verein für Computergenealogie (Computer genealogy association). Its focus are settlements in Europe, the U.S. and Australia, along with historical names, to facilitate genealogical research in these regions. GOV is integrated into the application via a web service.

Gemeinsame Normdatei (GND) (Geographical data)

The Gemeinsame Normdatei (Integrated Authority File), managed by the German National Library, is an international authority file which also contains global geographical entities (GND Geografika). GND Geografika is integrated into the application via a web service.

Wikidata

Wikidata is a collaboratively edited knowledge base which contains structured data about all kinds of entities. All entities containing coordinates and assigned to a country in East-Central Europe where downloaded with a few attributes and integrated into the application's database.

Państwowy Rejestr Nazw Geograficznych (PRNG)

The Państwowy Rejestr Nazw Geograficznych (National Register of Geographical Names) contains a "Register of geographical names of the Polish Republic" which was downloaded and integrated into the application's database.

BKG Historische Ortsnamen

BKG Historische Ortsnamen was created by the Bundesamt für Kartographie und Geodäsie (BKG), the Federal Agency for Cartography and Geodesy. The dataset's entities mainly resemble East-Central European settlements and contain historical names. The dataset was kindly provided by the BKG and integrated into the application's database.

Ziemie polskie Korony w XVI w.

Ziemie polskie Korony w XVI w. (Polish lands of the Crown in the 16th century) is a spatial database created by the Tadeusz Manteuffel Institute of History of the Polish Academy of Sciences. It contains data about 16th century Polish settlements. The database was kindly provided by the Polish Academy of Sciences and integrated into the application's database.

Kaszëbsczé miestné muiona (Naszekaszuby)

Kaszëbsczé miestné muiona (Kashubian place names) is a small dataset containing about 1500 entries resembling Kashubian and Polish place names in Kashubia in northern Poland. It was downloaded and integrated into the application's database.

Lemko Village Resource Guide (Carpathorusyn)

The Lemko Village Resource Guide contains a small dataset consisting of around 400 entries with Rusyn, Polish and Ukrainian names of places located mainly in southeastern Poland. The dataset was downloaded and integrated into the application's database.

Interaktyvus Rytų Prūsijos žemėlapis IV (Prusijalit)

The Interaktyvus Rytų Prūsijos žemėlapis IV dataset was published by the Lietuvių kalbos institutas (Lithuanian Language Institute). It contains approximately 300 entities representing settlements located in the former East Prussia, with their Lithuanian, German, and optionally Russian and Polish names, along with coordinates. The dataset was downloaded and then integrated into the application's database.
The application enables the user to simultaneously query multiple gazetteers with one search request in a unified manner. This makes it easier to work with several databases, since they can normally only be searched individually and not even with uniform criteria, because each has its own search behavior.

The gazetteers.net project team considers a flexible word / name search (i.e. selecting entities based on their names) and a geographical search (i.e. filtering entities based on their coordinates) the core search filters. Additionally it is possible to use a settlement type filter (Note: Due to the properties of and differences in the gazetteers' type systems, this filter is only an approximation, especially concerning the GND data). These three aspects can be combined, where the name filter is mandatory and the others are optional. These filters can be applied to all gazetteers resulting in a consistent, uniform behavior. Other filters that can't be queried for across all gazetteers are not provided.

The search is consistent across all integrated databases, i.e. the query is processed the same way for all gazetteers, e.g. concerning the handling of wildcards and diacritics. The user can easily choose a desired search behavior.

The name search is applied to all names of the entities of all selected gazetteers. The user can optionally set a name search scope and use wildcards. The search is always permissive regarding case and diacritics. Currently the name search is restricted to latin-based characters including letters with diacritics etc.

Name search scope

The user can choose if a name search is applied to every word in a name or only to the whole name. The search can either be permissive or restrictive in that respect. In case of the permissive "match word in name" (which is the the default setting) the search phrase is applied to every single word of a name, i.e. it only needs to match a word contained in a name to include the entity in the result set. In case of "match whole name" the search phrase must match the entire name.
Example
GOV contains an entity named "Wrocław", one named "Wrocław Gądów", and a third with the named "Bielany Wrocławskie".

The user searches for "Wrocław" and the name search scope is left unchanged, i.e. it is set to "match word in name".
> The result set contains the entity named "Wrocław" as well as the one named "Wrocław Gądów", because in both names there is a matching word; however it does not return "Bielany Wrocławskie" (this would require a wildcard search).

The user searches for "Wrocław" and the name search scope is set to "match whole name".
> The result set contains the entity "Wrocław", but neither the "Wrocław Gądów" nor the "Bielany Wrocławskie" entity.

The third option is to set the name search scope to "original search". In that case every search behavior of a live request (i.e. a request against GeoNames, GOV, or GND) is left as-is.

Wildcards

In case of a unified search, i.e. the name search scope is set to "match word in name" or "match whole name", the user can use wildcards (by using the Asterisk symbol *). Note that a wildcard cannot not be used as the leading character.
Example
As mentioned in the above example, GOV contains three entities, "Wrocław", "Wrocław Gądów", and "Bielany Wrocławskie".

The user searches for "Wrocław*" and the name search scope is the default setting, i.e. "match word in name".
> The result set contains all three entities, because all of them bear a name containing a word which begins with "Wrocław".

The user searches for "Wrocław*" and the name search scope is set to "match whole name".
> The result does not contain the entity named "Bielany Wrocławskie", because the name does not begin with "Wrocław".

Permissive search

The search is case- and diacritic-insensitive in unified search (i.e. if the name search scope is set to "match word in name" or "match whole name").
Example
Let's assume two gazetteers contain entities referring to the city of Wrocław. In gazetteer A the entity bears the correct name "Wrocław", in the second one it is called "Wroclaw" (containing a plain l without the slash diacritic instead of ł).

The user searches for "Wrocław" (with the diacritic).
> The result sets contain both entities, "Wrocław" and "Wroclaw", because ł is also mapped to l.

The user searches for "Wroclaw" (without the diacritic).
> The result sets contain both entities, "Wroclaw" and "Wrocław", because l is also mapped to the corresponding character containing the diacritic.


Matchings

The application supports the identification of corresponding items in different databases, i.e. possible "same as" relationships. Therefore, for each item in the search result set the web app looks for items in other gazetteers which likely refer to the same geographical entity and adds this additional information to the entity. We built the matching system beforehand, i.e. the possible "same as" relationships are already in a separate project database, as determining them live during a user search would take too much time, especially for large result sets.
Note that the term "matchings" is used here for all possible "same as" relationships as well as in a strict sense only for those algorithmically determined.
Example
The user searches for "Wrocław" in the GOV gazetteer with matching functionality enabled. The gazetteers.net server queries the GOV web service accordingly and processes the results. Before sending the result set to the web browser, the gazetteers.net server tries to find possible "same as" relations for each entity in the result set. For the GOV entity resembling the city of Wrocław it finds a corresponding entity in GeoNames and adds this information to the GOV entity in the result set before sending it to the browser:
GOV result set: [
...
  Wrocław (id: BREADTJO81MC) {
    names: [
      Wrocław
      Breslau
      ...
    ]  
    ...
    
    /* additional information added by the gazetteers.net server */
    matchings: [
      {
        gazetteer: geonames
        id: 3081368
        link:https://www.geonames.org/3081368
        description: lev dist: 0, geo dist: 2237 m, type: both_settlements, assignment: 1:1
      }
      ...
    ]
    
  }
]
Currently possible "same as" assignments are mainly applied to entities in Poland.
As the matchings lookup requires additional search operations for each entity in each result set and in sum this can be time-consuming, this feature can be turned on and off.

Layers of the matchings system

This system of identifying possible "same as"-relations is composed of the following layers / components:

1) "live references": references already contained in the live data returned by the respective gazetteer service

2) "reference table": references found in database dumps or derived ones by combining references

3) "matchings": possible matchings essentially based on name similarity and geographical proximity

4) "matchings linked with references/matchings": a combination of 2) and 3)

2) "reference table"

Some gazetteers contain "same as" references to other gazetteers. In order to build a comprehensive reference table that contains the given references in a unified form, they are also combined to infer indirect references. Additionally, to have a fast-responding service the respective database dumps were downloaded and parsed offline, i.e. beforehand. The resulting reference table is used in the gazetteer app.
Example
As mentioned, the reference table was constructed offline and is stored in the application's database. This example illustrates the process.

In many gazetteers, there are entities representing a Silesian village named Łubowice:

E.g. in GOV (gov::object_189067), in a Polish database called TERYT/SIMC (teryt_simc::0220871), in Wikidata (wikidata::1397867), and in GND (gnd::10132266-5).

gov::object_189067 contains a reference to teryt_simc::0220871

wikidata::1397867 contains two references, one to teryt_simc::0220871, one to gnd::10132266-5

From these three reference statements it can be concluded that all four items represent the same geographical entity, because if

  gov::object_189067  =   teryt_simc::0220871    and
  wikidata::1397867   =   teryt_simc::0220871    and
  wikidata::1397867   =   gnd::10132266-5
then
  gov::object_189067  =  teryt_simc::0220871  =  wikidata::1397867  =  gnd::10132266-5
So six "same as" references can be explicitly formulated between two entities at a time
(gov:... = teryt_simc:... , gov:... = wikidata:... , ... , wikidata::... = gnd::...)
(or actually twelve, if a distinction is made between a statement gazA::ent1 = gazB::ent2 and its inverse gazB::ent2 = gazA::ent1)
and these references are stored in the reference table.

3) "matchings"

Corresponding entities can also be determined by comparing their attribute values. Especially the entities' names and the coordinates are well suited for that, because combined they make a rather good identifier, and because similarity measures can be easily computed for two names (using the "Levenshtein distance") or two coordinates (calculating the geographic distance) , provided that suitable data sets are available.
Real data often do not allow a clear assignment based on this strategy. The reasons for that include name related challenges like name variations, e.g. possible name affixes, missing coordinates, ambiguities (e.g. because an entity in one gazetteer is split up into many in another gazetteer), and the absence of a record in a database.
In order to reduce ambiguous matchings the entities are compared based on normalised data (e.g. optionally removing diacritics and name affixes) and additional information like the entity type may be used.
Like the "reference table", the "matchings" component was built offline and is stored in the application's database.
Example
The matching algorithm also searches for corresponding GeoNames-GOV entity pairs.

The Łubowice entity in GOV contains two names ("Łubowice (województwo śląskie)", "Lubowitz"), coordinates (50.16012, 18.23241) and a type ("Dorf" (village)).

When using the normalized name ("Łubowice") and the coordinate and type information, the algorithm identifies a matching GeoNames entity:

{id: 11497515, name: "Łubowice", coordinates: (50.16139, 18.23583), type: "PPLH - historical populated place"}

The match was by determined by

comparing the names with the "Levenshtein distance" name distance measurement:

lev("Łubowice", "Łubowice") = 0
and calculating the geographical distance of the two point coordinates:
geodist((50.16012, 18.23241), (50.16139, 18.23583))) = 0.282 km
and taking into account that both entities are settlements ("village" and "historical populated place")

By using these three criteria the algorithm unambiguously matches the two entities, i.e. for the GOV entity there is only one matching GeoNames entity and for the GeoNames entity there is only one matching GOV entity as well.

4) "matchings linked with references/matchings"

The matchings and the reference table can be combined to infer possible "same as" relationships.
Example
After combining the reference table with the matchings there is a link between entities from the Naszekaszuby gazetteer and Wikidata, both resembling a Pomeranian village named Skórowo. The Wikidata entity contains a reference to a GeoNames entity, and this GeoNames entity is matched to the Naszekaszuby entity based on name similarity and region:
reference table entry:
wikidata::Q431650 = geonames::3085756

matching entry:
geonames::3085756 = naszekaszuby::1192

resulting match linked with reference entry:
wikidata::Q431650 = naszekaszuby::1192

Integration

The "same as" system is prioritized in the aforementioned order. I.e. if the live data of a gazetteer A entity contains a "same as" reference to another entity in another gazetteer B in for a given entity, this information is considered most valid and possible"same as"-statements for that entity to gazetteer B in 2), 3) and 4) are ignored. Accordingly, "same as" references in 2) are preferred to those in 3) and 4) and so on.
Example
The example illustrates how the "same as" information for the "GOV" Łubowice (Silesian) is put together.
The entity data delivered by the GOV web service already contains a "same as" reference to an entity in the TERYT/SIMC database. This "live" reference is added to the "same as" property of the GOV Łubowice entity:
	
db	"teryt_simc"
id	"0220871"
type	"ref (from live data)"
The system looks for other "same as" references in the reference table. As the live data already contained a reference to a TERYT/SIMC entity, the system does not look for one to that database. But the table contains indirect references to entities in GND, PRNG, and Wikidata, so the following is added:
db	"prng"
id	"73702"
type	"ref"
description	"ref dbpath: gov > teryt_simc < prng"
	
db	"wikidata"
id	"1397867"
type	"ref"
description	"ref dbpath: gov> teryt_simc < wikidata"
link	"https://www.wikidata.org/wiki/Q1397867"

db      "gnd"
id      "10132266-5"
type    "ref"
description     "ref dbpath: gov > teryt_simc < wikidata > gnd"
link    "https://d-nb.info/gnd/10132266-5"
As neither the live data nor the reference table contain a reference from the GOV Łubowice entity to a GeoNames entity, the app checks the matchings for that and adds:
	
db	"geonames"
id	"11497515"
type	"match"
description	"lev dist: 0, geo dist: 282 m, type: both_settlements, assignment: 1:1"
link	"https://www.geonames.org/11497515"