w!"$&'+,-.012345yA|FI MUFaculty of InformaticsMasaryk UniversityData Structures for Spatial Data MiningbyPetr KubaFI MU Report SeriesFIMU-RS-2001-05Copyright c 2001, FI MUSeptember 2001Data structures for spatial data miningPetr KubaDepartment of Computer Science, Faculty of InformaticsMasaryk University Brno, Czech Republicxkuba@fi.muni.czSeptember 4, 2001AbstractThis report deals with spatial data structures for indexing andwith their usability for knowledge discovery in spatial data. Hugeamount of data processed in spatial data mining (or in data min-ing generally) requires using some indexing structures to speed upthe mining process. Typical data types and operations used in geo-graphic information systems are described in this paper. Then basicspatial data mining tasks and some spatial data mining systems areintroduced. Finally, indexing spatial structures for both vector andmetric spaces are described and structures used in some spatial datamining systems are presented.1 Spatial data types in GISA geographic information system [43] is a special kind of information sys-tem, which allows manipulate, analyse, summarize, query, edit and visu-alize geographically related data. Geographically related data are com-posed of:• spatial attributes, e. g. coordinates, geometry• non-spatial attributes, e. g. name of town, number of inhabitants12 SPATIAL DATA MINING2Spatial data in GIS may be represented in raster or vector model. Rastermodel divides space into a regular grid of cells, usually called pixels. Eachcell contains a single value and its position is defined by its indices in thegrid. The resolution of the raster depends on its pixel size. The smaller thepixel size, the higher the resolution, but also the larger the data size.Vector model represents spatial objects with data structures, whose basicprimitive is a point. This allows precise representation of coordinates andit’s useful for analysis. Data structures used to store spatial objects in thevector model are:• point – defined by its coordinates• chain – sequence of points connected with lines• polygon – sequence of connected chains (the last point of one chain isthe first point of the other chain), it is enclosed and the chains mustnot intersectFundamental operations used to manipulate vector data are:• determining the distance of two objects• determining the area of the object (if it is a polygon)• determining the length of the object (if it is a chain or polygon)• determining an intersection or union of the objects• determining a mutual position of two objects (they can intersect,overlap, touch, one can contain the other, . . . )2 Spatial data miningAnalysis is an important part of GIS which allows spatial operations withdata (e. g. network analysis or filtering of raster data), measuring func-tions (e.g. distance, direction between objects), statistic analyses or terrainmodel analysis (e. g. visibility analysis).Spatial data mining [11, 24] is a special kind of data mining [12]. Themain difference between data mining and spatial data mining is that inspatial data mining tasks we use not only non-spatial attributes (as it isusual in datamining in non-spatial data), but also spatial attributes.2 SPATIAL DATA MINING32.1 Spatial data mining tasksBasic tasks of spatial data mining are:• classification – finds a set of rules which determine the class of theclassified object according to its attributes e. g. ”IF population of city= high AND economic power of city = high THEN unemployment ofcity = low” or classification of a pixel into one of classes, e. g. water,field, forest.• association rules – find (spatially related) rules from the database. As-sociation rules describe patterns, which are often in the database.The association rule has the following form: A → B(s%, c%), wheres is the support of the rule (the probability, that A and B hold to-gether in all the possible cases) and c is the confidence (the condi-tional probability that B is true under the condition of A e. g. ”ifthe city is large, it is near the river (with probability 80%)” or ”if theneighboring pixels are classified as water, then central pixel is water(probability 80%).”• characteristic rules – describe some part of database e. g. ”bridge is anobject in the place where a road crosses a river.”• discriminant rules – describe differences between two parts ofdatabase e. g. find differences between cities with high and lowunemployment rate.• clustering – groups the object from database into clusters in such away that object in one cluster are similar and objects from differentclusters are dissimilar e. g. we can find clusters of cities with similarlevel of unemployment or we can cluster pixels into similarity classesbased on spectral characteristics.• trend detection – finds trends in database. A trend is a temporal pat-tern in some time series data. A spatial trend is defined as a patternof change of a non-spatial attribute in the neighborhood of a spatialobject e. g. ”when moving away from Brno, the unemployment rateincreases” or we can find changes of pixel classification of a givenarea in the last five years.2 SPATIAL DATA MINING4In the rest of this section, we will describe several existing systems whichcan be used for spatial data mining and information sources concerningspatial data mining.2.2 Spatial data mining systemsGeoMinerThe GeoMiner [17, 21] is a system for knowledge discovery in large spatialdatabases. It was developed at Simon Fraser University in Canada. TheGeoMiner is an extension of and developed from DBMiner [9, 20]. TheDBMiner is a relational data mining system, which uses Microsoft SQLServer 7.0 to store data. It contains the five following data mining mod-ules: Association, Classification, Clustering, 3D Cube Explorer (displays datacube in a 3D view) and OLAP Browser (generalizes data in a spreadsheetor graphical form).The Geominer is composed of the following modules:Geo–characterizer, Geo–associator, Geo–cluster analyzer, Geo–classifier (their func-tion is similar to the previous definition) and Geo–comparator (its functioncorresponds to the discriminant rules in the previous definition).The system contains its own language for knowledge discovery in spa-tial data and it uses graphical interface to communicate with the user anddisplay results in the form of graphs, charts, maps, etc.DescartesSystem Descartes [8] supports the visual analysis of spatially referenceddata. It uses two basic tools: automatic visualization (presentation of thedata on the map) and interactive manipulation with maps. The systemuses the following methods to visualize information: area coloration, chartsand combination of both of them.The area coloration represents a numeric attribute as color: the greaterthe value, the darker the color of the region. Descartes offers various typesof charts (bar, pie, etc.). The combination of both the methods allows tovisualize one attribute with area coloration and another attribute (or at-tributes) with charts.In [1] the integration of Descartes with Kepler is used to classify spatialrelated data. Kepler [44] is a knowledge discovery system with a plug–2 SPATIAL DATA MINING51LowMediumHigh0valueFigure 1: Fuzzy values for numerical attributein architecture. The classification system C4.5 [38] is one of the pluggins.In this integrated system, Kepler is used to classify data and Descartes isused to visualize and analyse source data and results of the classificationon the map.Fuzzy Spatial OQL for Fuzzy KDDIn [4], a fuzzy spatial object query language and fuzzy decision tree [30]are introduced. This language is designed to select, process and mine datafrom Spatial Object–Oriented Databases and it is based on a fuzzy set the-ory. In the classical theory, given a set S and an element e, we can decidewhether this element belongs or does not belong to S. In the fuzzy settheory, the probability that e belongs to S can vary from 0 to 1. In figure1, a numerical attribute value can have the fuzzy values of Low, Mediumand High.In the fuzzy spatial OQL, fuzzy values are used in the where clause ofa fuzzy query and the answer is a fuzzy set of elements defined by thesefuzzy values. In the fuzzy decision tree, each node is associated with a teston the values of some attribute. All the edges of the node are labeled byfuzzy values. This enhances the comprehensibility of the decision tree.GWiMIn [35], the use of Inductive Logic Programing for knowledge discoveryin spatial data is discussed and an inductive query language is proposed.Then, a description of mining system GWiM is given.GWiM is a system for knowledge discovery in spatial data. It is builtupon the WiM system. WiM [14] uses inductive logic programming tosynthesize closed Horn clauses.2 SPATIAL DATA MINING6Query language of GWiM contains three types of queries. Two of them,characteristic and discriminant rules, are adaptation of query languageof DBMiner [9]. The dependency rules describe a dependency betweendifferent classes.extract <KindOfRule> rulefor <NameOfTarget>[from <ListOfClasses>] [<Constraints>][from point of view <ExplicitDomainKnowledge>];The clause <KindOfRule> determines which rule we want to mine.The Following clauses <NameOfTarget>, <ListOfClasses> and<Constraints> specify the data which should be used for mining. Theclause <ExplicitDomainKnowledge> is a list of predicates or hierarchyof predicates. The answer to these queries is a formula of first–order logicwhich characterizes the subset of the database specified by the rule.GeoKDIn [26, 27], a language for knowledge discovery in spatial data is proposed.Interpreter of this language is called GeoKD. This language contains threekinds of rules (classification, characteristic and discrimination rules) andit uses neighborhood graphs to represent spatial data. Syntax of the lan-guage is very similar to GWiM and GeoMiner:extract <KindOfRule> rulefor <NameOfTarget>from <ListOfClasses>where <condition>from point of view <DomainKnowledge>;The clause <KindOfRule> determines which rule we want to mine (clas-sification, characteristic or discrimination). Clause <NameOfTarget>determines the object we want to discover the knowledge about.<ListOfClasses> contains a list of tables where data for miningare stored. Data from these tables are selected according to the con-dition <condition>, which is similar to the language SQL. Clause<DomainKnowledge> contains some other information necessary forknowledge discovery. For example, it contains information where spatialdata for this query are stored.3 SPATIAL DATA STRUCTURES IN GIS7Source data for interpreter of this language are stored in the database Post-greSQL [36]. Interpreter uses the three programs for data mining: C4.5[38], Rt4.0 [39] and Progol [37]. C4.5 and Rt4.0 are used to find classifi-cation rules and Progol is used to find characteristic and discriminationrules. Interpreter unifies access to these systems and allows to use themfor spatial datamining.2.3 Information sourcesPapers from the area of knowledge discovery in spatial data can be foundin conference proceedings and journals that focus on GIS or knowledgediscovery in databases. Here are some basic of them: European Conferenceon Principles of Data Mining and Knowledge Discovery (PKDD) [33], ACMSIGKDD International Conference on Knowledge Discovery and Data Mining,ACM International Workshop on Advances in Geographic Information Systems[25], International Conference on Geographic Information Science (GIScience)[18] and journals Data Mining and Knowledge Discovery [7] and GeoInfor-matica [16].Local conferences in the Czech republic are GIS... Ostrava, Dob´yv´an´ıznalost´ı z datab´az´ı or Znalosti.Another useful information source is the project Spin! [41] where newSpatial Data Mining system is being developed, or The National Center forGeographic Information and Analysis (NCGIA) [31].The rest of the report is organized as follows. In section 3 basic spatialdata structures used in GIS are presented. In section 4, spatial structuresdesigned primarily for metric spaces are described. And finally, structuresused in some spatial data mining systems are presented in section 5.3 Spatial data structures in GIS3.1 Quad treeThe quad tree [13, 34] is used to index 2D space. Each internal node ofthe tree splits the space into four disjunct subspaces (called NW, NE, SW,SE) according to the axes. Each of these subspaces is split recursively untilthere is at most one object inside each of them (see Figure 2).3 SPATIAL DATA STRUCTURES IN GIS813342125675647Figure 2: Quad tree41263251365747Figure 3: k–d–treeThe quad tree is not balanced and its balance depends on the data distri-bution and the order of inserting the points.3.2 k–d–treeThis method uses a binary tree to split k–dimensional space [3, 15, 34]. Thistree splits the space into two subspaces according to one of the coordinatesof the splitting point (see Figure 3).Let level(nod) be the length of the path from the root to the nodenod and suppose the axes are numbered from 0 to k − 1. At the levellevel(nod) in every node the space is split according to the coordinate num-3 SPATIAL DATA STRUCTURES IN GIS9A B C1A3B21 23 56 7 4C5647Figure 4: R–treeber (level(nod) mod k).Inserting and searching are similar to the binary trees. We only haveto compare nodes according to the coordinate number (level(nod) mod k).This structure has one disadvantage: it is sensitive to the order in whichthe objects are inserted.3.3 R–treeThe R–tree [19, 34] is the modification of B-tree for spatial data. This tree isbalanced and splits the space into the rectangles which can overlap. Eachnode except root contains from m to M children, where 2 ≤ m ≤ M/2.The root contains at least 2 children unless it is a leaf. Figure 4 shows anexample of r-tree of the order 3.The node is represented by the minimum bounding rectangle contain-ing all the objects of its subtree. Each of children of the node is split recur-sively. Pointers to the data objects are stored in the leafs.Because of the overlapping of bounding rectangles it could be neces-sary to search more than one branch of the tree. Therefore, it is impor-tant to separate the rectangles as much as possible. This problem mustbe solved by the operation INSERT which uses some kind of heuristic.It finds such a leaf that inserting a new object into it will cause as smallchanges in the tree as possible.The splitting operation is also important. We want to decrease theprobability that we will have to search both new nodes. Testing all the pos-
Add New Comment