Assist user in pattern identification
Network graphs provide a means to visually explore and discover patterns in data that might not otherwise be realized by other graphical means, but it is not always clear how to generate, filter, or process graph data to identify patterns. General network graph visualization packages are common and allow a user to generate a network graph from data provided in a standard format. To effectively use this software, the user must understand data mining issues relevant to network graph theory. In this scenario, large data sets that are reduced in size for manageability may be incorrectly filtered leaving significant patterns undiscovered. It is apparent that a need exists for a software environment in the bioinformatics community by which data can be examined by bioinformatics specialists (not graph theory specialists) that will assist them in identifying patterns within a network graph. The environment should contain key concepts of graph theory and ensure that a data set is exhaustively examined for significant patterns. The user can then make an informed decision about an appropriate filtering method based on the pattern search results.
Suite of local and remote data analysis tools
In its current form, GeNetViz provides tools to plot various variables of the network graph. While these plots will identify significant variable quantities, they do not provide information about the significance of one variable test versus another. Thus, Exploratory Data Analysis [SCE_5] (EDA) will be employed to maximize insight into data, identify important variables, noise and anomalies, and more. Through this process, a suite of exploratory methods will be built consisting of basic statistical analysis, and multivariate techniques, such as cluster analysis. Some multivariate techniques are computationally intensive and not practical for a typical personal computer. Layout algorithms, for example, are necessary for any size network graph. Several different layout theories exist and have been implemented on different platforms. But as the size of a network graph becomes large, most common personal computers will take several days to display an appropriately converged layout solution. In these cases, access to dedicated and/or parallel remote algorithms is desired. Interfacing with a remote database of such algorithms is easily implemented through the Java framework. The set of exploratory algorithms will be accessible to the user to run each algorithm individually, or presented as a group which may be run in its entirety. The results of statistical tests will present a numeric measure of pattern recognition significance. This measure can then be used to filter the data set, or identify appropriate algorithms for visualization. Furthermore, a subset of data produced as a result of the EDA process might be used in more advanced data mining techniques for predictive and comparative studies with larger data sets.
Remote repository of network graphs
Building a network graph of genome data can be time consuming as large data sets are processed with multiple algorithms in an attempt to filter unwanted data producing a graph with significant patterns and manageable size. Also, as graphs are generated, comparisons with other graphs are desirable for correlation between data sets. A remote data repository of network graphs will provide a central location for shared information among collaborators. Connectivity to a database from the Java framework is made possible through the Java Database Connectivity (JDBC) API. Here, GeNetViz will provide a user the ability to open an existing network graph, save a graph to the database, and run comparison algorithms among data sets. Similarly, groups that wish to share fully defined GeNetViz files in the GraphML format need not pass files back and forth. The XML like format will automatically make GeNetViz files web friendly and accessible from a remote location.
GeNetViz can read a network graph of two thousand vertices and four thousand edges. However, as the number of edges increases, the speed of local algorithm execution and visual rendering slows. When a graph contains over one thousand vertices, the rendering capability of GeNetViz switches to a "simplified" form to ensure responsiveness of the application. In future releases, GeNetViz will limit the data set size by the amount of memory available in the Java Virtual Machine. It is not reasonable to expect that all algorithms will be ineffective at the same "cut-off" value on all computers. Thus, each algorithm will determine its effectiveness on a case-by-case basis using the number of graph vertices, edges, and the speed of the local pc, notifying the user of any potential problem. The user may then choose to reduce the data set size or run the algorithm remotely. Further efficiency issues will be explored in rendering techniques including shadowing, where an objects "shadow" is manipulated and not the fully rendered object, and logical collapsing of vertex clusters to reduce the number visible vertices.
GeNetViz developed and maintained by Shawn Ericson, Bing Zhang, Stephen Kirov, and Jay Snoddy
File: future_work.htm, Website author: Shawn Ericson, Last revised: October 10, 2005