Data Visualization Poll (Fall 2011)

7 months ago I published a poll on LinkedIn and got a lot of responses, 1340 votes (in average 1 vote per hour) and comments. People asked me many times to repeat this poll from time to time. I guess it is time to re-Poll. I added 2 more choices (LinkedIn allows maximum 5 choices in their polls and it is clear not enough for this poll), based on a feedback I got: Omniscope and Visual Insight/Microstrategy. I also got some angry voters complaining that certain vendors are funding this poll. This is completely FALSE, I am unaffiliated with any of vendors, mentioned in this poll and I am working for completely independent (from those vendors) software company, see the About page of this Blog.


Tabular Model, Columnstore, new BIDS

Microsoft finally released SQL Server 11 “Denali” as CTP3 (Community Technology Preview) for public … Preview. Microsoft is (these are politeness words I can type) stubbornly refusing to have/build own Data Visualization Product. I doubt Crescent “experience” can be considered as a product, especially because it is Silverlight-base, while world already moved to HTML5.

If you have 7 minutes, you can watch Crescent Demo from WPC11, which is showing that while trailing a few years behind DV Leaders and Google, Microsoft is giving to its die hard followers something to cheer about:

I have to admit, that while there is nothing new (for DV expert) in video above, it is a huge progress compare with Excel-based Data Visualizations, which Microsoft tried to promote as a replacement of ProClarity and PerformancePoint Server. Even Microsoft itself positions Crescent (which is 32-bit only!) as a replacement for SSRS Report Builder, so DV Leaders can sleep well another night.

However, Microsoft’s BI Stack is the number 4 or 5 on my list of DV Leaders and CTP3 is so rich with new cool functionality, that it deserves to be covered on this blog.

Of course major news is availability of Tabular Data Model, which means VertiPaq in-memory columnar Engine, similar to PowerPivot Engine but running on Server without any SharePoint (which is a slow virus, as far as I am concerned) and without stupid SharePoint UI and limitations and I quote Microsoft: “ In contrast with the previous release, where VertiPaq was only available via in PowerPivot for SharePoint, you can now use VertiPaq on a standalone Analysis Services instance with no dependency on SharePoint.“!

SSAS (SQL Server Analysis Services) has new (they may existed before, but before CTP3 – ALL who knew that were under NDA) features like memory paging (allows models to be larger than the physical memory of the server, means unlimited scalability and BIG Data support), row level security (user identity used to hide/show visible data), KPI, Partitions; CTP3 removes the maximum 4GB file size limit for string storage file, removes the limit of 2 billion rows per table (each column is still limited to a maximum of 2 billion distinct values, but in columnar database it is much more tolerable restriction!).

New version of PowerPivot is released with support of  Tabular Model and I quote: “You can use this version of the add-in to author and publish PowerPivot workbooks from Excel 2010 to Microsoft SQL Server” and it means no SharePoint involvement again! As Marco Russo put it: “Import your existing PowerPivot workbooks in a Tabular project (yes, you can!)” and I agreed 100% with Marco when he said 4 times: Learn DAX!

After 3 years of delays, Microsoft is finally has BIDS for Visual Studio 2010  and that is huge too, I quote again: “The Tabular Model Designer … is now integrated with Microsoft SQL Server “Denali” (CTP 3) Business Intelligence Development Studio.” It means that BIDS now is not just available but is the main unified development interface for both Multidimensional and Tabular Data Models. Now we can forget about Visual Studio 2008 and finally use more modern VS2010!

Another extremely important for Data Visualization feature is not in SSAS but in SQL Server itself: Columnstore index is finally released and I a quote 1 more time again: “The … SQL Server (CTP 3) introduces a new data warehouse query acceleration feature based on a new type of index called the columnstore. This new index … improves DW query performance by hundreds to thousands of times in some cases, and can routinely give a tenfold speedup for a broad range of decision support queries… columnstore indexes limit or eliminate the need to rely on pre-built aggregates, including user-defined summary tables, and indexed (materialized) views. Furthermore, columnstore indexes can greatly improve ROLAP performance” (ROLAP can be used for real-time Cubes and real-time Data Visualizations).

All these cool SQL Server 11 new stuff is coming soon into Azure Cloud and this can be scary for any DV vendor, unless it knows (Tableau does; Qliktech and Spotfire still ignore SSAS) how to be friendly with Microsoft.

As we know now the newly coined by Microsoft term BISM (Business  Intelligence  Semantic Model) was a marketing attempt to have a “unified” umbrella

for 2 different Data Models and Data Engines: Multidimensional Cubes (invented by Mosha Pasumansky 15 years ago and the foundation for SSAS and MDX – SQL Server Analysis Services) and Tabular Model (used in PowerPivot and VertiPaq in-memory columnar Database with new DAX Language which is going to be very important for future Data Visualization projects).

New CTP3-released BIDS 2010 (finally almighty Visual Studio 2010 will have a “Business Intelligence Development Studio” after 3+ years of unjustified delays!) UI-wise will able to handle these 2 Data Models, but it is giving me a clue why Mosha left Microsoft for Google. And lack of DV product is a clue for me why Donald Farmer (face of Microsoft BI) left Microsoft for Qliktech.

Even more: if you need both Data Models to be present, you need to install 2 (TWO!) different instances of “Analysis Services”: one with Multidimensional Engine and one with new Tabular (VertiPaq/PowerPivot) Engine. It seems to me not as ONE “BI” architecture but TWO “BI” Architectures, interface-glued on Surface by BIDS 2010 and on back-end by all kind of Data Connectors. Basically Microsoft is in confused BI state now because financially it can afford 2 BI Architectures and NO Data Visualization Product!

I cannot believe I am saying this, but I wish Bill Gates back from retirement (it will be good for Microsoft shares and good for Microsoft market capitalization too – just ask Apple’s shareholders about Steve and they will say he is a god)!

Permalink: http://apandre.wordpress.com/2011/07/14/tabular-model/

Excel as a Business Intelligence Platform – Part 1

This is a Part 1 of surprise Guest post. My guest is Ugur Kadakal, Ph.D., he is the CEO and founder of Pagos, Inc., which he started almost 10 years ago.

Dr. Kadakal is an expert in Excel, Business Intelligence, Data Analytics and Data Visualization. His comprehensive knowledge of Excel, along with his ambitious inventions and ideas, supply the foundation for all Pagos products, which include SpreadsheetWEB (which converts Excel spreadsheets into web applications), SpreasheetLIVE  (a fully-featured, browser-based spreadsheet application environment) and Pagos Spreadsheet Component (which integrates Excel spreadsheets into enterprise web applications).

Pagos started and hosted the largest free collection and repository of professional templates of Excel spreadsheets on the web: http://spreadsheetzone.com . 3 Excel-based Dashboard below can be found on this very popular repository and done by Dr. Kadakal:

Dashboard 1 : Human Resources Dashboard: http://spreadsheetzone.com/templateview.aspx?i=498

Dashboard 2 : Business Activity Dashboard in EuroZone: http://spreadsheetzone.com/templateview.aspx?i=490

Dashboard 3 : Energy Dashboard for Euro Zone: http://spreadsheetzone.com/templateview.aspx?i=491

The topic is large, so this Guest article is splitted on 3 blog posts. The first portion of article contains the Introduction and Part 1 “Use of Excel as a BI Platform Today“, then I expect Dr. Kadakal will do at least 2 more posts: Part 2 – “Dos and Don’ts of building dashboards in Excel“, Part 3 – “Moving Excel dashboards to the Web“.

Excel as a Business Intelligence Platform – Part 1

Introduction

Electronic spreadsheets were one of the very first Business Intelligence (BI) software. While the availability of spreadsheet software and it use as a tool for data analysis dates back to the 1960s, its application in the BI field began with the integration of OLAP and pivot tables. In 1991, Lotus released Improve, followed by Microsoft’s release of PivotTable in 1993. However, Essbase was the first scalable OLAP software to handle large data sets that the early spreadsheet software was incapable of. This is where its name comes from: Extended Spread Sheet Database.

There is no doubt that Microsoft Excel is the most commonly used software for BI purposes. While Excel is general business software, its flexibility and ease of use makes it popular for data analysis with millions of users worldwide. Excel has an install base of hundreds of millions of desktops: far more than any other BI platform. It has become a household name. From educational utilization to domestic applications and enterprise implementation, Excel has been proven incredibly indispensable. Most people with commercial or corporate backgrounds have developed a proficient Excel skillset. This makes Excel the ultimate self-service BI platform. However, like all systems, Excel has some weaknesses that make it difficult to use as a BI tool under certain conditions.

Use of Excel as a BI Platform Today

Small Businesses

Traditionally, small businesses are not considered as an important market segment by most BI vendors. Their data analysis and reporting needs are limited primarily due to their smaller commercial volumes. However, this is changing quickly as smaller organizations begin to collect large amounts of data, thanks to the Internet and social media, and require tools to manage that data. However, what is not changing is the limited financial resources available to them. Small businesses cannot spare to spend large amounts of money on BI software or consultants to aid them in the creation of the applications. That’s why Excel is the ideal platform for them and will most probably remain that way for a foreseeable future. The reasons are clear: (1) most of them already have Excel licenses, (2) most of their users know how to use Excel and (3) their needs are simpler and can be met with Excel.

Mid-Range Businesses

Mid-range businesses are a quickly growing market segment for BI vendors. Traditionally, Excel as a BI platform has been more popular among these businesses. Cost and availability are the primary factors in this. However, two aspects have been steering them to searching for alternatives: (1) Excel can no longer handle their growing data volumes and (2) other BI vendors started offering cost-effective alternatives.

As a result, Excel’s market share in this field is in decline although it still remains the most popular. On the other hand, with the release of Office 2010 and its extended capabilities for handling very large data sets, Excel stands a good chance at reversing this decline.

Large Enterprises

The situation with large enterprises is rather complex. Most of them already have large-scale a BI implementation in place. Those implementations often connect various databases and data warehouses within the organizations. They have made significant investments and continue doing so to expand and maintain their BI systems. They already have a number of dashboards and reports designed to serve their business units. However, business users always need new and different dashboards and reporting tools. The only software that gives them the ultimate flexibility in creating their own reports is Excel. As a result, even in large Enterprises, usage of Excel for BI purposes is common. Business users often go to their data warehouses or BI tools and get a data extract to bring into Excel. They can then prepare their analysis and build their reports in Excel.

Enterprises will continue using their existing platforms because they have made huge investments building those systems. However, Excel use by business users as their secondary BI and reporting tool will continue to rise unless the alternative vendors significantly improve their self-servicing capabilities.

Summary

Excel is one of the ultimate business platforms and offers unparalleled features and capabilities to non-programmers. This makes it an ideal self-service BI platform. In this article, we examined the use of Excel as a BI platform in companies of different sizes. In the next article of this series, we will discuss how to use Excel more efficiently as a BI platform, from handling data to calculations and visual interactions.

Footprint Comparison for DV Leaders

Comparison of DV Tools is the most popular page (and post) of this site, visited by many thousands of people. Some of them keep asking to append this comparison with different additional features, one of them is a comparison of requirements of leading DV tools for file and memory footprint and also for reading and saving time.

I took mid-sized dataset (428999 rows and 135 columns), exported it into CSV and compressed it to ZIP format, because all native DV formats (QVW by Qlikview, DXP by Spotfire, TWBX by Tableau and XLSX by Excel and PowerPivot) are compressed one way or another. My starting filesize (of ZIPped dataset) was 56 MB. Here is what I got, see for yourself:

One comment is that numbers above are all relative to configuration of hardware used for tests and also depend on other software I ran during tests, because that software also requires RAM, CPU cycles, disk I/O and even on speed of repainting applications windows on screen, especially for Excel. I probably will add more comments to this post/page, but my first impression from this comparison is that new Tableau’s Data Engine (released in version 6.0 and soon will be updated in 6.1) made Tableau more competitive. Please keep in mind, that comparison of in-memory footprint was much less significant in above test, because Qlikview, Excel and PowerPivot putting all dataset into RAM, while Tableau and Spotfire can leave some (unneeded for visualization) data on disk, treating it as “virtual memory”. Also Tableau using 2 executables (not just one EXE as others): tableau.exe (or tabreader.exe) and tdserver64.exe

Since Tableau is the only DV Leading software, capable to read from SSAS Cubes and from PowerPivot (local SSAS) Cubes, I also took large SSAS Cube and for testing purposes I selected SSAS Sub-Cube with 3 Dimensions, 2 Measures and 156439 “rows”, measured the Time and Footprint, needed for Tableau to read Sub-Cube, Refresh it in Memory, Save to local application file, and also measurted “Cubical” Footprint of it in Memory and on Disk and then compared all results with the same tests while running Excel 2010 alone and Excel 2010 with PowerPivot:

While Tableau’s ability to read and visualize Cubes is cool, performance-wise Tableau is far behind of Excel and PowerPivot, especially in Reading department and memory footprint. In Saving department and File footprint Tableau is doing nothing because it is not saving cube locally in its local application TWBX file (and it keeps data in SSAS cube outside of Tableau) so Tableau’s file footprint for SSAS Cubes is not an indicator but for PowerPivot-based local Cubes Tableau does better job (saving data into local application file) then both Excel and PowerPivot!

Dimensionality of Visible Data

Human eye has own Curse of Dimensionality (term suggested in 1961 by R.Bellman and described independently by G. Hughes in 1968). In most cases the data (before they visualized) usually organized in multidimensional Cubes (n-Cubes) and/or Data Warehouses and/or speaking more cloudy – in Data Cloud – need to be projected into less-dimensional datasets (small-dimensional Cubes, e.g. 3d-Cubes) before they can be exposed through (preferably  interactive  and  synchronized set of charts, sometimes called dashboards) 2-dimensional surface of computer monitor in form of Charts.

Projection of DataCloud to DataCubes and then to Charts

During last 200+ years people kept inventing all type of charts to be printed on paper or shown on screen, so most charts showing 2- or 3-dimensional datasets. Prof. Hans Rosling led Gapminder.org to create the web-based, animated 6-dimensional Color Bubble Motion Chart (Trendalyzer) ,

which he used in his famous demos: http://www.gapminder.org/world/ , where 6 dimensions in this specific Chart are (almost a record for 2-dimensional chart to carry):

  • X coordinate of the Bubble = Income per person,
  • Y coordinate of the Bubble = Life expectancy,
  • Size of the Bubble = Population of the Country,
  • Color of the Bubble = Continent of the Country,
  • Name of the Bubble = Country,
  • Year = animated 6th Dimension/Parameter as time-stamp of the Bubble.

Trendalyzer was bought from Gapminder in 2007 by Google and was converted into Google Motion Chart, but Google somehow is not in rush to enter the Data Visualization (DV) market.

Dimensionality of this Motion Chart can be pushed even further to 7 dimensions (dimension as an expression of measurement without units) if we will use different Shapes (in addition to filled Circles we can use Triangles, Squares etc.) but it will be literally pushing the limit of what human eye can handle. If you will add to the consideration a tendency of DV Designers to squeeze more than one chart on a screen (how about overcrowded Dashboards with multiple synchronized interactive Charts?), we are literally approaching the limits of both human eye and human brain, regardless of the dimensionality of the Data Warehouse in backend.

Below I approximately assessed the dimensionality of datasets for some popular charts (please feel free to send me the corrections). For each Dataset and respective Chart I estimated the number of measures (usually real or integer number, can be a calculation from other dimensions of dataset), the number of attributes (in many cases they are categories, enumerations or have string as datatype) and 0 or 1 parameter (presenting a well-ordered set, like time (for time series), date, year, sequence (can be used for Data Slicing), natural, integer or real  number) and Dimensionality (the number of Dimensions) as a total number of measures, attributes and parameters in a given dataset.

Chart Measures Attributes Parameter Dimensionality
Gauge, Bullet, KPI 0 0
Monochromatic Pie 1 1
Colorful Pie 1 1 2
Bar/Column 1 1 2
Sparkline 1 1 2
Line 1 1 2
Area 1 1 2
Radar 1 1 2
Stacked Line 1 1 1 3
Multiline 1 1 1 3
Stacked Area 1 1 1 3
Overlapped Radar 1 1 1 3
Stacked Bar/Column 1 1 1 3
Heatmap 1 2 3
Combo 1 2 3
Mekko 2 1 3
Scatter (2-d set) 2 1 3
Bubble (3-d set) 3 1 4
Shaped Motion Bubble 3 1 1 5
Color Shaped Bubble 3 2 5
Color Motion Bubble 3 2 1 6
Motion Chart 3 3 1 7


The diversity of Charts and their Dimensionality adding another complexity for DV Designer: what Chart(s) choose. You can find on web some good suggestions about that. Dr. Andrew Abela created Chart Chooser Diagram

Choosing a good chart by Dr. Abela

and it was even converted into online “application“!

Permalink: http://apandre.wordpress.com/2011/03/02/dimensionality/

"Quadrant" for Data Visualization Platforms

For many years, Gartner keeps annoying me every January by publishing so called “Magic Quadrant for Business Intelligence Platforms” (MQ4BI for short) and most vendors (mentioned in it; this is funny, even Donald Farmer quotes MQ4BI) almost immediately re-published it either on so-called reprint (e.g. here – for a few months) area of Gartner website or on own website; some of them also making this “report” available to web visitors in exchange for contact info – for free. To channel my feeling toward Gartner  to a  something constructive, I decided to produce my own “Quadrant” for Data Visualization Platforms (DV “Quadrant” or Q4DV for short) – it is below and is a work in-progress and will be modified and republished overtime:

3 DV Leaders (green dots in upper right corner of Q4DV above) compared with each other and with Microsoft BI stack on this blog, as well as voted in DV Poll on LinkedIn. MQ4BI report actually contains a lot of useful info and it deserved to be used as a one of possible data sources for my new post, which has more specific target – Data Visualization Platforms. As I said above, I will call it Quadrant too: Q4DV. But before I will do that, I have to comment on Gartner’s annual MQ4BI.

MQ4BI customer survey included vendor-provided references, as well as survey responses from BI users in Gartner’s BI summit and inquiry lists. There were 1,225 survey responses (funny enough, almost the same number of responces as on my DV Poll on LinkedIn), with 247 (20%) from non-vendor-supplied reference lists. Magic Quadrant Customer Survey’s results the Gartner promised to publish in 1Q11. The Gartner has a somewhat reasonable “Inclusion and Exclusion Criteria” (for Data Visualization Q4DV I excluded some vendors from Gartner List and included a few too), almost tolerable but a fuzzy BI Market Definition (based on 13 loosely pre-defined capabilities organized into 3 categories of functionality: integration, information delivery and analysis).

I also partially agree with the definition and the usage of “Ability to Execute” as one  (Y axis) of 2 dimensions for bubble Chart above (called the same way as entire report “Magic Quadrant for Business Intelligence Platforms”). However I disagree with Gartner’s order of vendors in their ability to execute and for DV purposes I had to completely change order of DV Vendors on X axis (“Completeness of Vision”).

For Q4DV purposes I am reusing Gartner’s MQ as a template, I also excluded almost all vendors, classified by Gartner as niche players with lower ability to execute (bottom-left quarter of MQ4BI), except Panorama Software (Gartner put Panorama to a last place, which is unfair) and will add the following vendors: Panopticon, Visokio, Pagos and may be some others after further testing.

Permalink: http://apandre.wordpress.com/2011/02/13/q4dv/

“Quadrant” for Data Visualization Platforms

For many years, Gartner keeps annoying me every January by publishing so called “Magic Quadrant for Business Intelligence Platforms” (MQ4BI for short) and most vendors (mentioned in it; this is funny, even Donald Farmer quotes MQ4BI) almost immediately re-published it either on so-called reprint (e.g. here – for a few months) area of Gartner website or on own website; some of them also making this “report” available to web visitors in exchange for contact info – for free. To channel my feeling toward Gartner  to a  something constructive, I decided to produce my own “Quadrant” for Data Visualization Platforms (DV “Quadrant” or Q4DV for short) – it is below and is a work in-progress and will be modified and republished overtime:

3 DV Leaders (green dots in upper right corner of Q4DV above) compared with each other and with Microsoft BI stack on this blog, as well as voted in DV Poll on LinkedIn. MQ4BI report actually contains a lot of useful info and it deserved to be used as a one of possible data sources for my new post, which has more specific target – Data Visualization Platforms. As I said above, I will call it Quadrant too: Q4DV. But before I will do that, I have to comment on Gartner’s annual MQ4BI.

MQ4BI customer survey included vendor-provided references, as well as survey responses from BI users in Gartner’s BI summit and inquiry lists. There were 1,225 survey responses (funny enough, almost the same number of responces as on my DV Poll on LinkedIn), with 247 (20%) from non-vendor-supplied reference lists. Magic Quadrant Customer Survey’s results the Gartner promised to publish in 1Q11. The Gartner has a somewhat reasonable “Inclusion and Exclusion Criteria” (for Data Visualization Q4DV I excluded some vendors from Gartner List and included a few too), almost tolerable but a fuzzy BI Market Definition (based on 13 loosely pre-defined capabilities organized into 3 categories of functionality: integration, information delivery and analysis).

I also partially agree with the definition and the usage of “Ability to Execute” as one  (Y axis) of 2 dimensions for bubble Chart above (called the same way as entire report “Magic Quadrant for Business Intelligence Platforms”). However I disagree with Gartner’s order of vendors in their ability to execute and for DV purposes I had to completely change order of DV Vendors on X axis (“Completeness of Vision”).

For Q4DV purposes I am reusing Gartner’s MQ as a template, I also excluded almost all vendors, classified by Gartner as niche players with lower ability to execute (bottom-left quarter of MQ4BI), except Panorama Software (Gartner put Panorama to a last place, which is unfair) and will add the following vendors: Panopticon, Visokio, Pagos and may be some others after further testing.

I am going to update this DV “Quadrant”, using the method suggested by Jon Peltier: http://peltiertech.com/WordPress/excel-chart-with-colored-quadrant-background/ - Thank you Jon! I hope I will have time before end of 2011 for it…

Permalink: http://apandre.wordpress.com/2011/02/13/q4dv/

Google keeps own Data Visualizations options open

Recently I had a few reasons to review Data Visualization technologies in Google portfolio. In short: Google (if it decided to do so) has all components to create a good visualization tool, but the same thing can be said about Microsoft and Microsoft decided to postpone the production of DV tool in favor of other business goals.

I remember a few years ago Google bought a Gapminder (Hans Rosling did some very impressive Demos with it a while ago)

and converted it to a Motion Chart “technology” of its own. Motion Chart (For Motion Chart Demo I did below, please Choose a few countries (e.g. check checkboxes for US and France) and then Click on “Right Arrow” button in the bottom left corner of the Motion Chart below)

(see also here a sample I did myself, using Google’s motion Chart) allows to have 5-6 dimensions crammed into 2-dimensional chart: shape, color and size of bubbles, Axes X and Y as usual (above it will be Life Expectancy and Income per Person) and animated time series (see light blue 1985 in background above – all bubbles will move as “time” goes by). Google uses this and other own visualization technologies in its very useful Public Data Explorer.

Google Fusion Tables is a free service for sharing and visualizing data online. It allows you to upload and share data, merge data from multiple tables into interesting derived tables, and see the most up-to-date data from all sources, it has  TutorialsUser’s GroupDeveloper’s Guide and sample code, as well as examples. You can check a video here:

The Google Fusion Tables API enables programmatic access to Google Fusion Tables content. It is an extension of Google’s existing structured data capabilities for developers. Developer can populate a table in Google Fusion Tables with data, from a single row to hundreds at a time. The data can come from a variety of sources, such as a local database, .CSV file, data collection form, or mobile device. The Google Fusion Tables API is built on top of a subset of the SQL querying language. By referencing data values in SQL-like query expressions, developer can find the data you need, then download it for use by your application. Your app can do any desired processing on the data, such as computing aggregates or feeding into a visualization gadget. Data can be synchronized when you add or change data in the tables in your offline repository, you can ensure the most up-to-date version is available to the world by synchronizing those changes up to Google Fusion Tables.

Everybody knows about Google Web Analytics for your web traffic, visitors, visits, pageviews, length and depth of visits, presented by very simple charts and dashboard, see sample below:

Less people know that Panorama Software has OEM partnership with Google, enabling Google Spreadsheets with SaaS Data Visualizations and Pivot Tables.

Google has Visualization API (and interactive Charts, including all standard Charts, GeoMap, Intensity Map, Map, DyGraph, Sparkline, WordCloud and other Charts) which enables developers to expose own data, stored on any data-store that is connected to the web, as a Visualization compliant datasource. The Google Visualization API also provides a platform that can be used to create, share and reuse visualizations written by the developer community at large. Google provides samples, Chart/API Gallery (Javascript-based visualizations) and Gadget Gallery.

And last but not least, Google has excellent back-end technologies needed for big Data Visualization applications, like BigTable (BigTable is a compressed, high performance, and proprietary database system built on Google File System (GFS), Chubby Lock Service, and a few other Google programs; it is currently not distributed or used outside of Google, although Google offers access to it as part of their Google App Engine) and MapReduce. Add to this list Google Maps and Google Earth

and ask yourself then: what is stopping Google to produce a Competitor for the Holy Trinity (of Qlikview+Spotfire+Tableau) of DV?

Permalink: http://apandre.wordpress.com/2011/02/08/dvgoogle/

Poll about Data Visualization tools

On New Year Eve I started on LinkedIn the Poll “What tool is better for Data Visualization? and 1340 people voted there (which is unusually high return for LinkedIn polls, most of them getting less then 1000 votes), in average one vote per hour during 8 weeks, which is statistically significant as a reflection of the fact that the Data Visualization market has 3 clear leaders (probably at least a generation ahead of all other competitors: Spotfire, Tableau and Qlikview. Spotfire is a top vote getter: as of 2/27/11, 1pm EST: Spotfire got 450 votes (34%), Tableau 308 (23%), Qlikview 305 (23% ; Qlikview result improved during last 3 weeks of this poll), PowerPivot 146 (11%, more votes then all “Other” DV Tools) and all Others DV tools got just 131 votes (10%). Poll got 88 comments (more then 6% of voters commented on poll!) , will be open for more unique voters until 2/27/11, 7pm and its results consistent during last 5 weeks, so statistically it represents the user preferences of the LinkedIn population:

URL is http://linkd.in/f5SRw9 but you need to login to LinkedIn.com to vote. Also see some demographic info (in somewhat ugly visualization by … LinkedIn) about poll voters below:

Interesting that Tableau voters are younger then for other DV tools and more then 82% voters in poll are men. Summary of some comments:

  • - poll’s question is too generic – because an answer partially depends on what you are trying to visualize;
  • - poll is limited by LinkedIn restrictions, which allows no more than 5 possible/optional answers on Poll’s question;
  • - poll’s results may correlate with number of Qlikview/Tableau/Spotfire groups (and the size of their membership) on LinkedIn and also ability of employees of vendors of respective tools to vote in favor of the tool, produced by their company (I don’t see this happened). LinkedIn has 85 groups, related to Qlikview (with almost 5000 members), 34 groups related to Tableau (with 2000+ members total) and 7 groups related to Spotfire (with about 400 members total).
  • Randall Hand posted interesting comments about my poll here:    http://www.vizworld.com/2011/01/tool-data-visualization/#more-19190 . I disagreed with some of Randall’s assessments that “Gartner is probably right” (in my opinion Gartner is usually wrong when it is talking about BI, I posted on this blog about it and Randall agreed with me) and that “IBM & Microsoft rule … markets”. In fact IBM is very far behind (of Qlikview, Spotfire and Tableau) and Microsoft, while has excellent technologies (like PowerPivot and SSAS) are behind too, because Microsoft made a strategic mistake and does not have a visualization product, only technologies for it.
  • Spotfire fans from Facebook had some “advise” from here: http://www.facebook.com/TIBCOSpotfire (post said “TIBCO Spotfire LinkedIn users: Spotfire needs your votes! Weigh in on this poll and make us the Data Visualization tool of choice…” (nothing I can do to prevent people doing that, sorry). I think that the poll is statistically significant anyway and voters from Facebook may be added just a couple of dozens of votes for … their favorite tool.
  • Among Other Data Visualization tools, mentioned in 88 comments so far were JMP, R, Panopticon, Omniscope (from Visokio), BO/SAP Explorer and Excelsius, IBM Cognos, SpreadsheetWEB, IBM’s Elixir Enterprise Edition, iCharts, UC4 Insight, Birst, Digdash, Constellation Roamer, BIme, Bissantz DeltaMaster, RA.Pid, Corda Technologies, Advizor, LogiXml,TeleView etc.

Permalink: http://apandre.wordpress.com/2011/01/26/poll/

Big Data Analytics: Signal-to-Noise ratio even lower then in BI?

“Big Data Analytics” (BDA) is going to be a new buzzword for 2011. The same and new companies (and in some cases even the same people) who tried for 20+ years to use the term BI in order to sell their underused software now trying to use the new term BDA in hope to increase their sales and relevancy. Suddenly one of main reasons why BI tools are underused is a rapidly growing size of data.

Now new generation of existing tools (Teradata, Exadata, Netezza, Greenplum, PDW  etc.) and of course “new” tools (can you say VoltDB, Aster Data (Teradata now!), Hadoop, MapReduce, Cassandra, R, HANA, Paradigm4, MPP appliances etc. which are all cool and hot at the same time) and companies will enable users to collect, store, access and manipulate much larger datasets (petabytes).

For users, the level of noise will be now much bigger than before (and SNR – Signal-to-Noise ratio will be lower), because BDA is solving a HUGE (massive amounts of data are everywhere, from genome to RFID to application and network logfiles  to health data etc.) backend problem, while users interact with front-end and concern about trends, outliers, clusters, patterns, drilldowns and other visually intensive data phenomenas. However, SNR can be increased if  BDA technologies will be used together and as supporting tools to the signal-producing tools which are … Data Visualization tools.

Example of that can be a recent partnership between Tableau Software and Aster Data (Teradata bought Aster Data in March 2011!). I know for sure that EMC trying to partner Greenplum with most viable Data Visualizers, Microsoft will integrate its PDW with PowerPivot and Excel and I can assume of how to integrate Spotfire with BDA. Integration of Qlikview with BDA can be more difficult, since Qlikview currently can manipulate only data in own memory. In any case, I see DV tools as the main attraction and selling point for end-users and I hope BDA vendors can/will understand this simple truth and behave accordingly.

Permalink: http://apandre.wordpress.com/2011/01/16/bigdata/

Donald Farmer moved from Microsoft to Qliktech

I never saw before when one man moved from one company to another, then 46+ people will almost immediately comment on it. But this is what happened during last few days, when Donald Farmer, the Principal Program Manager for Microsoft BI Platform for 10 years, left Microsoft for Qliktech. Less than one year ago, Donald compared Qlikview and PowerPivot and while he was respectful to Qlikview, his comparison favored PowerPivot and Microsoft BI stack. I can think/guess about multiple reasons why (and I quote him: “I look forward to telling you more about this role and what promises to be a thrilling new direction for me with the most exciting company I have seen in years”) he did it, for example:

  • Microsoft does not have a DV Product (and one can guess that Donald wants to be the “face” of the product),
  • Qliktech had a successful IPO and secondary offering (money talks, especially when 700-strong company has $2B market capitalization and growing),
  • lack of confidence in Microsoft BI Vision (one can guess that Donald has a different “vision”),
  • SharePoint is a virus (SharePoint created a billion dollar industry, which one can consider wasted),
  • Qlikview making a DV Developer much more productive (a cool 30 to 50 times more productive) than Microsoft’s toolset (Microsoft even did not migrate the BIDS 2008 to Visual Studio 2010!),
  • and many others (Donald said that for him it is mostly user empowerment and user inspiration by Qlikview – sounds like he was underinspired with Microsoft BI stack so is it just a move from Microsoft rather then move  to Qliktech? – I guess I need a better explanation),

but Donald did explain it in his next blog post: “QlikView stands out for me, because it not only enables and empowers users; QlikView users are also inspired. This is, in a way, beyond our control. BI vendors and analysts cannot prescribe inspiration“. I have to be honest – and I repeat it again – I wish a better explanation… For  example, one my friend made a “ridiculous guess” that Microsoft sent Donald inside Qliktech to figure out if it does make sense to buy Qliktech and when (I think it is too late for that, but at least it is an interesting thought: good/evil  buyer/VC/investor will do a “due diligence” first, preferably internal and “technical due diligence” too) to buy it and who should stay and who should go.

I actually know other people recently moved to Qliktech (e.g. from Spotfire), but I have a question for Donald about his new title: “QlikView Product Advocate”. According to http://dictionary.reference.com/ the Advocate is a person who defends, supports and promotes a cause. I will argue that Qlikview does not need any of that (no need to defend it for sure, Qlikview has plenty of Supporters and Promoters); instead Qlikview needs a strong strategist and visionary

(and Donald is the best at it) who can lead and convince Qliktech to add new functionality in order to stay ahead of competition with at least Tableau, Spotfire and Microsoft included. One of many examples will be an ability to read … Microsoft’s SSAS multidimensional cubes, like Tableau 6.0 and Omniscope 2.6 have now.

Almost unrelated – I updated this page:  http://apandre.wordpress.com/market/competitors/qliktech/

Permalink: http://apandre.wordpress.com/2011/01/09/farmer_goes_2_qlikview/

Happy New 2011 Year!

Happy holidays to visitors of this blog and my best wishes for 2011! December 2010 was so busy for me, so I did not have time to blog about anything. I will just mention some news in this last post of 2010.

Tableau sales will exceed $40M in 2010 (and they planning to employ 300+ by end of 2011!), which is almost 20% of Qliktech sales in 2010. My guesstimate (if anybody has better data, please comment on it) that Spotfire’s sales in 2010 are about $80M. Qliktech’s market capitalization exceeded recently $2B, more than twice of Microstrategy ($930M as of today) Cap!

I recently noticed that Gartner trying to coin the new catch phrase because old (referring to BI, which never worked because intelligence is attribute of humans and not attribute of businesses) does not work. Now they are saying that for last 20+ years when they talked about business intelligence (BI) they meant an intelligent business. I think this is confusing because (at least in USA) business is all about profit and Chief Business Intelligent Dr. Karl Marx will agree with that. I respect the phrase “Profitable Business” but “Intelligent Business” reminds me the old phrase “Crocodile tears“. Gartner also saying that BI projects should be treated as a “cultural transformation” which reminds me a road paved with good intentions.

I also noticed the huge attention paid by Forrester to Advanced Data Visualization and probably for 4  good reasons (I have the different reasoning, but I am not part of Forrester) :

  • - data visualization can fit much more (tens of thousands) data points into one screen or page compare with numerical information and datagrid ( hundreds datapoints per screen);
  • - ability to visually drilldown and zoom through interactive and synchronized charts;
  • - ability to convey a story behind the data to a wider audience through data visualization.
  • - analysts and decision makers cannot see patterns (and in many cases also trends and outliers) in data without data visualization, like 37+ years old example, known as Anscombe’s quartet, which comprises four datasets that have identical simple statistical properties, yet appear very different when visualized. They were constructed by F.J. Anscombe to demonstrate the importance of Data Visualization (DV):
Anscombe’s quartet
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

In 2nd half of 2010 all 3 DV leaders released new versions of their beautiful software: Qlikview, Spotfire and Tableau. Visokio’s Omniscope 2.6 will be available soon and I am waiting for it since June 2010… In 2010 Microsoft, IBM, SAP, SAS, Oracle, Microstrategy etc. all trying hard to catch up with DV leaders and I wish to all of them the best of luck in 2011. Here is a list of some other things I still remember from 2010:

  • Microsoft officially declared that it prefers BISM over OLAP and will invest into their future accordingly. I am very disappointed with Microsoft, because it did not include BIDS (Business Intelligence Development Studio) into Visual Studio 2010. Even with release of supercool and free PowerPivot it is likely now that Microsoft will not be a leader in DV (Data Visualization), given it discontinued ProClarity and PerformancePoint and considering ugliness of SharePoint. Project Crescent (new visualization “experience” from Microsoft) was announced 6 weeks ago, but still not too many details about it, except that it mostly done with Silverlight 5 and Community Technology Preview will be available in 1st half of 2011.
  • SAP bought Sybase, released new version 4.0 of Business Objects and HANA “analytic appliance”
  • IBM bought Netezza and released Cognos 10.
  • Oracle released OBIEE 11g with ROLAP and MOLAP unified
  • Microstrategy released its version 9 Released 3 with much faster performance, integration with ESRI and support for web-serviced data
  • EMC bought Greenplum and started new DCD (Data Computing Division), which is obvious attempt to join BI and DV market
  • Panorama released NovaView for PowerPivot, which is natively connecting to the PowerPivot in-memory models.
  • Actuate’s BIRT was downloaded 10 million times (!) and has over a million (!) BIRT developers
  • Panopticon 5.7 was released recently (on 11/22/10) and adds the ability to display real-time streaming data.

David Raab, one of my favorite DV and BI gurus, published on his blog the interesting comparison of some leading DV tools. According to David’ scenario, one of possible ranking of DV Tools can be like that: Tableau is 1st than  Advizor (version 5.6 available since June 2010), Spotfire and Qlikview (seems to me David implied that order). In my recent DV comparison “my scenario” gave a different ranking: Qlikview is slightly ahead, while Spotfire and Tableau are sharing 2nd place (but very competitive to Qlikview) and Microsoft is distant 4th place, but it is possible that David knows something, which I don’t…

In addition to David, I want to thank  Boris Evelson, Mark Smith, Prof. Shneiderman, Prof. Rosling, Curt Monash, Stephen Few and others for their publications, articles, blogs and demos dedicated to Data Visualization in 2010 and before.

Permalink: http://apandre.wordpress.com/2010/12/25/hny2011/

Columnstore index in SQL Server 11.0 will accelerate DW queries by 100X

Microsoft reused its patented VertiPaq column-oriented DB technology in upcoming SQL Server 11.0 release by introducing columnstore indexes, where each columns stored in separate set of disk pages. Below is a “compressed” extraction from Microsoft publication and I think it is very relevant to the future of Data Visualization techologies. Traditionally RDBMS uses “row store” where

heap or a B-tree contains multiple rows per page. The columns are stored in different groups of pages in the columnstore index. Benefits of this are:

  • only the columns needed to solve a query are fetched from disk (this is often fewer than 15% of the columns in a typical fact table),
  • it’s easier to compress the data due to the redundancy of data within a column, and
  • buffer hit rates are improved because data is highly compressed, and frequently accessed parts of commonly used columns remain in memory, while infrequently used parts are paged out.

“The columnstore index in SQL Server employs Microsoft’s patented Vertipaq™ technology, which it shares with SQL Server Analysis Services and PowerPivot. SQL Server columnstore indexes don’t have to fit in main memory, but they can effectively use as much memory as is available on the server. Portions of columns are moved in and out of memory on demand.” SQL Server is the first major database product to support a pure Columnstore index. Columnstore recommended for fact tables in DW in datawarehouse, for large dimensions (say with more than 10 millions of records) and any large tables designated to be used as read-only.

“In memory-constrained environments when the columnstore working set fits in RAM but the row store working set doesn’t fit, it is easy to demonstrate thousand-fold speedups. When both the column store7and the row store fit in RAM, the differences are smaller but are usually in the 6X to 100X range for star join queries with grouping and aggregation.” Your results will of course depend on your data, workload, and hardware. Columnstore index query processing is most heavily optimized for star join queries. OLTP-style queries, including point lookups, and fetches of every column of a wide row, will usually not perform as well with a columnstore index as with a B-tree index.

Columnstore compressed data with a factor of 4 to a factor of 15 compression with different fact tables. The columnstore index is a secondary index; the row store is still present, though during query processing it is often not need, and ends up being paged out. A clustered columnstore index, which will be the master copy of the data, is planned for the future. This will give significant space savings.

Tables with columnstore indexes can’t be updated directly using INSERT, UPDATE, DELETE, and MERGE statements, or bulk load operations. To move data into a columnstore table you can switch in a partition, or disable the columnstore index, update the table, and rebuild the index. Columnstore indexes on partitioned tables must be partition-aligned. Most data warehouse customers have a daily, weekly or monthly load cycle, and treat the data warehouse as read-only during the day, so they’ll almost certainly be able to use columnstore indexes.You can also create a view that uses UNION ALL to combine a table with a column store index and an updatable table without a columnstore index into one logical table. This view can then be referenced by queries. This allows dynamic insertion of new data into a single logical fact table while still retaining much of the performance benefit of columnstore capability.

Most important for DV systems is this statement: “Users who were using OLAP systems only to get fast query performance, but who prefer to use the T-SQL language to write queries, may find they can have one less moving part in their environment, reducing cost and complexity. Users who like the sophisticated reporting tools, dimensional modeling capability, forecasting facilities, and decision-support specific query languages that OLAP tools offer can continue to benefit from them. Moreover, they may now be able to use ROLAP against a columnstore-indexed SQL Server data warehouse, and meet or exceed the performance they were used to in the past with OLAP, but save time by eliminating the cube building process“. This sounds like Microsoft finally figured out of how to compete with Qlikview (technology-wise only, because Microsoft still does not have – may be intentionally(?) – DV product).

Permalink: http://apandre.wordpress.com/2010/12/03/columnstore-index/

Microsoft BI: Roadmap to where?

Microsoft used to be a greatest marketing machine in software industry. But after loosing search business to Google and smartphone business to Apple and Google they lost their winning skills. It is clear now that this is also true in so called BI Market (Business Intelligence is just a marketing term).  Microsoft bought ProClarity and it disappeared, they released PerformancePoint Server and it is disappearing too. They have (or had?) the best BI Stack (SQL Server 2008 R2 and its Analysis Services, Business Intelligence Development Studio 2008 (BIDS), Excel 2010, PowerPivot etc.) and they failed to release any BI or Data Visualization Product, despite having all technological pieces and components. Microsoft even released Visual Studio 2010 without any support for BIDS and recently they talked about their Roadmap for BI and again – they delayed the mentioning of BIDS 2010 and they declared NO plans for BI or DV products! Instead they are talking about “new ad hoc reporting and data visualization experience codenamed “Project Crescent””!

And than they have a BISM model as a part of Roadmap: “A new Business Intelligence Semantic Model (BISM) in Analysis Services that will power Crescent as well as other Microsoft BI front end experiences such as Excel, Reporting Services and SharePoint Insights”.

Experience and Model instead of Product? What Microsoft did with PowerPivot is clear: they gave some users the reason to upgrade to Office 2010, and as a result, Microsoft preserved and protected (for another 2 years?) their lucrative Office business but diminished their chances to get a significant pie of $11B (and  growing 10% per year) BI Market. new BISM (Business Intelligence Semantic Model) is a clear sign of losing technological edge:

image

I have to quote (because they finally admitted that BIDS will be  replaced by BISM – when “Project Juneau” will be available): “The BI Semantic Model can be authored by BI professionals in the Visual Studio 2010 environment using a new project type that will be available as part of “Project Juneau”. Juneau is an integrated development environment for all of SQL Server and subsumes the Business Intelligence Development Studio (BIDS). When a business user creates a PowerPivot application, the model that is embedded inside the workbook is also a BI Semantic Model. When the workbook is published to SharePoint, the model is hosted inside an SSAS server and served up to other applications and services such as Excel Services, Reporting Services, etc. Since it is the same BI Semantic Model that is powering PowerPivot for Excel, PowerPivot for SharePoint and Analysis Services, it enables seamless transition of BI applications from Personal BI to Team BI to Organizational (or Professional) BI.

Funniest part of this quote above that Microsoft is honestly believe that SharePoint is not a Virus but a viable Product and it will escape the fate of its “step-brother” – PerfromancePoint Server. Sweet dreams! It is clear that Microsoft failed to understand that Data Visualization is the future of BI market and they keep recycling for themselves the obvious lie “Analysis Services is the industry leading BI platform in this space today“! Indirectly they acknowledged it in a very next statement : “With the introduction of the BI Semantic Model, there are two flavors of Analysis Services – one that runs the UDM (OLAP) model and one that runs the BISM model”. Hello?

Why we need 2 BI Models instead of 1 BI product? BIDS 2008 itself is already buggy and much less productive development environment than Qlikview, Spotfire and Tableau, but now Microsoft wants us to be confused with 2 co-existing approaches: OLAP and BISM? And now get this: “you should expect to see more investment put into the BISM and less in the UDM(OLAP)”!

Dirty Harry will say in such situation: “Go ahead, make my day!” And I guess that Microsoft  does not care that Apple’s  Market CAP is larger than Microsoft now.

Afterthought (looking at this from 2011 point of view): I am thinking now that I know why Donald Farmer left Microsoft 2 months after BISM announcement above.

p010: http://wp.me/pCJUg-7r

Tableau 6 reads local PowerPivot, does Motion Chart

It looks like honeymoon for Qlikview after Qliktech’s IPO is over. In addition to Spotfire 3.2/Silver, now we have the 3rd great piece of software in form of Tableau 6. Tableau 6.0 released today (both 32-bit and 64-bit) with new in-memory data engine (very fast, say 67 millions of rows in 2 seconds) and quick data blending from multiple data sources while normalizing across them. Data Visualization Software available as a Server (with web browsers as free Clients) and as a Desktop (Pro for $1999, Personal for $999, Reader for free).

New Data Sources include local PowerPivot files(!),  Aster Data ; new Data Connections include OData , (recently released) Windows Azure Marketplace Datamarket; Data Connection can be Direct/Live or to in-memory data engine. Tableau 6 does full or partial automatic data updates; supports parameters for calculations, what-if modeling, and selectability of Displaying fields in Chart’s axis; combo charts of any pair of charts; has new project views, supports Motion Charts (a la Hans Rosling) etc. Also see Ventana Research and comments by Tableau followers.

This post can be expanded, since it is officially 1st day of release.

n009: http://wp.me/sCJUg-tableau6

DV Comparison: Qlikview, Spotfire, Tableau, MS BI Stack

Published the comparison of 4 leading DV Products, see http://wp.me/PCJUg-1T

I did not included into comparison the 5th leading product – Visokio’s Omniscope, because it has very limited scalability due the specifics of it’s implementation: Java does not allow to visualize too much data. Among factors to considered when comparing DV tools:

  • - memory optimization [Qlikview is the leader in in-memory columnar database technology];
  • - load time [I tested all products above and PowerPivot is the fastest];
  • - memory swapping [Spotfire is only who can use a disk as a virtual memory, while Qlikview limited by RAM only];
  • - incremental updates [Qlikview probably the best in this area];
  • - thin clients [Spotfire has the the best THIN/Web/ZFC (zero-footprint) client, especially with their recent release of Spotfire 3.2 and Spotfire Silver];
  • - thick clients [Qlikview has the best THICK client] ,
  • - access by 3rd party tools [PowerPivot's integration with Excel 2010, SQL Server 2008 R2 Analysis Services and SharePoint 2010 is a big attraction];
  • - interface with SSAS cubes [PowerPivot has it, Tableau has it, Omniscope will have it very soon, Qlikview and Spotfire do not have it],
  • - GUI [3-way tie, it is heavily depends on personal preferences, but in my opinion Qlikview is more easy to use than others];
  • - advanced analytics [Spotfire 3.2 is the leader here with its integration with S-PLUS and support for IronPython and other add-ons]
  • - the productivity of developers involved with tools mentioned above. In my experience Qlikview is much more productive tool in this regard.

p003: http://wp.me/pCJUg-3R

Blog as a thought saver

How do I know what I think until I see what I say?” Or let me rephrase Mr. E.M. Forster: “How do YOU know what I think until I will blog about it“?

I resisted to an idea to have a blog since 1996, because I perceived the blogging as very similar to a fasting in desert (actually after a few months of blogging I am amazed – according to WordPress Statistics – that my blog has hundreds and hundreds of visitors every day!). But recently I got a few excellent pushes to start my own blog because when I posted comments on somebody’s blog they got deleted against my will. Turned out that owners of those blogs can delete my comments and thoughts anytime if he/she/they do not like what I said. It happened to me on one of Forrester’s Blogs and it happened to me on my own profile on LinkedIn – when I posted so called “update” and some of LinkedIn employees decided to delete it. In both cases above administrators even did not bother to send me my own thoughts for archiving purposes – they just disappear!

So I decided to start the blog about Data Visualization (DV),

because I am doing DV for many years and accumulated many DV implementations and thoughts about DV, DV tools, DV Vendors, DV Market etc. For now I will have 8 main pages (and they will be used as root pages for hierarchy of sub-pages):

  • Home Page of this blog  is a place where all posts and comments will go,
  • Visualization Page (with sub-pages) is for DV Samples and Demos,
  • DataViews Page (and it’s sub-pages) is about … Data Views, Charts and Chartology,
  • Tools Page designated for DV Software and comparison of DV Tools,
  • Solutions Page will describe possible DV solutions, DV System, products  and DV services I can provide,
  • Market Page dedicated to DV Vendors and DV market news and analyses,
  • Data Page is about ETL processes, Data Collection and Data Sources
  • About page can give you an info about me

p001: http://wp.me/pCJUg-3

DV Comparison: Qlikview, Spotfire, Tableau, MS BI Stack

Published the comparison of 4 leading DV Products, see http://wp.me/PCJUg-1T

I did not included into comparison the 5th leading product – Visokio’s Omniscope, because it has very limited scalability due the specifics of it’s implementation: Java does not allow to visualize too much data. Among factors to considered when comparing DV tools:

  • - memory optimization [Qlikview is the leader in in-memory columnar database technology];
  • - load time [I tested all products above and PowerPivot is the fastest];
  • - memory swapping [Spotfire is only who can use a disk as a virtual memory, while Qlikview limited by RAM only];
  • - incremental updates [Qlikview probably the best in this area];
  • - thin clients [Spotfire has the the best THIN/Web/ZFC (zero-footprint) client, especially with their recent release of Spotfire 3.2 and Spotfire Silver];
  • - thick clients [Qlikview has the best THICK client] ,
  • - access by 3rd party tools [PowerPivot's integration with Excel 2010, SQL Server 2008 R2 Analysis Services and SharePoint 2010 is a big attraction];
  • - interface with SSAS cubes [PowerPivot has it, Tableau has it, Omniscope will have it very soon, Qlikview and Spotfire do not have it],
  • - GUI [3-way tie, it is heavily depends on personal preferences, but in my opinion Qlikview is more easy to use than others];
  • - advanced analytics [Spotfire 3.2 is the leader here with its integration with S-PLUS and support for IronPython and other add-ons]
  • - the productivity of developers involved with tools mentioned above. In my experience Qlikview is much more productive tool in this regard.

p003: http://wp.me/pCJUg-3R

Google keeps own Data Visualizations options open

Recently I had a few reasons to review Data Visualization technologies in Google portfolio. In short: Google (if it decided to do so) has all components to create a good visualization tool, but the same thing can be said about Microsoft and Microsoft decided to postpone the production of DV tool in favor of other business goals.

I remember a few years ago Google bought a Gapminder (Hans Rosling did some very impressive Demos with it a while ago)

and converted it to a Motion Chart “technology” of its own. Motion Chart (For Motion Chart Demo I did below, please Choose a few countries (e.g. check checkboxes for US and France) and then Click on “Right Arrow” button in the bottom left corner of the Motion Chart below)

(see also here a sample I did myself, using Google’s motion Chart) allows to have 5-6 dimensions crammed into 2-dimensional chart: shape, color and size of bubbles, Axes X and Y as usual (above it will be Life Expectancy and Income per Person) and animated time series (see light blue 1985 in background above – all bubbles will move as “time” goes by). Google uses this and other own visualization technologies in its very useful Public Data Explorer.

Google Fusion Tables is a free service for sharing and visualizing data online. It allows you to upload and share data, merge data from multiple tables into interesting derived tables, and see the most up-to-date data from all sources, it has  TutorialsUser’s GroupDeveloper’s Guide and sample code, as well as examples. You can check a video here:

The Google Fusion Tables API enables programmatic access to Google Fusion Tables content. It is an extension of Google’s existing structured data capabilities for developers. Developer can populate a table in Google Fusion Tables with data, from a single row to hundreds at a time. The data can come from a variety of sources, such as a local database, .CSV file, data collection form, or mobile device. The Google Fusion Tables API is built on top of a subset of the SQL querying language. By referencing data values in SQL-like query expressions, developer can find the data you need, then download it for use by your application. Your app can do any desired processing on the data, such as computing aggregates or feeding into a visualization gadget. Data can be synchronized when you add or change data in the tables in your offline repository, you can ensure the most up-to-date version is available to the world by synchronizing those changes up to Google Fusion Tables.

Everybody knows about Google Web Analytics for your web traffic, visitors, visits, pageviews, length and depth of visits, presented by very simple charts and dashboard, see sample below:

Less people know that Panorama Software has OEM partnership with Google, enabling Google Spreadsheets with SaaS Data Visualizations and Pivot Tables.

Google has Visualization API (and interactive Charts, including all standard Charts, GeoMap, Intensity Map, Map, DyGraph, Sparkline, WordCloud and other Charts) which enables developers to expose own data, stored on any data-store that is connected to the web, as a Visualization compliant datasource. The Google Visualization API also provides a platform that can be used to create, share and reuse visualizations written by the developer community at large. Google provides samples, Chart/API Gallery (Javascript-based visualizations) and Gadget Gallery.

And last but not least, Google has excellent back-end technologies needed for big Data Visualization applications, like BigTable (BigTable is a compressed, high performance, and proprietary database system built on Google File System (GFS), Chubby Lock Service, and a few other Google programs; it is currently not distributed or used outside of Google, although Google offers access to it as part of their Google App Engine) and MapReduce. Add to this list Google Maps and Google Earth

and ask yourself then: what is stopping Google to produce a Competitor for the Holy Trinity (of Qlikview+Spotfire+Tableau) of DV?

Permalink: http://apandre.wordpress.com/2011/02/08/dvgoogle/

Trend Analysis: see it 1st

Data Visualization can be a good thing for Trend Analysis: it allows to “see this” before “analyze this” and to take advantage of human eye ability to recognize trends quicker than any other methods. Dr. Ahlberg started (after selling Spotfire to TIBCO and claiming that “Second place is first loser”) a “Recorded Future” to basically sell … future trends in form (mostly) of Sparklines; he succeeded at least in selling RecordedFuture to investors from CIA and Google. Trend analysis is an attempt to “spot” a pattern, or trend, in data (in most cases well-ordered set of datapoints, e.g. by timestamps) or predict future events.

Visualizing Trends means in many cases either Time Series Chart (can you spot a pattern here with your naked eye?):

or Motion Chart (both best done by … Google, see it here http://visibledata.blogspot.com/p/demos.html ) – can you predict the future here(?):

or Sparklines (I like Sparkline implementations by Qlikview and Excel 2010) – sparklines are scale-less visualization of “trends”:

may be Scatter (Excel is good for it too):

and in some cases Stock Chart (Volume-Open-High-Low-Close, best done with Excel) – for example Microsoft stock is fluctuating near the same level for many years, so I guess there is no visible trend  here, which may be spells a trouble for Microsoft future (compare with visible trend of Apple and Google stocks):

Or you can see Motion, Timeline, Sparkline and Scatter charts alive/online below: for Motion Chart Demo, please Choose a few countries (e.g. check checkboxes for US and France) and then Click on “Right Arrow” button in the bottom left corner of the Motion Chart below:

In statistics trend analysis often refers to techniques for extracting an underlying pattern of behavior in well-ordered dataset which would otherwise be partly hidden by “noise data”. It means that if one cannot “spot” a pattern by visualizing such a dataset, then (and only then) it is time to apply regression analysis and other mathematical methods (unless you smart or lucky enough to remove a noise from your data). As I said in a beginning: try to see it first! However, extrapolating the past to the future can be a source for very dangerous mistakes (just check a history of almost any empire: Roman, Mongol, British, Ottoman, Austrian, Russian etc.)

Dimensionality of Visible Data

Human eye has own Curse of Dimensionality (term suggested in 1961 by R.Bellman and described independently by G. Hughes in 1968). In most cases the data (before they visualized) usually organized in multidimensional Cubes (n-Cubes) and/or Data Warehouses and/or speaking more cloudy – in Data Cloud – need to be projected into less-dimensional datasets (small-dimensional Cubes, e.g. 3d-Cubes) before they can be exposed through (preferably  interactive  and  synchronized set of charts, sometimes called dashboards) 2-dimensional surface of computer monitor in form of Charts.

Projection of DataCloud to DataCubes and then to Charts

During last 200+ years people kept inventing all type of charts to be printed on paper or shown on screen, so most charts showing 2- or 3-dimensional datasets. Prof. Hans Rosling led Gapminder.org to create the web-based, animated 6-dimensional Color Bubble Motion Chart (Trendalyzer) ,

which he used in his famous demos: http://www.gapminder.org/world/ , where 6 dimensions in this specific Chart are (almost a record for 2-dimensional chart to carry):

  • X coordinate of the Bubble = Income per person,
  • Y coordinate of the Bubble = Life expectancy,
  • Size of the Bubble = Population of the Country,
  • Color of the Bubble = Continent of the Country,
  • Name of the Bubble = Country,
  • Year = animated 6th Dimension/Parameter as time-stamp of the Bubble.

Trendalyzer was bought from Gapminder in 2007 by Google and was converted into Google Motion Chart, but Google somehow is not in rush to enter the Data Visualization (DV) market.

Dimensionality of this Motion Chart can be pushed even further to 7 dimensions (dimension as an expression of measurement without units) if we will use different Shapes (in addition to filled Circles we can use Triangles, Squares etc.) but it will be literally pushing the limit of what human eye can handle. If you will add to the consideration a tendency of DV Designers to squeeze more than one chart on a screen (how about overcrowded Dashboards with multiple synchronized interactive Charts?), we are literally approaching the limits of both human eye and human brain, regardless of the dimensionality of the Data Warehouse in backend.

Below I approximately assessed the dimensionality of datasets for some popular charts (please feel free to send me the corrections). For each Dataset and respective Chart I estimated the number of measures (usually real or integer number, can be a calculation from other dimensions of dataset), the number of attributes (in many cases they are categories, enumerations or have string as datatype) and 0 or 1 parameter (presenting a well-ordered set, like time (for time series), date, year, sequence (can be used for Data Slicing), natural, integer or real  number) and Dimensionality (the number of Dimensions) as a total number of measures, attributes and parameters in a given dataset.

Chart Measures Attributes Parameter Dimensionality
Gauge, Bullet, KPI 0 0
Monochromatic Pie 1 1
Colorful Pie 1 1 2
Bar/Column 1 1 2
Sparkline 1 1 2
Line 1 1 2
Area 1 1 2
Radar 1 1 2
Stacked Line 1 1 1 3
Multiline 1 1 1 3
Stacked Area 1 1 1 3
Overlapped Radar 1 1 1 3
Stacked Bar/Column 1 1 1 3
Heatmap 1 2 3
Combo 1 2 3
Mekko 2 1 3
Scatter (2-d set) 2 1 3
Bubble (3-d set) 3 1 4
Shaped Motion Bubble 3 1 1 5
Color Shaped Bubble 3 2 5
Color Motion Bubble 3 2 1 6
Motion Chart 3 3 1 7


The diversity of Charts and their Dimensionality adding another complexity for DV Designer: what Chart(s) choose. You can find on web some good suggestions about that. Dr. Andrew Abela created Chart Chooser Diagram

Choosing a good chart by Dr. Abela

and it was even converted into online “application“!

Permalink: http://apandre.wordpress.com/2011/03/02/dimensionality/

Blog as a thought saver

How do I know what I think until I see what I say?” Or let me rephrase Mr. E.M. Forster: “How do YOU know what I think until I will blog about it“?

I resisted to an idea to have a blog since 1996, because I perceived the blogging as very similar to a fasting in desert (actually after a few months of blogging I am amazed – according to WordPress Statistics – that my blog has hundreds and hundreds of visitors every day!). But recently I got a few excellent pushes to start my own blog because when I posted comments on somebody’s blog they got deleted against my will. Turned out that owners of those blogs can delete my comments and thoughts anytime if he/she/they do not like what I said. It happened to me on one of Forrester’s Blogs and it happened to me on my own profile on LinkedIn – when I posted so called “update” and some of LinkedIn employees decided to delete it. In both cases above administrators even did not bother to send me my own thoughts for archiving purposes – they just disappear!

So I decided to start the blog about Data Visualization (DV),

because I am doing DV for many years and accumulated many DV implementations and thoughts about DV, DV tools, DV Vendors, DV Market etc. For now I will have 8 main pages (and they will be used as root pages for hierarchy of sub-pages):

  • Home Page of this blog  is a place where all posts and comments will go,
  • Visualization Page (with sub-pages) is for DV Samples and Demos,
  • DataViews Page (and it’s sub-pages) is about … Data Views, Charts and Chartology,
  • Tools Page designated for DV Software and comparison of DV Tools,
  • Solutions Page will describe possible DV solutions, DV System, products  and DV services I can provide,
  • Market Page dedicated to DV Vendors and DV market news and analyses,
  • Data Page is about ETL processes, Data Collection and Data Sources
  • About page can give you an info about me

And finally, I have to mention this 500-years old story in hope it will help me to filter out from this blog unneeded pieces: “An admirer asked Michelangelo how he sculpted the famous statue of David that now sits in the Academia Gallery in Florence. How did he craft this masterpiece of form and beauty? Michelangelo’s offered this strikingly simple description: He first fixed his attention on the slab of raw marble. He studied it and then “chipped away all that wasn’t David.”

p001: http://wp.me/pCJUg-3

Follow

Get every new post delivered to your Inbox.