Rafal Lukawiecki: Absolutely. My name is Rafal Lukawiecki and I work for Project Botticelli Ltd., which is a small consulting company based in Ireland. Over there I specialize in business intelligence and that has been the speciality of my company for the last five years. Previous to that our main area of speciality was data protection and security.
Slo-Tech: What are the newest trends in business intelligence?
Rafal Lukawiecki: The newest trends in business intelligence, which started outside of Microsoft about two or three years ago and which arrived at Microsoft this year, are primarily focused to move from traditional, the so-called organizational business intelligence, towards self-service, do-it-yourself business intelligence which is done by every information worker in the organization. There is a vision which says that before the end of this decade - so we have about nine years left - all people who use computers, whether for e-mail or any other purposes in companies, will become a little bit like analysts and they will be analyzing their own data, which today is just impossible to think about because people don't have analytical skills. And even if you have access to the data, it's very very hard to do it yourself. What is changing is that new technologies are enabling that huge group of users to get answers from data all by themselves without having to have analytical skills. And that's probably the biggest trend.
The second trend, which parallels this one, is in management of business intelligence. Because once you have self-service business intelligence, it's very easy to have inconsistency, since many different people are using different types of business intelligence and therefore slightly different answers can lead to a lot of chaos. Whilst it's important to give people tools that empower them to find their own answers, it's important to be able to manage what they are doing and this is where perhaps having a more traditional business intelligence technology - and that includes Microsoft's new version of SharePoint Server - helped in managing that growing mass of self-service business intelligence.
Slo-Tech: What do you think about the newest trends in information retrieval from unstructured data like semantic technologies and similar?
Rafal Lukawiecki: Absolutely, well in fact it's all a question of what is unstructured. Using technologies I am familiar the most, which would be Microsoft technologies, we've been able to do unstructured data analysis for a very long time. One of the very typical things I've been doing using SQL Server 2009 data mining technologies was extracting meaning from block entries and e-mails. When people send such unstructured data into a database, it's very difficult to understand what it means unless you read it. But there are automatic technologies - we choose for example associative analyses, which are part of SQL, but are able to classify information that is unstructured and extract information from it. Will it go to a point that we will be able to find repetitive structure in unstructured data? I don't know. But that is a big goal. But it is saying that what seems unstructured really has a structure that we want to extract. I think we are too early for that. But definitely analyzing unstructured data is possible.
Also to some people unstructured simply means not in a database, but maybe on a report, maybe on a website, in a table, and people want to bring all of this together to find whether there are connections between the data. And that's exactly what PowerPivot can do, which is the technology I was talking about at this NT Conference. It is able to allow the user to bring some data from a web page, some date from a database, some data from a report and find out whether this data connect. And this is a very new way to do analysis. Is it unstructured? To some people that is.
Slo-Tech: How long do you think it will take to get a usable natural language processing abilities to the end users?
Rafal Lukawiecki: We've been trying to answer that question for many years. In fact, before I even started studying computers in London in the 90s, when I was still a mid-level student in Poland before I left Poland 20 years ago, one of the very first programs I was writing in a language called SNOBOL4 was for natural language recognition. When asked the question "How long before we can use it?" we said maybe 10 or 15 years. And you see where we are now and it still takes longer. The optimist in me says that when you don't involve speech, we should be seeing imprecise results, intent-based results within the next 7 to 10 years. If you want to get precise results, where the intent is deduced and subsequently verified to have close to 100 % match with what the user wanted really to say, I think we are going to wait another 30 to 40 years. It's going to take a long time.
Slo-Tech: What about all the hype we had about 10 years ago about speech recognition? At some point it just stopped because speech recognition is a mostly solved problem - except for the meaning recognition.
Rafal Lukawiecki: Exactly. You see, speech recognition is very interesting because it goes back to intent. If the choice of intent is very specific and comes from a domain, say you use speech recognition to give a command to a computer from a choice of eight commands, it's very easy to do and it works. With a little training it's perfect. If you look at the way Office, in particular Word and Windows allow people to dictate, it's pretty good. When it comes to a very unbounded domain, that's a very different story. And especially if you want perfection - or as close to perfection as you can get - in the responses or the understanding of the response, then you are absolutely right. Even if we understand language perfectly, that's not the problem. The problem is that people use very colloquial ways of describing their intent and deducing intent is a long, long journey. Because for that we need to make significant advances in ontology, in knowledge representation in general... it's beyond semantics, it's a question of understanding pretty much what artificial intelligence has been trying to do in the last 50 years - understanding what the user really wanted to do, but didn't say. Or understanding what the user really wanted to do, but said the opposite. Many people will give a command that, on word-by-word basis, means something completely opposite to what they really want to achieve simply because they are using a language that is colloquial. There is no way we can cope with that in speech recognition before we do it with written language and even that is hard.
Slo-Tech: How far is ontology building from unstructured data?
Rafal Lukawiecki: I would say that I should know about it because I invested and lost quite a bit of money in the early days of this decade. In 2000 I invested in a company that was doing precisely ontology research and we had this idea that we can build a matrix of automated, fairly autonomous software that would be building ontological representation of the user's interest precisely by recognizing by its environment the intent. Unfortunately at that time I realized that we were in very early stages of even academic research. Look at the huge difficulties that UDDI (Universal Description, Discovery and Integration) has had with web services. Web services technology works well, the difficulty is in recognizing which services to use. Look at the difficulty that semantic web is having in pushing through. I think that the current level of research is great, but I will be very pessimistic and say that it will take another 20 years to get anywhere where we want to be.
On the other side, there is a different way of solving this problem, which is guessing the intent and assuming a certain amount of error and then statistically deducing what the user wanted to achieve. And if you look at it that way, then we can achieve more. We can say "OK, I really don't know what you said. I have a rough idea that it could be these three choices and statistics tells me that from the context you probably want to do that." And if you think about it, this is actually how people work, how consultants work. When I meet somebody and I don't fully understand what they are asking me during a seminary, but I feel obliged to help them, I try to think what they could have meant based on my assumptions: OK, this is a banking institution, they've just deployed a new system, ATM cards are being processed, they have some issues... Well, it's probably some reliability issue to do with a new deployment of a risky system and specifically I refer to my banking knowledge, because that's where I am. I think adding probabilistic elements to any ontological system is probably going to speed up or deliver some intermediate matching and that will bring us an interesting solution.
Slo-Tech: Bayesian statistics is state-of-the-art at the moment.
Rafal Lukawiecki: I think so too. And that connects me very much to business intelligence which is the subject I was talking here about, because data mining is actually a very clever application of hard statistics with a little bit of knowledge discovery and mission learning and artificial intelligence, but brought to the level of people who don't need to understand statistics very well. And SQL Server has data mining that for example can allow the user to find out the exceptions or the outlines in a data set by pressing one button. Nobody needs to know that this is a use of a fairly interesting clustering algorithm and we don't need to explain what that means. But by using that we can deduce intent, we can deduce what's outside the intent and find out the outliers. In fact we are bringing artificial intelligence through the back door to the level of the user. So statistics are definitely big now.
Slo-Tech: You said that SQL Server has some data mining functionality. How about some more advanced features like express clustering, where you get hierarchical k-means and similar? What about other tools which would be of benefit? Are there any plans to get that to the masses or is this still only highly-specialized software?
Rafal Lukawiecki: That's a fantastic question. As I said at the beginning of the interview, a big trend right now is in moving business intelligence towards an enormous quantity of information workers, people who just use e-mail, but don't understand analytics. SQL Server has had for a long time e.g. a technology of building data cubes and analysis services. This technology has been taken to another level with the addition of PowerPivot for SharePoint and Excel which enables a very non-technical user to build cubes without them knowing that they are actually building a cube.
Slo-Tech: What about some advanced knowledge extraction methods over that domain of data?
Rafal Lukawiecki: That's primarily data mining. In business intelligence the extraction of any information from data is an application of knowledge discovery in databases. And in SQL that is done through nine data mining algorithms, from decision trees, clustering, neural networks, naive Bayesian classifiers, logistic regression, linear regression and others that I can't recall at this moment. Of all of those algorithms the one that allows a user to sift through an enormous amount of data and get some early understanding of what's going on over there probably the most useful is the decision tree. A decision tree is very easy to understand as it builds a visually appealing tree that shows the combinations of correlations in a way that anybody understands with very little training. That technology has been simplified again by being brought into Excel. So you don't need to understand SQL Server in order to be able to use these tools. There is an data mining add-in for Office which Microsoft provides free of charge and anybody who is using Excel 2007 or 2010 using these add-ins can connect to SQL Server to find meaning in their data. Another development that Microsoft is using in regards to clustering is kinda clustering in the opposite direction.
We have another problem in business intelligence where we have a massive amount of data sitting in separate data warehouses and we are realizing on an increasing basis that we want to bring them together so that we can analyze them as one. And it hasn't been shipped yet, although SQL Server 2008 R2 has shipped last week, this particular technology will be released by Microsoft later this summer. They call it a Parallel Data Warehouse and it's effectively a technology for building large logical data warehouses from a possibly big number of even separate physical servers all representing data slightly differently. And it's easy to say that there was a logical connection between them but it's actually quite difficult to do it in a meaningful and reliable way.
Slo-Tech: Does this work only on SQL Server?
Rafal Lukawiecki: Yes, Parallel Data Warehouses groups different SQL Servers into one mega data warehouse.
Slo-Tech: So if you have some data in some other type of data storage, you still can't access it?
Rafal Lukawiecki: Well, then you can use the SQL Integration Services, which allows you to bring any information from any database on the planet into the Parallel Data Warehouse. This is a very fast technology. It's designed for processing millions of rows of data in a matter of hours, depending on how you structure your data centres. It's a big machine.
What's interesting is also the future. Whilst all of those technologies are great now and they enable the user to do a lot of stuff, Microsoft is also moving out into the cloud. And then the question you asked about bringing the data from other servers becomes even more interesting because now we would be bringing data from possibly very unstructured sources in SQL Server as well. At the moment we can do only a very limited sample of this, but Microsoft has very big plans for it.
Microsoft's plans are that in the future all software development will be done in Microsoft's Cloud. Microsoft is investing an enormous amount of people, money and time doing nothing else but building the planet's biggest and most powerful cloud, which will have no limit. You don't get there by dropping all the limits now, you have to do it in stages. It will take a while to get there. But it's interesting, because if you think about the competition, e.g. Amazon Elastic Compute Cloud uses a very different module: you copy what you've done locally in your data centre to a data centre in the cloud. There is no difference, it's just an outsourced, virtualized version of what you are running locally on a server. I don't find that very much exciting at all, that's so 10 or 20 years ago. What Microsoft is hopefully going to do before anybody else catches up with that and spoils the plans is to bring a very different way of developing the software, where what you are developing in the cloud is developed in a way that is unique to the cloud, that takes advantage of this interconnectedness.
Since we're talking about business intelligence, everybody needs some kind of reference data or master data. For instance, when a developer develops an application for Slovenia, very often the developer needs to access a list of cities in Slovenia or maybe a list of cities in the EU. Where do you get such a list of cities from? It's actually very difficult. You could copy it from Wikipedia or get it from other sources, but it's very hard to get one. So there is a huge market growing right now for the so-called reference data where people who are specializing in creating e.g. a list of cities will offer it through the cloud to cloud users and then will have a new type of business whose accuracy of data will rely on the provision of accurate master data known as reference data. I think that Microsoft by investing in things like SQL reference data is recognizing the importance of the cloud as a different module, as a different paradigm.
Slo-Tech: What do you think about cheating, e.g. collaborative filtering and other social technologies that allow you to instead of deducing intent from a person to cheat by classifying a person to some group and then just seeing what other people do?
Rafal Lukawiecki: This is very basic associative analysis from data mining we've been using for years. Using e.g. SQL Server 2008 data mining I have used association rules to classify shoppers and then I find similarities between shoppers using clustering and do exactly the same thing. I don't need social networks to extract such things if I have the data. What social networks do is they allow you to do this without the data. And that is definitely a big opportunity, provided that the people who control the information in a social network are prepared somehow to do business with others in releasing this information. Technologically I don't think this is a cheat, but a very interesting way of doing a business, but I have big issues with regards to privacy and security. Because you could very easily deduce intent to do some very unsavoury things with regards to people's personal safety and I don't think that people who sign up to social networks think even for a minute about the potential of the information they have released.
Slo-Tech: Let's say Facebook is the end of privacy for everybody who joins in. How does that apply to your own safety?
Rafal Lukawiecki: I'm glad that you said that social networks are the end of privacy today, because I lament about the fact that young people grow up not just with giving up their privacy, not understanding privacy, but even giving up their liberty. As, I believe Franklin said that, "Anybody who is prepared to give up their liberty for some security deserves neither security nor liberty", I would like to paraphrase the same with privacy. I would say that losing your liberty by losing your privacy at this point means that we don't deserve to have the liberty. I think things will change and we'll go the other way. In about five to ten years people will realize that moving in the direction of an Orwellian society is in nobody's interest and we will go the other way around. But there is a generation now that has lost its privacy and young people will realize when they get into their 40s and 50s how everything they said 20 or 30 years ago online is getting back at them, how suddenly they are no longer interested in talking about their private romantic escapades is relevant to what they do when they are 50. Today they don't think that way, but they will probably change their mind later. Or we will see the birth of a very new approach towards liberty and privacy.
But I cry and therefore suggest that anybody who cares about it participates in whatever you can do in your own country to have a hard look at the way we are losing liberty and privacy. Whether it is through Electronic Frontiers Foundation or liberties organizations, it's time to have a deep thought about this and to educate people about this, especially young people, about the loss.
Slo-Tech: Is there anything else important that I forgot to ask you?
Rafal Lukawiecki: You asked a lot of very important questions about very different subjects. What I would like to stress is that once we talk about esoterical subjects that may or may not affect people in 20 or 30 years' time, I would like to bring everything back to the reality of today and say that right now people have to give answers based on numbers and very hard and boring decisions perhaps need to be made, but which are important for the success of organizations. I would like people to understand they can make those decisions based on answers they no longer have to entrust to somebody else. They can find out the meaning of data all by themselves and I would like everybody to visit www.powerpivot.com. There are very friendly videos and really exciting examples over there. Download PowerPivot free of charge, try it out in Office 2010 and just see how magical it is to learn something about Slovenia, for example. If you can just plug Slovenian statistical database into it, push a few buttons and learn things you never knew about your country.
Slo-Tech: Thank you very much for the interview, Mr Rafal Lukawiecki.
Slo-Tech: Introduce yourself to our readers (job, education, interests, etc) and please explain if your real surname is Spender or Spengler :-) Also, was Brad ever member of any Black hat group? Brad Spengler: Brad Spengler (not Brad Spender), though the similarity in the names isn't a coincidence, ...
Introduction: We continue our series of interviews with a slightly »unusual« talk this time: Peter Van Eeckhoutte may be unknown to readers who don't follow the InfoSec scene on a daily basis. But he is well known to the international security community and his name is climbing fast on the ...
- Jure Čuhalev ::
Slo-tech: Can you introduce yourself for our readers? August: My name is August de los Reyes and I'm a designer. I'm also a professor of design at the University of Washington. I work for an independent design studio in Seattle called Artefact-I just joined them two months ago-and before that I worked ...
- Jure Čuhalev ::
Slo-Tech: Can you introduce yourself? Seth: My name is Seth Bindernagel and I am the director of localization for Mozilla Firefox. Slo-Tech: Our community regularly follows nightly builds of Opera, Firefox, Chrome, etc. as it’s a very competitive landscape. How do you see it from the Firefox perspective? Seth: ...
Slo-tech: Can you please introduce yourself? Gary Kovacs: Yes. I’m Gary Kovacs, CEO of Mozilla. [st.slika 48180] Slo-tech: How do you feel after these two months of being CEO of Mozilla? Gary Kovacs: It’s actually been three and a half weeks and I feel great. It is as I expected it ...