http://bit.ly/BWPABD
A two-day workshop on “Beyond Watson, Predictive Analytics and Big Data” was held on Monday and Tuesday, February 3-4, 2014 In Hanover, Maryland with support from NSF and the DoD.
Watson is IBM’s DeepQA system that successfully out-competed humans on Jeopardy. In its now famous human-machine challenge, Watson was asked to give the question to an answer whose associated question was already known. Can the techniques implemented in Watson be extended and used as a powerful knowledge discovery tool to find the questions to answers whose associated questions are unknown? Turning this around to a non-Jeopardy context, can these techniques be extended to find the answers to questions whose answers are unknown? What are the Underlying areas in Computer Science where new advances are needed ? How can they be combined ? What domains present the most pressing challenges ? This workshop will bring together leading figures from academia, industry, and the government to brainstorm about these issues, and provide feedback to the federal agencies.
Today, the sub-specialties of knowledge discovery, machine learning, data mining, and information retrieval in the fields of computer science and statistics, are commonly being applied to the fields of medicine, biology, ecology, physics, genomics, cryptanalysis, etc., and more recently to social science, advertising, and cybersecurity applications. With the advent of Cloud computing technologies and publicly available Cloud technologies, e.g. Amazon, Google, etc., institutions, companies, and users can rent compute time for large-scale computation without the enormous cost of creating and maintaining supercomputers usually only available for large laboratory applications. Additionally, programming and data storage paradigms have evolved to make use of the inherent parallelism in many business applications, e.g. MapReduce, Google File System, Big Table, etc. This has led to new applications for computer science. As the Internet has evolved, we have access to a plethora of online information in real time, e.g. Google, Wikipedia, etc.. Real-time data sources have emerged that provide social contact with friends and like-minded individuals (e.g. Facebook, Twitter), online banking, online purchasing (e.g Amazon), video streaming, telephone, geolocation, etc. With the advent of wireless technologies and smartphones, we now have access to all of this information in the palm of our hands. Payment systems are now facilitated by smart phones. Advertising is ubiquitous and tailored to the user in real time.
Machine learning and data mining techniques are used to answer a wide range of questions from simple facts to trends and predictions. Medical applications seek to understand questions such as how many people in a population suffer from a given disease to knowing the efficacy of various treatment regimens to predicting which hospital patients will have a heart attack within the next hour. Advertising applications seek to analyze your click stream and cookies so that ads tailored to your interests can be placed on the website you are browsing in real time. It also seeks to identify users who are clicking through links to intentionally skew statistics about products, web sites, and more, called click fraud.
How problems get identified and get solved in these various domains is usually dependent on teams of people. Typically, an analyst of some type has a collection of questions that they would like to ask about their domain of interest. Analysts (for the sake of this document) are experts in the subject, such as doctors, financial analysts, etc. are teamed with statisticians, computer scientists, and others to develop and write the computer programs to put the data in a usable form and to develop the machine learning applications, tools, and algorithms. This combination of machine readable and executable information get combined into analytics designed to answer a question in a given line of inquiry. Depending on the application, these analytics can be reused but eventually need to be refreshed given new data. Applications such as credit card fraud detection and click stream detection need to be refreshed regularly. A number of different higher-level computer languages exist to facilitate the development and execution of analytics. They include everything from SQL (Structured Query Language) to XML (Extensible Markup Language) to PMML (Predictive Model Markup Language) and much more. In some domains where answers to analytics (analyst questions) span sources of information and types of media, ontologies and taxonomies are used to assist in the process of connecting the data. However, there is no common language used to express either the queries or the analytics or the connecting ontologies, databases, and systems.
This process of developing and maintaining analytics tends to be quite labor intensive. There is always a risk of overfitting the analytic to problem data so that it is only useful in a narrow domain. This is particularly true when the analytic is designed to be predictive, e.g. will the user buy this product, or when summarizing real time information, e.g. news feeds. Real time financial industry data requires that credit card fraud models must be recreated at least annually. For newer applications such as computational advertising, models must be refreshed weekly.
Can these methods from disparate domains be pulled together under one umbrella with one query language? Furthermore, can simple English language questions be posed and translated into appropriate existing or easily modified analytics through the use of shared taxonomies and ontologies? Can analytics learn from one another? Can they advertise themselves, how they work and what they can operate on? What categories and classes of questions and their resultant analytics exist today? What query, modeling, and markup languages exist today to aid in this process? What are the computational limits of the current tools and algorithms? Should different types of computer architectures be considered to make these analytics more computationally feasible?
Registration
This workshop is by invitation only. An email with registration details will be sent to the invited participants.
Contact
If you have any questions, please contact us by email at beyondWatson@cs.umbc.edu.