Conference about Big Data, High Load, Data Science, Machine Learning
Conference is over!
About the Conference
Big Data Conference Vilnius is a two-day conference with technical talks in the fields of Big Data, High Load, Data Science and Machine Learning.
Conference brings together developers, IT professionals and users to share their experience, discuss best practices, describe use cases and business applications related to their successes.
The event is designed to educate, inform and inspire – organized by people who are passionate about Big Data and Data Exploration. We look forward to seeing you there!
Why to attend?
- Meet 20+ international speakers who work in top data-driven companies.
- Join 3 technical tracks that cover the most important and up-to-date aspects of Big Data, including deep learning, real-time stream processing, data science, predictive analytics and cloud.
- Hear carefully selected purely technical and independent content.
- Network with 350+ participants from various companies that use Big Data in production.
- Select among 5 full-day technical pre conference Workshops.
List is getting longer!
Conference program - November 30 (09:00-17:30)
Opening Keynote: A Novel Approach to Data Mining And Prediction Modelling in Dairy Cows
As a major livestock producer, the European Union is directly affected by the global need for more sustainable food production. Climate change will undoubtedly impact on farm animal production but the health and welfare of livestock is also of increasing public concern. The â€˜common currencyâ€™ in developing solutions to all of these challenges is improved animal production efficiency (Simmons, 2012). Diseases at calving and during early lactation (shortly after calving) account for the major health and welfare problems in dairy production (Drackley et al., 2005). These include production diseases such as fatty liver, ketosis, rumen acidosis and lameness and infectious diseases such as mastitis and reproductive tract infections. Infertility and disease are strongly interlinked through communal metabolic and immune signaling pathways (Moore et al., 2005). The knock-on consequences of these are suboptimal production and lower reproductive efficiency. These in turn contribute to excess methane emissions and higher nitrogen and phosphorous losses from soil, because there is a need to breed and keep a higher number of replacement animals. Due to rapid development of precision livestock farming technologies and availability of high-throughput from sensors, large-scale massive data has become available on many research farms which can each serve as a candidate phenotype for the aforementioned challenges. Dealing with such interlinked challenges requires early identification using biomarkers (BM). The preferred matrix in which to measure the biomarkers is milk, as it is more accessible than blood and allows low-cost, automated repeat sampling using recently developed â€˜in-lineâ€™ sampling and analytical technologies (Egger-Danner et al., 2015). It is highly probable that certain N-glycan structures (BM-1), metabolites (BM-2) or mid-infra-red spectra (BM-3) in bovine milk can serve as biomarkers to predict one of the aforementioned phenotypes but comparable prediction methodologies are lacking. Machine learning is often described as a key technology that will unlock insights from such data (Domingos, 2012). Effectively leveraging these technologies however requires thorough understanding of the dairy domain and data science in order to clean and effectively join multiple data sources, identify the relevant parts in the data, build domain specific algorithms and visualizations and validate the predictive models. Each of these steps are essential in advance of efficient and effective transfer of such models to the dairy industry. Accordingly, this methodology paper describes a semi-automated approach to select potential candidates for future industry-wide prediction models. The primary objective of this study is rank the 3 types of BM according to their predictive power to predict the phenotypes of interest from a multi-site study design.
Post-doc assistant position on herd health management focusing on the optimization of productive and reproductive performances in small and large herds with an emphasis on nutrition at the Ambulatory Clinic of the Department of Reproduction, Obstetrics and Herd Health. Workpackage leader for 3 work packages with a focus on data management in EU FP7 project GplusE Education of master students in Veterinary Medicine. Statistical training of Ph.D. students in data management in the area of dairy cows. Post academic and extension services in the area of herd health management in dairy cows.
From model to production – turning Microsoft Azure into data science platform
Whether you are open source guru or Microsoft stack aficionado come and see how cloud may help you run your data science business: from process to data preparation to scalable computation and then - running your model in production.
Michal has been with Microsoft 10 years and is specialized in incubating new Microsoft Azure services and usage scenarios across EMEA (Europe, Middle East & Africa) region as well as providing sales and technical insights to mainstream sellers, validating new business models, and supporting selected customers and partners. Michal is an expert in cloud big data and big compute solutions and as he says in his own words “I make cloud happen”.
Automating Analytics with Deep Learning – Exacaster experience
Exacaster has recently released a public API powered by Deep Learning that takes in your time series data and analyzes it the way a human analyst would approach it. Session shares our experiences and the latest from our R&D in the area of deep learning applied to real world analytical projects.
Egidijus Pilypas is a Head of Product Development and Data Science at Exacaster. He has been dealing with AI and Big Data challenges for the last 10 years in Telco, Retail, Financial and Utility industries. He is a fully committed AI geek ready to share his AI learnings, experiences and the latest products he's working on with a team.
Tracking down bad guys on the “dark web” with Open Source software
We have been working with both DARPA and NASA to help develop and deliver machine learning tooling for law enforcement agencies both in the USA and Europe to help them track down smugglers and criminals on the “dark web”. In this talk we’ll detail some of the tools we built, the pipelines we generated and how we help focus on the correct data using Open Source software.
Tom Barber developes data solutions for companies and organisations world wide. Currently working on a range of Open Source data projects with NASA JPL.
Take Control over Your KPIs
KPIs are the single most important and valuable metrics to ensure that the company makes data-driven decisions about the direction of current projects. Companies use KPIs to evaluate past performance, but also to plan ahead. In order to have a better control over future KPI changes, it is very useful to have the right tools at hand that can predict accurately the development of your KPIs. This gives the company leverage to efficiently allocate resources, focus on the project that matter most and avoid negative results. In this talk I will present how Runtastic is currently estimating its KPIs and how we use estimates to drive business decisions.
Hilda Kosorus is Data Scientist at Runtastic. Coming from the academic and research world after finishing her PhD in Computer Science, she took on the challenge of laying the Data Science foundation at Runtastic. She focuses on bridging the gap between analytics and other departments, creating value with every experiment and data product and helping Runtastic become more data-driven.
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale. Behind this service, there are various components that power this platform, including Jupyter Notebooks, an enterprise gateway that manages the execution of the Jupyter Kernels and an Apache Spark cluster that power the computation. In this session we will describe our experience and best practices putting together this analytical platform as a service based on Jupyter Notebooks and Apache Spark, in particular how we built the Enterprise Gateway that enables all the Notebooks to share the Spark cluster computational resources.
Luciano Resende is an Architect at IBM Spark Technology Center. He has been contributing to open source at The ASF for over 10 years, he is a member of ASF and is currently contributing to various big data related Apache projects including Apache Spark, Apache Zeppelin, Apache Bahir, Apache Toree and Apache SystemML. Luciano is the project chair for Apache Bahir, and also spend time mentoring newly created Apache Incubator projects. Recently, Luciano has started contributing to Jupyter Ecosystem projects around Enterprise Notebook Platform. At IBM, he contributed to several IBM big data offerings, including BigInsights, IOP and its respective Bluemix Cloud services.
You are using the wrong database!
Relational, graph, document, in memory, key-value, search, stream, embedded – those are the most common database types. This talk will cover types of databases, their weaknesses, main players, strong points when to use them, and when it might not be the best idea and lastly how to combine them. Full description: https://indexoutofrange.com/speaking/cfp/You-are-using-the-wrong-database!/
Apart from technical issues he focuses on building self-organized and self-sustainable teams with a well defined work culture.
Closing Keynote: Introduction to Deep Learning
This session will introduce Deep Learning concepts, with a focus on image recognition and image location exercises. After a brief introduction to the motivating factors – and why we cannot always just rely on the fantastic Cognitive Services -, concepts such as convolution, pooling and rectified linear units will be presented, so at the end of the session the attendant will be able to understand why deep learning is relevant in ‘day to day projects’, learn about the development cycle of deep learning models and some techniques as partial checkpoint training.
At the end of the session, a couple of less traditional models (autoencoders and LSTM) will be discussed and cases will be analyzed.
It is expected the audience to understand the basics of the ML processes, such as the training, testing and validation workflow, and basic algorithms such as logistic regression.
Pablo Doval is the Data Team Lead and the General Manager of Plain Concepts in the UK. With a background of relational databases, data warehousing and traditional BI projects, he has spent the last years architecting and building Big Data and Machine Learning projects for customers in different sectors, such as Healthcare, Digital Media, Retail and Industry.
Multi-tenant Streaming and TensorFlow as a Service with Hops
Hops is a new European version of Apache Hadoop that introduces new concepts to Hadoop to enable multi-tenant Streaming-as-a-Service and TensorFlow-as-a-Service. In particular, Hops introduces the abstractions: projects, datasets and users. Projects are containers for datasets and users, and are aimed at removing the need for users to manage and launch clusters today, as clusters are currently the only strong mechanisms for isolating users and their data from one another. Our platform for managing datasets and running jobs, called Hopsworks, builds on Hops concepts and is in an entirely UI-driven environment implemented with only open-source software. In this talk we will discuss the challenges and experiences in building secure streaming applications on both Spark and Flink with Kafka over YARN using Hopsworks. We also show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark applications, how we use Grafana and Vizops (an in-house developed monitoring tool) with InfluxDB to monitor Spark applications and finally how Apache Zeppelin and Jupyter can provide interactive visualizations and charts to end-users. We also discuss how Hopsworks provides TensorFlow-as-a-Service with Distributed TensorFlow and Yahoo’s TensorFlowOnSpark. Users can debug applications using Tensorboard and SparkUI, examine logs and monitor training. Moreover, we will show how Hopsworks simplifies discovering and downloading huge datasets between Hopsworks clusters using a custom peer-to-peer sharing tool. Users can, within minutes, install Hopsworks, discover curated important datasets and download them to either apply their business logic with a streaming application or train Deep Neural networks using TensorFlow. We will also discuss our experiences running Streaming-as-a-Service and TensorFlow-as-a-Service on a cluster in Sweden with over 200 users (as of mid 2017).
Theofilos Kakantousis is a co-founder of Logical Clocks AB, the main developers of Hops Hadoop (www.hops.io). He received his MSc in Distributed Systems from KTH in 2014. He has previously worked as a middleware consultant at Oracle, Greece, as well a research engineer at SICS Swedish ICT, Stockholm. He frequently gives talks on Hops Hadoop, and has presented Hops at venues such as Strata San Jose/New York and Big Data Tech Warsaw.
Spring RabbitMQ for High load
In this session we will provide a practical overview of the support that the Spring framework provides for the AMQP protocol and in particular – the utilities provided by the framework for integration with the RabbitMQ message broker. We will first discuss what makes RabbitMQ such a powerful and widely-deployed message broker along with a demo and in the second part – how does the Spring framework provide support for RabbitMQ along with a second demo.
Martin is an IT consultant, Java enthusiast and has been heavily involved in the activities of the Bulgarian Java User group (BG JUG). His areas of interest include the wide range of Java-related technologies (such as Servlets, JSP, JAXB, JAXP, JMS, JMX, JAX-RS, JAX-WS, Hibernate, Spring Framework, Liferay Portal and Eclipse RCP), cloud computing technologies, cloud-based software architectures, enterprise application integration, relational and NoSQL databases. You can reach him for any Java and FOSS-related topics (especially Eclipse and the OpenJDK). Martin is also a regular speaker at Java conferences and helps with the organization of the jPrime conference.
Feature importance in ensemble methods : understanding the prediction thanks to your variables
Ensemble methods are extremely performant in terms of prediction, but lack easy interpretation. Feature importance is not only counting up how many times a feature has been used in a weak learner, but also by how much this feature contributes to the result. Moreover, feature importance is strongly linked to the problem at stake (regression, classification), and the algorithm used. Constant mainly focused his work around gradient boosting implementation, and provide a relevant metric for feature importance and prediction interpretation for several typical use cases. He also benchmarkes this metric with other agnostic approaches.
Constant is strongly interested in the creation of value out of data, and he is currently experiencing one of the most satisfying mind shifts of the past 10 years. Companies start to realize that data is not a useless expense to build on, but a real opportunity to assess their results, find insights in their process failures, reclaim their expertise and probably evolve to a more sustainable business. He wants to help those who believe in such a potential by accelerating their transition toward a data driven company. In order to address these new problematics, Constant focuses on mastering every skill of a complete Data Geek : architecture expertise (data, applications, network), data science mastering (statistical learning, data visualisation, algorithmic theory), customer and business understanding (model prediction consumption, business metrics, customer needs). He has been working for about two years for OCTO Technology in the best Big Data team in France. Constant is an expert in the industry sector and he works on several types of mission, ranging from predictive maintenance of production site, to prediction of critical KPIs in video games, via real time monitoring of manufacturing devices. Prior to joining OCTO, he was working as a researcher in data61 (formerly known as NICTA), the best research institute in ICT in Australia on applying Machine Learning to profile GUI users and provide the best amount of information to help them make a decision based on a machine learning prediction. Constant published his work in two major conferences: CHI WIP 2015 and OzCHI ’15.
Kafka in a microservice world
When designing a new system, scalability, fault-tolerance, high availability, real time are must haves. All these can be answered with Î¼services. They all have to communicate to each other. But communication between them can be a bottleneck. This talk will try to show you why Apache Kafka is a great candidate for this.
Laszlo-Robert is an enthusiastic Java developer. He worked hard on backend systems over the last 10+ years, solving technical challenges of a broad range of enterprise Java applications as a developer. During these years he has had the chance to learn to handle Java’s performance in ways most people never thought of. About 4 years ago he was asked to teach their inexperienced developers Spring Framework. This was the journey that introduced him to hold a speech in front of an audience, and he liked it, so after that he started holding tons of workshops for his fellow colleagues. Having several presentations on different subjects, like OSGi, Eclipse RCP, Microservices and lately on Apache Kafka, Laszlo-Robert finally sent a CFP at an international conference too, and he was accepted, so he continued. Since then he has presented or been already accepted as speaker at several conferences and he would like to continue to share his knowledge. Laszlo-Robert also has an MsC in Computer Science from Babes-Bolyai University of Cluj-Napoca.
Geospatial Analytics at Scale
Data’s spatial context can be a very important variable in many applications. Massive volumes of spatial data are generated on daily basis – from cell phone usage, commuting from home to work, by taxi services, airplanes, drones, various sensors and logs, etc. Geospatial data provides us very important insights into customer behavior and various movement trends , which can be an important information in decision making and for various optimizations. In order to benefit from geospatial context in our applications, we need to be able to efficiently parse geospatial datasets at scale, and use them together with other available data sources and information. There is a limited number of open source tools that provide an efficient way to parse and query geospatial data, which makes utilization of geospatial information in Business Intelligence and Predictive Modelling quite a challenge. The main focus of my talk will be utilizing geospatial data at scale based on Apache Spark and Magellan library – an approach that we are using within our applications.
Milos Milovanovic is a Co-Founder and Data Engineer of Things Solver company, based in Belgrade, Serbia. He is also a Co-Founder of Data Science Serbia nonprofit organization, which has the main goal to educate in the Data Science field. His current focus is the Analytics at scale. He has a strong technical background in Data Engineering – ranging from Data Collection, Cleansing, Data Preparation and Analytics, and serving the data to Data Scientists so they can easily query it. On daily basis he is utilizing Big Data technologies such as Apache Spark, Hadoop, Hive, Elasticsearch, Python, PostgreSQL, Airflow… In professional career, he worked on many production scale projects in Telco industry, Finance and Retail.
Distributed Deep Learning with TensorFlow and Kubernetes
Training (Deep) Neural Networks can quickly become a very time consuming task. Training a single network can easily take hours/days. Even with ever increasing CPU/GPU speed, using a single machine becomes cumbersome. Fortunately distributed training is a feature of TensorFlow which can drastically speed up training.
Science. His main interest is to play with exciting and evolving technologies around orchestration, automation and Machine Learning. Currently he helps a large Enterprise as a cluster operator at running a multi tenant kubernetes cluster.
Advanced search for your legacy application
How do you mix SQL and NoSQL worlds without starting a messy revolution? This live coding talk will show you how to add Elasticsearch to your legacy application without changing all your current development habits. Your application will have suddenly have advanced search features, all without the need to write complex SQL code! David will start from a RestX, Hibernate and Postgresql/MySQL based application and will add a complete integration of Elasticsearch, all live from the stage during his presentation.
David Pilato is Developer and Evangelist at elastic and French spoken language User Group creator. In his free time, he likes talking about elastic search in conferences or in companies (Brown Bag Lunches).
Does P in your RPA project still stands for Painful?
Robotic Process Automation is an innovative and cost-efficient way to improve and optimize enterprise business processes. However, the way to successful RPA project is paved with many rough stones – identification of the right processes and suitable automation points, manual software robot programming and of course monitoring and evaluating of the result – all of this driven by manual analysis and consulting. This presentation will show you, how big data collected when monitoring the user behavior can be leveraged by process mining technology to overcome these issues and make your RPA project implementation a pleasant journey instead of painful experience.
As Product Visionary for Minit, Michal defines the Research & Development direction for this process mining solution, develops close ties to the academic community in this area and evangelizes process mining benefits to enterprises worldwide. Michal previously lead Microsoft Consulting department in Siemens and was involved in several large enterprise projects as a consultant and project manager. In his free time, he is a passionate trail runner.
Dawid Wysakowicz, Adam Kawa
Streaming analytics better than batch – when and why
While a lot of problems can be solved in batch, the stream processing approach currently gives you more benefits. And it’s not only sub-second latency at scale. But mainly possibility … to express accurate analytics with little effort – something that is hard or usually ignored with older batch technologies like Pig, Scalding, Spark or even established stream processors like Storm or Spark Streaming. In this talk we’ll use a real-world example of user session analytics (inspired by Spotify) to give you a use-case driven overview of business and technical problems that modern stream processing technologies like Flink help you solve, and benefits you can get by using them today for processing your data as a stream.
Dawid Wysakowicz, Adam Kawa
Dawid Wysakowicz works as a Data Engineer at GetInData working to help people and companies succeed with Apache Flink. Actively participates in the Flink community what resulted in becoming a committer. First interested with Big Data technologies in 2015 while writing Master Thesis on Distributed Genomic Datawarehouse. Recently had helped to extract value from large datasets at mBank. Adam Kawa became a fan of Big Data after implementing his first Hadoop job in 2010. Since then he has been working with Hadoop at Spotify (where he had proudly operated one of the largest and fastest-growing Hadoop clusters in Europe for two years), Truecaller, Authorized Cloudera Training Partner and finally now at GetInData. He works with technologies like Hadoop, Hive, Spark, Flink, Kafka, HBase and more. He has helped a number of companies ranging from fast-growing startups to global corporations. Adam regularly blogs about Big Data and he also is a frequent speaker at major Big Data conferences and meetups. He is the co-founder of Stockholm HUG and the co-organizer of Warsaw HUG.
Monitoring the unknown, 1000*100 series a day
How to monitor unknown third party code? One of the hardest challenges we face running Clever Cloud, apart from the impressive scale we face with hundreds of new applications per week, is the monitoring of unknown tech stacks. The first goal of rebuilding the monitoring platform was to accommodate the immutable infrastructure pattern that generates lots of ephemeral hosts every minute. The traditional approach is to focus on VMs or hosts, not applications. We needed to shift this into an approach of auto-discovery of metrics to monitor, allowing third party code to publish new items. This talk explains our journey in building Clever Cloud Metrics stack, heavily based on Warp10 (Kafka/Hadoop/Storm based) to deliver developer efficiency and trustability to our clients applications.
Quentin Adam is the CEO of Clever Cloud: an IT automation company, running a Platform as a Service allowing you to run java, scala, ruby, node.js, php, python or go applications, with auto scaling and auto healing features. This position allow him to study lots of applications, code, practice, and extract some talks and advises. Regular speaker at various tech conference, he’s focused to help developers to deliver quickly and happily good applications.
Full-day Workshops - November 29 (10:00-17:30)
Workshops will take place in the University of Applied Social Sciences (Socialiniu mokslu kolegija) Kalvariju str. 137E LT-08221 Vilnius, Lithuania.
BUILDING RESILIENT SYSTEMS FOR HIGH TRAFFIC WITH ERLANG & ELIXIR (ENG)
Máté Marjai, Ireland
Groups of 10+ attendees will receive additional 10 % discount. To request the invoice for Full ticket, please contact us at email@example.com
For more information, please call +370 618 00 999
Become a Volunteer
We are looking for enthusiastic supporters to help for the upcoming Big Data Conference Vilnius 2017 on November 29-30 . All accepted volunteers will receive a free ticket for the conference and much more in return. To become a Helper, please fill the registration form.
Become a Sponsor
You are invited to be a part of an exciting event: actively contribute to the success of Big Data Conference Vilnius 2017, target a specific, high profile market and reinforce your brand’s presence by making yourself known among Big Data, High Load, Data Science and Machine Learning experts. Do not miss out on the opportunity to be noticed and get involved in this event.
UNIVERSITY OF APPLIED SOCIAL SCIENCES
Kalvarijų str. 137E, LT–08221 Vilnius