Machine Learning is modifying the way businesses figure out the best path forward. Organizations across different industry verticals are leveraging the technology to accelerate decision-making, mitigate risks and enhance customer experience. However, Machine Learning requires the appropriate underlying technology framework, such as Data Lake on AWS, which breaks down silos and helps scale up or down easily.
This blog talks about a use case where one of our customers managed to implement Data Lake on AWS cloud and used it to create predictive ML models.
At present, the customer uses self-owned trucks for the transportation of finished vehicles in India for OEMs. These trucks are modern, technology-enabled with various sensors like GPS, Fuel, Engine Telematics Sensor, Door Sensor, Fast tag sensor, Tyre Pressure and Temperature. This data was stored in Cassandra Database.
The customer was facing the below challenges:
The customer was looking for a solution that could create ML models to predict the shipment arrival time and plan and schedule other shipments. They wanted a cost-effective solution to justify the ROI from the analyzed data.
The team at MIND analyzed the problem, associated data and identified a solution to create a data lake solution on AWS to handle the scale, volume and variety of the data. A cost-effective approach using managed services on AWS was proposed and implemented. Following are some of the key features of the implemented solution:
Fig 1. Architecture Diagram -1
Fig 2. Architecture Diagram-2
Before arriving at the final solution, we researched multiple approaches as mentioned below. However, these approaches were not successful, and we faced challenges with each one. Let’s dig deeper into these methodologies and look for why they failed.
AWS Database Migration Service (DMS) helps set up and manage a replication instance on AWS. In this approach, we explored the AWS Data Migration Service for the Incremental transfer of data and used the AWS Schema Conversion Tool to set up on the database side. Yet, it could only perform Change Data Capture (CDC) to AWS Dynamo DB and could not carry out incremental data transfer to AWS S3.
Due to a lack of direct support for Cassandra, we had to drop the DMS approach. The option to select Cassandra as a source in DMS is not available as of now. Also, for the incremental transfer of data from S3 to DynamoDB, CDC is not supported, which was the customer’s primary requirement.
Debezium provides a ready-to-use application that streams change events from a source database to messaging infrastructure like Amazon Kinesis, Google Cloud Pub/Sub or Apache Pulsar. We explored Debezium Server for the incremental data transfer from Cassandra to S3 for creating a data lake. Interestingly, we were able to work this out as we had streamed data from kinesis to S3 in the data lake.
As per the official documentation of Debezium Server, Debezium is incubating, which further implies that the semantics, configuration options etc., might change in future revisions, which is why it is not recommended for the production purpose.
The Cassandra connector is capable of monitoring a Cassandra cluster and recording all row-level changes. To do this, the connector must be deployed locally on each node of the cluster. As soon as the connector gets linked to a Cassandra node, it takes a snapshot of all CDC-enabled tables in all keyspaces. The connector also reads the changes written to Cassandra commit logs and creates the corresponding insert, update, and delete events. All events for each table get recorded in a separate Kafka topic, where applications and services could consume them.
In this approach, an initial setup is required from the client-side, but the semantics, API configuration keeps changing, which is why we could not risk using this approach for production deployment.
AWS Data Pipeline helps users define automated workflows for the movement and transformation of data. In other words, it helps with reliable processing and moving data between different AWS compute and storage services, as well as on-premises data sources.
We implemented the AWS Data Pipeline solution, which synchronizes all the activities from launching EMR Cluster, running Job Flow for incremental data transfer, retry mechanism and sending the notification on a successful run of the job or in case of some failure.
As this service is not available in Mumbai and the database is currently located in the Mumbai region, it incurred the inter-region data transfer cost. Therefore, to eliminate this cost overhead, we decided not to move forward with this approach instead of using some other services for the same purpose.
All in all, to overcome the challenges in the above-discussed approaches, we used Spark Cassandra Connector by Datastax for accessing the customer’s data from Cassandra DB using EMR on AWS.
The Final Note
Overall, in this blog, we learnt that data partitioning improved the data query performance, and the correct calculation of the nodes is required for data pre-processing in EMR Cluster. The parquet format also gave us several benefits, like less space needed for data storage. Moreover, we showed how we ingested one-time 7 GB data and incremental 600 MB data from Cassandra to S3 by creating a data lake in S3 and using Glue for its corresponding data lake schema. Further, we analyzed the data in S3 by running queries through Athena and successfully created a data lake on AWS from the Cassandra database.
This led to a 90% improvement in analysis and query time for data and a 20% cost saving for driver utilization improvement. Also, there was a 15% improvement in the on-time shipment of vehicles, which further improved customer satisfaction and savings in delivery penalties.
With the help of the machine learning model, the predicted ETA(estimated time to arrival) using the Random Forest algorithm increased overall customer satisfaction, which then resulted in many intangible benefits. For instance, it helped increase trust in the brand, which eventually leads to customer loyalty.
Dr. Bishan Chauhan – AI / ML Practice Head
Umesh Taneja – Senior Cloud Engineer, AI / ML Practice
Pulkit Garg – Data Scientist, AI / ML Practice
Trends and insights from our IT Experts