A patent is a type of intellectual property that gives its owner the exclusive right to exclude others from making, using, or selling an invention for a limited period of time in exchange for publishing an enabling disclosure of the invention by backing it legally.
There are three types of patents:
- Utility Patents: Granted to the one who invents or discovers any new and useful process, machine, article of manufacture, composition of matter, or any new and useful improvement
- Design Patents: Granted for inventing a new, original, and ornamental design for an article of manufacture
- Plant Patents: Granted to anyone who invents or discovers and asexually reproduces any distinct and new variety of plant.
12 years ago or so, IBM had the Patent GPFS and said you can’t have a truly distributed file system, and thus all the work was done on a Java Virtual Machine Hypervisor layer technology, called Java or JVM. To overcome this hypervisor compute limitation, companies had to surpass this Java compute level, with the volumes of data to come, and create plans to have machine learning, living, features, and run sets under one of the many artifactual intelligent frameworks to come.
In the early days, both leading chip manufacturers, Intel and AMD invested over $1.5B into ownership of Hadoop, HDFS. Somewhat forecasting the data explosions/implosion ahead, the ultimate goal was to design a distributed file system (like HDFS) on the IC chipset itself in order to meet the demands and exceed compute limitations of a JVM cluster, distributed computing platform.
As of today AWS, Intel, and others are on this race IC chipset path, in IC design and manufacturing. Basically, they are taking the concepts of Hadoop / Big Data and putting this at the IC, chipset CPU architecture.
Patents are Big Deal / Big Money Items; of Things, for example, the little company that never really has to sell, but they always sell. For instance, IBM has over 7,000 awarded patents. From this, we can see, that companies like AWS, and even Qubole, have the same thought.
Qubole is an Open Data Lake Platform founded in 2011 by Big Data pioneers from Facebook and the co-creators of Apache Hive. Over the last decade, we have had huge market success in many verticals and industries. As a cloud-agnostic platform, supporting the major cloud providers (with AWS leading the charge), we have multi-engine support including the best of breed in Apache Spark, Presto, Hive, and Airflow. This level of flexibility and support is a huge differentiator and is a USP of Qubole since no other player in the market can offer the one-stop solution that Qubole can.
This Open Data Lake Platform truly provides you with the flexibility for many workloads and many use case examples like Sentiment Analysis, 360 Degree Customer View, Ad Hoc Analysis, Real-time Analysis, Multichannel Marketing, Clickstream analysis, and much more. Some of the major benefits you receive as a client are scalable performance enhancements, cost-effectiveness, and ultimately help in controlling cloud spending.
Let’s take a look at some of the patents assigned to Qubole and understand their benefits to the customers::
- Systems and methods for auto-tuning big data workloads on cloud platforms
This invention is generally directed to systems and methods of automatically tuning big data workloads across various cloud platforms, the system being in communication with a cloud platform and a user, the cloud platform including data storage, and a data engine. The system may include a system information module in communication with the cloud platform; a static tuner in communication with the system information module; a cloud tuner in communication with the static tuner and the user; and an automation module in communication with the cloud tuner.
Methods may include extracting information impacting or associated with the performance of the big data workload from the cloud platform; determining recommendations based at least in part on the information extracted; iterating through different hardware configurations to determine optimal hardware and data engine configuration, and applying the determined configuration to the data engine. - Systems and methods for scheduling and running interactive database queries with service level agreements in a multi-tenant processing system
This invention is directed to systems and methods for scheduling interactive database queries from multiple tenants onto distributed query processing clusters with Service Level Agreements (SLAs). SLAs may be provided through a combination of estimation of resources per query followed by scheduling of that query onto a cluster if enough resources are available or triggering proactive autoscaling to spawn new clusters if they are not.
In some embodiments, systems may include a workflow manager; a resource estimator cluster; one or more execution clusters; and one or more meta-stores. A workflow manager may include an active node and a passive node configured to send a query to the resource estimator cluster and receive a resource estimate. A resource estimator cluster may be in communication with the workflow manager. One or more execution clusters may be scaled by the workflow manager as part of a schedule or autoscale based on workload. - Caching framework for big-data engines in the cloud
This invention is generally directed to a caching framework that provides a common abstraction across one or more big data engines, comprising a cache file system including a cache file system interface used by applications to access cloud storage through a cache subsystem. The cache file system interface is in communication with a big data engine extension and a cache manager; the big data engine extension provides cluster information to the cache file system and works with the cache file system interface to determine which nodes cache which part of a file; and a cache manager for maintaining metadata about the cache, the metadata comprising the status of blocks for each file.
The invention may provide common abstraction across big data engines that do not require changes to the setup of infrastructure or user workloads, allows sharing of cached data and caching only the parts of files that are required, and can process columnar format. - Task packing scheduling process for long-running applications
In general, this invention is directed to systems and methods of distributing tasks amongst servers or nodes in a cluster in a cloud-based big data environment. This includes:- establishing a high_server_threshold
- dividing active servers/nodes into at least three categories of high usage servers, comprising servers on which usage is greater than the high_server_threshold
- medium usage servers, comprising servers on which usage is less than the high_server_threshold, but is greater than zero
- low usage servers, comprising servers that are currently not utilized
- receiving one or more tasks to be performed
- scheduling the tasks
- High-performance Hadoop with new generation instances
This invention is generally directed to a distributed computing system comprising a plurality of computational clusters and instances. Each instance comprises a local instance of data storage in communication with reserved disk storage, wherein the processing hierarchy provides priority to local instance data storage before providing priority to reserved disk storage. - Heterogeneous auto-scaling big-data clusters in the cloud
This invention is directed to systems and methods of provisioning and using heterogeneous clusters in a cloud-based big data system, the heterogeneous clusters are made up of primary instance types and different types of instances. This method includes: determining if there are composition requirements of any heterogeneous cluster, the composition requirements defining instance types permitted for use; determining if any of the permitted different types of instances are required or advantageous for use; determining the number of different types of instances to utilize. This determination is based at least in part on an instance weight; provisioning the heterogeneous cluster comprising both primary instances and permitted different types of instances.