Requirements

Let's dive into building a simplified version of a Wide Column Store, similar in concept to Cassandra. A Wide Column Store differs significantly from relational databases. Instead of rows with predefined columns, it uses column families that group related columns together. Think of it as a map of maps - a keyspace containing column families, which contain rows, which contain columns. Each row can have a different set of columns, adding flexibility. This implementation will include support for secondary indexes, a crucial feature for querying data efficiently based on values other than the primary key. Without secondary indexes, you'd need to scan the entire table, which is a performance killer in large datasets. Imagine storing user profile data. A column family might be `users`. Each row would represent a user, keyed by `user_id`. Columns could include `name`, `email`, `age`, `city`, etc. With a secondary index on `city`, you can quickly find all users in a specific city. This problem focuses on the *in-memory* data structure representation, how to handle concurrent operations, and the design of secondary indexes. The core challenge lies in maintaining the consistency and integrity of the indexes when data is updated or deleted. We will simulate the database behavior in memory for the purpose of this problem. # Requirements ## Functional Requirements 1. **Keyspace Creation:** * Ability to create a keyspace with a given name. Keyspace names should be unique. 2. **Column Family Creation:** * Ability to create a column family within a keyspace, specifying a primary key. The primary key is a column in the row that uniquely identifies the row. * Column Family names should be unique within a keyspace. 3. **Data Insertion:** * Ability to insert a row into a column family. A row consists of a primary key value and a set of columns (column name, column value pairs). * Data types are limited to String for simplicity. 4. **Data Retrieval:** * Ability to retrieve a row from a column family given its primary key. 5. **Data Update:** * Ability to update existing columns in a row within a column family. 6. **Data Deletion:** * Ability to delete an entire row from a column family given its primary key. * Ability to delete one or more columns within a specific row of a column family. 7. **Secondary Index Creation:** * Ability to create a secondary index on a specific column within a column family. You can assume only one secondary index per column family for simplicity. * The secondary index should efficiently map a column value to the primary keys of rows containing that value. 8. **Indexed Data Retrieval:** * Ability to retrieve rows from a column family based on a value in the indexed column. This should leverage the secondary index for efficient lookup. 9. **Secondary Index Maintenance:** * When data is inserted, updated, or deleted, the secondary index should be automatically updated to reflect the changes. The index must remain consistent. ## Non-Functional Requirements 1. **Thread Safety:** The data store should be thread-safe. Multiple threads should be able to concurrently insert, update, delete, and retrieve data without data corruption or race conditions. Use appropriate synchronization mechanisms (e.g., locks, concurrent data structures). 2. **Extensibility:** The design should be extensible to support new data types, indexing strategies, and query operations without requiring major code changes. Consider using interfaces and abstract classes to define extension points. 3. **Modularity:** The code should be well-modularized, with clear separation of concerns between different components such as keyspace management, column family management, data storage, and index management. 4. **Testability:** The code should be designed to be easily testable, with unit tests covering all major functionalities and edge cases. Use dependency injection or other techniques to facilitate mocking and testing of individual components. 5. **Performance:** While this is an in-memory implementation, efficiency is still important. Operations involving secondary indexes should be optimized to avoid full table scans. Choose appropriate data structures for indexes to enable fast lookups. 6. **Error Handling:** Handle potential errors gracefully, such as attempting to create a keyspace or column family with a duplicate name, or attempting to access a non-existent keyspace or column family. Throw appropriate exceptions to indicate errors. ## Core Entities 1. **`Keyspace`:** Represents a namespace for column families. It should manage the creation, deletion, and retrieval of column families. 2. **`ColumnFamily`:** Represents a collection of rows. It manages the storage and retrieval of rows, as well as the creation and management of secondary indexes. It stores the data rows as `Map >`. 3. **`Row`:** Represents a single row in a column family. It consists of a primary key and a set of columns (column name, column value pairs). Internally represented as `Map `. 4. **`Column`:** Represents a single column in a row. It consists of a column name and a column value. 5. **`PrimaryKey`:** Represents the unique identifier for a row within a column family. In our case, this is a String. 6. **`ColumnName`:** Represents the name of a column. In our case, this is a String. 7. **`ColumnValue`:** Represents the value of a column. In our case, this is a String. 8. **`SecondaryIndex`:** An interface for secondary index implementations. Different indexing strategies can be implemented behind this interface (e.g., hash-based, tree-based). 9. **`HashIndex`:** A concrete implementation of the `SecondaryIndex` interface, using a hash map to store the index data. It maps column values to a set of primary keys.

Implement a wide column store like Cassandra (bonus : support secondary indexes).

Requirements

Think like an Architect

Premium Content