Distributed ID Generation Scheme#
Content from
Problem Description#
In a single-machine system (for example, a MySQL instance), generating a unique ID is very simple, directly utilizing MySQL's built-in auto-increment ID feature.
However, in a distributed system with multiple shards (for example, a cluster composed of multiple MySQL instances inserting data), this problem becomes complex. The globally unique ID generated must meet the following requirements:
- Ensure that the generated ID is globally unique
- Future data migration between multiple shards should not be limited by the ID generation method
- It is preferable for the generated ID to include time information, for example, the first k bits of the ID are a Timestamp, allowing data to be sorted by time directly through sorting the first k bits of the ID
- The generated ID should preferably not exceed 64 bits
- There are speed requirements for ID generation. For example, in a high-throughput scenario, tens of thousands of IDs need to be generated per second (Twitter's latest peak reached 143,199 Tweets/s, which is over 100,000 per second)
- The entire service should ideally have no single point of failure
Timestamp#
Using time to create unique IDs is basically unfeasible in high concurrency or distributed environments; IDs generated at the same time are duplicates and do not meet global uniqueness.
Using Database Auto-Increment#
Still utilizing the database to generate auto-increment IDs to ensure uniqueness. However, it should be noted that without sharding, a fixed number of database tables can be used solely to generate auto-increment IDs, unrelated to business, and no further sharding will occur.
The advantage is that it is simple to implement and easy to understand. The downside is that the ID generation rate is affected by database performance and the network connecting to the database.
Using Redis Atomic Operations#
- Does not rely on the database, is flexible and convenient, and performs better than the database.
- Numeric IDs are naturally sortable, which is very helpful for pagination or results that require sorting.
Note:
- In a Redis cluster scenario, different increment steps need to be set.
- If the generated ID exceeds the auto-increment growth, a prefix + auto-increment + setting an expiration time can be used.
Using UUID/GUID#
Basic Concept#
UUID refers to a number generated on a single machine, ensuring uniqueness across all machines in the same space-time.
UUID components: current date and time + clock sequence + random number + globally unique IEEE machine identifier.
Advantages and Disadvantages#
Advantages:
- Simple, easy to code
- ID generation performance is very good, basically no performance issues
- Globally unique, can handle data migration, system data merging, or database changes calmly.
Disadvantages: - No sorting, cannot guarantee trend increment
- UUIDs are often stored as strings, resulting in lower query efficiency
- Storage space is relatively large; if it is a massive database, storage capacity must be considered.
- Too long, 128 bits, not suitable for database primary keys.
Twitter Snowflake#
Snowflake is Twitter's open-source distributed ID generation algorithm, resulting in a long-type ID.
Its core idea is: high bits random + milliseconds + machine code (data center + machine ID) + 10 bits of sequence number
ID Generation Process#
- 10 bits of machine number, obtained from a Zookeeper cluster when the ID allocation worker starts (ensuring that all workers do not have duplicate machine numbers)
- 41 bits of Timestamp: Each time a new ID is to be generated, the current Timestamp is obtained, and then the sequence number is generated in two situations:
- If the current Timestamp is the same as the Timestamp of the previously generated ID (within the same millisecond), the previous ID's sequence number + 1 is used as the new sequence number (12 bits); if all IDs in this millisecond are used up, wait for the next millisecond to continue (during this waiting period, no new IDs can be allocated).
- If the current Timestamp is greater than the previous ID's Timestamp, a random initial sequence number (12 bits) is generated as the first sequence number for this millisecond. Throughout the process, there is only external dependency when the worker starts (needs to obtain the worker number from Zookeeper), after which it can work independently, achieving decentralization.