AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |
Back to Blog
Python queue to list fast2/9/2024 ![]() def load_csv (csv_file, table_def, conn_params ) : # Optional, drops table if it exists before creating Queues are optimized for pulling data from the start and adding data to the end, i.e., we don't need to access values at any other place in the queue.This property is known as being " thread safe" and addresses a problem called a race condition. If we pull a value off the queue, no other thread can simultaneously modify the queue. Queues are locked whenever modifications take place on them.The best choice here is a queue which has useful features for this kind of task: Recall earlier that the faster approach is to insert as many records as possible in a single transaction? Or put another way, batching.īatching requires a data structure to hold row data temporarily. Remember that once data has made it to the Database, there's no need to keep it in memory. Instead, data is read in chunks (in this case, a single line), discarding each chunk when finished with it. Using a generator avoids putting all the data in a list and running out of memory. reader (in_file, delimiter = "|" ) next (reader ) # Header row for row in reader : yield row #. ![]() def read_csv (csv_file ) : with open (csv_file, encoding = "utf-8", newline = "" ) as in_file : Generators are a particular type of function that returns values one at a time ( yield) rather than returning them all at once ( return). Reading the data is done with a generator. Next comes reading the data from the CSV file. These include the MULTI_ROW_INSERT_LIMIT, determining the queue size (see below), and the number of threads, or WORKERS. The module starts by importing the required libraries, concurrent.futures, csv, and queue, all from the standard library, followed by the sqlactions module.Īfter the imports, constants get defined. Load_csv ( "dummy_data.csv", table_def, conn_params ) Initial setup ![]() However, the script uses the excellent Faker library to generate fake personal data and pandas to write out to CSV.įIELDS = def create_dummy_data ( ) : return This post aims to cover loading data quickly, so I won't detail how this works. The source in this walkthrough is a CSV file. If you want to follow along, you'll need some source data. Instead, raw SQL will be built from the data and executed with the Database API. However, creating thousands or millions of ORM specific objects adds overhead we could avoid. Usually, it makes sense to use an ORM (Object Relational Mapper) to interact with a Database in Python. The specific INSERT statement used is called a Multi-row Insert. To maximize speed, we'll commit the maximum records possible in a single Database transaction. ![]() Inserting one row at a time might only take a microsecond but adds up if done millions of times. We can do more work in the same amount of time. Threading allows us to break up a task into chunks of work and execute them concurrently instead of sequentially. Repeatedly performing the same action on data is a good hint that the task could benefit from concurrency, namely Threads. Going fasterīefore we look at code, let's explore some ways to go faster when loading data to a Database. In this post, we'll explore how to deal with these challenges. Memory - If a dataset's size is more than our available RAM, we're not going to be able to put it all in a list and process it.Time - An operation, such as inserting a record, may take microseconds but can add up to minutes or hours when performed millions of times.There are two main challenges when moving large volumes of data to a Database. This post intends to walk through an approach to a problem rather than provide a tutorial. These have features that are helpful for loading data quickly, such as concurrency and multiple connections. Lastly, I will assume the destination Database is an enterprise-level Relational Database such as Postgres or SQL Server.You're familiar with using Python to move data to a Database and using Database connectors.First, you're somewhat familiar with Python and some concepts like using modules, importing libraries, and common data structures.This post came from a project requirement to read around 20 million JSON records and load their contents to a SQL Server Database. In this post, I'm going walk through a way to quickly move a large amount of data to a Database using Python.
0 Comments
Read More
Leave a Reply. |