Lab_1 - Qcuik

# COMP7095 - Big Data Management ## Lab 3 MongoDB Part 2: MongoDB NoSQL + Python ### Launching Jupyter Notebook We are going to switch the remaining parts to Jupyter Notebook. Let's try it with the following steps: 1. Download the `ipynb version` of this lab. 2. save it to the `comp7095` folder on the desktop. 3. Launch the terminal (macOS) or command prompt (Windows). 4. Run the followings commands: macOS: \ <code>cd Desktop/comp7095 source ./venv/bin/activate jupyter notebook </code> Windows: \ <code>cd comp7095 .\venv\Scripts\activate jupyter notebook </code> 5. On the Jupyter home page, select the `ipnb version` of this lab to run it. Then, you will see this page in notebook format. ## Starting MongoDB Next, we need to start the MongoDB. macOS: 1. Launch a new terminal and run the following command: \ `brew services start mongodb-community@6.0` 2. Wait for the respond to verify MongoDB is running. 3. Leave the terminal aside. Windows: 1. Launch a new Command Prompt and run the following command: \ `"C:\Program Files\MongoDB\Server\6.0\bin\mongod.exe" --dbpath="c:\data\db"` 2. Wait for the respond to verify MongoDB is running. 3. Leave the Command Prompt aside. ## Connecting MongoDB MongoDB is now running locally. We are going to use Python code to connect it. Import the required packages: - `MongoClient` is used to establish connect to MongoDB. With the client, we can insert, update, and delete the documents in MongoDB. - `pprint` provides a capability to “pretty-print” arbitrary Python data structures in a form which can be used as input to the interpreter. First, we need to import the required packages. ```python from pymongo import MongoClient from pprint import pprint ``` To connect MongoDB, we need to use the `MongoClient` function with the connection string. The connection string - `mongodb://localhost:27017`, describes the protocol, host, and port. If you want to connect MongoDB running on another host, you should replace `localhost` with the IP address of the host. ```python client = MongoClient('mongodb://localhost:27017/') ``` You can check the existing databases by using the `list_database_names` function. ```python client.list_database_names() ``` ['admin', 'config', 'local', 'mydb', 'shopDB'] To switch database, we only need to provide the database name as follows: ```python db = client['mydb'] ``` Let's use the `list_collection_names` function to see what collections the database has. ```python db.list_collection_names() ``` ['products'] Now, we are going to use the `products` collection. ```python products = db['products'] ``` In the last lab, we use `mongosh` to create some documents in the `products` collection. Let's see whether we can get them back in Python. We call the `find` function, just like what we have done in `mongosh`. The output of the `find` function is in Cursor format. For convenience, we convert the output of the `find` function to a list. ```python data = list(products.find()) pprint(data) ``` [{'_id': ObjectId('63fd894717191eaefaf3be4f'), 'name': 'apple', 'qty': 15, 'type': 'fruit'}, {'_id': ObjectId('63fd89db17191eaefaf3be50'), 'name': 'orange', 'price': 8, 'qty': 13, 'type': 'fruit'}, {'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}, {'_id': 200, 'name': 'dragon fruit', 'price': 5, 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be51'), 'name': 'pear', 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be52'), 'name': 'watermelon', 'qty': 100, 'type': 'fruit'}, {'_id': ObjectId('63fd90ad533c273c4e752e39'), 'name': 'mango', 'qty': 100}] The following two lines of code is used to delete some documents created by this notebook, just in case you ran this notebook previously. ```python products.delete_one({'_id':12}) products.delete_many({'type':'unknown'}) ``` <pymongo.results.DeleteResult at 0x7faf5356d430> With the argument, the `find` function filters the output as we want. For example, we want to get the documents about the IT products only. Note that we need to add quotation marks to quote the keys mentioned in the argument. i.e., `'type'` in the following example. ```python data = list(products.find({'type': 'it'})) pprint(data) ``` [{'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}] ## Insertion, Update, and Deletion Same as `mongosh`, `PyMongo` provides functions for manipulating the documents including insertion, update, and deletion. To insert a new document, we use the `insert_one` function. ```python products.insert_one({'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'}) ``` <pymongo.results.InsertOneResult at 0x7faf55b2ac40> Let's check whether it works or not! ```python data = products.find_one({'_id': 12}) pprint(data) ``` {'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'} We can also insert multiple documents at once using the `insert_many` function. ```python products.insert_many([ {'_id': 200, 'name': 'something 1', 'qty': 500, 'type': 'unknown'}, {'_id': 201, 'name': 'something 2', 'qty': 500, 'type': 'unknown'}, {'_id': 202, 'name': 'something 3', 'qty': 500, 'type': 'unknown'}, {'_id': 203, 'name': 'something 4', 'qty': 500, 'type': 'unknown'}, {'_id': 204, 'name': 'something 5', 'qty': 500, 'type': 'unknown'} ]) data = list(products.find()) pprint(data) ``` [{'_id': ObjectId('63fd894717191eaefaf3be4f'), 'name': 'apple', 'qty': 15, 'type': 'fruit'}, {'_id': ObjectId('63fd89db17191eaefaf3be50'), 'name': 'orange', 'price': 8, 'qty': 13, 'type': 'fruit'}, {'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}, {'_id': ObjectId('63fd8bd517191eaefaf3be51'), 'name': 'pear', 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be52'), 'name': 'watermelon', 'qty': 100, 'type': 'fruit'}, {'_id': ObjectId('63fd90ad533c273c4e752e39'), 'name': 'mango', 'qty': 100}, {'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'}, {'_id': 200, 'name': 'something 1', 'qty': 500, 'type': 'unknown'}, {'_id': 201, 'name': 'something 2', 'qty': 500, 'type': 'unknown'}, {'_id': 202, 'name': 'something 3', 'qty': 500, 'type': 'unknown'}, {'_id': 203, 'name': 'something 4', 'qty': 500, 'type': 'unknown'}, {'_id': 204, 'name': 'something 5', 'qty': 500, 'type': 'unknown'}] We even can get a value of the specific field in a document. For example, we want to get **the quantity of the product with ID 204**. ```python data = products.find_one({'_id': 204}) qty = data['qty'] pprint(qty) ``` 500 Let's try to update the quantity of product 204 with `qty + 1000`. Note that the `update_one` function will update the first match only. ```python products.update_one({'_id': 204}, {'$set': {'qty': qty + 1000}}) data = products.find_one({'_id': 204}) pprint(data) ``` {'_id': 204, 'name': 'something 5', 'qty': 1500, 'type': 'unknown'} With the `update_many` function, we can update multiple documents at once. For example, we change type "unknown" to "UNKNOWN". ```python products.update_many({'type': 'unknown'}, {'$set':{'type':'UNKNOWN'}}) data = list(products.find()) pprint(data) ``` [{'_id': ObjectId('63fd894717191eaefaf3be4f'), 'name': 'apple', 'qty': 15, 'type': 'fruit'}, {'_id': ObjectId('63fd89db17191eaefaf3be50'), 'name': 'orange', 'price': 8, 'qty': 13, 'type': 'fruit'}, {'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}, {'_id': ObjectId('63fd8bd517191eaefaf3be51'), 'name': 'pear', 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be52'), 'name': 'watermelon', 'qty': 100, 'type': 'fruit'}, {'_id': ObjectId('63fd90ad533c273c4e752e39'), 'name': 'mango', 'qty': 100}, {'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'}, {'_id': 200, 'name': 'something 1', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 201, 'name': 'something 2', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 202, 'name': 'something 3', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 203, 'name': 'something 4', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 204, 'name': 'something 5', 'qty': 1500, 'type': 'UNKNOWN'}] To delete documents, we use the `delete_one` or `delete_many` functions. Same situation as `update_one`, the `delete_one` function takes effect on the first match only. ```python products.delete_one({'type': 'UNKNOWN'}) data = list(products.find()) pprint(data) ``` [{'_id': ObjectId('63fd894717191eaefaf3be4f'), 'name': 'apple', 'qty': 15, 'type': 'fruit'}, {'_id': ObjectId('63fd89db17191eaefaf3be50'), 'name': 'orange', 'price': 8, 'qty': 13, 'type': 'fruit'}, {'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}, {'_id': ObjectId('63fd8bd517191eaefaf3be51'), 'name': 'pear', 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be52'), 'name': 'watermelon', 'qty': 100, 'type': 'fruit'}, {'_id': ObjectId('63fd90ad533c273c4e752e39'), 'name': 'mango', 'qty': 100}, {'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'}, {'_id': 201, 'name': 'something 2', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 202, 'name': 'something 3', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 203, 'name': 'something 4', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 204, 'name': 'something 5', 'qty': 1500, 'type': 'UNKNOWN'}] To delete all products with "UNKNOWN" type, we should use the `delete_many` function instead. ```python products.delete_many({'type':'UNKNOWN'}) data = list(products.find()) pprint(data) ``` [{'_id': ObjectId('63fd894717191eaefaf3be4f'), 'name': 'apple', 'qty': 15, 'type': 'fruit'}, {'_id': ObjectId('63fd89db17191eaefaf3be50'), 'name': 'orange', 'price': 8, 'qty': 13, 'type': 'fruit'}, {'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}, {'_id': ObjectId('63fd8bd517191eaefaf3be51'), 'name': 'pear', 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be52'), 'name': 'watermelon', 'qty': 100, 'type': 'fruit'}, {'_id': ObjectId('63fd90ad533c273c4e752e39'), 'name': 'mango', 'qty': 100}, {'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'}] ## Importing Data In the previous section, we inserted documents manually. Of course, we may store a lot of data in data files and want to import them into MongoDB. We are going to learn how to import the data from a text file using the Python code. Let's download the data file named `annual-enterprise-survey-2021-financial-year-provisional-csv.txt` and store it in the comp7095 folder on the desktop. The data file is originally from Stats NZ Tatauranga Aotearoa. https://stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2021-financial-year-provisional/Download-data/annual-enterprise-survey-2021-financial-year-provisional-csv.csv To simplify the Python code for loading the data from the data file, the content of the data file is pre-processed using Microsoft Excel. The changes include: 1. Removed the comma from column `Value`. 2. Saved the file in Tab Delimited Text format (`.txt`). First, we use the `open` function to open the data file in `read-only` mode. ```python file = open('annual-enterprise-survey-2021-financial-year-provisional-csv.txt', 'r') ``` #### Process the Heading Line The first line in the data file is a heading line. We use the `readline` function to read it and name it `headings`. ```python headings = file.readline() ``` Let's see what we get from the file! The line contains the headings separated by a tab (`\t`) and ends with a newline character (`\n`). ```python headings ``` 'Year\tIndustry_aggregation_NZSIOC\tIndustry_code_NZSIOC\tIndustry_name_NZSIOC\tUnits\tVariable_code\tVariable_name\tVariable_category\tValue\tIndustry_code_ANZSIC06\n' We delete the `\n` and split the line using the `\t`. So, we have a list that contains the headings. Then, `headings` now refers to a list of the headings. ```python headings = headings.replace('\n', '').split('\t') headings ``` ['Year', 'Industry_aggregation_NZSIOC', 'Industry_code_NZSIOC', 'Industry_name_NZSIOC', 'Units', 'Variable_code', 'Variable_name', 'Variable_category', 'Value', 'Industry_code_ANZSIC06'] #### Process the Data Lines Next, we do the same thing for the data part. The procedure is that: 1. create an empty list and a counter. ```python temp = [] count = 0 ``` 2. Read all remaining lines from the file and name it `lines`. After that we close the file. ```python lines = file.readlines() file.close() ``` 3. Repeatedly do the following for each line: 1. Get a line from the list, delete the `\n` and split it using `\t`. 1. Create a new dictionary. 1. Increate the count by 1 and store it in the dictionary as an ID of the document we are going to create. 1. Use a for loop to put the values with corresponding headings to the dictionary. 1. Put the dictionary to the list. ```python for line in lines: values = line.replace('\n', '').split('\t') doc = {} count += 1 doc['_id'] = count for i in range(len(headings)): doc[headings[i]] = values[i] temp.append(doc) ``` 4. We get a collection named `companies`. To guarantee it is empty, we call its `delete_many` function to delete everything. ```python companies = db['companies'] companies.delete_many({}) ``` <pymongo.results.DeleteResult at 0x7faf535954f0> 5. Call the `insert_many` function with the list to create documents in MongoDB. ```python companies.insert_many(temp) ``` <pymongo.results.InsertManyResult at 0x7faf55cc6eb0> 6. Verify the insertion by getting the first 5 documents using the `find` function and `limit` function. ```python data = list(companies.find().limit(5)) pprint(data) ``` [{'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': '757504', 'Variable_category': 'Financial performance', 'Variable_code': 'H01', 'Variable_name': 'Total income', 'Year': '2021', '_id': 1}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': '674890', 'Variable_category': 'Financial performance', 'Variable_code': 'H04', 'Variable_name': '"Sales, government funding, grants and subsidies"', 'Year': '2021', '_id': 2}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': '49593', 'Variable_category': 'Financial performance', 'Variable_code': 'H05', 'Variable_name': '"Interest, dividends and donations"', 'Year': '2021', '_id': 3}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': '33020', 'Variable_category': 'Financial performance', 'Variable_code': 'H07', 'Variable_name': 'Non-operating income', 'Year': '2021', '_id': 4}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': '654404', 'Variable_category': 'Financial performance', 'Variable_code': 'H08', 'Variable_name': 'Total expenditure', 'Year': '2021', '_id': 5}] 7. By default, the data type of each field is string. We need to convert the `Value` field to integer. We use a for loop to retrieve every document and use the `update_one` function to update the value of the `Value` field. ```python for d in data: id = d['_id'] try: value = int(d['Value']) except: value = 0 ## We set the value to zero if the value cannot be converted to an integer. companies.update_one({'_id':id}, {'$set':{'Value':value}}) ``` Done! Let's see what we have! ```python data = list(companies.find().limit(5)) pprint(data) ``` [{'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 757504, 'Variable_category': 'Financial performance', 'Variable_code': 'H01', 'Variable_name': 'Total income', 'Year': '2021', '_id': 1}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 674890, 'Variable_category': 'Financial performance', 'Variable_code': 'H04', 'Variable_name': '"Sales, government funding, grants and subsidies"', 'Year': '2021', '_id': 2}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 49593, 'Variable_category': 'Financial performance', 'Variable_code': 'H05', 'Variable_name': '"Interest, dividends and donations"', 'Year': '2021', '_id': 3}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 33020, 'Variable_category': 'Financial performance', 'Variable_code': 'H07', 'Variable_name': 'Non-operating income', 'Year': '2021', '_id': 4}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 654404, 'Variable_category': 'Financial performance', 'Variable_code': 'H08', 'Variable_name': 'Total expenditure', 'Year': '2021', '_id': 5}] We can use the comparison keywords to filter the documents, just like what we learnt about `mongosh`. ```python data = list(companies.find({'Value':{'$gt': 100000}}).limit(5)) pprint(data) ``` [{'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 757504, 'Variable_category': 'Financial performance', 'Variable_code': 'H01', 'Variable_name': 'Total income', 'Year': '2021', '_id': 1}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 674890, 'Variable_category': 'Financial performance', 'Variable_code': 'H04', 'Variable_name': '"Sales, government funding, grants and subsidies"', 'Year': '2021', '_id': 2}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 654404, 'Variable_category': 'Financial performance', 'Variable_code': 'H08', 'Variable_name': 'Total expenditure', 'Year': '2021', '_id': 5}] We can also sort the output by using the `sort` function. ```python data = list(companies.find({'Value':{'$lte': -1000}}).sort('Value', -1).limit(5)) pprint(data) ``` [] ## Aggregate In Python, PyMongo package provides the `aggregate` function, similar to the `aggregate` function of `mongosh`. But, the argument of the `aggregate` function must be a list of dictionaries. Let's see the following example. We use the `aggregate` function to get the count of the documents in the `companies` collection: ```python data = list(companies.aggregate([{'$group':{'_id':'', 'count(*)':{'$sum':1}}}])) pprint(data) ``` [{'_id': '', 'count(*)': 41715}] You can think the code above as `SELECT count(*) FROM companies`. The following example shows how to retrieve the document counts of each year. You can think it as `SELECT Years, count(*) FROM companies GROUP BY Years`. ```python data = list(companies.aggregate([{'$group':{'_id':'$Year', 'count':{'$sum':1}}}])) pprint(data) ``` [{'_id': '2020', 'count': 4635}, {'_id': '2015', 'count': 4635}, {'_id': '2019', 'count': 4635}, {'_id': '2018', 'count': 4635}, {'_id': '2014', 'count': 4635}, {'_id': '2013', 'count': 4635}, {'_id': '2021', 'count': 4635}, {'_id': '2017', 'count': 4635}, {'_id': '2016', 'count': 4635}] In the following example, we add one more dictionary to sort the output by `_id` in descending order. ```python data = list(companies.aggregate([{'$group':{'_id':'$Year', 'count':{'$sum':1}}}, {'$sort': {'_id': -1}}])) pprint(data) ``` [{'_id': '2021', 'count': 4635}, {'_id': '2020', 'count': 4635}, {'_id': '2019', 'count': 4635}, {'_id': '2018', 'count': 4635}, {'_id': '2017', 'count': 4635}, {'_id': '2016', 'count': 4635}, {'_id': '2015', 'count': 4635}, {'_id': '2014', 'count': 4635}, {'_id': '2013', 'count': 4635}]