Lab_1

# COMP7095 - Big Data Management ## Lab 3 MongoDB Part 2: MongoDB NoSQL + Python ### Launching Jupyter Notebook We are going to switch the remaining parts to Jupyter Notebook. Let's try it with the following steps: 1. Download the `ipynb version` of this lab. 2. save it to the `comp7095` folder on the desktop. 3. Launch the terminal (macOS) or command prompt (Windows). 4. Run the followings commands: macOS: \ <code>cd Desktop/comp7095 source ./venv/bin/activate jupyter notebook </code> Windows: \ <code>cd comp7095 .\venv\Scripts\activate jupyter notebook </code> 5. On the Jupyter home page, select the `ipnb version` of this lab to run it. Then, you will see this page in notebook format. ## Starting MongoDB Next, we need to start the MongoDB. macOS: 1. Launch a new terminal and run the following command: \ `brew services start mongodb-community@6.0` 2. Wait for the respond to verify MongoDB is running. 3. Leave the terminal aside. Windows: 1. Launch a new Command Prompt and run the following command: \ `"C:\Program Files\MongoDB\Server\6.0\bin\mongod.exe" --dbpath="c:\data\db"` 2. Wait for the respond to verify MongoDB is running. 3. Leave the Command Prompt aside. ## Connecting MongoDB MongoDB is now running locally. We are going to use Python code to connect it. Import the required packages: - `MongoClient` is used to establish connect to MongoDB. With the client, we can insert, update, and delete the documents in MongoDB. - `pprint` provides a capability to “pretty-print” arbitrary Python data structures in a form which can be used as input to the interpreter. First, we need to import the required packages. ```python from pymongo import MongoClient from pprint import pprint ``` To connect MongoDB, we need to use the `MongoClient` function with the connection string. The connection string - `mongodb://localhost:27017`, describes the protocol, host, and port. If you want to connect MongoDB running on another host, you should replace `localhost` with the IP address of the host. ```python client = MongoClient('mongodb://localhost:27017/') ``` You can check the existing databases by using the `list_database_names` function. ```python client.list_database_names() ``` ['admin', 'config', 'local', 'mydb', 'shopDB'] To switch database, we only need to provide the database name as follows: ```python db = client['mydb'] ``` Let's use the `list_collection_names` function to see what collections the database has. ```python db.list_collection_names() ``` ['products'] Now, we are going to use the `products` collection. ```python products = db['products'] ``` In the last lab, we use `mongosh` to create some documents in the `products` collection. Let's see whether we can get them back in Python. We call the `find` function, just like what we have done in `mongosh`. The output of the `find` function is in Cursor format. For convenience, we convert the output of the `find` function to a list. ```python data = list(products.find()) pprint(data) ``` [{'_id': ObjectId('63fd894717191eaefaf3be4f'), 'name': 'apple', 'qty': 15, 'type': 'fruit'}, {'_id': ObjectId('63fd89db17191eaefaf3be50'), 'name': 'orange', 'price': 8, 'qty': 13, 'type': 'fruit'}, {'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}, {'_id': 200, 'name': 'dragon fruit', 'price': 5, 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be51'), 'name': 'pear', 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be52'), 'name': 'watermelon', 'qty': 100, 'type': 'fruit'}, {'_id': ObjectId('63fd90ad533c273c4e752e39'), 'name': 'mango', 'qty': 100}] The following two lines of code is used to delete some documents created by this notebook, just in case you ran this notebook previously. ```python products.delete_one({'_id':12}) products.delete_many({'type':'unknown'}) ``` <pymongo.results.DeleteResult at 0x7faf5356d430> With the argument, the `find` function filters the output as we want. For example, we want to get the documents about the IT products only. <i>Note that we need to add quotation marks to quote the keys mentioned in the argument. i.e., `'type'` in the following example.</i> ```python data = list(products.find({'type': 'it'})) pprint(data) ``` [{'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}] ## Insertion, Update, and Deletion Same as `mongosh`, `PyMongo` provides functions for manipulating the documents including insertion, update, and deletion. To insert a new document, we use the `insert_one` function. ```python products.insert_one({'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'}) ``` <pymongo.results.InsertOneResult at 0x7faf55b2ac40> Let's check whether it works or not! ```python data = products.find_one({'_id': 12}) pprint(data) ``` {'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'} We can also insert multiple documents at once using the `insert_many` function. ```python products.insert_many([ {'_id': 200, 'name': 'something 1', 'qty': 500, 'type': 'unknown'}, {'_id': 201, 'name': 'something 2', 'qty': 500, 'type': 'unknown'}, {'_id': 202, 'name': 'something 3', 'qty': 500, 'type': 'unknown'}, {'_id': 203, 'name': 'something 4', 'qty': 500, 'type': 'unknown'}, {'_id': 204, 'name': 'something 5', 'qty': 500, 'type': 'unknown'} ]) data = list(products.find()) pprint(data) ``` [{'_id': ObjectId('63fd894717191eaefaf3be4f'), 'name': 'apple', 'qty': 15, 'type': 'fruit'}, {'_id': ObjectId('63fd89db17191eaefaf3be50'), 'name': 'orange', 'price': 8, 'qty': 13, 'type': 'fruit'}, {'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}, {'_id': ObjectId('63fd8bd517191eaefaf3be51'), 'name': 'pear', 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be52'), 'name': 'watermelon', 'qty': 100, 'type': 'fruit'}, {'_id': ObjectId('63fd90ad533c273c4e752e39'), 'name': 'mango', 'qty': 100}, {'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'}, {'_id': 200, 'name': 'something 1', 'qty': 500, 'type': 'unknown'}, {'_id': 201, 'name': 'something 2', 'qty': 500, 'type': 'unknown'}, {'_id': 202, 'name': 'something 3', 'qty': 500, 'type': 'unknown'}, {'_id': 203, 'name': 'something 4', 'qty': 500, 'type': 'unknown'}, {'_id': 204, 'name': 'something 5', 'qty': 500, 'type': 'unknown'}] We even can get a value of the specific field in a document. For example, we want to get **the quantity of the product with ID 204**. ```python data = products.find_one({'_id': 204}) qty = data['qty'] pprint(qty) ``` 500 Let's try to update the quantity of product 204 with `qty + 1000`. <i>Note that the `update_one` function will update the first match only.</i> ```python products.update_one({'_id': 204}, {'$set': {'qty': qty + 1000}}) data = products.find_one({'_id': 204}) pprint(data) ``` {'_id': 204, 'name': 'something 5', 'qty': 1500, 'type': 'unknown'} With the `update_many` function, we can update multiple documents at once. For example, we change type "unknown" to "UNKNOWN". ```python products.update_many({'type': 'unknown'}, {'$set':{'type':'UNKNOWN'}}) data = list(products.find()) pprint(data) ``` [{'_id': ObjectId('63fd894717191eaefaf3be4f'), 'name': 'apple', 'qty': 15, 'type': 'fruit'}, {'_id': ObjectId('63fd89db17191eaefaf3be50'), 'name': 'orange', 'price': 8, 'qty': 13, 'type': 'fruit'}, {'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}, {'_id': ObjectId('63fd8bd517191eaefaf3be51'), 'name': 'pear', 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be52'), 'name': 'watermelon', 'qty': 100, 'type': 'fruit'}, {'_id': ObjectId('63fd90ad533c273c4e752e39'), 'name': 'mango', 'qty': 100}, {'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'}, {'_id': 200, 'name': 'something 1', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 201, 'name': 'something 2', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 202, 'name': 'something 3', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 203, 'name': 'something 4', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 204, 'name': 'something 5', 'qty': 1500, 'type': 'UNKNOWN'}] To delete documents, we use the `delete_one` or `delete_many` functions. <i>Same situation as `update_one`, the `delete_one` function takes effect on the first match only.</i> ```python products.delete_one({'type': 'UNKNOWN'}) data = list(products.find()) pprint(data) ``` [{'_id': ObjectId('63fd894717191eaefaf3be4f'), 'name': 'apple', 'qty': 15, 'type': 'fruit'}, {'_id': ObjectId('63fd89db17191eaefaf3be50'), 'name': 'orange', 'price': 8, 'qty': 13, 'type': 'fruit'}, {'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}, {'_id': ObjectId('63fd8bd517191eaefaf3be51'), 'name': 'pear', 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be52'), 'name': 'watermelon', 'qty': 100, 'type': 'fruit'}, {'_id': ObjectId('63fd90ad533c273c4e752e39'), 'name': 'mango', 'qty': 100}, {'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'}, {'_id': 201, 'name': 'something 2', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 202, 'name': 'something 3', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 203, 'name': 'something 4', 'qty': 500, 'type': 'UNKNOWN'}, {'_id': 204, 'name': 'something 5', 'qty': 1500, 'type': 'UNKNOWN'}] To delete all products with "UNKNOWN" type, we should use the `delete_many` function instead. ```python products.delete_many({'type':'UNKNOWN'}) data = list(products.find()) pprint(data) ``` [{'_id': ObjectId('63fd894717191eaefaf3be4f'), 'name': 'apple', 'qty': 15, 'type': 'fruit'}, {'_id': ObjectId('63fd89db17191eaefaf3be50'), 'name': 'orange', 'price': 8, 'qty': 13, 'type': 'fruit'}, {'_id': 10, 'name': 'hard disk', 'qty': 10, 'type': 'it'}, {'_id': ObjectId('63fd8bd517191eaefaf3be51'), 'name': 'pear', 'qty': 20, 'type': 'fruit'}, {'_id': ObjectId('63fd8bd517191eaefaf3be52'), 'name': 'watermelon', 'qty': 100, 'type': 'fruit'}, {'_id': ObjectId('63fd90ad533c273c4e752e39'), 'name': 'mango', 'qty': 100}, {'_id': 12, 'name': 'mouse', 'qty': 20, 'type': 'it'}] ## Importing Data In the previous section, we inserted documents manually. Of course, we may store a lot of data in data files and want to import them into MongoDB. We are going to learn how to import the data from a text file using the Python code. Let's download the data file named `annual-enterprise-survey-2021-financial-year-provisional-csv.txt` and store it in the comp7095 folder on the desktop. The data file is originally from Stats NZ Tatauranga Aotearoa. https://stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2021-financial-year-provisional/Download-data/annual-enterprise-survey-2021-financial-year-provisional-csv.csv To simplify the Python code for loading the data from the data file, the content of the data file is pre-processed using Microsoft Excel. The changes include: 1. Removed the comma from column `Value`. 2. Saved the file in Tab Delimited Text format (`.txt`). First, we use the `open` function to open the data file in `read-only` mode. ```python file = open('annual-enterprise-survey-2021-financial-year-provisional-csv.txt', 'r') ``` #### Process the Heading Line The first line in the data file is a heading line. We use the `readline` function to read it and name it `headings`. ```python headings = file.readline() ``` Let's see what we get from the file! The line contains the headings separated by a tab (`\t`) and ends with a newline character (`\n`). ```python headings ``` 'Year\tIndustry_aggregation_NZSIOC\tIndustry_code_NZSIOC\tIndustry_name_NZSIOC\tUnits\tVariable_code\tVariable_name\tVariable_category\tValue\tIndustry_code_ANZSIC06\n' We delete the `\n` and split the line using the `\t`. So, we have a list that contains the headings. Then, `headings` now refers to a list of the headings. ```python headings = headings.replace('\n', '').split('\t') headings ``` ['Year', 'Industry_aggregation_NZSIOC', 'Industry_code_NZSIOC', 'Industry_name_NZSIOC', 'Units', 'Variable_code', 'Variable_name', 'Variable_category', 'Value', 'Industry_code_ANZSIC06'] #### Process the Data Lines Next, we do the same thing for the data part. The procedure is that: 1. create an empty list and a counter. ```python temp = [] count = 0 ``` 2. Read all remaining lines from the file and name it `lines`. After that we close the file. ```python lines = file.readlines() file.close() ``` 3. Repeatedly do the following for each line: 1. Get a line from the list, delete the `\n` and split it using `\t`. 1. Create a new dictionary. 1. Increate the count by 1 and store it in the dictionary as an ID of the document we are going to create. 1. Use a for loop to put the values with corresponding headings to the dictionary. 1. Put the dictionary to the list. ```python for line in lines: values = line.replace('\n', '').split('\t') doc = {} count += 1 doc['_id'] = count for i in range(len(headings)): doc[headings[i]] = values[i] temp.append(doc) ``` 4. We get a collection named `companies`. To guarantee it is empty, we call its `delete_many` function to delete everything. ```python companies = db['companies'] companies.delete_many({}) ``` <pymongo.results.DeleteResult at 0x7faf535954f0> 5. Call the `insert_many` function with the list to create documents in MongoDB. ```python companies.insert_many(temp) ``` <pymongo.results.InsertManyResult at 0x7faf55cc6eb0> 6. Verify the insertion by getting the first 5 documents using the `find` function and `limit` function. ```python data = list(companies.find().limit(5)) pprint(data) ``` [{'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': '757504', 'Variable_category': 'Financial performance', 'Variable_code': 'H01', 'Variable_name': 'Total income', 'Year': '2021', '_id': 1}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': '674890', 'Variable_category': 'Financial performance', 'Variable_code': 'H04', 'Variable_name': '"Sales, government funding, grants and subsidies"', 'Year': '2021', '_id': 2}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': '49593', 'Variable_category': 'Financial performance', 'Variable_code': 'H05', 'Variable_name': '"Interest, dividends and donations"', 'Year': '2021', '_id': 3}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': '33020', 'Variable_category': 'Financial performance', 'Variable_code': 'H07', 'Variable_name': 'Non-operating income', 'Year': '2021', '_id': 4}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': '654404', 'Variable_category': 'Financial performance', 'Variable_code': 'H08', 'Variable_name': 'Total expenditure', 'Year': '2021', '_id': 5}] 7. By default, the data type of each field is string. We need to convert the `Value` field to integer. We use a for loop to retrieve every document and use the `update_one` function to update the value of the `Value` field. ```python for d in data: id = d['_id'] try: value = int(d['Value']) except: value = 0 ## We set the value to zero if the value cannot be converted to an integer. companies.update_one({'_id':id}, {'$set':{'Value':value}}) ``` Done! Let's see what we have! ```python data = list(companies.find().limit(5)) pprint(data) ``` [{'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 757504, 'Variable_category': 'Financial performance', 'Variable_code': 'H01', 'Variable_name': 'Total income', 'Year': '2021', '_id': 1}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 674890, 'Variable_category': 'Financial performance', 'Variable_code': 'H04', 'Variable_name': '"Sales, government funding, grants and subsidies"', 'Year': '2021', '_id': 2}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 49593, 'Variable_category': 'Financial performance', 'Variable_code': 'H05', 'Variable_name': '"Interest, dividends and donations"', 'Year': '2021', '_id': 3}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 33020, 'Variable_category': 'Financial performance', 'Variable_code': 'H07', 'Variable_name': 'Non-operating income', 'Year': '2021', '_id': 4}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 654404, 'Variable_category': 'Financial performance', 'Variable_code': 'H08', 'Variable_name': 'Total expenditure', 'Year': '2021', '_id': 5}] We can use the comparison keywords to filter the documents, just like what we learnt about `mongosh`. ```python data = list(companies.find({'Value':{'$gt': 100000}}).limit(5)) pprint(data) ``` [{'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 757504, 'Variable_category': 'Financial performance', 'Variable_code': 'H01', 'Variable_name': 'Total income', 'Year': '2021', '_id': 1}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 674890, 'Variable_category': 'Financial performance', 'Variable_code': 'H04', 'Variable_name': '"Sales, government funding, grants and subsidies"', 'Year': '2021', '_id': 2}, {'Industry_aggregation_NZSIOC': 'Level 1', 'Industry_code_ANZSIC06': '"ANZSIC06 divisions A-S (excluding classes K6330, ' 'L6711, O7552, O760, O771, O772, S9540, S9601, ' 'S9602, and S9603)"', 'Industry_code_NZSIOC': '99999', 'Industry_name_NZSIOC': 'All industries', 'Units': 'Dollars (millions)', 'Value': 654404, 'Variable_category': 'Financial performance', 'Variable_code': 'H08', 'Variable_name': 'Total expenditure', 'Year': '2021', '_id': 5}] We can also sort the output by using the `sort` function. ```python data = list(companies.find({'Value':{'$lte': -1000}}).sort('Value', -1).limit(5)) pprint(data) ``` [] ## Aggregate In Python, PyMongo package provides the `aggregate` function, similar to the `aggregate` function of `mongosh`. But, the argument of the `aggregate` function must be a list of dictionaries. Let's see the following example. We use the `aggregate` function to get the count of the documents in the `companies` collection: ```python data = list(companies.aggregate([{'$group':{'_id':'', 'count(*)':{'$sum':1}}}])) pprint(data) ``` [{'_id': '', 'count(*)': 41715}] You can think the code above as `SELECT count(*) FROM companies`. The following example shows how to retrieve the document counts of each year. You can think it as `SELECT Years, count(*) FROM companies GROUP BY Years`. ```python data = list(companies.aggregate([{'$group':{'_id':'$Year', 'count':{'$sum':1}}}])) pprint(data) ``` [{'_id': '2020', 'count': 4635}, {'_id': '2015', 'count': 4635}, {'_id': '2019', 'count': 4635}, {'_id': '2018', 'count': 4635}, {'_id': '2014', 'count': 4635}, {'_id': '2013', 'count': 4635}, {'_id': '2021', 'count': 4635}, {'_id': '2017', 'count': 4635}, {'_id': '2016', 'count': 4635}] In the following example, we add one more dictionary to sort the output by `_id` in descending order. ```python data = list(companies.aggregate([{'$group':{'_id':'$Year', 'count':{'$sum':1}}}, {'$sort': {'_id': -1}}])) pprint(data) ``` [{'_id': '2021', 'count': 4635}, {'_id': '2020', 'count': 4635}, {'_id': '2019', 'count': 4635}, {'_id': '2018', 'count': 4635}, {'_id': '2017', 'count': 4635}, {'_id': '2016', 'count': 4635}, {'_id': '2015', 'count': 4635}, {'_id': '2014', 'count': 4635}, {'_id': '2013', 'count': 4635}]