Lab notebook: Creating a stackoverflow service
Aim:
Render the top answer from a stackoverflow question inside Neera itself. Why, So that you don't have to open 5 tabs to determine if an SO question is worth yourtime.
How:
Use the bing api to get the relevant SO questions for a query
For each of these question, call the stackoverflow service to get the top answer.
Render the top answer inside Neera itself
Steps:
Rendering is easy, won’t talk about it.
Download the zip of all Stackoverflow Post. Stackoverflow updates this data dump every 3 months.
This zip is 17GB, so use some download manager which download parallely, and restart stopped donwloads. We used axel
axel -a https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z
Extract the zip. Extracting the dump takes few hours, so be patient.
The zip extracts into a 86GB xml file with each line containing a stackoverflow post.sudo apt-get install p7zip-full 7z x stackoverflow.com-Posts.7z
Push the xml into a SQLite store. This was the hardest part. Finally used this script https://meta.stackexchange.com/a/286488/992029. Note: You’ll have to rename Posts.xml to lowercase posts.xml
SQLite is really good if the writes are very infrequent.
Create index on ‘Id’ and ‘ParentId’ keys
$ sqlite3 so-dump.db # Connect to sqlite store
CREATE INDEX posts_parent_idx ON posts (ParentId); CREATE UNIQUE INDEX posts_ids_idx ON posts (Id);
Server the answers through a node.js service
const sqlite3 = require('sqlite3').verbose(); const express = require("express") const morgan = require('morgan') const cors = require('cors') const app = express() app.use(express.json()) app.use(cors()) app.options('*', cors()) app.use(morgan('tiny')) let db = new sqlite3.Database('./so-dump.db', (err) => { if (err) { console.error(err.message); } console.log('Connected to the SO database.'); }); app.get("/api/stackoverflow.com/answers", async function(req, res) { const questionId = decodeURIComponent(req.query.question_id); db.all("SELECT * FROM posts where Id= (SELECT AcceptedAnswerId FROM posts where Id=$1)", [questionId], (err, rows) => { if(rows.length > 0) { res.send(rows[0]); } else { db.all("select * from posts where ParentId=$1 ORDER BY Score DESC LIMIT 1;", [questionId], (err2, rows2) => { res.send(rows2[0]); }) } }) }) app.get("/api/stackoverflow.com/questions", async function(req, res) { const questionId = decodeURIComponent(req.query.question_id); try{ db.all("select * from posts where Id=$1;", [questionId], (err, rows) => { res.send(rows[0]); }) } catch (err) { res.status(404) } }) port = process.env.port || process.env.PORT || 5000 app.listen(port)
Appendix:
Things we tried but didn’t work
Offical API
Stackoverflow’s Official API is heavily rate limited my IP got banned for 15 hours within 15 minutes.
Postgres
Did use postgres initally, but 86GB is too much. It got full.
How to use the SQLite dump
Download the sqlite dump zip
axel -a https://archive.org/download/stackover-sqlite.db/copy-so-dump.db.bz2
Extract the zip. (Using lbzip2 because its fast and does the decompresion in all available cores of the CPU)
lbzip2 -d copy-so-dump.db.bz2
Rename the dump to so-dump.db and use the above nodejs code