<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2023-10-11T12:08:55+00:00</updated><id>/feed.xml</id><title type="html">Ketan’s Blog</title><subtitle>Notes about the stuff that I do or find interesting.</subtitle><entry><title type="html">Running Awk in parallel to process 256M records</title><link href="/posts/2020/05/24/SMC18-Data-Challenge-4.html" rel="alternate" type="text/html" title="Running Awk in parallel to process 256M records" /><published>2020-05-24T00:00:00+00:00</published><updated>2020-05-24T00:00:00+00:00</updated><id>/posts/2020/05/24/SMC18-Data-Challenge-4</id><content type="html" xml:base="/posts/2020/05/24/SMC18-Data-Challenge-4.html">&lt;h3 id=&quot;tldr&quot;&gt;TL;DR&lt;/h3&gt;
&lt;p&gt;Awk crunches massive data; a High Performance Computing (HPC) script calls
hundreds of Awk concurrently. Fast and scalable in-memory solution on a fat
machine.&lt;/p&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Presenting the solution I worked on in 2018, to a &lt;a href=&quot;https://smc-datachallenge.ornl.gov/challenges-2018/&quot;&gt;Data
Challenge&lt;/a&gt; organized at
work. I solve the Scientific Publications Mining challenge (no.4) that consists
of 5 problems. I use classic Unix tools with a modern scalable HPC scripting
tool to work out the solutions. The project is hosted on
&lt;a href=&quot;https://github.com/ketancmaheshwari/SMC18&quot;&gt;github&lt;/a&gt;. About 12 teams entered the
contest.&lt;/p&gt;

&lt;h1 id=&quot;tools&quot;&gt;Tools&lt;/h1&gt;

&lt;h2 id=&quot;software&quot;&gt;Software&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Awk&lt;/strong&gt; (gawk v4.0.2) is dominantly used for the bulk of core processing.&lt;/p&gt;

&lt;p&gt;Argonne National Laboratory developed HPC scripting tool called
&lt;a href=&quot;http://swift-lang.org/Swift-T&quot;&gt;Swift&lt;/a&gt; (&lt;strong&gt;NOT&lt;/strong&gt; the Apple Swift) is used to run
the Awk programs concurrently over the dataset to radically improve
performance. Swift uses MPI based communication to parallelize and synchronize
independent tasks.&lt;/p&gt;

&lt;p&gt;Other Unix tools such as &lt;em&gt;sort&lt;/em&gt;, &lt;em&gt;grep&lt;/em&gt;, &lt;em&gt;tr&lt;/em&gt;, &lt;em&gt;sed&lt;/em&gt; and &lt;em&gt;bash&lt;/em&gt; are used as
well. Additionally, &lt;em&gt;jq&lt;/em&gt;, &lt;em&gt;D3&lt;/em&gt;, &lt;em&gt;dot/graphviz&lt;/em&gt;, and &lt;em&gt;ffmpeg&lt;/em&gt; are used.&lt;/p&gt;

&lt;h2 id=&quot;hardware&quot;&gt;Hardware&lt;/h2&gt;

&lt;p&gt;Fortunately, I had access to a large-memory (24 T) SGI system with 512-core
Intel Xeon (2.5GHz) CPUs. All the IO is memory (&lt;em&gt;/dev/shm&lt;/em&gt;) bound ie. the data
is read from and written to &lt;em&gt;/dev/shm&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;rationale&quot;&gt;Rationale&lt;/h3&gt;

&lt;p&gt;Awk is lightweight, concise, expressive, and fast – especially for text processing
applications. Some people find Awk programs terse and hard to read. I have
taken care to make the code readable. I wanted to see how far can I go with Awk
(and boy did I go far!). Alternative tools such as modern Python libraries
sometimes have scaling limitations, portability concerns. Some are still
evolving. Swift is used simply because I was familiar with it and confident
that it will scale well in this case.&lt;/p&gt;

&lt;h1 id=&quot;data&quot;&gt;Data&lt;/h1&gt;

&lt;p&gt;The original &lt;a href=&quot;https://www.openacademic.ai/oag&quot;&gt;data&lt;/a&gt; was in two sets (&lt;em&gt;aminer&lt;/em&gt;
and &lt;em&gt;mag&lt;/em&gt;) of 322 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;json&lt;/code&gt; files – each containing a million records. A file
with a list of common records appearing in both sets was available. An Awk script
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/filterdup.awk&lt;/code&gt;) is used to exclude these duplicate records from the aminer
dataset. As a result, it came out about &lt;strong&gt;256 million&lt;/strong&gt; (256,382,605 to be
exact) unique records to be processed. The total data size is 329GB. Some
fields in the data are &lt;em&gt;null&lt;/em&gt;. Those records are avoided where relevant.
Additionally, records related to non-English publications were avoided as
needed. A
&lt;a href=&quot;https://raw.githubusercontent.com/ketancmaheshwari/SMC18/master/data/aminer_papers_sample.allcols.excl.txt&quot;&gt;snapshot&lt;/a&gt;
of tabular data is available. String &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwqw&lt;/code&gt; is chosen as a column separator to
distinguish it from text already found in data. All other 3 or less character
combinations already existed in data prohibiting them to be used as separators.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env awk -f&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# $1 magid&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# $2 aminerid&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Filter duplicate papers and remove them&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# from aminer database based on the linking relationship &lt;/span&gt;

BEGIN &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;FS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; OFS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;qwqw&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

NR &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; FNR &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;a[&lt;span class=&quot;nv&quot;&gt;$2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;!(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;a&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; FILENAME ~ /aminer/ &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; print &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NR == FNR&lt;/code&gt; is a cool Awk idiom that ensures the condition is true only for the first file. This is because for each file that is processed the FNR (File Record Number) gets reset but the NR does not. This means the condition &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NR == FNR&lt;/code&gt; yields true only for the first file.&lt;/p&gt;

&lt;p&gt;In addition to the publications data, I use the following:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;A list of large cities (population 100K+) and their lat-long coordinates (3,517).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A list of countries (190).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A list of world universities and research institutes (8,984).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A list of stop-words to avoid in some of the results (161 words).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;solutions&quot;&gt;Solutions&lt;/h1&gt;

&lt;h2 id=&quot;pre--and-post-processing&quot;&gt;Pre- and post-processing&lt;/h2&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jq&lt;/code&gt; is used to transform the json data to tabular format
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/json2tabular.sh&lt;/code&gt;). The converted tabular files have 19 original columns
(&lt;strong&gt;id&lt;/strong&gt;, &lt;strong&gt;title&lt;/strong&gt;, &lt;strong&gt;authors&lt;/strong&gt;, &lt;strong&gt;year&lt;/strong&gt;, &lt;strong&gt;citations&lt;/strong&gt;,  etc) and one
additional column called &lt;strong&gt;num_authors&lt;/strong&gt; showing the number of authors for a
given publication record. The authors column has a semi-colon separator for
multiple authors. Further curation of tabular data is done by removing
extraneous space, square brackets, escape characters and quotes using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sed&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Some of the results obtained were postprocessed for visulization using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;D3&lt;/code&gt;
graphics framework. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffmpeg&lt;/code&gt; is used to stitch images of trending terms to
create an animation. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dot/graphviz&lt;/code&gt; is used to build the massive citation
network graph of the best paper.&lt;/p&gt;

&lt;h2 id=&quot;scaling-up&quot;&gt;Scaling up&lt;/h2&gt;

&lt;p&gt;Each solution has Awk code run concurrently over the 322 data files on 322 CPU
cores using Swift. This resulted in radical speedup at scale. None of the
solution has taken more than an hour of runtime–most took less than a minute.&lt;/p&gt;

&lt;h3 id=&quot;problem-1&quot;&gt;Problem 1&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Identify the individual or group of individuals who appear to be the expert in a particular field or sub-field.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is solved in two ways. First approach identifies all the entries with
citations higher than 500 for a given search topic
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;results/meditation_highly_cited.txt&lt;/code&gt;).&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env awk -f&lt;/span&gt;

BEGIN &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;# Field Separator&lt;/span&gt;
    FS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;qwqw&quot;&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# Output field separator&lt;/span&gt;
    OFS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
    IGNORECASE &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1

    &lt;span class=&quot;c&quot;&gt;# Field names for readability&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;num_authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;fos_isbn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doctype_issn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;6&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;lang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;7&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;issue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;9&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;volume&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;11&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;page_start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;12&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;page_end&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;13&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;14&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;venue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;15&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;publisher_pdf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;16&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;references&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;17&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;18&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;19&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;20&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$0&lt;/span&gt;~topic &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$num_authors&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; 0 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;500&lt;span class=&quot;o&quot;&gt;){&lt;/span&gt;
    print &lt;span class=&quot;nv&quot;&gt;$n_citation&lt;/span&gt;, &lt;span class=&quot;nv&quot;&gt;$title&lt;/span&gt;, &lt;span class=&quot;nv&quot;&gt;$authors&lt;/span&gt;, &lt;span class=&quot;nv&quot;&gt;$year&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# How to run:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# awk -v topic=meditation -f src/prob1_p1.awk data/mag_papers_sample.allcols.txt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The second approach finds the names of authors whose names are repeating for
queried topic with at least a certain number of citations in each entry. This
gives a reasonable idea of who are the expert figures in a given research area.
One such result in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;results/cancer_research_topauths.txt&lt;/code&gt; shows authors in
cancer research with more than one publication with at least 1,000 citations.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env awk -f&lt;/span&gt;

BEGIN &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# Field separator&lt;/span&gt;
    FS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;qwqw&quot;&lt;/span&gt;
    OFS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
    IGNORECASE &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1

    &lt;span class=&quot;c&quot;&gt;# Field names&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;num_authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;fos_isbn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doctype_issn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;6&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;lang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;7&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;issue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;9&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;volume&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;11&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;page_start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;12&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;page_end&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;13&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;14&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;venue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;15&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;publisher_pdf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;16&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;references&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;17&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;18&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;19&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;20&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$0&lt;/span&gt;~topic &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$num_authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;0 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;1000&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
   
   &lt;span class=&quot;c&quot;&gt;# find the authors whose names are repeating for a particular topic.&lt;/span&gt;
   &lt;span class=&quot;c&quot;&gt;# Those authors will be considered experts. &lt;/span&gt;
   
   gsub&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;,&lt;span class=&quot;nv&quot;&gt;$authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
   &lt;span class=&quot;nb&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$authors&lt;/span&gt;, a, &lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
   
   &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;i &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;a&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
       &lt;span class=&quot;nb&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;a[i], b, &lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
       &lt;span class=&quot;c&quot;&gt;# auths array will have keys as auth names and the &lt;/span&gt;
       &lt;span class=&quot;c&quot;&gt;# element value increases if the key repeats&lt;/span&gt;
       &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;b[1]!~/null/&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; auths[b[1]]++ 
   &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

END &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;k &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;auths&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;auths[k]&amp;gt;1&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; print auths[k], k &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#How to run:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# awk -v topic=cancer -f src/prob1_p2.awk data/mag_papers_sample.allcols.txt&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# sort the results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The HPC implementation of this solution finishes in &lt;strong&gt;25 seconds&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Alongside is the citation &lt;strong&gt;network graph&lt;/strong&gt; of the most cited paper in this
&lt;a href=&quot;https://raw.githubusercontent.com/ketancmaheshwari/SMC18/15b0519d789b0e4b86f66b6bb6199fe24c1a4730/results/best_papers.svg&quot;&gt;diagram&lt;/a&gt;
(too big to fit here). The result of a query for all-time list of most cited
papers with a threshold of 20,000 is in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;results/top_papers.txt&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;problem-2&quot;&gt;Problem 2&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Identify topics that have been researched across all publications.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is solved by identifying most frequently appearing words in the
collection. Title, abstract and keywords are parsed and top 1,000 frequently
occurring words across the whole collection is found. Several common words (aka
&lt;em&gt;stop-words&lt;/em&gt;) are filtered from the results. At over 23 million, the word
“patients” occurs the most frequently. The full list of top 1,000 words is
found in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/results/top_1K_words_kw_abs_title.txt&lt;/code&gt;. The target collection of
publications may be narrowed down to criteria such as years range.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env awk -f&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Problem  Statement&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#    Identify topics that have been researched across all publications.  &lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Solution:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# step1. Filter the input to English language records &lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# step2. Eliminate unnecessary content such as punctuation,&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#        non-printable chars and small words such as &lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#        1 letter and 2 letter words&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# step3. Extract words used in keywords, title and abstract&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# step4. Find most frequently used words &lt;/span&gt;

BEGIN &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    FS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;qwqw&quot;&lt;/span&gt;
    IGNORECASE &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1
  
    &lt;span class=&quot;c&quot;&gt;# Field names&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;num_authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;fos_isbn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; 
    &lt;span class=&quot;nv&quot;&gt;doctype_issn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;6&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;lang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;7&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;issue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;9&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;volume&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;11&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;page_start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;12&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;page_end&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;13&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;14&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;venue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;15&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; 
    &lt;span class=&quot;nv&quot;&gt;publisher_pdf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;16&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;references&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;17&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;18&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;19&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;20&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#collect stop words&lt;/span&gt;
NR &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; FNR &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;x[&lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;next&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$lang&lt;/span&gt;~/en/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/ &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/ &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# treat titles&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;$title&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; tolower&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$title&lt;/span&gt;, a, &lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;i &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;a&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;length&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;a[i]&lt;span class=&quot;o&quot;&gt;)&amp;gt;&lt;/span&gt;2 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; match&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;a[i],/[a-z]/&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; a[i] &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; 0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; kw[a[i]]++

    &lt;span class=&quot;c&quot;&gt;# treat keywords&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; tolower&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt;, b, &lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;i &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;b&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;length&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;b[i]&lt;span class=&quot;o&quot;&gt;)&amp;gt;&lt;/span&gt;2 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; match&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;b[i],/[a-z]/&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; b[i] &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; 0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; kw[b[i]]++

     &lt;span class=&quot;c&quot;&gt;# treat abstracts (Computationally expensive, results are in:&lt;/span&gt;
     &lt;span class=&quot;c&quot;&gt;# top_1000_words_from_kw_abstract_title_by_freq.txt)&lt;/span&gt;
     &lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; tolower&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
     gsub&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;,&lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
     gsub&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;,&lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
     &lt;span class=&quot;nb&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt;, c, &lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;i &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;c&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;length&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;c[i]&lt;span class=&quot;o&quot;&gt;)&amp;gt;&lt;/span&gt;2 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; match&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;c[i],/[a-z]/&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; c[i] &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; 0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; kw[c[i]]++
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

END &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;k &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;kw&lt;span class=&quot;o&quot;&gt;){&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;kw[k]&amp;gt;1000&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; print kw[k], k
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# HOW TO RUN: LC_ALL=C awk -f prob2.awk stop_words.txt \&lt;/span&gt;
              ../aminer_papers_allcols_excl/aminer_papers_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.allcols.excl.txt &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
              ../mag_papers_allcols/mag_papers_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.allcols.txt &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
              | &lt;span class=&quot;nb&quot;&gt;sort&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-nr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; freq.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The HPC implementation (Swift code shown below) finishes in &lt;strong&gt;9 minutes&lt;/strong&gt;.&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;files&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/* app defines what we want to run, the input parameters,
   where the stdout should go, etc.
*/&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;myawk&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;awkprog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stop_words&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;infile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;&quot;/usr/bin/awk&quot;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;-f&quot;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;awkprog&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stop_words&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;infile&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stdout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/* populate the input data */&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aminer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;glob&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/dev/shm/aminer_mag_papers/*.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/* output for each call will be collected here */&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;outfiles&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[];&lt;/span&gt; 

&lt;span class=&quot;n&quot;&gt;foreach&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aminer&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;outfiles&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;myawk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/home/km0/SMC18/src/prob2.awk&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/home/km0/SMC18/data/stop_words.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
                &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/* Combine all output in one file */&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;joined&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;joined.txt&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;outfiles&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/*
 After running this swift app:
 awk '{a[$2]+=$1} END {for (k in a) print a[k],k}' joined.txt | sort -nr &amp;gt; freq.txt
*/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;problem-3&quot;&gt;Problem 3&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Visualize the geographic distribution of the topics in the publications.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is solved by identifying the author affiliations for the records that has
the search topic in them. The affiliation is searched against three
databases–cities, universities and countries to find out the geographic
locations for that research. The results are aggregated to present a list of
centers for which a given keyword appears most frequently. For cities, the
results are plotted on world map. One such result is shown below for the topic
of research on “birds”.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/ketancmaheshwari/SMC18/master/results/bird_research_cities.png&quot; alt=&quot;bird research&quot; title=&quot;Bird Research Around the World!&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;results/&lt;/code&gt; directory contains other similar results such as epilepsy,
opioid, meditation research by universities and by countries. The HPC
implementation finishes in &lt;strong&gt;25 seconds&lt;/strong&gt;. The Awk code is shown below.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env awk -f&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# problem statement&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#    visualize the geographic distribution of the topics in the publications.&lt;/span&gt;

BEGIN &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# Field separator&lt;/span&gt;
    FS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; OFS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;qwqw&quot;&lt;/span&gt;
    IGNORECASE &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1

    &lt;span class=&quot;c&quot;&gt;# Field names&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;num_authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;fos_isbn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doctype_issn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;6&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;lang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;7&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;issue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;9&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;volume&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;11&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;page_start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;12&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;page_end&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;13&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;14&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;venue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;15&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;publisher_pdf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;16&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;references&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;17&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;18&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;19&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;20&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#collect the countries/cities/univs data&lt;/span&gt;
NR &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; FNR &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;a[&lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;next&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; 

&lt;span class=&quot;c&quot;&gt;#treat records with authors whose affiliation is available&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$0&lt;/span&gt;~topic &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$num_authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$authors&lt;/span&gt;~/&lt;span class=&quot;se&quot;&gt;\,&lt;/span&gt;/ &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; 
    &lt;span class=&quot;c&quot;&gt;# extract words from author affiliation and compare with the countries.&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# If a match is found increment that array entry.&lt;/span&gt;
    w &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$authors&lt;/span&gt;, b, &lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;0&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;i&amp;lt;w&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;i++&lt;span class=&quot;o&quot;&gt;){&lt;/span&gt;
        gsub&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;,b[i]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; 
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;b[i] &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;a&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; a[b[i]]++
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

END &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;k &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;a&lt;span class=&quot;o&quot;&gt;){&lt;/span&gt; 
     &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;a[k]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; print a[k], k
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# HOW TO RUN:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# awk -v topic=birds -f prob3.awk cities.txt \&lt;/span&gt;
        ../mag_papers_allcols/mag_papers_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.allcols.txt &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
        ../aminer_papers_allcols_excl/aminer_papers_&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.allcols.excl.txt
&lt;span class=&quot;c&quot;&gt;# awk -v topic=birds -f prob3.awk countries.txt ...&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# awk -v topic=birds -f prob3.awk universities.txt ... &lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Run the following pipeline on the results:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# sort -nr -k 1 citywise_papers.txt &amp;gt; tmp &amp;amp;&amp;amp; mv tmp citywise_papers.txt &lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# OR&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# After running the swift app:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# awk -F: '{a[$2]+=$1} END {for (k in a) print a[k],k}' joined_cities.txt \&lt;/span&gt;
         | &lt;span class=&quot;nb&quot;&gt;sort&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-nr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; tmp &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;tmp joined_cities.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;problem-4&quot;&gt;Problem 4&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Identify how topics have shifted over time.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This problem may be solved in three distinct ways. The first approach processes
the database to find out year-wise occurrence of any given two topics
&lt;em&gt;together&lt;/em&gt;. It generates a list of years and the number of times &lt;em&gt;both&lt;/em&gt; topics
has occurred in a single publication in that year. For example, the plot shown
below shows how the terms “obesity” and “sugar” have trended together in
publications over the years.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/ketancmaheshwari/SMC18/master/results/obesity_sugar.png&quot; alt=&quot;obesity sugar&quot; title=&quot;papers in which obesity and sugar appears together&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Awk code shown below.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env awk -f&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Problem Statement&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Identify how topics have shifted over time.&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Solution 1 below will search for any two topics&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# mentioned and show the number of occurrence of both the topics year-wise&lt;/span&gt;
BEGIN &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# Field Separator&lt;/span&gt;
    FS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;qwqw&quot;&lt;/span&gt;
    IGNORECASE &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1
    &lt;span class=&quot;c&quot;&gt;# Field names&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;num_authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;fos_isbn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doctype_issn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;6&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;lang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;7&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;issue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;9&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;volume&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;11&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;page_start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;12&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;page_end&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;13&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;14&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;venue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;15&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;publisher_pdf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;16&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;references&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;17&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;18&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;19&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;20&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$lang&lt;/span&gt;~/en/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$0&lt;/span&gt;~topic1 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$0&lt;/span&gt;~topic2 &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    a[&lt;span class=&quot;nv&quot;&gt;$year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;++
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

END &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    n &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; asorti&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;a,b&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Trend for topics: %s, %s&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;, topic1, topic2&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;i&amp;lt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;n&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;i++&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;%d :- %d&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;, b[i], a[b[i]]&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Run as follows:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# awk -v topic1=obesity -v topic2=sugar -f code/prob4.awk aminer_mag_papers/*.txt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The second approach finds the papers that has highest impact in each year and
extracts the keywords in those papers. The impact is computed by the paper that
is cited the most in that year. The result for this task are in
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;results/yearwise_trending_keywords.txt&lt;/code&gt; in the form of year, keywords,
citations triplet.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env awk -f&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Problem Statement&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#    Identify how topics have shifted over time.&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Solution 2 is to find the highest cited paper&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# year-wise and figure out the topics it was based on&lt;/span&gt;
BEGIN &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    FS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; OFS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;qwqw&quot;&lt;/span&gt;
    IGNORECASE &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1

    &lt;span class=&quot;c&quot;&gt;# Field names&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;num_authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;fos_isbn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doctype_issn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;6&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;lang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;7&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;issue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;9&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;volume&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;11&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;page_start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;12&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;page_end&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;13&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;14&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;venue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;15&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;publisher_pdf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;16&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;references&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;17&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;18&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;19&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;20&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$lang&lt;/span&gt;~/en/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$year&lt;/span&gt;&amp;lt;2020 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/ 
&lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;max[&lt;span class=&quot;nv&quot;&gt;$year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    max[&lt;span class=&quot;nv&quot;&gt;$year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$n_citation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; a[&lt;span class=&quot;nv&quot;&gt;$year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

END &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    n &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; asorti&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;a,b&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;i&amp;lt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;n&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;i++&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; print b[i], a[b[i]], max[b[i]]
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Run via Swift in parallel. If serial, run like so:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# awk -f code/prob4_p2.awk aminer_mag_papers/*.txt &amp;gt; yearwise_trending_keywords.txt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The third approach finds the top 10 most frequently occurring terms each year
to find how the topics get in and out of trend over the years. An mkv animation
video showing a bubble plot of words trending between the year 1800 and 2017 is
&lt;a href=&quot;https://github.com/ketancmaheshwari/SMC18/blob/master/results/freqwordsoveryears.mkv?raw=true&quot;&gt;here&lt;/a&gt;.
A file list of all the words is found in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;results/trending_words_by_year&lt;/code&gt;. A
snapshot trending words bubble in 2002 is shown below:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/ketancmaheshwari/SMC18/master/results/trending_words_by_year/2002.png&quot; alt=&quot;trending bubble 2020&quot; title=&quot;top 10 research words in 2002&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The Awk code that generates the raw data for above picture is shown below:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env awk -f&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# find the top 10 trending topics year-wise and see how they appear/disappear in the trend&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# We achieve this by writing keywords, titles and abstract to files named after &lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# the year they appeared and do postprocessing on those files&lt;/span&gt;

BEGIN &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    FS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;qwqw&quot;&lt;/span&gt;
    IGNORECASE &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1

    &lt;span class=&quot;c&quot;&gt;# Field names&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;num_authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;fos_isbn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doctype_issn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;6&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;lang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;7&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;issue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;9&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;volume&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;11&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;page_start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;12&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;page_end&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;13&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;14&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;venue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;15&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;publisher_pdf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;16&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;references&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;17&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;18&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;19&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;20&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#collect stop words&lt;/span&gt;
NR &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; FNR &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;x[&lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;next&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$lang&lt;/span&gt;~/en/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;0 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;yr &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# write title, keywords and abstract to a file &lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;#      titled by the year in which they appear&lt;/span&gt;
    
    &lt;span class=&quot;c&quot;&gt;# treat title&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;$title&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; tolower&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$title&lt;/span&gt;, a, &lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;i &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;a&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;length&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;a[i]&lt;span class=&quot;o&quot;&gt;)&amp;gt;&lt;/span&gt;2 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; match&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;a[i],/[a-z]/&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; a[i] &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; 0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; print a[i]

    &lt;span class=&quot;c&quot;&gt;# treat keywords&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; tolower&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    gsub&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;,&lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$keywords&lt;/span&gt;, b, &lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;i &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;b&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;length&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;b[i]&lt;span class=&quot;o&quot;&gt;)&amp;gt;&lt;/span&gt;2 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; match&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;b[i],/[a-z]/&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; b[i] &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; 0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; print b[i]

     &lt;span class=&quot;c&quot;&gt;# treat abstract&lt;/span&gt;
     &lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; tolower&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
     gsub&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;,&lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
     gsub&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;,&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;,&lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
     &lt;span class=&quot;nb&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$abstract&lt;/span&gt;, c, &lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;i &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;c&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;length&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;c[i]&lt;span class=&quot;o&quot;&gt;)&amp;gt;&lt;/span&gt;2 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; match&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;c[i],/[a-z]/&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; c[i] &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;x &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; 0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; print c[i]

&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#Do the following for postprocessing:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#for i in 18?? 19?? 20??&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# do (grep -o -E '\w+' $i | tr [A-Z] [a-z] \&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#     | sed -e 's/null//g' -e 's/^.$//g' -e 's/^..$//g' -e 's/^[0-9]*$//g' \&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#     | awk NF | fgrep -v -w -f stop_words.txt \&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#     | sort | uniq -c | sort -nr \&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#     | head -10 &amp;gt; trending/trending.$i.txt) &amp;amp; done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Parallelizing the third approach was challenging as it involved a two-level
nested foreach loop. The outer loop iterates over the years and the inner loop
iterates over the input files. The HPC implementation finishes in &lt;strong&gt;48
minutes&lt;/strong&gt;. Swift code for this shown below.&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;files&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;io&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;myawk&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;awkprog&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;infile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stopwords&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;&quot;/usr/bin/awk&quot;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;-v&quot;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yr&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;-f&quot;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;awkprog&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stopwords&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;infile&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stdout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aminer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;glob&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/dev/shm/aminer_mag_papers/*.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;foreach&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1800&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2017&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yearfiles&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[];&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;foreach&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aminer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;yearfiles&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;myawk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/home/km0/SMC18/src/prob4_p3.awk&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;  
                         &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                         &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/home/km0/SMC18/data/stop_words.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
                         &lt;span class=&quot;n&quot;&gt;sprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;yr=%s&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toString&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)));&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;joined&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;year%s.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toString&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;yearfiles&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;problem-5&quot;&gt;Problem 5&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Given a research proposal, determine whether the proposed work has been accomplished previously.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This has a simple solution: Find the keywords on a new proposal.  If those
keywords appear in an existing publication record, it is a suspect. A broad
list of suspects may be found with logical &lt;strong&gt;OR&lt;/strong&gt; between keywords which could
be narrowed down with logical &lt;strong&gt;AND&lt;/strong&gt;. The keywords may be arbitrarily combined
in ORs and ANDs. The results file &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/results/suspects.txt&lt;/code&gt; shows over 1,400
suspects for an &lt;strong&gt;AND&lt;/strong&gt; combination of keywords: &lt;em&gt;battery&lt;/em&gt;, &lt;em&gt;electronics&lt;/em&gt;,
&lt;em&gt;lithium&lt;/em&gt;, and &lt;em&gt;energy&lt;/em&gt; from English language papers. The HPC implementation
finishes in &lt;strong&gt;26 seconds&lt;/strong&gt;. Awk code below.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env awk -f&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Problem Statement&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#    Given a research proposal, determine whether the proposed work has been&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#    accomplished previously.&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Solution: Find the keywords in the new proposal. &lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# If those keywords appear in an existing publication record, it is a suspect.&lt;/span&gt;

BEGIN &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    FS &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;qwqw&quot;&lt;/span&gt;
    IGNORECASE &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; 1 

    &lt;span class=&quot;c&quot;&gt;# Field names&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;num_authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;fos_isbn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;doctype_issn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;6&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;lang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;7&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;n_citation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;issue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;9&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;volume&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;11&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;page_start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;12&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;page_end&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;13&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;14&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;venue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;15&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;publisher_pdf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;16&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;references&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;17&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;keywords&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;18&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;abstract&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;19&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;20&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# topic1 .. topic4 are provided at command line&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$0&lt;/span&gt;~topic1 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$0&lt;/span&gt;~topic2 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$0&lt;/span&gt;~topic3 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$0&lt;/span&gt;~topic4 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$lang&lt;/span&gt;~/en/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$authors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;~/null/&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    print &lt;span class=&quot;nv&quot;&gt;$id&lt;/span&gt;, &lt;span class=&quot;nv&quot;&gt;$title&lt;/span&gt;, &lt;span class=&quot;nv&quot;&gt;$authors&lt;/span&gt;, &lt;span class=&quot;nv&quot;&gt;$year&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;I show how the classic Unix tools may be leveraged to solve modern problems
and that millions of records may be processed in under a minute at scale.
About the data itself, it seems the biosciences research dominates
the publications followed by perhaps physics. I am sure more sophisticated
tools could be used to get refined results and gain better insights – this is
my take. I &lt;a href=&quot;https://twitter.com/SciDatathon/status/1120335746358026240&quot;&gt;won&lt;/a&gt;
the data challenge. Awk is awesome!&lt;/p&gt;</content><author><name></name></author><category term="posts" /><summary type="html">TL;DR Awk crunches massive data; a High Performance Computing (HPC) script calls hundreds of Awk concurrently. Fast and scalable in-memory solution on a fat machine.</summary></entry></feed>